colrev.record.record_pdf.PDFRecord

class PDFRecord(data, path)[source]

Bases: Record

The PDFRecord class provides a range of convenience functions for PDF handling

Methods

add_field_provenance

Add a field provenance, including source and note (based on a key)

add_field_provenance_note

Add a field provenance note (based on a key)

add_provenance_all

Add a data provenance (source) to all fields

align_provenance

Remove unnecessary provenance information and add missing provenance information

change_entrytype

Change the ENTRYTYPE

complete_provenance

Complete provenance information for indexing

copy_prep_rec

Copy the record object (as a PrepRecord)

defects

Get a list of defects for a field

extract_pages

Extract pages from the PDF

extract_pages_from_pdf

Extract pages from the PDF

extract_text_by_page

Extract the text from the PDF for a given number of pages

format_bib_style

Simple formatter for bibliography-style output

get_citation_format

Get the record as a citation

get_colrev_id

Returns the colrev_id of the Record.

get_colrev_pdf_id

Generate the colrev_pdf_id

get_container_title

Get the record's container title (journal name, booktitle, etc.)

get_data

Get the record data

get_diff

Get diff between record objects

get_field_provenance

Get the provenance for a selected field (key)

get_field_provenance_notes

Get field provenance notes based on a key

get_field_provenance_source

Get the provenance source for a selected field (key)

get_pdf_hash

Get the PDF image hash

get_record_change_score

Determine how much records changed

get_record_similarity

Determine the similarity between two records (their masterdata)

get_tei_filename

Get the TEI filename associated with the file (PDF)

get_toc_key

Get the record's toc-key

get_value

Get a record value (based on the key parameter)

has_pdf_defects

Check whether the PDF has quality defects

has_quality_defects

Check whether a record (or specific field/key) has quality defects

ignore_defect

Ignore a defect for a field

ignored_defect

Get a list of ignored defects for a record

is_retracted

Check for potential retracts

masterdata_is_curated

Check whether the record masterdata is curated

merge

General-purpose record merging for preparation, curated/non-curated records and records with origins

prescreen_exclude

Prescreen-exclude a record

print_citation_format

Print the record as a citation

remove_field

Remove a field

remove_field_provenance_note

Remove field provenance notes based on a key

rename_field

Rename a field

require_prov

Ensure that provenance fields are available

reset_pdf_provenance_notes

Reset the PDF (file) provenance notes

run_pdf_quality_model

Run the PDF quality model

run_quality_model

Update the masterdata provenance

set_masterdata_complete

Set the masterdata to complete

set_masterdata_consistent

Set the masterdata to consistent

set_masterdata_curated

Set record masterdata to curated

set_nr_pages_in_pdf

Set the pages_in_file field based on the PDF

set_status

Set the record status

set_text_from_pdf

Set the text_from_pdf field based on the PDF

update_by_record

Update all data of a record object based on another record

update_field

Update a record field (including provenance information)

Attributes

pp

data

Dictionary containing the record data

path

Path to the repository (record.data[Fields.File] is relative to path)

add_field_provenance(*, key, source, note='')

Add a field provenance, including source and note (based on a key)

Return type:

None

add_field_provenance_note(*, key, note)

Add a field provenance note (based on a key)

Return type:

None

add_provenance_all(*, source)

Add a data provenance (source) to all fields

Return type:

None

align_provenance()

Remove unnecessary provenance information and add missing provenance information

Return type:

None

change_entrytype(new_entrytype, *, qm)

Change the ENTRYTYPE

Return type:

None

complete_provenance(*, source_info)

Complete provenance information for indexing

Return type:

bool

copy_prep_rec()

Copy the record object (as a PrepRecord)

Return type:

PrepRecord

data

Dictionary containing the record data

defects(key)

Get a list of defects for a field

Return type:

List[str]

extract_pages(*, pages, save_to_path=None)[source]

Extract pages from the PDF

Return type:

None

classmethod extract_pages_from_pdf(*, pages, pdf_path, save_to_path=None)[source]

Extract pages from the PDF

Return type:

None

extract_text_by_page(*, pages=None)[source]

Extract the text from the PDF for a given number of pages

Return type:

str

format_bib_style()

Simple formatter for bibliography-style output

Return type:

str

get_citation_format()

Get the record as a citation

Return type:

str

get_colrev_id(*, assume_complete=False)

Returns the colrev_id of the Record.

Return type:

str

classmethod get_colrev_pdf_id(pdf_path)

Generate the colrev_pdf_id

Return type:

str

get_container_title(*, na_string='NA')

Get the record’s container title (journal name, booktitle, etc.)

Return type:

str

get_data()

Get the record data

Return type:

dict

get_diff(other_record, *, identifying_fields_only=True)

Get diff between record objects

Return type:

list

get_field_provenance(*, key, default_source='ORIGINAL')

Get the provenance for a selected field (key)

Return type:

dict

get_field_provenance_notes(key)

Get field provenance notes based on a key

Return type:

list

get_field_provenance_source(key)

Get the provenance source for a selected field (key)

Return type:

str

get_pdf_hash(*, page_nr, hash_size=32)[source]

Get the PDF image hash

Return type:

str

classmethod get_record_change_score(record_a, record_b)

Determine how much records changed

This method is less sensitive than get_record_similarity, especially when fields are missing. For example, if the journal field is missing in both records, get_similarity will return a value > 1.0. The get_record_changes will return 0.0 (if all other fields are equal).

Return type:

float

classmethod get_record_similarity(record_a, record_b)

Determine the similarity between two records (their masterdata)

Return type:

float

get_tei_filename()

Get the TEI filename associated with the file (PDF)

Return type:

Path

get_toc_key()

Get the record’s toc-key

Return type:

str

get_value(key, *, default=None)

Get a record value (based on the key parameter)

Return type:

str

has_pdf_defects()

Check whether the PDF has quality defects

Return type:

bool

has_quality_defects(*, key='')

Check whether a record (or specific field/key) has quality defects

Return type:

bool

ignore_defect(*, key, defect)

Ignore a defect for a field

Return type:

None

ignored_defect(*, key, defect)

Get a list of ignored defects for a record

Return type:

bool

is_retracted()

Check for potential retracts

Return type:

bool

masterdata_is_curated()

Check whether the record masterdata is curated

Return type:

bool

merge(merging_record, *, default_source, preferred_masterdata_source_prefixes=None)

General-purpose record merging for preparation, curated/non-curated records and records with origins

Apply heuristics to create a fusion of the best fields based on quality heuristics

Return type:

None

path

Path to the repository (record.data[Fields.File] is relative to path)

prescreen_exclude(*, reason, print_warning=False)

Prescreen-exclude a record

Return type:

None

print_citation_format()

Print the record as a citation

Return type:

None

remove_field(*, key, not_missing_note=False, source='')

Remove a field

Return type:

None

remove_field_provenance_note(*, key, note)

Remove field provenance notes based on a key

Return type:

None

rename_field(*, key, new_key)

Rename a field

Return type:

None

require_prov()

Ensure that provenance fields are available

Return type:

None

reset_pdf_provenance_notes()

Reset the PDF (file) provenance notes

Return type:

None

run_pdf_quality_model(pdf_qm, *, set_prepared=False)

Run the PDF quality model

Return type:

None

run_quality_model(quality_model, *, set_prepared=False)

Update the masterdata provenance

Return type:

None

set_masterdata_complete(*, source, masterdata_repository, replace_source=True)

Set the masterdata to complete

Return type:

None

set_masterdata_consistent()

Set the masterdata to consistent

Return type:

None

set_masterdata_curated(source)

Set record masterdata to curated

Return type:

None

set_nr_pages_in_pdf()[source]

Set the pages_in_file field based on the PDF

Return type:

None

set_status(target_state, *, force=False)

Set the record status

Return type:

None

set_text_from_pdf()[source]

Set the text_from_pdf field based on the PDF

Return type:

None

update_by_record(update_record)

Update all data of a record object based on another record

Return type:

None

update_field(*, key, value, source, note='', keep_source_if_equal=True, append_edit=True)

Update a record field (including provenance information)

Return type:

None