colrev.env.tei_parser.TEIParser

class colrev.env.tei_parser.TEIParser(*, pdf_path=None, tei_path=None)[source]

Bases: object

Environment service for TEI parsing.

Creates a TEI file modes of operation: - pdf_path: create TEI and temporarily store in self.data - pfd_path and tei_path: create TEI and save in tei_path - tei_path: read TEI from file.

Methods

get_abstract

Get the abstract.

get_author_details

Get the author details.

get_citations_per_section

Get a dict of section-names and list-of-citations.

get_grobid_version

Get the GROBID version used for TEI creation.

get_metadata

Get the metadata of the PDF (title, author, ...) as a dict.

get_paper_keywords

Get the keywords.

get_references

Get the bibliography (references section) as a list of record dicts.

get_tei_str

Get the TEI string.

iter_paragraphs

Iterate over body paragraphs in reading order.

mark_references

Mark references with the additional record ID.

Attributes

ns

nsmap

get_abstract()[source]

Get the abstract.

Return type:

str

get_author_details()[source]

Get the author details.

Return type:

list

get_citations_per_section()[source]

Get a dict of section-names and list-of-citations.

Return type:

dict

get_grobid_version()[source]

Get the GROBID version used for TEI creation.

Return type:

str

get_metadata()[source]

Get the metadata of the PDF (title, author, …) as a dict.

Return type:

dict

get_paper_keywords()[source]

Get the keywords.

Return type:

list

get_references(*, add_intext_citation_count=False)[source]

Get the bibliography (references section) as a list of record dicts.

Return type:

list

get_tei_str()[source]

Get the TEI string.

Return type:

str

iter_paragraphs(*, min_chars=40, exclude_sections=('references', 'reference', 'bibliography', 'acknowledgment', 'acknowledgement', 'appendix'))[source]

Iterate over body paragraphs in reading order.

Return type:

Iterator[str]

Args:
min_chars: Minimum number of characters for a paragraph to be yielded.

Shorter chunks (e.g., headings, figure labels) are skipped.

exclude_sections: Section names (case-insensitive) whose paragraphs

should be excluded (e.g., references, appendix).

Yields:

Cleaned paragraph texts in the order they appear in the body.

mark_references(*, records)[source]

Mark references with the additional record ID.