colrev.env.tei_parser.TEIParser

class colrev.env.tei_parser.TEIParser(*, pdf_path=None, tei_path=None)[source]

Bases: object

Environment service for TEI parsing

Creates a TEI file modes of operation: - pdf_path: create TEI and temporarily store in self.data - pfd_path and tei_path: create TEI and save in tei_path - tei_path: read TEI from file

Methods

get_abstract

Get the abstract

get_author_details

Get the author details

get_citations_per_section

Get a dict of section-names and list-of-citations

get_grobid_version

Get the GROBID version used for TEI creation

get_metadata

Get the metadata of the PDF (title, author, ...) as a dict

get_paper_keywords

Get hte keywords

get_references

Get the bibliography (references section) as a list of record dicts

get_tei_str

Get the TEI string

iter_paragraphs

Iterate over body paragraphs in reading order.

mark_references

Mark references with the additional record ID

Attributes

ns

nsmap

get_abstract()[source]

Get the abstract

Return type:

str

get_author_details()[source]

Get the author details

Return type:

list

get_citations_per_section()[source]

Get a dict of section-names and list-of-citations

Return type:

dict

get_grobid_version()[source]

Get the GROBID version used for TEI creation

Return type:

str

get_metadata()[source]

Get the metadata of the PDF (title, author, …) as a dict

Return type:

dict

get_paper_keywords()[source]

Get hte keywords

Return type:

list

get_references(*, add_intext_citation_count=False)[source]

Get the bibliography (references section) as a list of record dicts

Return type:

list

get_tei_str()[source]

Get the TEI string

Return type:

str

iter_paragraphs(*, min_chars=40, exclude_sections=('references', 'reference', 'bibliography', 'acknowledgment', 'acknowledgement', 'appendix'))[source]

Iterate over body paragraphs in reading order.

Return type:

Iterator[str]

Args:
min_chars: Minimum number of characters for a paragraph to be yielded.

Shorter chunks (e.g., headings, figure labels) are skipped.

exclude_sections: Section names (case-insensitive) whose paragraphs

should be excluded (e.g., references, appendix).

Yields:

Cleaned paragraph texts in the order they appear in the body.

mark_references(*, records)[source]

Mark references with the additional record ID