colrev.env.tei_parser.TEIParser
- class colrev.env.tei_parser.TEIParser(*, pdf_path=None, tei_path=None)[source]
Bases:
objectEnvironment service for TEI parsing
Creates a TEI file modes of operation: - pdf_path: create TEI and temporarily store in self.data - pfd_path and tei_path: create TEI and save in tei_path - tei_path: read TEI from file
Methods
get_abstractGet the abstract
get_author_detailsGet the author details
get_citations_per_sectionGet a dict of section-names and list-of-citations
get_grobid_versionGet the GROBID version used for TEI creation
get_metadataGet the metadata of the PDF (title, author, ...) as a dict
get_paper_keywordsGet hte keywords
get_referencesGet the bibliography (references section) as a list of record dicts
get_tei_strGet the TEI string
iter_paragraphsIterate over body paragraphs in reading order.
mark_referencesMark references with the additional record ID
Attributes
nsnsmap- get_citations_per_section()[source]
Get a dict of section-names and list-of-citations
- Return type:
dict
- get_references(*, add_intext_citation_count=False)[source]
Get the bibliography (references section) as a list of record dicts
- Return type:
list
- iter_paragraphs(*, min_chars=40, exclude_sections=('references', 'reference', 'bibliography', 'acknowledgment', 'acknowledgement', 'appendix'))[source]
Iterate over body paragraphs in reading order.
- Return type:
Iterator[str]
- Args:
- min_chars: Minimum number of characters for a paragraph to be yielded.
Shorter chunks (e.g., headings, figure labels) are skipped.
- exclude_sections: Section names (case-insensitive) whose paragraphs
should be excluded (e.g., references, appendix).
- Yields:
Cleaned paragraph texts in the order they appear in the body.