colrev pdf-prepΒΆ

In the colrev pdf-prep operation, records transition from pdf_imported to pdf_prepared or pdf_needs_manual_preparation. Depending on the settings, this operation may involve any of the following:

  • Check whether the PDF is machine readable and apply OCR if necessary

  • Identify and remove additional pages and decorations (may interfere with machine learning tools)

  • Validate whether the PDF matches the record metadata and whether the PDF is complete (matches the number of pages)

  • Create unique PDF identifiers (PDF hashes) that can be used for retrieval and validation (e.g., in crowdsourcing)

Per default, CoLRev keeps a backup of PDFs that are changed by the pdf-prep operation. The keep_backup_of_pdfs option of the pdf_prep settings can be modified to change this behavior:

colrev pdf-prep [options]

The following options for pdf-prep are available:

Identifier

Description

Status

colrev.grobid_tei

GROBID TEI (instructions)

MATURING

colrev.ocrmypdf

OCRMyPDF (instructions)

MATURING

colrev.remove_coverpage

Remove Cover Page (instructions)

MATURING

colrev.remove_last_page

Remove Last Page (instructions)

MATURING

The colrev pdf-prep-man operation provides an interactive convenience function for PDFs that cannot be prepared automatically, with records transitioning from pdf_needs_manual_preparation to pdf_prepared.

colrev pdf-prep-man [options]

The following options for pdf-prep-man are available:

Identifier

Description

Status

colrev.colrev_cli_pdf_prep_man

Prep PDFs manually (CLI) (instructions)

MATURING