colrev pdf-prep¶

In the colrev pdf-prep operation, records transition from pdf_imported to pdf_prepared or pdf_needs_manual_preparation. Depending on the settings, this operation may involve any of the following:

Check whether the PDF is machine readable and apply OCR if necessary
Identify and remove additional pages and decorations (may interfere with machine learning tools)
Validate whether the PDF matches the record metadata and whether the PDF is complete (matches the number of pages)
Create unique PDF identifiers (PDF hashes) that can be used for retrieval and validation (e.g., in crowdsourcing)

Per default, CoLRev keeps a backup of PDFs that are changed by the pdf-prep operation. The keep_backup_of_pdfs option of the pdf_prep settings can be modified to change this behavior:

colrev pdf-prep [options]

The following options for pdf-prep are available:

Identifier	Description	Status
colrev.grobid_tei	GROBID TEI (instructions)
colrev.ocrmypdf	OCRMyPDF (instructions)
colrev.remove_coverpage	Remove Cover Page (instructions)
colrev.remove_last_page	Remove Last Page (instructions)

The colrev pdf-prep-man operation provides an interactive convenience function for PDFs that cannot be prepared automatically, with records transitioning from pdf_needs_manual_preparation to pdf_prepared.

colrev pdf-prep-man [options]

The following options for pdf-prep-man are available:

Identifier	Description	Status
colrev.colrev_cli_pdf_prep_man	Prep PDFs manually (CLI) (instructions)