colrev pdf-prep
In the colrev pdf-prep operation, records transition from pdf_imported to pdf_prepared or pdf_needs_manual_preparation.
Depending on the settings, this operation may involve any of the following:
Check whether the PDF is machine readable and apply OCR if necessary
Identify and remove additional pages and decorations (may interfere with machine learning tools)
Validate whether the PDF matches the record metadata and whether the PDF is complete (matches the number of pages)
Create unique PDF identifiers (PDF hashes) that can be used for retrieval and validation (e.g., in crowdsourcing)
Per default, CoLRev keeps a backup of PDFs that are changed by the pdf-prep operation. The keep_backup_of_pdfs option of the pdf_prep settings can be modified to change this behavior:
colrev pdf-prep [options]
The following options for pdf-prep are available:
Identifier |
Description |
Status |
|---|---|---|
colrev.grobid_tei |
GROBID TEI (instructions) |
|
colrev.ocrmypdf |
OCRMyPDF (instructions) |
|
colrev.remove_coverpage |
Remove Cover Page (instructions) |
|
colrev.remove_last_page |
Remove Last Page (instructions) |
The colrev pdf-prep-man operation provides an interactive convenience function for PDFs that cannot be prepared automatically, with records transitioning from pdf_needs_manual_preparation to pdf_prepared.
colrev pdf-prep-man [options]
The following options for pdf-prep-man are available:
Identifier |
Description |
Status |
|---|---|---|
colrev.colrev_cli_pdf_prep_man |
Prep PDFs manually (CLI) (instructions) |