Step 4: PDF retrievalΒΆ
The step of PDF retrieval refers to the activities to acquire PDF documents (or documents in other formats) as well as to ascertain or improve their quality. It should ensure that PDF documents correspond to their associated metadata (no mismatches), that they are machine readable (OCR and semantically annotated), and that unnecessary materials (such as cover pages) are removed.
PDFs are stored in the data/pdfs
directory.
Per default, PDF documents are not versioned by git to ensure that CoLRev repositories can be published without violating copyright restrictions.
It is recommended to share PDFs through file synchronization clients and to create a symlink pointing from the colrev-repository/data/pdfs
to the respective directory used for file sharing.
PDF retrieval is a high-level operation consisting of the following operations:
The
pdf-get
operation, which retrieves PDFs from sources like the local_index or unpaywall. It may refer users topdf-get-man
as the manual fallback when the record state is set topdf_needs_manual_retrieval
.The
pdf-prep
operation, which refers to the preparation of PDF documents. It may refer users topdf-prep-man
as the manual fallback when the record state is set topdf_needs_manual_preparation
.
To start the pdfs
operation, run:
colrev pdfs