colrev.ocrmypdf

Version Version: 0.1.0

Maintainer Maintainer: Gerit Wagner

Licencse License: MIT

Git repository Repository: CoLRev-Environment/colrev

Endpoint

Status

Add

pdf_prep

MATURING

colrev pdf-prep --add colrev.ocrmypdf

Summary

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched. Its main features are:

  • Generates a searchable PDF/A file from a regular PDF

  • Places OCR text accurately below the image to ease copy / paste

  • Keeps the exact resolution of the original embedded images

  • When possible, inserts OCR information as a “lossless” operation without disrupting any other content

  • Optimizes PDF images, often producing files smaller than the input file

  • If requested, deskews and/or cleans the image before performing OCR

  • Validates input and output files

  • Distributes work across all available CPU cores

  • Uses Tesseract OCR engine to recognize more than 100 languages

  • Keeps your private data private.

  • Scales properly to handle files with thousands of pages.

  • Battle-tested on millions of PDFs.

Running OCRmyPDF requires Docker.

pdf-prep

OCRmyPDF is contained in the default setup. To add it, run

colrev pdf-prep -a colrev.ocrmypddf