Free software solutions for Linux that can run OCR on PDF documents and convert them to searchable PDF.
By searchable PDF, we refer to a scanned PDF document that contains invisible OCR'ed text over the scanned image. The text should have the right size in order to be placed over the text portions from image. Every word from the text layer should overlay exactly on the portion of the image that contains that word.
Here are two software solutions that are able to create searchable PDFs. One is a native Linux OCR engine and the other is a free PDF reader with OCR capabilities running in Wine.
1. Tesseract & PDFsandwich
Tesseract is the first and currently the only OCR engine for Linux that supports direct searchable PDF output (starting from version 3.03). The only problem is that it only accepts image input. So you can't feed it a PDF document. You can install it on APT based Linux (like Ubuntu) using the following command:sudo apt-get install tesseract-ocr tesseract-ocr-allIf you have a bunch of images resulted from a scanner, you can make a simple script that will OCR each image into single page searchable PDF then join pages into a single PDF document:
#!/bin/bash LANG=eng #replace with your language code shopt -s nullglob for f in *.tif; do echo "Running OCR on $f" tesseract -psm 1 -l $LANG $f $f pdf done echo "Joining files into single PDF..." pdftk *.pdf cat output ../outdocument.pdf rm -r -f *.pdfThis script takes all .tif files from the directory where it is run and processes them with tesseract. To use it, you need also pdftk installed. Copy the above snippet into a new file ocr.sh, make it executable (chmod +x ocr.sh), then place it in the folder with scanned images and run it.
Things get complicated if you already have a PDF document that you want to make searchable. In order to use tesseract, it must be exported to images. And to do this, you must know the resolution of the scanned image. And this can be a problem if you didn't scan the document and have no idea what resolution it is.
In this situation, you can use the pdfsandwich script by Tobias Elze. Not only it extracts all pages from PDF as images, but it also pre-processes them for OCR using multiple threads. You can download the DEB package from the website and you can install it with GDebi. It's easy to use, but there are some command line arguments that need attention:
- -nopreproc is useful when the PDF already contains processed images and you don't want any other processing. Note that by default, this script will convert your document to black and white! Using this option you avoid any kind of conversion.
- -resolution has a default value of 300 DPI. This is used when converting PDF pages to images and 300 is a good value. But if your document contains small text and you know/believe it may have been scanned at a higher DPI, specify it.
- -lang must always be specified if you need to OCR in other than English language. This parameter is passed to tesseract. The availability of languages depends on installed tesseract-ocr-<lang_code> packages.
pdfsandwich -lang eng input_document.pdfThe result will be input_document_ocr.pdf in the same folder as the initial document.
2. PDF X-Change Viewer
This is a free PDF reader with a lot of other functions provided by Tracker Software. It is a Windows only application that runs in Wine. I tested the viewer in Wine 1.6, 1.7 and 1.8 and it worked great in all these versions. Yet the OCR engine only worked with Wine 1.8 which is available in PPA.To install it in Linux, you must have Wine 1.8 installed (wine1.8:i386 package) and download the following files from Tracker Software:
- Portable PDF Viewer archive: Portable version (ZIP) | 8 MB
- Portable PDF Viewer OCR engine: Portable Version (OCR Lang Files) | 8 MB
- Additional OCR languages: choose a package that contains the language(s) you are interested in.
Extract the ZIP file by right clicking it and choosing Extract Here. You should get a folder PDFX_Vwr_Port. Extract the OCR Lang files archive and you will get an ocrdats folder. Put this folder in the PDFX_Vwr_Port folder. You can now start PDFXCview.exe with wine and you can OCR English, German, French and Spanish documents.
innoextract OCRAdditionalLangsEU.exeYou will get two folders (code:SetAppFolder|inst and code:SetEditorFolder|inst) with identical content. A language pack is contains two files: <lang>.lng and <lang>_pxvocr.dat. You need to copy both files to ocrdats folder. Fot example, to run OCR in Romanian, I copied rom.lng and ron_pxvocr.dat from one of those two folders.
OCR in PDF X-Change Viewer |
Notes: in Wine 1.6, PDF X-Change Viewer crashed when launching OCR (on click on the OK button). In Wine 1.7 it crashed after reaching 99% OCR progress. It Wine 1.8 it works without issues.
Thanks a lot for this post! I've been struggling to find a way to convert a scanned pdf into a searchable one, and this is something that finally works for me.
ReplyDeleteI desesperately needed to convert an image-pdf file into a searchable one. Thank you very much!
ReplyDelete