PDF tweaking software alternatives in Linux: insert, remove, reorder pages, OCR, make searchable, edit metadata, attach files to documents.
I will use throughout this tutorial only free software (probably most are open source too). I will avoid as much as possible using CLI software from the command line because most new Linux users are afraid of this. Although I will remind what backends are used to perform certain operations.
If you got a bunch of scanned pages as images and you want to turn them into a PDF I already discussed that in the post High quality scanning vs. small file size. For now I will focus on what you can do with an existing PDF.
The most used GUI software for this tutorial will be: PDF Mod, PDF-Shuffler, PDF Chain, PDFSAM and gImageReader. The CLI software will be: gs, pdftk, tesseract, pdfsandwich, pdfimages. In order to install them, run on Ubuntu:
sudo apt-get install pdfmod pdfshuffler pdfchain pdftk ghostscript poppler-utils tesseract-ocr gimagereader imagemagick cups-pdfI recommend getting PDFSAM, pdfsandwich, BRISS, PDF Quench from their websites. You will also need Java for some of these.
1. Document
a. Concatenate
The easiest to use here is PDF Chain (pdftk is the backend). When you first launch it you will see the concatenate tab. All you have to do is add the files you want. Note that upon adding the files it will read them in order to know the total number of pages and security issues (protected documents). The open dialog may appear frozen, but all you have to do is wait. When it is ready you will be presented to the list of files you selected. You can:- un-tick the checkbox if you don't want it anymore in the final result;
- drag and drop items to change their order;
- click on any value from the Page selection, Rotation and Password fields to modify or set their value. Note that Page selection has two values - both numeric and even or odd;
When you are ready, click Save As. Note that there is no progress indicator and this dialog seems to freeze too. Just let it finish and then look for the result.
PDF Chain concatenate files |
b. Split
Splitting means creating multiple PDF documents from a single one based on a criteria. It is not the same as extracting pages (described below in 2.a).
A basic form of splitting can be achieved with PDF Chain too. It calls it Burst and splits a document by each page, thus the result is a number of single page PDFs equal to the number of pages of the input document.
PDFSAM is better suited for this task. The version in some Linux repos (Ubuntu) is very old, so I suggest downloading the current version from its website. It requires Java to be installed on your system. You may also find it useful in other situations. For now, just use the Split plugin. After you've configured it click Run.
PDFSAM split options |
c. Attach file
This is another one for PDF Chain (pdftk). The interface is very intuitive, yet the apparent freeze problem persists (when loading a document, the application analyzes its properties and if it is a large document it may take a while).You choose a document, add desired files as attachments, choose to add them to page or to document and save. The attached files can be viewed with Evince.
PDF Chain attach a file to document |
d. Create (virtual printer)
If you need a virtual PDF printer all you have to do is install cups-pdf package with sudo apt-get install cups-pdf. Now, from any application, just print using PDF printer.Virtual printer cups-pdf |
e. Edit metadata
PDF Mod can do that. Load a document and click the Properties icon (or the File - Properties menu). Save the document after you finish.Metadata editing with PDF Mod |
f. Bookmarks
Again, PDF Mod. Can read, add, edit and delete bookmarks.PDF Mod bookmarks editor |
2. Pages
a. Extract
You probably seen this option in PDFSAM. You probably know you can do it with PDF Chain if you burst then concatenate a new document. But I will use PDF Mod. Load a document, wait for thumbnails to be generated, then Ctrl+Click or Shift+Click to select pages. Right click the selection and choose Extract (x pages).PDF Mod page extraction |
b. Delete
I'm sure you know how it is done with PDF Mod. You see it in the above screenshot (Remove x pages). But I will present PDF Shuffler. The interface looks similar to PDF Mod, yet there are some differences of features as you will see soon. Don't forget that both these and PDFSAM can do visual reorder of pages. Just drag and drop the thumbnails as you wish.PDF Shuffler delete pages |
c. Rotate
Nothing new here to show. Just use whatever you want: PDFSAM, PDF Mod, PDF Shuffler and even PDF Chain when concatenating documents.d. Crop
A difference between PDF Mod and PDF Shuffler is that only one can crop pages. Yes, it is PDF Shuffler and unfortunately it accepts only percent as input. Right click a selection of pages and select Crop to see the dialog.Better tools for this job are BRISS and PDF Quench. Both can do visual cropping and are easy to use. BRISS is Java based. If you go for PDF Quench, you need to install python-pygoocanvas python-pypdf2 python-poppler.
e. Insert
If you want to insert pages in a document you must have them as PDF. PDF Mod does that but puts the newly inserted pages always at the beginning so you have to reorder them. PDF Shuffler too, and it puts them at the end. Not nice but currently these are the options.A combination of Delete & Insert can be used to Replace pages in a document.
f. Header / footer / watermark / background
PDF Chain (via pdftk) can do that. First you must prepare the document containing the layer. You can do that with whatever software you want (LibreOffice Writer, GIMP, Inkscape, Scribus - anything that can export an editable size page to PDF). Make sure the result is just one page and Use multiple pages is unchecked. For predictable results, the layer page should be the same size as source document (the one you want to add layer to).Use Background/Stamp tab to select the source document and the document containing the layer to be added.
Adding layers to document with PDF Chain |
What is the difference between Background and Stamp? Stamp is always the topmost layer. Background is behind all contents. Use Stamp whenever you must be sure it is always visible.
3. OCR
a. Get text
If you only need to copy unformatted text from a non-searchable PDF document, gImageReader can help you. It is a GUI for tesseract. Simply load a document (or even a image), then choose a page (1), increase resolution if needed (2), select language (3), select content (4) or let it detect it automatically (5), start OCR (6), remove unnecessary line breaks (7) and save the result after you check it.gImageReader main window |
b. Make searchable
That is a bit difficult and there is no GUI for it. Starting with version 3.03, tesseract can output searchable PDF. The drawback is that it only accepts image input (including multipage TIFF). The command is simple (replace <lang> with the language code):tesseract -l <lang> in.tiff result pdfYet you will have to write a script that would save a document as a bunch of images, OCR each of them to PDF then concatenate the PDF files into the final document.
I talked about searchable PDFs in detail at How to OCR to searchable PDF in Linux.
A better option is pdfsandwich. It automatically converts color to grayscale (yet you can use it for color pages with the right arguments). The result are pretty good. It calls tesseract for OCR. More than that it supports PDF input. I recommend it over simply calling tesseract for every image.
pdfsandwich -lang <lng> -rgb input_document.pdf
3. Other formats
a. Save as image(s)
If you need a page from a PDF document as an image (or a few pages) you can simply open the document with GIMP. It will prompt you to select the pages to import and you can set the resolution.Import PDF pages as images in GIMP |
gs -sDEVICE=pnggray -r150x150 -sOutputFile="/path/to/image%04d.png" -dNOPAUSE -dBATCH "/path/to/document.pdf"Let me explain some arguments:
- -rNxN sets resolution where N is the DPI value (e.g. -r300x300 or -r72x72)
- -sDEVICE sets output format and colorspace as follows:
- -sDEVICE=png16m stands for 24-bit color PNG;
- -sDEVICE=pnggray stands for 8-bit grayscale PNG;
- -sDEVICE=pngmono stands for 1-bit monochrome PNG;
- -sDEVICE=tiff24nc stands for 24-bit color TIFF;
- -sDEVICE=tiffgray stands for 8-bit grayscale TIFF;
- -sDEVICE=tiffg4 stands for 1-bit monochrome TIFF;
- -sDEVICE=jpeg stands for 24-bit color JPEG;
- -sDEVICE=jpeggray stands for 8-bit grayscale JPEG;
- -sDEVICE=bmp16m stands for 24-bit color bitmap;
- -sDEVICE=bmpgray stands for 8-bit grayscale bitmap;
- -sDEVICE=bmpmono stands for 1-bit monochrome bitmap;
- -sDEVICE=psdrgb stands for 24-bit color Photoshop document.
b. Export images
PDF Mod can export images embedded in a PDF document, but I do believe that it only handles JPEG. The complete solution is pdfimages command line tool. It can also export images as they are in PDF or convert them to PNG or TIFF. Here is an example:
pdfimages -p -all /path/to/document.pdf ./It saves images in the folder from where it is called! The argument -all tells it to export images as they are, but if you want them as PNG use -png instead and if you want them as TIFF use -tiff instead of -all.
c. Images into PDF
There is no direct way of doing this. You can do it from the command line using convert for each image to turn it into a PDF page then concatenate pages.A faster way is using leptonica library. It can read TIFF, PNG and JPEG. How to write a small utility named joinpdf and use it for that purpose is described here, at the bottom of the post. If you can't/don't know how to compile software from source, user dingodog from DIY Book Scanner forum compiled it for you. Get it here.
Are you trying to do something else that is not shown here? Your feedback is welcome. Tell us what it is and we'll try to find an answer.
No comments :
New comments are not allowed.