Turning a (possibly password-protected) PDF into OCR’d text on Linux, OSX, and Windows.
1. Make sure you have ImageMagick, tesseract, and pdftk installed
2. If the PDF is password-protected (and you have the password!), run:
$ pdftk InputFileName.pdf output OutputFileName.pdf do_ask
(the “do_ask” bit will have pdftk ask for the owner’s password to unlock the file)
The resulting OutputFileName.pdf has the password removed… this is just to feed into the next step.
In this case, I used:
$ pdftk DeadPhilosophersCafe_publisher_file_DND.pdf output dpc.pdf
3. Save the following as “pdf-ocr.sh” in the same directory as the pdf:
#!/bin/sh STARTPAGE=5 # set to pagenumber of the first page of PDF you wish to convert ENDPAGE=176 # set to pagenumber of the last page of PDF you wish to convert SOURCE=dpc.pdf # set to the file name of the PDF OUTPUT=DeadPhilosophersCafe_publisher_file_DND.txt # set to the final output file RESOLUTION=600 # set to the resolution the scanner used (the higher, the better) touch $OUTPUT for i in `seq $STARTPAGE $ENDPAGE`; do convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif echo processing page $i tesseract page.tif tempoutput cat tempoutput.txt >> $OUTPUT done
4. Edit it to use the correct start/end/source/etc
5. Run it:
$ sh pdf-ocr.sh
It will convert one page of the pdf at a time to a .tif image, and then OCR that image and append the OCR’d text to the output (text)file.
6. Delete the un-password-protected publisher’s pdf
7. Hand off the output (text) file to your OCR clean-up crew.
From David Christensen, February 2017.