User Tools


PDF to OCR'd Text

Turning a (possibly password-protected) PDF into OCR’d text on Linux, OSX, and Windows.

1. Make sure you have ImageMagick, tesseract, and pdftk installed

2. If the PDF is password-protected (and you have the password!), run:

 $ pdftk InputFileName.pdf output OutputFileName.pdf do_ask

(the “do_ask” bit will have pdftk ask for the owner’s password to unlock the file)

The resulting OutputFileName.pdf has the password removed… this is just to feed into the next step.

In this case, I used:

 $ pdftk DeadPhilosophersCafe_publisher_file_DND.pdf output dpc.pdf

3. Save the following as “pdf-ocr.sh” in the same directory as the pdf:

  #!/bin/sh
  STARTPAGE=5 # set to pagenumber of the first page of PDF you wish to convert
  ENDPAGE=176 # set to pagenumber of the last page of PDF you wish to convert
  SOURCE=dpc.pdf # set to the file name of the PDF
  OUTPUT=DeadPhilosophersCafe_publisher_file_DND.txt # set to the final output     file
  RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
  touch $OUTPUT
  for i in `seq $STARTPAGE $ENDPAGE`; do
  convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
  echo processing page $i
  tesseract page.tif tempoutput
  cat tempoutput.txt >> $OUTPUT
  done

4. Edit it to use the correct start/end/source/etc

5. Run it:

 $ sh pdf-ocr.sh

It will convert one page of the pdf at a time to a .tif image, and then OCR that image and append the OCR’d text to the output (text)file.

6. Delete the un-password-protected publisher’s pdf

7. Hand off the output (text) file to your OCR clean-up crew.

From David Christensen, February 2017.

public/nnels/etext/pdf-to-ocr-text.txt · Last modified: 2018/08/16 22:06 (external edit)