Horribly obscure character recognition

So you have a scanned document, and you want to produce a searchable PDF from these images. In my case, I want to digitalize some of my books. There are other tutorials out there, but none of them worked for me. Here's what I did, maybe it helps you. First, get a debian box and install the packages tesseract (this is the OCR software), xsltproc (dark magic), exactimage (for hocr2pdf) and of course pdftk. Now, take your scan and use scantailor to split it into several neat black and white .tif files, one for each page. Now, create a file called fix-hocr.xsl and put this in it:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- use on hocr file to fix for hocr2pdf 0.8.9 textbox placement -->
<xsl:template match="/html">
   <xsl:text>&#13;</xsl:text>
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>
<xsl:template match="span[@class='ocr_line']">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
   <xsl:element name="br">&#13;</xsl:element>
</xsl:template>
</xsl:stylesheet>

Don't ponder on that. It is a dark conjuring that fixes a bug in hocr2pdf. For every .tif file, we now do the following: * Perform OCR on it, and record the information of where the letters are located in the image. That's what HOCR is all about. * Do some vodoo with fix-hocr.xsl on the HOCR information from tesseract because tesseract's output causes problems with hocr2pdf, at least it did for me. * Use the hocr2pdf tool to create a PDF document which contains two layers: One layer of text information and the original tif image above it. And here's how we do that.

for pg in \$(ls *.tif); do 
 tesseract -l eng -psm 1 $pg stdout hocr |
 xsltproc -html -nonet -novalid fix-hocr.xsl - |
 hocr2pdf -i $pg -o "\${pg%%.tif}.pdf"; 
done

Ignore the warnings about nonclosing ?xml tags, they're bogus. Now you should have tons of searchable pdf pages, let's merge them into a document.

pdftk *.pdf cat output book.pdf;

The document will be huge. Compressing it is a whole different story of pain and awe. What worked pretty good for me was to convert the PDF to PS and then from PS back to PDF and then from that PDF to DJVU:

pdf2ps book.pdf book.ps;
gs -dCompatibilityLevel=1.4 -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dPDFA=2 -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=/RGB -dUseCIEColor -sPDFACompatibilityPolicy=2 -sOutputFile=book2.pdf book.ps;
pdf2djvu --loss-level=200 --dpi=299 --verbose --monochrome -o book.djvu book2.pdf;

For 300 dpi files. Somehow, setting the DPI to 299 for pdf2djvu shrunk the file size absurdly, as opposed to 300. My wild guess is that 299 somehow allowed pdf2djvu to actually use its lossy compression, while maintaining the 300 did not. As you can see, I have no actual idea what is happening here, so you will probably have to toy with the options a bit like I did. Another side note: The PDF remained huge no matter what I tried, but I got the DJVU down to about 10kb per page, which would be 2mb for 200 pages.

Tags: convert - djvu - ghost script - gs - hocr - hocr2pdf - ocr - pdf - ps - scan - scantailor - tesseract

2 Replies to “Horribly obscure character recognition”

Robert says:

2016-06-23 at 3:14 pm

Thank you so much for sharing the xslt, this hocr2pdf error bugged me for weeks!

rattle says:

2016-06-23 at 11:17 pm

And thanks in return for commenting, it feels good when the blagging is of actual use to someone =).

Horribly obscure character recognition

2 Replies to “Horribly obscure character recognition”

Leave a Reply Cancel reply