So you have a scanned document, and you want to produce a searchable PDF from these images. In my case, I want to digitalize some of my books. There are other tutorials out there, but none of them worked for me. Here's what I did, maybe it helps you. First, get a debian box and install the packages tesseract (this is the OCR software), xsltproc (dark magic), exactimage (for hocr2pdf) and of course pdftk. Now, take your scan and use scantailor to split it into several neat black and white .tif files, one for each page. Now, create a file called fix-hocr.xsl and put this in it:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- use on hocr file to fix for hocr2pdf 0.8.9 textbox placement -->
<xsl:template match="/html">
   <xsl:text>&#13;</xsl:text>
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>
<xsl:template match="span[@class='ocr_line']">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
   <xsl:element name="br">&#13;</xsl:element>
</xsl:template>
</xsl:stylesheet>
Don't ponder on that. It is a dark conjuring that fixes a bug in hocr2pdf. For every .tif file, we now do the following: * Perform OCR on it, and record the information of where the letters are located in the image. That's what HOCR is all about. * Do some vodoo with fix-hocr.xsl on the HOCR information from tesseract because tesseract's output causes problems with hocr2pdf, at least it did for me. * Use the hocr2pdf tool to create a PDF document which contains two layers: One layer of text information and the original tif image above it. And here's how we do that.
for pg in \$(ls *.tif); do 
 tesseract -l eng -psm 1 $pg stdout hocr |
 xsltproc -html -nonet -novalid fix-hocr.xsl - |
 hocr2pdf -i $pg -o "\${pg%%.tif}.pdf"; 
done
Ignore the warnings about nonclosing ?xml tags, they're bogus. Now you should have tons of searchable pdf pages, let's merge them into a document.
pdftk *.pdf cat output book.pdf;
The document will be huge. Compressing it is a whole different story of pain and awe. What worked pretty good for me was to convert the PDF to PS and then from PS back to PDF and then from that PDF to DJVU:
pdf2ps book.pdf book.ps;
gs -dCompatibilityLevel=1.4 -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dPDFA=2 -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=/RGB -dUseCIEColor -sPDFACompatibilityPolicy=2 -sOutputFile=book2.pdf book.ps;
pdf2djvu --loss-level=200 --dpi=299 --verbose --monochrome -o book.djvu book2.pdf;
For 300 dpi files. Somehow, setting the DPI to 299 for pdf2djvu shrunk the file size absurdly, as opposed to 300. My wild guess is that 299 somehow allowed pdf2djvu to actually use its lossy compression, while maintaining the 300 did not. As you can see, I have no actual idea what is happening here, so you will probably have to toy with the options a bit like I did. Another side note: The PDF remained huge no matter what I tried, but I got the DJVU down to about 10kb per page, which would be 2mb for 200 pages.


Nikolai asked me how to open a Python interpreter prompt from the console with a certain module already imported. For the record, the Python command line switches tell us that python -i -m module is the way to start the prompt and load module.py. That made me wonder whether I can stuff it all into one batch file, and I came up with the following test.bat:
REM = '''
@COPY %0.bat %0.py
@python %0.py
@DEL %0.py 
@goto :eof ::'''
del REM
for k in range(70):
    print(k)
That script will ignore the first line because it is a comment, then copy itself to test.py, then launch python with this argument. Afterwards, test.py is deleted and the script terminates without looking at any of the following lines. Note that :: is yet another way to comment in Batch. Python, however, will see a script where the variable REM is defined as a multi-line string and deleted right after that. After this little stub, you can put any python code you want. Well. I thought it was funky.


If you like electronic music, you obviously lurk getworkdonemusic.com all the time with jDownloader running, so you can copy the link to any nice track on the air and download it, slowly aquiring a groovy selection of electronic music for the lonely train hours where you have to get work done, offline style. Sometimes, however, these tracks are in wave format, for reasons that are completely beyond me. A quick google search for »convert WAV to MP3« points you towards a huge selection of shitware, when it could be so much easier: Fire up VLC, the best media player in the world1 and go * Media, Convert/Save (Ctrl+R) * Add your file to the list. * Click the button that says Convert/Save. * Select a destination file. * Set the profile to Audio - MP3 and hit Start. Also, make sure you don't have VLC set to loop, because otherwise it'll spend the next few hours overwriting that file with the same converted MP3 again and again. True story.
  1. This post refers to VLC media player 2.0.5 Twoflower. []