pdf – nullteilerfrei

Horribly obscure character recognition

rattle technology 2015-10-05

So you have a scanned document, and you want to produce a searchable PDF from these images. In my case, I want to digitalize some of my books. There are other tutorials out there, but none of them worked for me. Here's what I did, maybe it helps you. First, get a debian box and install the packages tesseract (this is the OCR software), xsltproc (dark magic), exactimage (for hocr2pdf) and of course pdftk. Now, take your scan and use scantailor to split it into several neat black and white .tif files, one for each page. Now, create a file called fix-hocr.xsl and put this in it:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- use on hocr file to fix for hocr2pdf 0.8.9 textbox placement -->
<xsl:template match="/html">
   <xsl:text>&#13;</xsl:text>
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>
<xsl:template match="span[@class='ocr_line']">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
   <xsl:element name="br">&#13;</xsl:element>
</xsl:template>
</xsl:stylesheet>

Don't ponder on that. It is a dark conjuring that fixes a bug in hocr2pdf. For every .tif file, we now do the following: * Perform OCR on it, and record the information of where the letters are located in the image. That's what HOCR is all about. * Do some vodoo with fix-hocr.xsl on the HOCR information from tesseract because tesseract's output causes problems with hocr2pdf, at least it did for me. * Use the hocr2pdf tool to create a PDF document which contains two layers: One layer of text information and the original tif image above it. And here's how we do that.

for pg in \$(ls *.tif); do 
 tesseract -l eng -psm 1 $pg stdout hocr |
 xsltproc -html -nonet -novalid fix-hocr.xsl - |
 hocr2pdf -i $pg -o "\${pg%%.tif}.pdf"; 
done

Ignore the warnings about nonclosing ?xml tags, they're bogus. Now you should have tons of searchable pdf pages, let's merge them into a document.

pdftk *.pdf cat output book.pdf;

The document will be huge. Compressing it is a whole different story of pain and awe. What worked pretty good for me was to convert the PDF to PS and then from PS back to PDF and then from that PDF to DJVU:

pdf2ps book.pdf book.ps;
gs -dCompatibilityLevel=1.4 -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dPDFA=2 -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=/RGB -dUseCIEColor -sPDFACompatibilityPolicy=2 -sOutputFile=book2.pdf book.ps;
pdf2djvu --loss-level=200 --dpi=299 --verbose --monochrome -o book.djvu book2.pdf;

For 300 dpi files. Somehow, setting the DPI to 299 for pdf2djvu shrunk the file size absurdly, as opposed to 300. My wild guess is that 299 somehow allowed pdf2djvu to actually use its lossy compression, while maintaining the 300 did not. As you can see, I have no actual idea what is happening here, so you will probably have to toy with the options a bit like I did. Another side note: The PDF remained huge no matter what I tried, but I got the DJVU down to about 10kb per page, which would be 2mb for 200 pages.

Latex unter Windows mit Vorschau und deinem Lieblingseditor

born latex, technology 2013-07-12

Mein Eindruck ist, dass viele vernünftige Leute das Problem haben, unter Windows texen zu wollen und bei schrecklichen Programmen wie LaTeX Editor oder TeXnicCenter hängen bleiben. Prinzipiell funktionieren die ja auch. Vielleicht nicht perfekt und manchmal sind sie hier oder da etwas unpraktisch, oder stürzen ab (beim Editiern von Text) - aber sie tun ihre Arbeit und, hey, es gibt ja auch nichts besseres. Do you want to know more?

Animated LaTeX

rattle latex, mathematics 2013-03-01

After discovering the toggle command available in MathJax, I immediately went to asked the capable people of tex.stackexchange whether this could be done inside a PDF file. And indeed: It can be done!

\documentclass{article}
\usepackage{animate}

\begin{document}
\begin{animateinline}[step]{1}
  \strut$x=1$
\newframe
  \strut$x=2$
\newframe
  \strut$x=3$
\end{animateinline}
\end{document}

Now, what do I want with this package? I want to do abstract nonsense. A diagram-based proof should, in my opinion, be a slideshow. You start with the diagram that is your assumption and by simply interacting with the diagram (clicking it), in each step, a new arrow is constructed from some universal property. I wanted to write a neat animated PDF with an abstract nonsense proof of the famous Snake Lemma, and there is a great book by Francis Borceux containing a proof, but unfortunately, I was unable to overcome a difficulty with the proof, so that will have to wait until someone answers my question.