a.k.a. "My Bank does not support CSVs".
When I asked my bank for "machine readable" versions of my bank statements, they where like:
<img fetchpriority="high" decoding="async" src="https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/wat-low.jpg" alt="WAT!" width="640" height="425" class="aligncenter size-full wp-image-4701" srcset="https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/wat-low.jpg 640w, https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/wat-low-300x199.jpg 300w" sizes="(max-width: 640px) 100vw, 640px" />
Their website has a CSV-export function. But only data from the last three months can be exported. Of course, it would have been smart to have performed this export every two months or so, but let's talk about something else.
<span id="more-4694"></span>
I do have the electronic version of the bank statements for the whole time. These electronic bank statements come as PDFs and each page is made up of a header and a table containing the actual transactions. My goal was to extract some meta data from the header - like the time-frame the statement is for - and a list of all transactions containing amount, date and subject.
A main purpose of extracting meta data from the header is to be able to sanity-check the result:
* is there a statement missing or do the processed statements cover a larger time frame without any gaps?
* is the account number the same for all processed statements?
* is the statement number increasing over time?
* is the difference of end balance and start balance the same as the sum of all transactions?
After trying out some standalone executables and libraries I decided to follow the following process:
1. <a href="http://qpdf.sourceforge.net/">`qpdf`</a> to "unlock" PDFs
2. `pdftotext.exe` from Xpdf to convert the data to readable text
3. custom text parsing in Python to ultimately turn the data into a CSV
4. import CSV into Excel/Calc and perform some manual cleanup/filtering
WTF is "unlock" you may ask: PDFs can have "restrictions" on what you can do with them. Things like "it is not allowed to print this PDF" or "it is not allowed to copy text from this PDF to the clipboard". The weird thing is that all sort of PDF readers and as well as parsing libraries adhere to these restrictions.
<img decoding="async" src="https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/doc-props.png" alt="Acrobat Reader Screenshot" width="818" height="653" class="aligncenter size-full wp-image-4708" srcset="https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/doc-props.png 818w, https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/doc-props-300x239.png 300w, https://blag.nullteilerfrei.de/wp-content/uploads/2019/04/doc-props-768x613.png 768w" sizes="(max-width: 818px) 100vw, 818px" />
Even the Russia-developed PDF reader SumatraPDF tells me that it cannot copy text from such a PDF when the corresponding protection is in place.
Not so <a href="http://qpdf.sourceforge.net/">QPDF</a>:
```bash
qpdf.exe --decrypt statement.pdf temp.pdf
```
yields a file `temp.pdf` without any of the restrictions. The tool `pdftotext.exe` from the Xpdf toolset is then able to convert the data into text:
```bash
pdftotext.exe -table -enc UTF-8 temp.pdf -
```
will output the text contained in `temp.pdf` in UTF-8 encoding to stdout while assuming that the PDF contains data formated as a table.
The rest was pretty straight forward: write a few regular expressions to match on the text resulting from the PDFs and do some hacky line-based parsing to read transactions.