Extract Transactions from Bank Statement in PDF form



a.k.a. "My Bank does not support CSVs". When I asked my bank for "machine readable" versions of my bank statements, they where like: WAT! Their website has a CSV-export function. But only data from the last three months can be exported. Of course, it would have been smart to have performed this export every two months or so, but let's talk about something else. I do have the electronic version of the bank statements for the whole time. These electronic bank statements come as PDFs and each page is made up of a header and a table containing the actual transactions. My goal was to extract some meta data from the header - like the time-frame the statement is for - and a list of all transactions containing amount, date and subject. A main purpose of extracting meta data from the header is to be able to sanity-check the result: * is there a statement missing or do the processed statements cover a larger time frame without any gaps? * is the account number the same for all processed statements? * is the statement number increasing over time? * is the difference of end balance and start balance the same as the sum of all transactions? After trying out some standalone executables and libraries I decided to follow the following process: 1. qpdf to "unlock" PDFs 2. pdftotext.exe from Xpdf to convert the data to readable text 3. custom text parsing in Python to ultimately turn the data into a CSV 4. import CSV into Excel/Calc and perform some manual cleanup/filtering WTF is "unlock" you may ask: PDFs can have "restrictions" on what you can do with them. Things like "it is not allowed to print this PDF" or "it is not allowed to copy text from this PDF to the clipboard". The weird thing is that all sort of PDF readers and as well as parsing libraries adhere to these restrictions. Acrobat Reader Screenshot Even the Russia-developed PDF reader SumatraPDF tells me that it cannot copy text from such a PDF when the corresponding protection is in place. Not so QPDF:
qpdf.exe --decrypt statement.pdf temp.pdf
yields a file temp.pdf without any of the restrictions. The tool pdftotext.exe from the Xpdf toolset is then able to convert the data into text:
pdftotext.exe -table -enc UTF-8 temp.pdf -
will output the text contained in temp.pdf in UTF-8 encoding to stdout while assuming that the PDF contains data formated as a table. The rest was pretty straight forward: write a few regular expressions to match on the text resulting from the PDFs and do some hacky line-based parsing to read transactions.

Leave a Reply

Your email address will not be published.