Obviously, my Bank does not provide a REST API to download the transactions happening on my accounts. After I asked for "machine parseable" data, they told me that I can download CSV files. Awesome! So I wrote a parser and lived happily ever after. Except that they change their CSV format without notice every few months and at some point they started to mix different encodings in the same file. So I lived unhapply, regularly fixing the script reading the CSV file. What did not change for around 10 years now are their banking statements. And this also holds for the PDFs you have to download, if you want to avoid getting them via snail mail (and paying for the postage of course). I decided to parse the PDFs instead and this went pretty well for may years... up to recently, when something changed (it may have been a software update of the parser I use or something on their side). PDFs contain metadata, controlling what actions a viewer application should allow the user to make. Programs like the Adobe Reader shows this information through the Security tab in the document properties dialog. And since we are talking about bank statements here, my bank seems to have decided that one should not be able to do a lot with those documents apart from printing them. Exiftool reads this information from the PDF under the "UserAccess" tag since 2010 ((http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=2908.0)).
The annoying thing is that Adobe seemed to have bullied a lot library and parser developers to adhere to this metadata. That's what they did to PDFMiner as well ((https://github.com/euske/pdfminer)). If you just try to call PDFMiner with such a PDF, it throws a
$ exiftool statement.pdf ExifTool Version Number : 9.74 [...] User Access : Print, Assemble, Print high-res [...]
PDFTextExtractionNotAllowedexception. Here is a patch to fix this bug:
diff --git a/pdfminer/pdfpage.py b/pdfminer/pdfpage.py index a48767c..bd0d9df 100644 --- a/pdfminer/pdfpage.py +++ b/pdfminer/pdfpage.py @@ -8,7 +8,7 @@ from .pdftypes import list_value from .pdftypes import dict_value from .pdfparser import PDFParser from .pdfdocument import PDFDocument -from .pdfdocument import PDFTextExtractionNotAllowed # some predefined literals and keywords. LITERAL_PAGE = LIT('Page') @@ -121,8 +121,8 @@ class PDFPage(object): # Create a PDF document object that stores the document structure. doc = PDFDocument(parser, password=password, caching=caching) # Check if the document allows text extraction. If not, abort. - if check_extractable and not doc.is_extractable: - raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) # Process each page contained in the document. for (pageno, page) in enumerate(klass.create_pages(doc)): if pagenos and (pageno not in pagenos):