By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,851 Members | 1,095 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,851 IT Pros & Developers. It's quick & easy.

Extracting numbers from PDF sometimes scrambles them

P: 6
I need to extract text from PDF files. Usually it works fine, but in some files the extraction process replaces most (though not all) digits with other digits. For example, "1966" becomes "6611", "015785645" becomes "465665105", and so on. The ridiculous part is that the rest of the text is perfectly fine, only the digits are affected (but the amount of digits remains the same). I tried various extraction methods, even as simple as manual copy-pasting, I tried different computers with different OS - I always get the same result. While the numbers appear properly in Acrobat Reader or browser plugins, any attempt to copy/extract them results in such digit substitutions. The substitution isn't random at all - it produces self-consistent and identical results every time, anywhere, but other than that the substitution "rules" have no logic and no purpose I can think of. There is also nothing special about the affected files - they're normal PDFs, no restrictions, no protections, no special versions or software that created them, nothing.
Has anyone ever encountered such a phenomenon? Why is this happening? And how can this be fixed or at least detected programmatically?
1 Week Ago #1
Share this Question
Share on Google+
6 Replies


Expert 100+
P: 1,026
I would say: "more info needed".
Would it be possible to post an example of such a PDF somewhere?
1 Week Ago #2

P: 6
Here's one example: http://miranor.co.il/page27.pdf
For convenience I extracted just one page with quite a few numbers. Interestingly, the online extraction tool preserved both the correct presentation and the problem. The text is in Hebrew, but the numbers are recognizable. Right in the first couple of lines try to copy-paste 5.9 ad 1995 - they become 1.0 and 1001, at least for me. Maybe this problem exists only in Hebrew files, but I see no reason why.
5 Days Ago #3

zmbd
Expert Mod 5K+
P: 5,331
This persists in a non-adobe pdf reader as well...

I am not overly familiar with reverse engineering PDF files; however, I did use my trusty hex-editor and took a look under the hood. While I did find the spark plugs I didn't find the engine! Anyway, my best guess, looking at the object declarations, is how the font is being encoded within the PDF file structure.

You might want to list the methods you've tried, and if you have used any code such as JavaScript you might want to post it as well (please use the [CODE/] format tool for such scripts) so that our other experts have a foundation to work with.

Best of luck
5 Days Ago #4

P: 6
I didn't write any code. I used the 'pdftotext' CLI utility as well as several online converters. It doesn't matter, since the problem exists everywhere.
4 Days Ago #5

Expert 100+
P: 1,026
You failed to mention the text in your PDF is Hebrew. They read from right to left...

I think this bug is in place:
https://bugs.freedesktop.org/show_bug.cgi?id=32522
the status of this bug is: 'solved', it seems you need an update
4 Days Ago #6

P: 6
No, that bug is about letters being in the wrong order. In my case the letters are perfectly fine and in the right order. The digits are in the right order too, they're just substituted with completely different digits (sometimes). In this example, 2000 becomes 1999 and 2017 becomes 1910. From this and other examples we can derive the substitution map: 0->9, 1->1, 2->1, 3->1, 4->1, 5->1, 6->9, 7->0, 8->2, 9->0. Sadly, these rules apply only to this file. In other affected files the mapping is totally different, but always degenerative, so reverse mapping is impossible anyway, the numbers are damaged beyond repair. Interestingly, though, the page number, 27, isn't affected, so the error seems to be localized to the page body (not header or footer), or perhaps this is indeed a font problem (page number is written with a different font). In any case, the problem is not in 'pdftotext' - it appears everywhere, including Acrobat Reader itself (any version I checked), which presents the numbers correctly, but copy-pastes them wrong.
4 Days Ago #7

Post your reply

Sign in to post your reply or Sign up for a free account.