468,167 Members | 1,965 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,167 developers. It's quick & easy.

Extracting numbers from PDF sometimes scrambles them

I need to extract text from PDF files. Usually it works fine, but in some files the extraction process replaces most (though not all) digits with other digits. For example, "1966" becomes "6611", "015785645" becomes "465665105", and so on. The ridiculous part is that the rest of the text is perfectly fine, only the digits are affected (but the amount of digits remains the same). I tried various extraction methods, even as simple as manual copy-pasting, I tried different computers with different OS - I always get the same result. While the numbers appear properly in Acrobat Reader or browser plugins, any attempt to copy/extract them results in such digit substitutions. The substitution isn't random at all - it produces self-consistent and identical results every time, anywhere, but other than that the substitution "rules" have no logic and no purpose I can think of. There is also nothing special about the affected files - they're normal PDFs, no restrictions, no protections, no special versions or software that created them, nothing.
Has anyone ever encountered such a phenomenon? Why is this happening? And how can this be fixed or at least detected programmatically?
May 11 '19 #1
6 990
1,043 Expert 1GB
I would say: "more info needed".
Would it be possible to post an example of such a PDF somewhere?
May 11 '19 #2
Here's one example: http://miranor.co.il/page27.pdf
For convenience I extracted just one page with quite a few numbers. Interestingly, the online extraction tool preserved both the correct presentation and the problem. The text is in Hebrew, but the numbers are recognizable. Right in the first couple of lines try to copy-paste 5.9 ad 1995 - they become 1.0 and 1001, at least for me. Maybe this problem exists only in Hebrew files, but I see no reason why.
May 19 '19 #3
5,400 Expert Mod 4TB
This persists in a non-adobe pdf reader as well...

I am not overly familiar with reverse engineering PDF files; however, I did use my trusty hex-editor and took a look under the hood. While I did find the spark plugs I didn't find the engine! Anyway, my best guess, looking at the object declarations, is how the font is being encoded within the PDF file structure.

You might want to list the methods you've tried, and if you have used any code such as JavaScript you might want to post it as well (please use the [CODE/] format tool for such scripts) so that our other experts have a foundation to work with.

Best of luck
May 20 '19 #4
I didn't write any code. I used the 'pdftotext' CLI utility as well as several online converters. It doesn't matter, since the problem exists everywhere.
May 20 '19 #5
1,043 Expert 1GB
You failed to mention the text in your PDF is Hebrew. They read from right to left...

I think this bug is in place:
the status of this bug is: 'solved', it seems you need an update
May 20 '19 #6
No, that bug is about letters being in the wrong order. In my case the letters are perfectly fine and in the right order. The digits are in the right order too, they're just substituted with completely different digits (sometimes). In this example, 2000 becomes 1999 and 2017 becomes 1910. From this and other examples we can derive the substitution map: 0->9, 1->1, 2->1, 3->1, 4->1, 5->1, 6->9, 7->0, 8->2, 9->0. Sadly, these rules apply only to this file. In other affected files the mapping is totally different, but always degenerative, so reverse mapping is impossible anyway, the numbers are damaged beyond repair. Interestingly, though, the page number, 27, isn't affected, so the error seems to be localized to the page body (not header or footer), or perhaps this is indeed a font problem (page number is written with a different font). In any case, the problem is not in 'pdftotext' - it appears everywhere, including Acrobat Reader itself (any version I checked), which presents the numbers correctly, but copy-pastes them wrong.
May 20 '19 #7

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

5 posts views Thread by dawenliu | last post: by
96 posts views Thread by Gustav Hållberg | last post: by
7 posts views Thread by Raphi | last post: by
1 post views Thread by gcdp | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.