473,322 Members | 1,188 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Extracting numbers from PDF sometimes scrambles them

I need to extract text from PDF files. Usually it works fine, but in some files the extraction process replaces most (though not all) digits with other digits. For example, "1966" becomes "6611", "015785645" becomes "465665105", and so on. The ridiculous part is that the rest of the text is perfectly fine, only the digits are affected (but the amount of digits remains the same). I tried various extraction methods, even as simple as manual copy-pasting, I tried different computers with different OS - I always get the same result. While the numbers appear properly in Acrobat Reader or browser plugins, any attempt to copy/extract them results in such digit substitutions. The substitution isn't random at all - it produces self-consistent and identical results every time, anywhere, but other than that the substitution "rules" have no logic and no purpose I can think of. There is also nothing special about the affected files - they're normal PDFs, no restrictions, no protections, no special versions or software that created them, nothing.
Has anyone ever encountered such a phenomenon? Why is this happening? And how can this be fixed or at least detected programmatically?
May 11 '19 #1
6 1237
Luuk
1,047 Expert 1GB
I would say: "more info needed".
Would it be possible to post an example of such a PDF somewhere?
May 11 '19 #2
Here's one example: http://miranor.co.il/page27.pdf
For convenience I extracted just one page with quite a few numbers. Interestingly, the online extraction tool preserved both the correct presentation and the problem. The text is in Hebrew, but the numbers are recognizable. Right in the first couple of lines try to copy-paste 5.9 ad 1995 - they become 1.0 and 1001, at least for me. Maybe this problem exists only in Hebrew files, but I see no reason why.
May 19 '19 #3
zmbd
5,501 Expert Mod 4TB
This persists in a non-adobe pdf reader as well...

I am not overly familiar with reverse engineering PDF files; however, I did use my trusty hex-editor and took a look under the hood. While I did find the spark plugs I didn't find the engine! Anyway, my best guess, looking at the object declarations, is how the font is being encoded within the PDF file structure.

You might want to list the methods you've tried, and if you have used any code such as JavaScript you might want to post it as well (please use the [CODE/] format tool for such scripts) so that our other experts have a foundation to work with.

Best of luck
May 20 '19 #4
I didn't write any code. I used the 'pdftotext' CLI utility as well as several online converters. It doesn't matter, since the problem exists everywhere.
May 20 '19 #5
Luuk
1,047 Expert 1GB
You failed to mention the text in your PDF is Hebrew. They read from right to left...

I think this bug is in place:
https://bugs.freedesktop.org/show_bug.cgi?id=32522
the status of this bug is: 'solved', it seems you need an update
May 20 '19 #6
No, that bug is about letters being in the wrong order. In my case the letters are perfectly fine and in the right order. The digits are in the right order too, they're just substituted with completely different digits (sometimes). In this example, 2000 becomes 1999 and 2017 becomes 1910. From this and other examples we can derive the substitution map: 0->9, 1->1, 2->1, 3->1, 4->1, 5->1, 6->9, 7->0, 8->2, 9->0. Sadly, these rules apply only to this file. In other affected files the mapping is totally different, but always degenerative, so reverse mapping is impossible anyway, the numbers are damaged beyond repair. Interestingly, though, the page number, 27, isn't affected, so the error seems to be localized to the page body (not header or footer), or perhaps this is indeed a font problem (page number is written with a different font). In any case, the problem is not in 'pdftotext' - it appears everywhere, including Acrobat Reader itself (any version I checked), which presents the numbers correctly, but copy-pastes them wrong.
May 20 '19 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

5
by: dawenliu | last post by:
Hi, I have a file with this content: xxxxxxx xxxxxxxxxx xxxxx xxxxxxx 1 0 0 0 1 1 0 (many more 1's and 0's to follow)
5
by: dawenliu | last post by:
Hi, I have a file with this content: zzzz zzzzz zzz zzzzz .... xxxxxxx xxxxxxxxxx xxxxx 34.215 zzzzzzz zz zzzz .... "x" and "z" are letters. The lines with "z" are trash, and only the...
96
by: Gustav Hållberg | last post by:
I tried finding a discussion around adding the possibility to have optional underscores inside numbers in Python. This is a popular option available in several "competing" scripting langauges, that...
7
by: Raphi | last post by:
Hi, I'm trying to clean up a large database in Access. I have one field for address, which needs to be broken up into Street Number, Street Name, and Street Label (St., Road, etc.) The...
10
by: Dan | last post by:
I have a number of strings that represents time. 1w 2d 3h 15m 2d 3h 15m 4h 30m 45m I want to extract the number parts of my strings into separate variables for Weeks, Days, Hours and...
7
by: asedt | last post by:
I have strings like, only some examples: ABC 213/23213,23 ABC 213/23213,DSF CVNCVB 3456/324 XCVV 123 /234/324 I need to take out the two first numbers.
1
by: JSANL | last post by:
Hey people, I wanted to generate a random number (for a unique and unpredictable memberID) and insert it into every memberID column which has not a memberID already (equals NULL). I dont want...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.