pdf to text

tubby

I know this question comes up a lot, so here goes again. I want to read
text from a PDF file, run re searches on the text, etc. I do not care
about layout, fonts, borders, etc. I just want the text. I've been
reading Adobe's PDF Reference Guide and I'm beginning to develop a
better understanding of PDF in general, but I need a bit of help... this
seems like it should be easier than it is. Here's some code:

import zlib

fp = open('test.pdf', 'rb')
bytes = []
while 1:
byte = fp.read(1)
#print byte
bytes.append(byte)
if not byte:
break

for byte in bytes:

op = open('pdf.txt', 'a')

dco = zlib.decompressobj()

try:
s = dco.decompress(byte)
#print >op, s
print s
except Exception, e:
print e

op.close()

fp.close()

I know the text is compressed... that it would have stream and endstream
makers and BT (Begin Text) and ET (End Text) and that the uncompressed
text is enclosed in parenthesis (this is my text). Has anyone here done
this in a simple fashion? I've played with the pyPdf library some, but
it seems overly complex for my needs (merge PDFs, write PDFs, etc). I
just want a simple PDF text extractor.

Thanks

Jan 25 '07 #1

Subscribe Post Reply

3002

=?ISO-8859-1?Q?Nils_Oliver_Kr=F6ger?=

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

have a look at the pdflib (www.pdflib.com). Their Text Extraction
Toolkit might be what you are looking for, though I'm not sure whether
you can use it detached from the pdflib itself.

hth

Nils

tubby schrieb:

I know this question comes up a lot, so here goes again. I want to read
text from a PDF file, run re searches on the text, etc. I do not care
about layout, fonts, borders, etc. I just want the text. I've been
reading Adobe's PDF Reference Guide and I'm beginning to develop a
better understanding of PDF in general, but I need a bit of help... this
seems like it should be easier than it is. Here's some code:

import zlib

fp = open('test.pdf', 'rb')
bytes = []
while 1:
byte = fp.read(1)
#print byte
bytes.append(byte)
if not byte:
break

for byte in bytes:

op = open('pdf.txt', 'a')

dco = zlib.decompressobj()

try:
s = dco.decompress(byte)
#print >op, s
print s
except Exception, e:
print e

op.close()

fp.close()

I know the text is compressed... that it would have stream and endstream
makers and BT (Begin Text) and ET (End Text) and that the uncompressed
text is enclosed in parenthesis (this is my text). Has anyone here done
this in a simple fashion? I've played with the pyPdf library some, but
it seems overly complex for my needs (merge PDFs, write PDFs, etc). I
just want a simple PDF text extractor.

Thanks

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFuSPozvGJy8WEGTcRAnY0AJ0VZez3XRbLm/JXZKhn/rgHP0R3qwCfWAnT
EupBECHab2kG33Rmnh+xf74=
=INM5
-----END PGP SIGNATURE-----

Jan 25 '07 #2

David Boddie

On Thursday 25 January 2007 22:05, tubby wrote:

I know this question comes up a lot, so here goes again. I want to read
text from a PDF file, run re searches on the text, etc. I do not care
about layout, fonts, borders, etc. I just want the text. I've been
reading Adobe's PDF Reference Guide and I'm beginning to develop a
better understanding of PDF in general, but I need a bit of help... this
seems like it should be easier than it is.

It _seems_ that way. ;-)

One of the more promising suggestions for a way to solve this came
up in a comp.lang.python thread last year:

http://groups.google.com/group/comp....9?dmode=source

Basically, if you have access to the pdftotext command on a system
that supports xpdf, you should be able to get something reasonable
out of a PDF file.

I know the text is compressed... that it would have stream and endstream
makers and BT (Begin Text) and ET (End Text) and that the uncompressed
text is enclosed in parenthesis (this is my text). Has anyone here done
this in a simple fashion? I've played with the pyPdf library some, but
it seems overly complex for my needs (merge PDFs, write PDFs, etc). I
just want a simple PDF text extractor.

The pdftotext tool may do what you want:

http://www.foolabs.com/xpdf/download.html

Let us know how you get on with it.

David

Jan 25 '07 #3

tubby

David Boddie wrote:

The pdftotext tool may do what you want:

http://www.foolabs.com/xpdf/download.html

Let us know how you get on with it.

I have used this tool. However, I need PDF read ability on Windows and
Linux and in the future Macs. pdftotext works great on Linux, but poorly
on Windows (100% sustained CPU usage, etc).

Thank you for the suggestion. I'll keep hammering away at a simple
Python solution to this. Over the years, I have come to loath Adobe's
Portable Document Format!

Jan 25 '07 #4

tubby

David Boddie wrote:

The pdftotext tool may do what you want:

http://www.foolabs.com/xpdf/download.html

Let us know how you get on with it.

David

Perhaps I'm just using pdftotext wrong? Here's how I was using it:

f = filename

try:
sout = os.popen('pdftotext "%s" - ' %f)
data = sout.read().strip()
print data
sout.close()

except Exception, e:
print e

Jan 25 '07 #5

Lee Harr

Perhaps I'm just using pdftotext wrong? Here's how I was using it:

sout = os.popen('pdftotext "%s" - ' %f)

If you are having trouble with popen (not unlikely)
how about just writing to a temporary file and
reading the text from there?

I've used pdftotext several times in the past few
weeks (but not on windows). It was a major
time saver for me.

Jan 25 '07 #6

Dieter Deyke

tubby writes:

David Boddie wrote:

>The pdftotext tool may do what you want:

http://www.foolabs.com/xpdf/download.html

Let us know how you get on with it.

David

Perhaps I'm just using pdftotext wrong? Here's how I was using it:

f = filename

try:
sout = os.popen('pdftotext "%s" - ' %f)
data = sout.read().strip()
print data
sout.close()

except Exception, e:
print e

I am using pdftotext on Windows with cygwin on a regular basis without
any problem.

Your program above should read:

sout = os.popen('pdftotext "%s" - ' % (f,))

--
Dieter Deyke

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Jan 25 '07 #7

tubby

Dieter Deyke wrote:

> sout = os.popen('pdftotext "%s" - ' %f)

Your program above should read:

sout = os.popen('pdftotext "%s" - ' % (f,))

What is the significance of doing it this way?

Jan 29 '07 #8

Steve Holden

tubby wrote:

Dieter Deyke wrote:

>> sout = os.popen('pdftotext "%s" - ' %f)

>Your program above should read:

sout = os.popen('pdftotext "%s" - ' % (f,))

What is the significance of doing it this way?

It's actually just nit-picking - as long as you know f is never going to
be a tuple then it's perfectly acceptable to use a single value as the
right-hand operand.

Of course, if f ever *is* a tuple (with more than one element) then you
will get an error:

>>for f in ['string',

('one-element tuple', ),
("two-element", "tuple")]:
... print 'Nit: pdftotext "%s" - ' % (f,)
... print 'You: pdftotext "%s" - ' %f
...
Nit: pdftotext "string" -
You: pdftotext "string" -
Nit: pdftotext "('one-element tuple',)" -
You: pdftotext "one-element tuple" -
Nit: pdftotext "('two-element', 'tuple')" -
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
TypeError: not all arguments converted during string formatting

>>>

So there is potentially some value to it. But we often don't bother.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Jan 30 '07 #9

by: joes | last post by:

Hello there ! I rendering a PDF with XSLT using Xalan and FOP. I like to place in my article an image, so that the text is floating around the image. I tried several things but it didn't work so...

.NET Framework

text wrap around image

by: Xerxes | last post by:

Hi, I need help in setting up a page where the text wraps around an image. Right now, I am using table, with text in one <td> and the image in the adjacent <td>. The problem is when the text is...

HTML / CSS

Text flow around an image

by: Macsicarr | last post by:

Hi All Wonder if you could help me. I have created a CMS system that allows the user to enter text and pic 'tags' for their own About us page, eg text.... text.... text.... text.......

HTML / CSS

Two lines of text in inline context

by: Jiri Palecek | last post by:

I have a question on web authoring (probably HTML+CSS). Is it somehow possible to put two words above each other inside a paragraph so the result would be valid and render at least in Mozilla? I...

HTML / CSS

Button Text Not Changed Problem

by: Arif Çimen | last post by:

Hi to everybody, I have chnged a button text in design mode. But After compiling and executing the program the text of the button do not change to new value. Any Ideas? Thaks for helps.

C# / C Sharp

dropdown menu over text 2 / to see the behavior ...

by: jweinberg1975 | last post by:

I would like for users to be able to select from a small number of options that come from a little drop down menu which then closes. .....

Javascript

Reading and Writing Text Files

by: bbepristis | last post by:

Hey all I have this code that reads from one text file writes to another unless im on a certian line then it writes the new data however it only seems to do about 40 lines then quits and I cant...

Visual Basic .NET

calculations using createElement input text boxes in a table

by: acecraig100 | last post by:

I am fairly new to Javascript. I have a form that users fill out to enter an animal to exhibit at a fair. Because we have no way of knowing, how many animals a user may enter, I created a table...

Javascript

How to get an array of table data into multiple text box elements

by: jonniethecodeprince | last post by:

Hi all, I have trouble getting an array of data stored in a separate javascript file i.e. a file called books.js into a table of data for a .xhtml file. There are 50 Records in this file....

Javascript

editing a text file

by: bluemountain | last post by:

Hi there, Iam new to python forms and programming too I had a text file where i need to extract few words of data from the header(which is of 3 lines) and search for the keyword TEXT1, TEXT2,...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Similar topics