470,855 Members | 1,158 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,855 developers. It's quick & easy.

Read and extract text from pdf

Hi,
I have a problem :), I just want to extract text from pdf file with
python. There is differents libraries for that but it doesn't work...

pyPdf and pdfTools, I don't know why but it doesn't works with some
pdf... For example space chars are delete in the text..
Pdf playground : I don't understand how it work.

If you have an idea, a tutorial, a library or anything who can help me
to do that.

Apr 21 '06 #1
3 8714
Julien ARNOUX:
I have a problem :), I just want to extract text from pdf file with
python. There is differents libraries for that but it doesn't work...

pyPdf and pdfTools, I don't know why but it doesn't works with some
pdf...


Text can be represented in different ways in PDF: as tagged text, bitmap
and vector images, and even algorithms (IIRC). Most tools will only be
able to retrieve text represented as tagged text. So some tools may work
on some texts in some files and fail on others.

--
René Pijlman

Wat wil jij leren? http://www.leren.nl
Apr 21 '06 #2
You can use Ghostscript for that purpose. Look at ps2ascii script (or
batch file) in the Ghostscript distribution. You can either call
Ghostscript from command line or use its DLL (don't know if Python
binding already exists...). The limitations the previous author has
mentioned, however, still apply.

Avishay

Apr 21 '06 #3
Jim
There is a pdftotext executable, at least on Linux.

Apr 21 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by rishka | last post: by
3 posts views Thread by nicolasg | last post: by
2 posts views Thread by sk.rasheedfarhan | last post: by
6 posts views Thread by Thomas Kowalski | last post: by
5 posts views Thread by dm3281 | last post: by
3 posts views Thread by sam | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.