473,322 Members | 1,409 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Fw: PDF library for reading PDF files

Hi!

I am looking for a library in Python that would read PDF files and I could extract information from the PDF with it. I have searched with google, but only found libraries that can be used to write PDF files.

Any ideas?

Peter
Jul 18 '05 #1
14 11836
> I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.


reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald
Jul 18 '05 #2
"Peter Galfi" <ga****@freestart.hu> wrote in message news:<ma**************************************@pyt hon.org>...
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.

Any ideas?


I quickly searched back through Google, but I knew exactly what I was
looking for: ;-)

http://groups.google.com/groups?selm...ing.google.com

The page referred to is here:

http://www.boddie.org.uk/david/Proje...thon/pdftools/

The module is very much a "work in progress". You can probably get
some text and bitmap images out of a few documents, but that's
probably all you can expect unless you want to improve it (and
submit patches).

Good luck!

David
Jul 18 '05 #3
In article <Xn**********************************@62.153.159.1 34>,
Harald Massa <cp*********@spamgourmet.com> wrote:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.


reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald


ReportLab's libraries are great things--but they do not "extract
information from the PDF" in the sense I believe the original
questioner intended. As Andreas suggested, he's probably best
off using existing stand-alone applications as separate processes,
controlled from Python.
--

Cameron Laird <cl****@phaseit.net>
Business: http://www.Phaseit.net
Jul 18 '05 #4
Cameron Laird wrote:
In article <Xn**********************************@62.153.159.1 34>,
Harald Massa <cp*********@spamgourmet.com> wrote:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.


reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald

ReportLab's libraries are great things--but they do not "extract
information from the PDF" in the sense I believe the original
questioner intended.


No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.
Jul 18 '05 #5
In article <ox*******************@twister.socal.rr.com>,
Robert Kern <rk***@ucsd.edu> wrote:
Cameron Laird wrote:
In article <Xn**********************************@62.153.159.1 34>,
Harald Massa <cp*********@spamgourmet.com> wrote:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.

reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald

ReportLab's libraries are great things--but they do not "extract
information from the PDF" in the sense I believe the original
questioner intended.


No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.


Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read[s] PDF files ...
and ... extract[s] information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions.
--

Cameron Laird <cl****@phaseit.net>
Business: http://www.Phaseit.net
Jul 18 '05 #6
In article <10*************@corp.supernews.com>, Cameron Laird
<cl****@lairds.com> writes
......
No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.


Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read[s] PDF files ...
and ... extract[s] information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions.

I suspect Cameron is right. ReportLab does have a product called
pageCatcher, but its main function is to grab individual pages for
reuse. I believe it could be extended to go deeper and mess about with
text streams, but it certainly doesn't do that now and would take some
effort to do properly as text can be complicated in PDF (or postscript).
--
Robin Becker
Jul 18 '05 #7
Aloha,
Peter Galfi schrieb:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.
Any ideas?


Use file, split, zlib and a broad knowledge of the PDF-spec...

Accessing certain objects in the .pdf is not that complicated if
you f.e. try to read the /Info dictionary. Getting text from
actual page content could be very complicated.

Can you explain your 'information' further?

Wishing a happy day
LOBI
Jul 18 '05 #8
Cameron Laird wrote:
In article <ox*******************@twister.socal.rr.com>,
Robert Kern <rk***@ucsd.edu> wrote:
Cameron Laird wrote:
In article <Xn**********************************@62.153.159.1 34>,
Harald Massa <cp*********@spamgourmet.com> wrote:
>I am looking for a library in Python that would read PDF files and I
>could extract information from the PDF with it. I have searched with
>google, but only found libraries that can be used to write PDF files.

reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald
ReportLab's libraries are great things--but they do not "extract
information from the PDF" in the sense I believe the original
questioner intended.


No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.

Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read[s] PDF files ...
and ... extract[s] information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions.


Rereading http://www.reportlab.com/PageCatchIntro.html , you're right.
My apologies. I thought you were talking about the open source reportlab
package and not PageCatcher specifically.
Jul 18 '05 #9
Thanks. I am studying the PDF spec, it just does not seem to be that easy
having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.

Any suggestions?

Peter

----- Original Message -----
From: "Andreas Lobinger" <an**************@netsurf.de>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Monday, January 19, 2004 5:02 PM
Subject: Re: Fw: PDF library for reading PDF files
Aloha,
Peter Galfi schrieb:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.
Any ideas?


Use file, split, zlib and a broad knowledge of the PDF-spec...

Accessing certain objects in the .pdf is not that complicated if
you f.e. try to read the /Info dictionary. Getting text from
actual page content could be very complicated.

Can you explain your 'information' further?

Wishing a happy day
LOBI
--
http://mail.python.org/mailman/listinfo/python-list
Jul 18 '05 #10
> Thanks. I am studying the PDF spec, it just does not seem to be that easy
having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.

Any suggestions?


Peter,

Suggestion: extract the document to RTF using that other tool, then use
any one of the few dozen RTF parsers to convert them into plaintext.

- Josiah
Jul 18 '05 #11
Aloha,

Peter Galfi schrieb:
Thanks. I am studying the PDF spec, it just does not seem to be that easy
having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.


As others wrote here, the simplest solution is to use a external
pdf-2-text programm and postprocess the data. Read comp.text.pdf

There is no simple and consistent way to extract text from a .pdf
because there are many ways to set text. The optical impression
of a paragraph may not be represented by a similar command structure
in the .pdf.

Adobe recognized the difficulties for document reuse and introduced
tagged .pdf in 1.4. With tagged-pdf it is possible to insert
structural information in the .pdf. If you are interested in
using this, contact me.

Wishing a happy day
LOBI
Jul 18 '05 #12
In article <40***************@netsurf.de>,
Andreas Lobinger <an**************@netsurf.de> wrote:
Aloha,

Peter Galfi schrieb:

Jul 18 '05 #13
On Tue, 20 Jan 2004 08:59:03 +0100, "Peter Galfi" <ga****@freestart.hu>
declaimed the following in comp.lang.python:

Any suggestions?
Configure a text-only printer, with "print to file" capability,
and "print" the PDF file to it... Then read the print-out...
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #14
Peter Galfi wrote:
.... The "information" I am trying to extract from the PDF file is the text,
specifically in a way to keep the original paragraphs of the text. ....
Any suggestions?


Ghostscript has an Extract Text capability that I have used
successfully on some pdf files (but not on some others):
http://www.cs.wisc.edu/~ghost/

Thanks,
Jeff Sandys
Jul 18 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Doug Tolton | last post by:
At my company we have a program that parses through certain types of electronic files and stores the information in plain text. E-mail poses an interesting problem for us, because most of the text...
4
by: Chris Stiles | last post by:
Hi -- Is there a library available for python that will enable me to process .wav files ? Preferably extensible so I can write handlers for dealing with the non audio sections. --...
43
by: Steven T. Hatton | last post by:
Now that I have a better grasp of the scope and capabilities of the C++ Standard Library, I understand that products such as Qt actually provide much of the same functionality through their own...
20
by: pooja | last post by:
what is a class library and how does it different from c++ header file?
2
by: ViperCB | last post by:
Hello from a newbie, I am trying to do some research on an upcoming project that involves reading in audio files of various formats and using the audio signal as a source of noise to generate...
3
by: sklett | last post by:
I am trying to make a library that will read and write one of our text file formats. It is a hierarchical structure and I have modeled the classes after it. For example: class Character class...
1
by: SteveT | last post by:
I've written a small library that contains external XML files that will be read by the DLL during program execution. I want to use the "Test View" features of VS2005 to debug the reading of the...
14
by: MsNews | last post by:
Hi, I'm creating a free Icon library in C# with source code include, it already support .ico/.dll../exe and I'd like to support .ICL format too, I need to load a file .ICL (Icon Library) that...
16
by: Xiaoxiao | last post by:
Hi, I got a C library, is there a way to view the public function names in this library so that I can use in my C program? Thanks.
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.