473,385 Members | 1,856 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

searching pdf files for certain info

rbt
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt
Jul 18 '05 #1
12 4307
rbt wrote:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.


There is a commercial tool pdflib availabla, that might help. It has a free
evaluation version, and python bindings.

If it's only about text, maybe pdf2text helps.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #2
Aloha,

rbt wrote:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.


First of all,
http://groups.google.de/groups?selm=...&output=gplain
still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
import pdffile
import pages
import zlib
pf = pdffile.pdffile('../pdf-testset1/a.pdf')
pp = pages.pages(pf)
c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
op = pdftool.parse_content(c)
sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
for a in sop:

print a[0]

Wishing a happy day
LOBI
Jul 18 '05 #3
rbt
Andreas Lobinger wrote:
Aloha,

rbt wrote:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/groups?selm=...&output=gplain

still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
>>> import pdffile
>>> import pages
>>> import zlib
>>> pf = pdffile.pdffile('../pdf-testset1/a.pdf')
>>> pp = pages.pages(pf)
>>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
>>> op = pdftool.parse_content(c)
>>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
>>> for a in sop:

print a[0]

Wishing a happy day
LOBI


Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?
Jul 18 '05 #4
Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?


Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI
Jul 18 '05 #5
rbt
Andreas Lobinger wrote:
Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI


I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
Jul 18 '05 #6
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:
Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI


I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thomas G. Willis
http://paperbackmusic.net
Jul 18 '05 #7
rbt
Tom Willis wrote:
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:
Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?
Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI


I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list



For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.
Jul 18 '05 #8
rbt said the following on 2/22/2005 8:53 AM:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt


Hi,

Try pdftotext which is part of the XPdf project. pdftotext extracts
textual information from a PDF file to an output text file of your
choice. I have used it in the past (not with Python) to do what you are
attempting. It is a small program and you can invoke from python and
search for the string/pattern you want.

You can download for your OS from:
http://www.foolabs.com/xpdf/download.html

Thanks,
-Kartic
Jul 18 '05 #9
Well sporadic spaces in strings would cause problems would it not?

an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.
On Tue, 22 Feb 2005 11:31:16 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Tom Willis wrote:
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:

Aloha,

rbt wrote:
>Thanks guys... what if I convert it to PS via printing it to a file or
>something? Would that make it easier to work with?
Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list



For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thomas G. Willis
http://paperbackmusic.net
Jul 18 '05 #10
rbt
Tom Willis wrote:
Well sporadic spaces in strings would cause problems would it not?

an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.


No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.

The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
Jul 18 '05 #11
Ah that makes sense. I only see the behavior in pdftotext. ps2ascii
doesn't give me the layout , which for my purposes, I certainly need.

Thanks for the info, Looks like I'll keep searching for that silver bullet.:(
On Tue, 22 Feb 2005 20:07:50 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Tom Willis wrote:
Well sporadic spaces in strings would cause problems would it not?

an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.


No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.

The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thomas G. Willis
http://paperbackmusic.net
Jul 18 '05 #12
rbt <rb*@athop1.ath.vt.edu> wrote in message news:<cv**********@solaris.cc.vt.edu>...
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.


I've had success with both:

<http://www.boddie.org.uk/david/Projects/Python/pdftools/>

<http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py>

although my preference is for the latter as it transparently handles
decryption. (I've previously posted an enhancement to the `pdftools`
utility that adds decryption handling to it, but now use the `pdffile`
library as it handles it better.)

The ease of text extraction depends a lot on how the PDFs have been
created.

--Phil.
Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Govind | last post by:
Dear All, I want to search some specific content from html files. I am doing with File system object , Using For loop i read all files and using readToEnd, i did the Instr method to search the...
2
by: pkilambi | last post by:
can anyone help me with this... I want to search for a list for files in a given directory and if it exists copy them to destination directory so what i am looking for is : file =...
5
by: Steffen Loringer | last post by:
Hi group, may be a simple question: How can I find out in C, which files of a specified extension exist in a specified folder? My app should constantly look for a file with an known extension...
3
by: Ma Xiaoming | last post by:
Dear ladies and gentlemen, As you know, by building a Smart Device Application in Microsoft Visual Studio .NET 2003, we could create a project for Pocket PC. My question is: How to search...
1
by: Bud Dean | last post by:
I need to search files for given text. In particular, I'm searching dll's, exe's, asp, aspx and html pages. I am having difficulty converting the byte arrays to strings. The following code...
1
by: meghagowda | last post by:
how to search for files in a directory using C
1
by: SSJVEGETA | last post by:
Hello, everybody. I have read some examples and manuals for the egrep command for Linux and I don't know if this egrep command is right for the particular files I am searching for. Here is what the...
14
by: Airtech | last post by:
I am using the AllenBrowne code of "filldirlisttotable" to provide some functions for a media library manager I am building in access 2003. I have four checkboxes which if all four are not turned...
8
by: joeferns79 | last post by:
Hi, I wanted to write a Perl script that searches a given folder for all files that have filenames based on the previous day's date. eg. if the filenames of the files in the said folder are .......
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.