searching pdf files for certain info

rbt

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt

Jul 18 '05 #1

Subscribe Post Reply

4307

Diez B. Roggisch

rbt wrote:

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

There is a commercial tool pdflib availabla, that might help. It has a free
evaluation version, and python bindings.

If it's only about text, maybe pdf2text helps.
--
Regards,

Diez B. Roggisch

Jul 18 '05 #2

Andreas Lobinger

Aloha,

rbt wrote:

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/groups?selm=...&output=gplain
still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.

import pdffile
import pages
import zlib
pf = pdffile.pdffile('../pdf-testset1/a.pdf')
pp = pages.pages(pf)
c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
op = pdftool.parse_content(c)
sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
for a in sop:

print a[0]

Wishing a happy day
LOBI

Jul 18 '05 #3

rbt

Andreas Lobinger wrote:

Aloha,

rbt wrote:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/groups?selm=...&output=gplain

still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
>>> import pdffile
>>> import pages
>>> import zlib
>>> pf = pdffile.pdffile('../pdf-testset1/a.pdf')
>>> pp = pages.pages(pf)
>>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
>>> op = pdftool.parse_content(c)
>>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
>>> for a in sop:

print a[0]

Wishing a happy day
LOBI

Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Jul 18 '05 #4

Andreas Lobinger

Aloha,

rbt wrote:

Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

Jul 18 '05 #5

rbt

Andreas Lobinger wrote:

Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.

Jul 18 '05 #6

Tom Willis

I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:

Andreas Lobinger wrote:
Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thomas G. Willis
http://paperbackmusic.net

Jul 18 '05 #7

rbt

Tom Willis wrote:

I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:
Aloha,

rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?
Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list

For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.

Jul 18 '05 #8

Kartic

rbt said the following on 2/22/2005 8:53 AM:

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt

Hi,

Try pdftotext which is part of the XPdf project. pdftotext extracts
textual information from a PDF file to an output text file of your
choice. I have used it in the past (not with Python) to do what you are
attempting. It is a small program and you can invoke from python and
search for the string/pattern you want.

You can download for your OS from:
http://www.foolabs.com/xpdf/download.html

Thanks,
-Kartic

Jul 18 '05 #9

Tom Willis

Well sporadic spaces in strings would cause problems would it not?

an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.
On Tue, 22 Feb 2005 11:31:16 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:

Tom Willis wrote:
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:

Aloha,

rbt wrote:
>Thanks guys... what if I convert it to PS via printing it to a file or
>something? Would that make it easier to work with?
Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list

For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thomas G. Willis
http://paperbackmusic.net

Jul 18 '05 #10

rbt

Tom Willis wrote:

Well sporadic spaces in strings would cause problems would it not?

an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.

No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.

The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.

Jul 18 '05 #11

Tom Willis

Ah that makes sense. I only see the behavior in pdftotext. ps2ascii
doesn't give me the layout , which for my purposes, I certainly need.

Thanks for the info, Looks like I'll keep searching for that silver bullet.:(
On Tue, 22 Feb 2005 20:07:50 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:

Tom Willis wrote:
Well sporadic spaces in strings would cause problems would it not?

an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.

No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.

The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thomas G. Willis
http://paperbackmusic.net

Jul 18 '05 #12

Follower

rbt <rb*@athop1.ath.vt.edu> wrote in message news:<cv**********@solaris.cc.vt.edu>...

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

I've had success with both:

<http://www.boddie.org.uk/david/Projects/Python/pdftools/>

<http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py>

although my preference is for the latter as it transparently handles
decryption. (I've previously posted an enhancement to the `pdftools`
utility that adds decryption handling to it, but now use the `pdffile`
library as it handles it better.)

The ease of text extraction depends a lot on how the PDFs have been
created.

--Phil.

Jul 18 '05 #13

Similar topics

Searching Files

by: Govind | last post by:

Dear All, I want to search some specific content from html files. I am doing with File system object , Using For loop i read all files and using readToEnd, i did the Instr method to search the...

.NET Framework

Searching files in directories

by: pkilambi | last post by:

can anyone help me with this... I want to search for a list for files in a given directory and if it exists copy them to destination directory so what i am looking for is : file =...

Python

Searching files in folder

by: Steffen Loringer | last post by:

Hi group, may be a simple question: How can I find out in C, which files of a specified extension exist in a specified folder? My app should constantly look for a file with an known extension...

C / C++

Searching Files in Pocket PC?

by: Ma Xiaoming | last post by:

Dear ladies and gentlemen, As you know, by building a Smart Device Application in Microsoft Visual Studio .NET 2003, we could create a project for Pocket PC. My question is: How to search...

C# / C Sharp

searching files for text

by: Bud Dean | last post by:

I need to search files for given text. In particular, I'm searching dll's, exe's, asp, aspx and html pages. I am having difficulty converting the byte arrays to strings. The following code...

Visual Basic .NET

Searching Files in directories

by: meghagowda | last post by:

how to search for files in a directory using C

C / C++

Problem with searching files with egrep

by: SSJVEGETA | last post by:

Hello, everybody. I have read some examples and manuals for the egrep command for Linux and I don't know if this egrep command is right for the particular files I am searching for. Here is what the...

Linux

Filespec for searching files and directories

by: Airtech | last post by:

I am using the AllenBrowne code of "filldirlisttotable" to provide some functions for a media library manager I am building in access 2003. I have four checkboxes which if all four are not turned...

Microsoft Access / VBA

Searching files by matching filename pattern and concatenating contents of the files

by: joeferns79 | last post by:

Hi, I wanted to write a Perl script that searches a given folder for all files that have filenames based on the previous day's date. eg. if the filenames of the files in the said folder are .......

Perl

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing