Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.
Thanks, rbt 12 4307
rbt wrote: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings.
There is a commercial tool pdflib availabla, that might help. It has a free
evaluation version, and python bindings.
If it's only about text, maybe pdf2text helps.
--
Regards,
Diez B. Roggisch
Aloha,
rbt wrote: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings.
First of all, http://groups.google.de/groups?selm=...&output=gplain
still applies here.
If you can deal with a very basic implementation of a pdf-lib you
might be interested in http://sourceforge.net/projects/pdfplayground
In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction. import pdffile import pages import zlib pf = pdffile.pdffile('../pdf-testset1/a.pdf') pp = pages.pages(pf) c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream) op = pdftool.parse_content(c) sop = [x[1] for x in op if x[0] in ["'", "Tj"]] for a in sop:
print a[0]
Wishing a happy day
LOBI
Andreas Lobinger wrote: Aloha,
rbt wrote:
Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings.
First of all, http://groups.google.de/groups?selm=...&output=gplain
still applies here.
If you can deal with a very basic implementation of a pdf-lib you might be interested in http://sourceforge.net/projects/pdfplayground
In the CVS (or the current snapshot) you can find in ppg/Doc/text_extract.txt an example for text extraction.
>>> import pdffile >>> import pages >>> import zlib >>> pf = pdffile.pdffile('../pdf-testset1/a.pdf') >>> pp = pages.pages(pf) >>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream) >>> op = pdftool.parse_content(c) >>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]] >>> for a in sop:
print a[0]
Wishing a happy day LOBI
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?
Aloha,
rbt wrote: Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with?
Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.
Wishing a happy day
LOBI
Andreas Lobinger wrote: Aloha,
rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with?
Not really... The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly.
Wishing a happy day LOBI
I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.
Usage:
ps2ascii PDF_file.pdf > ASCII_file.txt
However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
I tried that for something not python related and I was getting
sporadic spaces everywhere.
I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote: Andreas Lobinger wrote: Aloha,
rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with?
Not really... The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly.
Wishing a happy day LOBI
I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free.
Usage:
ps2ascii PDF_file.pdf > ASCII_file.txt
However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list
--
Thomas G. Willis http://paperbackmusic.net
Tom Willis wrote: I tried that for something not python related and I was getting sporadic spaces everywhere.
I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:
Aloha,
rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with?
Not really... The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly.
Wishing a happy day LOBI
I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free.
Usage:
ps2ascii PDF_file.pdf > ASCII_file.txt
However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list
For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.
rbt said the following on 2/22/2005 8:53 AM: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings.
Thanks, rbt
Hi,
Try pdftotext which is part of the XPdf project. pdftotext extracts
textual information from a PDF file to an output text file of your
choice. I have used it in the past (not with Python) to do what you are
attempting. It is a small program and you can invoke from python and
search for the string/pattern you want.
You can download for your OS from: http://www.foolabs.com/xpdf/download.html
Thanks,
-Kartic
Well sporadic spaces in strings would cause problems would it not?
an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"
I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.
On Tue, 22 Feb 2005 11:31:16 -0500, rbt <rb*@athop1.ath.vt.edu> wrote: Tom Willis wrote: I tried that for something not python related and I was getting sporadic spaces everywhere.
I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt <rb*@athop1.ath.vt.edu> wrote:
Andreas Lobinger wrote:
Aloha,
rbt wrote:
>Thanks guys... what if I convert it to PS via printing it to a file or >something? Would that make it easier to work with?
Not really... The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly.
Wishing a happy day LOBI
I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free.
Usage:
ps2ascii PDF_file.pdf > ASCII_file.txt
However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list
For my purpose, it works fine. I'm searching for certain strings that might be in the document... all I need is a readable file. Layout, fonts and/or presentation is unimportant to me. -- http://mail.python.org/mailman/listinfo/python-list
--
Thomas G. Willis http://paperbackmusic.net
Tom Willis wrote: Well sporadic spaces in strings would cause problems would it not?
an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"
I'm just curious if you see anything like that, since I really have no clue about ps or pdf etc...but I have a strong desire to replace a really flaky commercial tool. And if I can do it with free stuff, all the better my boss will love me.
No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.
The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
Ah that makes sense. I only see the behavior in pdftotext. ps2ascii
doesn't give me the layout , which for my purposes, I certainly need.
Thanks for the info, Looks like I'll keep searching for that silver bullet.:(
On Tue, 22 Feb 2005 20:07:50 -0500, rbt <rb*@athop1.ath.vt.edu> wrote: Tom Willis wrote: Well sporadic spaces in strings would cause problems would it not?
an example....
The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"
I'm just curious if you see anything like that, since I really have no clue about ps or pdf etc...but I have a strong desire to replace a really flaky commercial tool. And if I can do it with free stuff, all the better my boss will love me.
No, I do not see that type of behavior. I'm looking for strings that resemble SS numbers. So my strings look like this: nnn-nn-nnnn.
The ps2ascii util in ghostscript reproduces strings in the format that I expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*. -- http://mail.python.org/mailman/listinfo/python-list
--
Thomas G. Willis http://paperbackmusic.net
rbt <rb*@athop1.ath.vt.edu> wrote in message news:<cv**********@solaris.cc.vt.edu>... Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings.
I've had success with both:
<http://www.boddie.org.uk/david/Projects/Python/pdftools/>
<http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py>
although my preference is for the latter as it transparently handles
decryption. (I've previously posted an enhancement to the `pdftools`
utility that adds decryption handling to it, but now use the `pdffile`
library as it handles it better.)
The ease of text extraction depends a lot on how the PDFs have been
created.
--Phil. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Govind |
last post by:
Dear All,
I want to search some specific content from html files. I am doing with File
system object , Using For loop i read all files and using readToEnd, i did
the Instr method to search the...
|
by: pkilambi |
last post by:
can anyone help me with this...
I want to search for a list for files in a given directory and if it
exists copy them to destination directory
so what i am looking for is :
file =...
|
by: Steffen Loringer |
last post by:
Hi group,
may be a simple question: How can I find out in C, which files of a
specified extension exist in a specified folder? My app should
constantly look for a file with an known extension...
|
by: Ma Xiaoming |
last post by:
Dear ladies and gentlemen,
As you know, by building a Smart Device Application in Microsoft Visual
Studio .NET 2003, we could create a project for Pocket PC.
My question is: How to search...
|
by: Bud Dean |
last post by:
I need to search files for given text. In particular, I'm searching dll's,
exe's, asp, aspx and html pages. I am having difficulty converting the byte
arrays to strings. The following code...
|
by: meghagowda |
last post by:
how to search for files in a directory
using C
|
by: SSJVEGETA |
last post by:
Hello, everybody. I have read some examples and manuals for the egrep command for Linux and I don't know if this egrep command is right for the particular files I am searching for. Here is what the...
|
by: Airtech |
last post by:
I am using the AllenBrowne code of "filldirlisttotable" to provide some functions for a media library manager I am building in access 2003.
I have four checkboxes which if all four are not turned...
|
by: joeferns79 |
last post by:
Hi,
I wanted to write a Perl script that searches a given folder for all files that have filenames based on the previous day's date.
eg. if the filenames of the files in the said folder are .......
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
| |