473,387 Members | 1,540 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Analyse of PDF (or EPS?)

Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan

Jul 18 '05 #1
9 3946
Johan Holst Nielsen wrote:

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)


I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

-Peter
Jul 18 '05 #2
Peter Hansen wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.


I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.


Aah, you think about the product "PageCatcher", right? :)

I haven't seen it yet :) I will contact ReportLab for further details,
thanks :)

Please let me know, if other know any alternatives ;) (in case that I
cannot use ReportLab's version)

Regards,
Johan

Jul 18 '05 #3
Johan Holst Nielsen wrote:
Peter Hansen wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.


I believe the not-for-free version of ReportLab has this sort of
capability,
at least in some sense.

Aah, you think about the product "PageCatcher", right? :)


Just found the pricing :( I think USD 25,000 are way out of my budget :(
I have someone have some alternatives :)

Regards,
Johan

Jul 18 '05 #4
Grzegorz Makarewicz wrote:
Johan Holst Nielsen wrote:
Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts,
embedded PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan


http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.


Hmmm
http://www.trisoft.com.pl/~mak/wxpdf.zip
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.
:( Did I get the wrong URL :(

Regards,
Johan

Jul 18 '05 #5
David Boddie wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.
It depends on the type of images (bitmap vs. vector).
Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=)
IIRC you can get the full specs of pdf and eps at the adobe site.
The full PDF specification is not exactly short, but it's fairly readable.


Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)
Some stuff is easy to get at, some may be compressed and/or encrypted,
and not so easy.


Although the FlateDecode compression format is straightforward with existing
libraries, some of the other compression techniques may be less accessible.


Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.
Conforming docs are supposed to be structured so that it is relatively easy
to grab chunks of document and do the kinds of things printing business s/w does,
like rotating and scaling and reordering pages, etc.


I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.


Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)

Regards,
Johan

Jul 18 '05 #6
Grzegorz Makarewicz wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts,
embedded PDFs/fonts/images etc.

Please let me know :)


http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.


Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

:( Can you please try to upload it again?

Regards,
Johan

Jul 18 '05 #7
Johan Holst Nielsen wrote:
[...]
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

Can you please try to upload it again?

Johan


Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Regards,
Grzegorz Makarewicz
Jul 18 '05 #8
Grzegorz Makarewicz wrote:
Johan Holst Nielsen wrote:
[...]
> Not Found
> The requested URL /~mak/wxpdf.zip was not found on this server.
>
> Can you please try to upload it again?
>
> Johan
>


Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip


Thanks Grzegorz, I will look at it in next week. If you want an reply
about if I can use - please send a message to me at tcr480 ( a t )
yahoo.dk
Regards,
Johan

Jul 18 '05 #9
Johan Holst Nielsen <jo***@weknowthewayout.com> wrote in message news:<3f***********************@dread11.news.tele. dk>...
David Boddie wrote:

The full PDF specification is not exactly short, but it's fairly readable.


Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)


Time is always an issue. How much of it do you have? ;-)
I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.


Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)


You may be disappointed, but here it is:

http://www.boddie.org.uk/david/Proje...thon/pdftools/

The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.

Basic use:

import pdftools

file = "MyFile.pdf"
doc = pdftools.PDFdocument(file)

print "Document uses PDF format version", doc.document_version()

pages = doc.count_pages()
print "Document contains %i pages." % pages

if pages > 123:

page123 = doc.read_page(123)
contents123 = page123.read_contents()

print "The objects found in this page:"
print
print contents123.contents

I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)

Have fun, and don't expect too much,

David
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

35
by: Troll | last post by:
Hi, I need to write a script which reads some data and reports the findings. Just to give you an idea the structure is similar to the following. Data input example: HEADING 1 **********...
2
by: Patrick Fischer | last post by:
Hello Hello I looks for a possibility to analyse html while browsing. Like a DOM Inspector. While the page is loading the analyser check the page, make a DOM tree and I can get the DOM tree. ...
3
by: Phil Endecott | last post by:
Dear PostgreSQL experts, This is with version 7.4.2. My database has grown a bit recently, mostly in number of tables but also their size, and I started to see ANALYSE failing with this...
8
by: novice | last post by:
Hi geeks, Can any body explain me how to analyse int the pollowing code This is the question I was asked in the interview... char *s ={ "hello", "basic", "world", "program"}; char...
0
by: =?ISO-8859-1?Q?Konrad_M=FChler?= | last post by:
Hallo, ich bin auf der Suche nach einem Tool, mit dem ich unter Visual Studio 7 oder 8 eine Performance Analyse auf meinem Code durchführen kann, um zu ermitteln, wo wieviel Zeit verloren geht....
0
by: Petr Jakes | last post by:
On the local radio station here in the Czech they announced simple contest: If listeners will hear Elton John's Sacrifice followed immediately by Madonna's Frozen they have to call to the...
1
by: Naha | last post by:
Hi, I am starting a new project where I have to create a office monitoring system, whereby I need to capture images from a webcam and analyse these images using Java advanced imaging in order to...
4
by: ramyamuthusamy | last post by:
Hi I want to know how to analyse the network traffic using java,,
4
by: finelady | last post by:
Hello, I am new on perl and want to do one script who will ask for the name of the log file to analyse and will give the statictics : 1- the covered period of the log (start-end) by date and hours;...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.