Connecting Tech Pros Worldwide Help | Site Map

Analyse of PDF (or EPS?)

Johan Holst Nielsen
Guest
 
Posts: n/a
#1: Jul 18 '05
Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan

Peter Hansen
Guest
 
Posts: n/a
#2: Jul 18 '05

re: Analyse of PDF (or EPS?)


Johan Holst Nielsen wrote:[color=blue]
>
> Is there any Python packages to analyse or get some information out of
> an PDF document...
>
> Like where the text are placed - what text are placed - fonts, embedded
> PDFs/fonts/images etc.
>
> Please let me know :)[/color]

I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

-Peter
Johan Holst Nielsen
Guest
 
Posts: n/a
#3: Jul 18 '05

re: Analyse of PDF (or EPS?)


Peter Hansen wrote:
[color=blue]
> Johan Holst Nielsen wrote:
>[color=green]
>>Is there any Python packages to analyse or get some information out of
>>an PDF document...
>>
>>Like where the text are placed - what text are placed - fonts, embedded
>>PDFs/fonts/images etc.
>>[/color]
>
> I believe the not-for-free version of ReportLab has this sort of capability,
> at least in some sense.[/color]

Aah, you think about the product "PageCatcher", right? :)

I haven't seen it yet :) I will contact ReportLab for further details,
thanks :)

Please let me know, if other know any alternatives ;) (in case that I
cannot use ReportLab's version)

Regards,
Johan

Johan Holst Nielsen
Guest
 
Posts: n/a
#4: Jul 18 '05

re: Analyse of PDF (or EPS?)


Johan Holst Nielsen wrote:
[color=blue]
> Peter Hansen wrote:
>[color=green]
>> Johan Holst Nielsen wrote:
>>[color=darkred]
>>> Is there any Python packages to analyse or get some information out of
>>> an PDF document...
>>>
>>> Like where the text are placed - what text are placed - fonts, embedded
>>> PDFs/fonts/images etc.
>>>[/color]
>>
>> I believe the not-for-free version of ReportLab has this sort of
>> capability,
>> at least in some sense.[/color]
>
>
> Aah, you think about the product "PageCatcher", right? :)[/color]

Just found the pricing :( I think USD 25,000 are way out of my budget :(
I have someone have some alternatives :)

Regards,
Johan

Johan Holst Nielsen
Guest
 
Posts: n/a
#5: Jul 18 '05

re: Analyse of PDF (or EPS?)


Grzegorz Makarewicz wrote:
[color=blue]
> Johan Holst Nielsen wrote:
>[color=green]
>> Hi,
>>
>> Is there any Python packages to analyse or get some information out of
>> an PDF document...
>>
>> Like where the text are placed - what text are placed - fonts,
>> embedded PDFs/fonts/images etc.
>>
>> Please let me know :)
>>
>> Regards,
>> Johan
>>[/color]
>
> http://www.trisoft.com.pl/~mak/wxpdf.zip
>
> My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
> python and wxPython - binaries for python22 (windows) are included.[/color]

Hmmm
http://www.trisoft.com.pl/~mak/wxpdf.zip
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.
:( Did I get the wrong URL :(

Regards,
Johan

Johan Holst Nielsen
Guest
 
Posts: n/a
#6: Jul 18 '05

re: Analyse of PDF (or EPS?)


David Boddie wrote:[color=blue][color=green][color=darkred]
>>>Is there any Python packages to analyse or get some information out of
>>>an PDF document...
>>>
>>>Like where the text are placed - what text are placed - fonts, embedded
>>>PDFs/fonts/images etc.[/color][/color]
>
> It depends on the type of images (bitmap vs. vector).[/color]

Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=)
[color=blue]
>[color=green]
>>IIRC you can get the full specs of pdf and eps at the adobe site.[/color]
>
> The full PDF specification is not exactly short, but it's fairly readable.[/color]

Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)
[color=blue]
>[color=green]
>>Some stuff is easy to get at, some may be compressed and/or encrypted,
>>and not so easy.[/color]
>
> Although the FlateDecode compression format is straightforward with existing
> libraries, some of the other compression techniques may be less accessible.[/color]

Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.
[color=blue][color=green]
>>Conforming docs are supposed to be structured so that it is relatively easy
>>to grab chunks of document and do the kinds of things printing business s/w does,
>>like rotating and scaling and reordering pages, etc.[/color]
>
> I have a Python library which is able to identify a lot of the structure in simple
> documents, including basic text extraction, but I've become pretty disillusioned
> with it because so much work is required to extract more complex information.
>
> Maybe it's time to stick a license on it and upload it somewhere.[/color]

Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)

Regards,
Johan

Johan Holst Nielsen
Guest
 
Posts: n/a
#7: Jul 18 '05

re: Analyse of PDF (or EPS?)


Grzegorz Makarewicz wrote:[color=blue]
> Johan Holst Nielsen wrote:[color=green]
>> Is there any Python packages to analyse or get some information out of
>> an PDF document...
>>
>> Like where the text are placed - what text are placed - fonts,
>> embedded PDFs/fonts/images etc.
>>
>> Please let me know :)[/color]
>
> http://www.trisoft.com.pl/~mak/wxpdf.zip
>
> My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
> python and wxPython - binaries for python22 (windows) are included.[/color]

Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

:( Can you please try to upload it again?

Regards,
Johan

Grzegorz Makarewicz
Guest
 
Posts: n/a
#8: Jul 18 '05

re: Analyse of PDF (or EPS?)


Johan Holst Nielsen wrote:
[...]
[color=blue]
> Not Found
> The requested URL /~mak/wxpdf.zip was not found on this server.
>
> Can you please try to upload it again?
>
> Johan
>[/color]

Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Regards,
Grzegorz Makarewicz


Johan Holst Nielsen
Guest
 
Posts: n/a
#9: Jul 18 '05

re: Analyse of PDF (or EPS?)


Grzegorz Makarewicz wrote:[color=blue]
> Johan Holst Nielsen wrote:
> [...]
>[color=green]
> > Not Found
> > The requested URL /~mak/wxpdf.zip was not found on this server.
> >
> > Can you please try to upload it again?
> >
> > Johan
> >[/color]
>
> Sorry for the missing link, this one works:
>
> http://www.trisoft.com.pl/mak/wxpdf.zip[/color]

Thanks Grzegorz, I will look at it in next week. If you want an reply
about if I can use - please send a message to me at tcr480 ( a t )
yahoo.dk


Regards,
Johan

David Boddie
Guest
 
Posts: n/a
#10: Jul 18 '05

re: Analyse of PDF (or EPS?)


Johan Holst Nielsen <johan@weknowthewayout.com> wrote in message news:<3fbe00e8$0$95070$edfadb0f@dread11.news.tele. dk>...[color=blue]
> David Boddie wrote:[/color]
[color=blue][color=green]
> > The full PDF specification is not exactly short, but it's fairly readable.[/color]
>
> Yep... I tried it... but there are no reason to do exactly the same - if
> other people already have done that. And time is an issue too ;)[/color]

Time is always an issue. How much of it do you have? ;-)
[color=blue][color=green]
> > I have a Python library which is able to identify a lot of the structure in simple
> > documents, including basic text extraction, but I've become pretty disillusioned
> > with it because so much work is required to extract more complex information.
> >
> > Maybe it's time to stick a license on it and upload it somewhere.[/color]
>
> Well, let me know ;) Maybe I could get an demo or something? That would
> be nice :)[/color]

You may be disappointed, but here it is:

http://www.boddie.org.uk/david/Proje...thon/pdftools/

The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.

Basic use:

import pdftools

file = "MyFile.pdf"
doc = pdftools.PDFdocument(file)

print "Document uses PDF format version", doc.document_version()

pages = doc.count_pages()
print "Document contains %i pages." % pages

if pages > 123:

page123 = doc.read_page(123)
contents123 = page123.read_contents()

print "The objects found in this page:"
print
print contents123.contents

I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)

Have fun, and don't expect too much,

David
Closed Thread