Analyse of PDF (or EPS?)

Johan Holst Nielsen

Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan

Jul 18 '05 #1

Subscribe Post Reply

3946

Peter Hansen

Johan Holst Nielsen wrote:

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)

I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

-Peter

Jul 18 '05 #2

Johan Holst Nielsen

Peter Hansen wrote:

Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

Aah, you think about the product "PageCatcher", right? :)

I haven't seen it yet :) I will contact ReportLab for further details,
thanks :)

Please let me know, if other know any alternatives ;) (in case that I
cannot use ReportLab's version)

Regards,
Johan

Jul 18 '05 #3

Johan Holst Nielsen

Johan Holst Nielsen wrote:

Peter Hansen wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

I believe the not-for-free version of ReportLab has this sort of
capability,
at least in some sense.

Aah, you think about the product "PageCatcher", right? :)

Just found the pricing :( I think USD 25,000 are way out of my budget :(
I have someone have some alternatives :)

Regards,
Johan

Jul 18 '05 #4

Johan Holst Nielsen

Grzegorz Makarewicz wrote:

Johan Holst Nielsen wrote:
Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts,
embedded PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan

http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.

Hmmm
http://www.trisoft.com.pl/~mak/wxpdf.zip
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.
:( Did I get the wrong URL :(

Regards,
Johan

Jul 18 '05 #5

Johan Holst Nielsen

David Boddie wrote:

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.
It depends on the type of images (bitmap vs. vector).
Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=)

IIRC you can get the full specs of pdf and eps at the adobe site.
The full PDF specification is not exactly short, but it's fairly readable.

Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)

Some stuff is easy to get at, some may be compressed and/or encrypted,
and not so easy.

Although the FlateDecode compression format is straightforward with existing
libraries, some of the other compression techniques may be less accessible.

Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.
Conforming docs are supposed to be structured so that it is relatively easy
to grab chunks of document and do the kinds of things printing business s/w does,
like rotating and scaling and reordering pages, etc.

I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.

Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)

Regards,
Johan

Jul 18 '05 #6

Johan Holst Nielsen

Grzegorz Makarewicz wrote:

Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts,
embedded PDFs/fonts/images etc.

Please let me know :)

http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.

Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

:( Can you please try to upload it again?

Regards,
Johan

Jul 18 '05 #7

Grzegorz Makarewicz

Johan Holst Nielsen wrote:
[...]

Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

Can you please try to upload it again?

Johan

Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Regards,
Grzegorz Makarewicz

Jul 18 '05 #8

Johan Holst Nielsen

Grzegorz Makarewicz wrote:

Johan Holst Nielsen wrote:
[...]
> Not Found
> The requested URL /~mak/wxpdf.zip was not found on this server.
>
> Can you please try to upload it again?
>
> Johan
>

Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Thanks Grzegorz, I will look at it in next week. If you want an reply
about if I can use - please send a message to me at tcr480 ( a t )
yahoo.dk
Regards,
Johan

Jul 18 '05 #9

David Boddie

Johan Holst Nielsen <jo***@weknowthewayout.com> wrote in message news:<3f***********************@dread11.news.tele. dk>...

David Boddie wrote:

The full PDF specification is not exactly short, but it's fairly readable.

Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)

Time is always an issue. How much of it do you have? ;-)

I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.

Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)

You may be disappointed, but here it is:

http://www.boddie.org.uk/david/Proje...thon/pdftools/

The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.

Basic use:

import pdftools

file = "MyFile.pdf"
doc = pdftools.PDFdocument(file)

print "Document uses PDF format version", doc.document_version()

pages = doc.count_pages()
print "Document contains %i pages." % pages

if pages > 123:

page123 = doc.read_page(123)
contents123 = page123.read_contents()

print "The objects found in this page:"
print
print contents123.contents

I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)

Have fun, and don't expect too much,

David

Jul 18 '05 #10

by: Troll | last post by:

Hi, I need to write a script which reads some data and reports the findings. Just to give you an idea the structure is similar to the following. Data input example: HEADING 1 **********...

Perl

analyse html while browsing

by: Patrick Fischer | last post by:

Hello Hello I looks for a possibility to analyse html while browsing. Like a DOM Inspector. While the page is loading the analyser check the page, make a DOM tree and I can get the DOM tree. ...

HTML / CSS

Analyse - max_locks_per_transaction - why?

by: Phil Endecott | last post by:

Dear PostgreSQL experts, This is with version 7.4.2. My database has grown a bit recently, mostly in number of tables but also their size, and I started to see ANALYSE failing with this...

PostgreSQL Database

Help analyse [ char *--*++ptr + 3 ]

by: novice | last post by:

Hi geeks, Can any body explain me how to analyse int the pollowing code This is the question I was asked in the interview... char *s ={ "hello", "basic", "world", "program"}; char...

C / C++

Performance Analyse mit Visual Studio ...

by: =?ISO-8859-1?Q?Konrad_M=FChler?= | last post by:

Hallo, ich bin auf der Suche nach einem Tool, mit dem ich unter Visual Studio 7 oder 8 eine Performance Analyse auf meinem Code durchführen kann, um zu ermitteln, wo wieviel Zeit verloren geht....

C / C++

how to analyse music stream

by: Petr Jakes | last post by:

On the local radio station here in the Czech they announced simple contest: If listeners will hear Elton John's Sacrifice followed immediately by Madonna's Frozen they have to call to the...

Python

how to analyse images using Java Advanced Imaging

by: Naha | last post by:

Hi, I am starting a new project where I have to create a office monitoring system, whereby I need to capture images from a webcam and analyse these images using Java advanced imaging in order to...

Java

how to analyse the network trffic using java

by: ramyamuthusamy | last post by:

Hi I want to know how to analyse the network traffic using java,,

Java

Analyse Log file

by: finelady | last post by:

Hello, I am new on perl and want to do one script who will ask for the name of the log file to analyse and will give the statictics : 1- the covered period of the log (start-end) by date and hours;...

Perl

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Analyse of PDF (or EPS?)

Similar topics