473,748 Members | 2,170 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Analyse of PDF (or EPS?)

Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan

Jul 18 '05 #1
9 3982
Johan Holst Nielsen wrote:

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know :)


I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

-Peter
Jul 18 '05 #2
Peter Hansen wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.


I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.


Aah, you think about the product "PageCatche r", right? :)

I haven't seen it yet :) I will contact ReportLab for further details,
thanks :)

Please let me know, if other know any alternatives ;) (in case that I
cannot use ReportLab's version)

Regards,
Johan

Jul 18 '05 #3
Johan Holst Nielsen wrote:
Peter Hansen wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.


I believe the not-for-free version of ReportLab has this sort of
capability,
at least in some sense.

Aah, you think about the product "PageCatche r", right? :)


Just found the pricing :( I think USD 25,000 are way out of my budget :(
I have someone have some alternatives :)

Regards,
Johan

Jul 18 '05 #4
Grzegorz Makarewicz wrote:
Johan Holst Nielsen wrote:
Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts,
embedded PDFs/fonts/images etc.

Please let me know :)

Regards,
Johan


http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.


Hmmm
http://www.trisoft.com.pl/~mak/wxpdf.zip
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.
:( Did I get the wrong URL :(

Regards,
Johan

Jul 18 '05 #5
David Boddie wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.
It depends on the type of images (bitmap vs. vector).
Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=)
IIRC you can get the full specs of pdf and eps at the adobe site.
The full PDF specification is not exactly short, but it's fairly readable.


Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)
Some stuff is easy to get at, some may be compressed and/or encrypted,
and not so easy.


Although the FlateDecode compression format is straightforward with existing
libraries, some of the other compression techniques may be less accessible.


Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.
Conforming docs are supposed to be structured so that it is relatively easy
to grab chunks of document and do the kinds of things printing business s/w does,
like rotating and scaling and reordering pages, etc.


I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.


Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)

Regards,
Johan

Jul 18 '05 #6
Grzegorz Makarewicz wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts,
embedded PDFs/fonts/images etc.

Please let me know :)


http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.


Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

:( Can you please try to upload it again?

Regards,
Johan

Jul 18 '05 #7
Johan Holst Nielsen wrote:
[...]
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

Can you please try to upload it again?

Johan


Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Regards,
Grzegorz Makarewicz
Jul 18 '05 #8
Grzegorz Makarewicz wrote:
Johan Holst Nielsen wrote:
[...]
> Not Found
> The requested URL /~mak/wxpdf.zip was not found on this server.
>
> Can you please try to upload it again?
>
> Johan
>


Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip


Thanks Grzegorz, I will look at it in next week. If you want an reply
about if I can use - please send a message to me at tcr480 ( a t )
yahoo.dk
Regards,
Johan

Jul 18 '05 #9
Johan Holst Nielsen <jo***@weknowth ewayout.com> wrote in message news:<3f******* *************** *@dread11.news. tele.dk>...
David Boddie wrote:

The full PDF specification is not exactly short, but it's fairly readable.


Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)


Time is always an issue. How much of it do you have? ;-)
I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.


Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)


You may be disappointed, but here it is:

http://www.boddie.org.uk/david/Proje...thon/pdftools/

The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.

Basic use:

import pdftools

file = "MyFile.pdf "
doc = pdftools.PDFdoc ument(file)

print "Document uses PDF format version", doc.document_ve rsion()

pages = doc.count_pages ()
print "Document contains %i pages." % pages

if pages > 123:

page123 = doc.read_page(1 23)
contents123 = page123.read_co ntents()

print "The objects found in this page:"
print
print contents123.con tents

I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)

Have fun, and don't expect too much,

David
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

35
3679
by: Troll | last post by:
Hi, I need to write a script which reads some data and reports the findings. Just to give you an idea the structure is similar to the following. Data input example: HEADING 1 ********** ColumnA ColumnB ColumnC ColumnD ColumnE
2
1979
by: Patrick Fischer | last post by:
Hello Hello I looks for a possibility to analyse html while browsing. Like a DOM Inspector. While the page is loading the analyser check the page, make a DOM tree and I can get the DOM tree. Is there a Browser extending for Firefox oder Mozilla?
3
12180
by: Phil Endecott | last post by:
Dear PostgreSQL experts, This is with version 7.4.2. My database has grown a bit recently, mostly in number of tables but also their size, and I started to see ANALYSE failing with this message: WARNING: out of shared memory ERROR: out of shared memory HINT: You may need to increase max_locks_per_transaction.
8
2093
by: novice | last post by:
Hi geeks, Can any body explain me how to analyse int the pollowing code This is the question I was asked in the interview... char *s ={ "hello", "basic", "world", "program"}; char **sPtr = { s+3, s+2, s+1, s };
0
1608
by: =?ISO-8859-1?Q?Konrad_M=FChler?= | last post by:
Hallo, ich bin auf der Suche nach einem Tool, mit dem ich unter Visual Studio 7 oder 8 eine Performance Analyse auf meinem Code durchführen kann, um zu ermitteln, wo wieviel Zeit verloren geht. Ich hab bisher nur die sündhaft teuren Produkte von DevPartner gefunden. Gibt es billigere oder gar kostenlose/OpenSource Tools, mit denen ich die Aufgabe lösen könnte? Habt vielen Dank
0
1212
by: Petr Jakes | last post by:
On the local radio station here in the Czech they announced simple contest: If listeners will hear Elton John's Sacrifice followed immediately by Madonna's Frozen they have to call to the broadcasting. First caller will get some price. I am just thinking about the concept how to analyse music stream form the PC radio card to get the signal: "first tones of Frozen were played after Sacrifice, call to the studio" :-)
1
3505
by: Naha | last post by:
Hi, I am starting a new project where I have to create a office monitoring system, whereby I need to capture images from a webcam and analyse these images using Java advanced imaging in order to determine if there are people in the images. Can someone please give me some guidance on how to do this, I am completely new to Java Advanced imaging!!!! Please...
4
2337
by: ramyamuthusamy | last post by:
Hi I want to know how to analyse the network traffic using java,,
4
2116
by: finelady | last post by:
Hello, I am new on perl and want to do one script who will ask for the name of the log file to analyse and will give the statictics : 1- the covered period of the log (start-end) by date and hours; 2- the total number of lines (traces) foreach adress; 3-the total numbers of traces for each service; 4-the total number of connections pop, ssh and imap; 5- the list of addresses that made a ssh connection and how many for each 6- list of...
0
9552
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9249
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8245
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6796
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6076
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4607
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3315
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2787
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2215
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.