Hi,
Is there any Python packages to analyse or get some information out of
an PDF document...
Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.
Please let me know :)
Regards,
Johan 9 3982
Johan Holst Nielsen wrote: Is there any Python packages to analyse or get some information out of an PDF document...
Like where the text are placed - what text are placed - fonts, embedded PDFs/fonts/images etc.
Please let me know :)
I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.
-Peter
Peter Hansen wrote: Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of an PDF document...
Like where the text are placed - what text are placed - fonts, embedded PDFs/fonts/images etc.
I believe the not-for-free version of ReportLab has this sort of capability, at least in some sense.
Aah, you think about the product "PageCatche r", right? :)
I haven't seen it yet :) I will contact ReportLab for further details,
thanks :)
Please let me know, if other know any alternatives ;) (in case that I
cannot use ReportLab's version)
Regards,
Johan
Johan Holst Nielsen wrote: Peter Hansen wrote:
Johan Holst Nielsen wrote:
Is there any Python packages to analyse or get some information out of an PDF document...
Like where the text are placed - what text are placed - fonts, embedded PDFs/fonts/images etc.
I believe the not-for-free version of ReportLab has this sort of capability, at least in some sense.
Aah, you think about the product "PageCatche r", right? :)
Just found the pricing :( I think USD 25,000 are way out of my budget :(
I have someone have some alternatives :)
Regards,
Johan
Grzegorz Makarewicz wrote: Johan Holst Nielsen wrote:
Hi,
Is there any Python packages to analyse or get some information out of an PDF document...
Like where the text are placed - what text are placed - fonts, embedded PDFs/fonts/images etc.
Please let me know :)
Regards, Johan
http://www.trisoft.com.pl/~mak/wxpdf.zip
My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of python and wxPython - binaries for python22 (windows) are included.
Hmmm http://www.trisoft.com.pl/~mak/wxpdf.zip
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.
:( Did I get the wrong URL :(
Regards,
Johan
David Boddie wrote: Is there any Python packages to analyse or get some information out of an PDF document...
Like where the text are placed - what text are placed - fonts, embedded PDFs/fonts/images etc. It depends on the type of images (bitmap vs. vector).
Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=) IIRC you can get the full specs of pdf and eps at the adobe site. The full PDF specification is not exactly short, but it's fairly readable.
Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;) Some stuff is easy to get at, some may be compressed and/or encrypted, and not so easy.
Although the FlateDecode compression format is straightforward with existing libraries, some of the other compression techniques may be less accessible.
Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.
Conforming docs are supposed to be structured so that it is relatively easy to grab chunks of document and do the kinds of things printing business s/w does, like rotating and scaling and reordering pages, etc.
I have a Python library which is able to identify a lot of the structure in simple documents, including basic text extraction, but I've become pretty disillusioned with it because so much work is required to extract more complex information.
Maybe it's time to stick a license on it and upload it somewhere.
Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)
Regards,
Johan
Grzegorz Makarewicz wrote: Johan Holst Nielsen wrote: Is there any Python packages to analyse or get some information out of an PDF document...
Like where the text are placed - what text are placed - fonts, embedded PDFs/fonts/images etc.
Please let me know :)
http://www.trisoft.com.pl/~mak/wxpdf.zip
My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of python and wxPython - binaries for python22 (windows) are included.
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.
:( Can you please try to upload it again?
Regards,
Johan
Johan Holst Nielsen wrote:
[...] Not Found The requested URL /~mak/wxpdf.zip was not found on this server.
Can you please try to upload it again?
Johan
Sorry for the missing link, this one works: http://www.trisoft.com.pl/mak/wxpdf.zip
Regards,
Grzegorz Makarewicz
Grzegorz Makarewicz wrote: Johan Holst Nielsen wrote: [...]
> Not Found > The requested URL /~mak/wxpdf.zip was not found on this server. > > Can you please try to upload it again? > > Johan >
Sorry for the missing link, this one works:
http://www.trisoft.com.pl/mak/wxpdf.zip
Thanks Grzegorz, I will look at it in next week. If you want an reply
about if I can use - please send a message to me at tcr480 ( a t )
yahoo.dk
Regards,
Johan
Johan Holst Nielsen <jo***@weknowth ewayout.com> wrote in message news:<3f******* *************** *@dread11.news. tele.dk>... David Boddie wrote: The full PDF specification is not exactly short, but it's fairly readable.
Yep... I tried it... but there are no reason to do exactly the same - if other people already have done that. And time is an issue too ;)
Time is always an issue. How much of it do you have? ;-) I have a Python library which is able to identify a lot of the structure in simple documents, including basic text extraction, but I've become pretty disillusioned with it because so much work is required to extract more complex information.
Maybe it's time to stick a license on it and upload it somewhere.
Well, let me know ;) Maybe I could get an demo or something? That would be nice :)
You may be disappointed, but here it is: http://www.boddie.org.uk/david/Proje...thon/pdftools/
The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.
Basic use:
import pdftools
file = "MyFile.pdf "
doc = pdftools.PDFdoc ument(file)
print "Document uses PDF format version", doc.document_ve rsion()
pages = doc.count_pages ()
print "Document contains %i pages." % pages
if pages > 123:
page123 = doc.read_page(1 23)
contents123 = page123.read_co ntents()
print "The objects found in this page:"
print
print contents123.con tents
I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)
Have fun, and don't expect too much,
David This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Troll |
last post by:
Hi,
I need to write a script which reads some data and reports the findings.
Just to give you an idea the structure is similar to the following.
Data input example:
HEADING 1
**********
ColumnA ColumnB ColumnC ColumnD ColumnE
|
by: Patrick Fischer |
last post by:
Hello
Hello I looks for a possibility to analyse html while browsing.
Like a DOM Inspector.
While the page is loading the analyser check the page, make a DOM tree
and I can get the DOM tree.
Is there a Browser extending for Firefox oder Mozilla?
|
by: Phil Endecott |
last post by:
Dear PostgreSQL experts,
This is with version 7.4.2.
My database has grown a bit recently, mostly in number of tables but
also their size, and I started to see ANALYSE failing with this message:
WARNING: out of shared memory
ERROR: out of shared memory
HINT: You may need to increase max_locks_per_transaction.
|
by: novice |
last post by:
Hi geeks,
Can any body explain me how to analyse int the
pollowing code
This is the question I was asked in the interview...
char *s ={ "hello", "basic", "world", "program"};
char **sPtr = { s+3, s+2, s+1, s };
|
by: =?ISO-8859-1?Q?Konrad_M=FChler?= |
last post by:
Hallo,
ich bin auf der Suche nach einem Tool, mit dem ich unter Visual Studio 7
oder 8 eine Performance Analyse auf meinem Code durchführen kann, um zu
ermitteln, wo wieviel Zeit verloren geht.
Ich hab bisher nur die sündhaft teuren Produkte von DevPartner gefunden.
Gibt es billigere oder gar kostenlose/OpenSource Tools, mit denen ich
die Aufgabe lösen könnte?
Habt vielen Dank
| |
by: Petr Jakes |
last post by:
On the local radio station here in the Czech they announced simple
contest:
If listeners will hear Elton John's Sacrifice followed immediately by
Madonna's Frozen they have to call to the broadcasting. First caller
will get some price.
I am just thinking about the concept how to analyse music stream form
the PC radio card to get the signal: "first tones of Frozen were
played after Sacrifice, call to the studio" :-)
|
by: Naha |
last post by:
Hi,
I am starting a new project where I have to create a office monitoring system, whereby I need to capture images from a webcam and analyse these images using Java advanced imaging in order to determine if there are people in the images. Can someone please give me some guidance on how to do this, I am completely new to Java Advanced imaging!!!! Please...
|
by: ramyamuthusamy |
last post by:
Hi
I want to know how to analyse the network traffic using java,,
|
by: finelady |
last post by:
Hello,
I am new on perl and want to do one script who will ask for the name of the log file to analyse and will give the statictics :
1- the covered period of the log (start-end) by date and hours;
2- the total number of lines (traces) foreach adress;
3-the total numbers of traces for each service;
4-the total number of connections pop, ssh and imap;
5- the list of addresses that made a ssh connection and how many for each
6- list of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
| |
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |