Johan Holst Nielsen <jo***@weknowthewayout.com> wrote in message news:<3f***********************@dread11.news.tele. dk>...
David Boddie wrote:
The full PDF specification is not exactly short, but it's fairly readable.
Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)
Time is always an issue. How much of it do you have? ;-)
I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.
Maybe it's time to stick a license on it and upload it somewhere.
Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)
You may be disappointed, but here it is:
http://www.boddie.org.uk/david/Proje...thon/pdftools/
The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.
Basic use:
import pdftools
file = "MyFile.pdf"
doc = pdftools.PDFdocument(file)
print "Document uses PDF format version", doc.document_version()
pages = doc.count_pages()
print "Document contains %i pages." % pages
if pages > 123:
page123 = doc.read_page(123)
contents123 = page123.read_contents()
print "The objects found in this page:"
print
print contents123.contents
I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)
Have fun, and don't expect too much,
David