By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,320 Members | 2,212 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,320 IT Pros & Developers. It's quick & easy.

Reading PDF file

P: n/a
Heallo,

In my company they need to compare two versions of a PDF file. I found a lot
of .NET components to generate and merge PDFs. Anyways neither was able just
to read the text. The PDF file is in german and has many photos and text box
and not just a plan text (but only the text has to be compared).
Can anybody help me finding a .NET component (or com !) to extract only
the text from such a file?

Thanks alot in advance
Nov 16 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a

Extracting text from PDF streams is an extremely complicated task.
I wont get deeper into that here, but you may read chapters 4-5 of the
PDF specification to get an idea.
There is not just one but many different formats which can be used to
describe text
in PDF documents.

I am sure there is nothing for free that will suit your needs, only
commercial products
provide such a professional text extraction implementation.
Maybe this one could give you a starter, though:
http://www.codeproject.com/cpp/ExtractPDFText.asp
It provides an open source text extraction implementation that can handle
GZIP compressed
(which is one of at least 3 common compression algorithms used for text)
streams of text in some formats.
But it is far from a complete PDF text extraction tool.

If the text to be compared is stored in separate content streams within the
documents,
maybe you could achieve it by comparing the streams at byte level.


--
Regards,
Dennis JD Myrén
Oslo Kodebureau
"Stefan Rosi" <in*****@invalid.com> wrote in message
news:%2***************@TK2MSFTNGP09.phx.gbl...
Heallo,

In my company they need to compare two versions of a PDF file. I found a lot of .NET components to generate and merge PDFs. Anyways neither was able just to read the text. The PDF file is in german and has many photos and text box and not just a plan text (but only the text has to be compared).
Can anybody help me finding a .NET component (or com !) to extract only
the text from such a file?

Thanks alot in advance

Nov 16 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.