Vyz wrote:
I am looking for a PDF to text script. I am working with multibyte
language PDFs on Windows Xp. I need to batch convert them to text and
feed into an encoding converter program
Thanks for any help in this regard
Multibyte languages are not easy. I do text extraction from PDF but 1)
I do it on Linux and 2) I only need English text. The utility I use is
pdftotext that comes as part of XPDF *nix package.
The other problem however, is not with the extraction but with the fact
that after you extract the text, it might not look very good. In other
words, the extraction program will never complain but will nevertheless
produce garbage. Then you have to process the result yourself. For
example, whitespace is not consistent, sometimes there will be extra
whitespace -- sometimes there won't be enough for example " S o m e
w ordsloo l i k e t his" and so on...
The real answer is that pdf text extraction is pretty hard. It is a
1000x better to get a hold of the original source...
Nick V.