By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,758 Members | 1,225 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,758 IT Pros & Developers. It's quick & easy.

Extracting text from pdf

P: n/a
Hi,

I have to index the text of a pdf document.

Does any of you know of a PHP script/extension or a binary that is able
to extract the text ?

The pdf extension mentioned in the php.net docs seem to indicate that
it's for _creation_ of documents only, is that so? Same with all the
PHP classes i have found.

Regards,
Johnny

--
Never express yourself more clearly than you are able to think.
- Niels Bohr
Jul 17 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is able
to extract the text ?


There's a Unix program that might help you: ps2ascii

--
-- Álvaro G. Vicario - Burgos, Spain
-- Thank you for not e-mailing me your questions
--
Jul 17 '05 #2

P: n/a
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Thanks for the pointer,
I'll have a look

/Johnny

--
He's turned his life around. He used to be depressed and miserable. Now
he's miserable and depressed.
- David Frost
Jul 17 '05 #3

P: n/a
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Does anyone know of any other tool for PDF text extraction ?
ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
tool to, but with same result.
I figured that it has something to do with my ghostscript version being
too old (7.05, newest is 8.14).

Unfortunally I have no experience in installing/upgrading unix stuff
(having spend half an evening trying in vain and confusion).

Regards,
Johnny

--
In the beginning the Universe was created. This has made a lot of
people very angry and been widely regarded as a bad move.
- Douglas Adams
Jul 17 '05 #4

P: n/a


xpdf will do this
http://www.foolabs.com/xpdf/

I use it with the namazu search tool (http://www.namazu.org/) to
provide search capabilities on websites that span web pages, office
docs, and PDF files.
In article <xn***************@news.tele.dk>, JustinCase <no@spam> wrote:
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Does anyone know of any other tool for PDF text extraction ?
ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
tool to, but with same result.
I figured that it has something to do with my ghostscript version being
too old (7.05, newest is 8.14).

Unfortunally I have no experience in installing/upgrading unix stuff
(having spend half an evening trying in vain and confusion).

Regards,
Johnny

Jul 17 '05 #5

P: n/a
On 26-10-2004 Darien Kruss wrote:


xpdf will do this
http://www.foolabs.com/xpdf/

I use it with the namazu search tool (http://www.namazu.org/) to
provide search capabilities on websites that span web pages, office
docs, and PDF files.

Hi Darian,

Perfect.

Funny though. I'd been to the site a few times in my search but had
somehow concluded that xpdf was not what I wanted. Looking too hard can
make you miss the obvious, eh !? So many hairs could still be resting
comfortably on my head. :)

Thanks,
Johnny

--
The universe is a big place, perhaps the biggest.
- Kilgore Trout
Jul 17 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.