468,241 Members | 1,444 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,241 developers. It's quick & easy.

Extracting text from pdf

Hi,

I have to index the text of a pdf document.

Does any of you know of a PHP script/extension or a binary that is able
to extract the text ?

The pdf extension mentioned in the php.net docs seem to indicate that
it's for _creation_ of documents only, is that so? Same with all the
PHP classes i have found.

Regards,
Johnny

--
Never express yourself more clearly than you are able to think.
- Niels Bohr
Jul 17 '05 #1
5 3927
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is able
to extract the text ?


There's a Unix program that might help you: ps2ascii

--
-- Álvaro G. Vicario - Burgos, Spain
-- Thank you for not e-mailing me your questions
--
Jul 17 '05 #2
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Thanks for the pointer,
I'll have a look

/Johnny

--
He's turned his life around. He used to be depressed and miserable. Now
he's miserable and depressed.
- David Frost
Jul 17 '05 #3
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Does anyone know of any other tool for PDF text extraction ?
ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
tool to, but with same result.
I figured that it has something to do with my ghostscript version being
too old (7.05, newest is 8.14).

Unfortunally I have no experience in installing/upgrading unix stuff
(having spend half an evening trying in vain and confusion).

Regards,
Johnny

--
In the beginning the Universe was created. This has made a lot of
people very angry and been widely regarded as a bad move.
- Douglas Adams
Jul 17 '05 #4


xpdf will do this
http://www.foolabs.com/xpdf/

I use it with the namazu search tool (http://www.namazu.org/) to
provide search capabilities on websites that span web pages, office
docs, and PDF files.
In article <xn***************@news.tele.dk>, JustinCase <no@spam> wrote:
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Does anyone know of any other tool for PDF text extraction ?
ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
tool to, but with same result.
I figured that it has something to do with my ghostscript version being
too old (7.05, newest is 8.14).

Unfortunally I have no experience in installing/upgrading unix stuff
(having spend half an evening trying in vain and confusion).

Regards,
Johnny

Jul 17 '05 #5
On 26-10-2004 Darien Kruss wrote:


xpdf will do this
http://www.foolabs.com/xpdf/

I use it with the namazu search tool (http://www.namazu.org/) to
provide search capabilities on websites that span web pages, office
docs, and PDF files.

Hi Darian,

Perfect.

Funny though. I'd been to the site a few times in my search but had
somehow concluded that xpdf was not what I wanted. Looking too hard can
make you miss the obvious, eh !? So many hairs could still be resting
comfortably on my head. :)

Thanks,
Johnny

--
The universe is a big place, perhaps the biggest.
- Kilgore Trout
Jul 17 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by lecichy | last post: by
5 posts views Thread by Michael Hill | last post: by
1 post views Thread by Cognizance | last post: by
4 posts views Thread by kirill_uk | last post: by
1 post views Thread by Mark Jones | last post: by
2 posts views Thread by chris_j_adams | last post: by
6 posts views Thread by sunil | last post: by
reply views Thread by NPC403 | last post: by
reply views Thread by kermitthefrogpy | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.