473,473 Members | 2,003 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Extracting text from pdf

Hi,

I have to index the text of a pdf document.

Does any of you know of a PHP script/extension or a binary that is able
to extract the text ?

The pdf extension mentioned in the php.net docs seem to indicate that
it's for _creation_ of documents only, is that so? Same with all the
PHP classes i have found.

Regards,
Johnny

--
Never express yourself more clearly than you are able to think.
- Niels Bohr
Jul 17 '05 #1
5 4069
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is able
to extract the text ?


There's a Unix program that might help you: ps2ascii

--
-- Álvaro G. Vicario - Burgos, Spain
-- Thank you for not e-mailing me your questions
--
Jul 17 '05 #2
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Thanks for the pointer,
I'll have a look

/Johnny

--
He's turned his life around. He used to be depressed and miserable. Now
he's miserable and depressed.
- David Frost
Jul 17 '05 #3
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Does anyone know of any other tool for PDF text extraction ?
ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
tool to, but with same result.
I figured that it has something to do with my ghostscript version being
too old (7.05, newest is 8.14).

Unfortunally I have no experience in installing/upgrading unix stuff
(having spend half an evening trying in vain and confusion).

Regards,
Johnny

--
In the beginning the Universe was created. This has made a lot of
people very angry and been widely regarded as a bad move.
- Douglas Adams
Jul 17 '05 #4


xpdf will do this
http://www.foolabs.com/xpdf/

I use it with the namazu search tool (http://www.namazu.org/) to
provide search capabilities on websites that span web pages, office
docs, and PDF files.
In article <xn***************@news.tele.dk>, JustinCase <no@spam> wrote:
On 25-10-2004 Alvaro G Vicario wrote:
*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):
Does any of you know of a PHP script/extension or a binary that is
able to extract the text ?


There's a Unix program that might help you: ps2ascii


Does anyone know of any other tool for PDF text extraction ?
ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
tool to, but with same result.
I figured that it has something to do with my ghostscript version being
too old (7.05, newest is 8.14).

Unfortunally I have no experience in installing/upgrading unix stuff
(having spend half an evening trying in vain and confusion).

Regards,
Johnny

Jul 17 '05 #5
On 26-10-2004 Darien Kruss wrote:


xpdf will do this
http://www.foolabs.com/xpdf/

I use it with the namazu search tool (http://www.namazu.org/) to
provide search capabilities on websites that span web pages, office
docs, and PDF files.

Hi Darian,

Perfect.

Funny though. I'd been to the site a few times in my search but had
somehow concluded that xpdf was not what I wanted. Looking too hard can
make you miss the obvious, eh !? So many hairs could still be resting
comfortably on my head. :)

Thanks,
Johnny

--
The universe is a big place, perhaps the biggest.
- Kilgore Trout
Jul 17 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: lecichy | last post by:
Hello Heres the situation: I got a file with lines like: name:second_name:somenumber:otherinfo etc with different values between colons ( just like passwd file) What I want is to extract...
5
by: Michael Hill | last post by:
Hi, folks. I am writing a Javascript program that accepts (x, y) data pairs from a text box and then analyzes that data in various ways. This is my first time using text area boxes; in the past,...
1
by: Cognizance | last post by:
Hi gang, I'm an ASP developer by trade, but I've had to create client side scripts with JavaScript many times in the past. Simple things, like validating form elements and such. Now I've been...
4
by: kirill_uk | last post by:
Help with extracting please folks.! Hi. I have this: a variable like: <a href="http://www.some_html.com/text.html" >some text</a><br> I heed to extract the "http://www.some_html.com/text.html "...
2
by: Chris Belcher | last post by:
First some background... The database tracks Action Items assigned to a group of 20 or so managers. Once the assignment is created it is then emailed to each of the managers that are included in...
1
by: Mark Jones | last post by:
Can anyone point me towards information/.net components that can be used for text extraction and pattern recognition? In particular, I am interested in working with a screenshot and extracting...
2
by: Kevin K | last post by:
Hi, I'm having a problem with extracting text from a Word document using StreamReader. As I'm developing a web application, I do NOT want the server to make calls to Word. I want to simply...
2
by: chris_j_adams | last post by:
Hi, I'm slowly discovering the world of JavaScript, so I'm not sure I'm attacking this problem in the right manner, thus if I'm in the wrong newsgroup, my apologies. What I'm trying to do is...
6
by: sunil | last post by:
I have a button named Button1, and I wrote an event handler for the OnClick event. protected void Button1_Click(object sender, System.EventArgs e) { this.Response.Redirect("Default.aspx?q=" +...
2
by: VictorTan | last post by:
Hello. I'm new to this forum. Hope that I don't make mistakes in here but if I do, please correct me if there is. Thanks. I also wanted to ask you guys regarding about the following following...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.