473,395 Members | 1,972 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Help with pdf to text

I am trying to use the sample of code posted by thodge at ipswich dot
qld
dot gov dot au found here:

http://au2.php.net/pdf

In order to convert a PDF file to a string. I am currently trying with
this
document:
http://www.tececo.com/files/appraisa...oAppraisal.pdf
however others fail in the same fashion. Basically the file read works,

since echoing $content after this point:

$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
Works fine, however using echo pdf2string($sourcefile) the final
result of
this script is blank output. Can anyone suggest what could be the
problem in
the way I am using it, or another easy to use, cross platform script
that
will extract the text from PDF files?
Entire script is copied here for easy reference (sorry but not very
sure
what is going wrong, i have no experiance with this):
<?php
function pdf2string($sourcefile) {

$fp = fopen($sourcefile, 'rb');

$content = fread($fp, filesize($sourcefile));
fclose($fp);

echo $content;
$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {

$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);

if ($pos !== false && $pos2 !== false){

if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}

if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] ==
0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}

$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = @gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;

}
}

return preg_replace('/(\s)+/', ' ', $pdfText);

}

function pdfExtractText($psData){

if (!is_string($psData)) {
return '';
}

$text = '';

// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);

preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$postScriptData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i],
$subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') .
$matches[4][$i];
}
}

// Translate special characters and put back brackets.
$trans = array(
'...' ='&hellip;',
'\205' ='&hellip;',
'\221' =chr(145),
'\222' =chr(146),
'\223' =chr(147),
'\224' =chr(148),
'\226' ='-',
'\267' ='&bull;',
'\(' ='(',
'\[' ='[',
'##ENDBRACKET##' =')',
'##ENDSBRACKET##' =']',
chr(133) ='-',
chr(141) =chr(147),
chr(142) =chr(148),
chr(143) =chr(145),
chr(144) =chr(146),
);
$text = strtr($text, $trans);

return $text;

}

echo pdf2string('GlasserTecEcoAppraisal.pdf');

?>

Sep 3 '06 #1
1 6097
"noodle_snacks" <no***********@yahoo.com.auwrote:
>
I am trying to use the sample of code posted by thodge at ipswich dot
qld dot gov dot au found here:

http://au2.php.net/pdf

In order to convert a PDF file to a string. I am currently trying with
this document:
http://www.tececo.com/files/appraisa...oAppraisal.pdf
however others fail in the same fashion. Basically the file read works,

since echoing $content after this point:

$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);

Works fine, however using echo pdf2string($sourcefile) the final
result of this script is blank output. Can anyone suggest what could be
the problem inthe way I am using it, or another easy to use, cross
platform script that will extract the text from PDF files?
That PDF file is compressed, as most PDF files are. The script you
provided only works on uncompressed PDF files. You can use the "pdftk"
tool to uncompress it.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Sep 4 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Sudheer Kareem | last post by:
Dear All Please tell me how to assosiate help files with my Vb.net Project. Regards Sudheer
0
by: Stephen | last post by:
I have been getting on well with help from this forum trying to create an array list and work with it. Everything is working fine apart from displaying my array list items into the labels in my...
3
by: Mike | last post by:
Hey guys I am pulling my hair out on this problem!!!!! Any help or ideas or comments on how to make this work I would be grateful! I have been working on this for the past 4 days and nothing I do...
5
by: Craig Keightley | last post by:
Please help, i have attached my page which worksin IE but i cannnot get the drop down menu to fucntion in firefox. Any one have any ideas why? Many Thanks Craig ...
22
by: Rafia Tapia | last post by:
Hi all This is what I have in mind and I will appreciate any suggestions. I am trying to create a xml help system for my application. The schema of the xml file will be <helpsystem> <help...
1
by: Michael D. Reed | last post by:
I am using the help class to display a simple help file. I generated the help file using Word and saving it as a single page Web page (.mht extension). I show the help file with the following...
1
by: Rahul | last post by:
Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am...
0
by: hshah | last post by:
Hello All, I have created a .aspx page with 7 text boxes and a save button. On click event following code is fired. It save the property information to sql server and also generate a unique Id...
4
by: Ron | last post by:
I am having a bit of problem with this code: Dim cmd As New OleDb.OleDbCommand("INSERT INTO help (Name, Email, telephone, description)VALUES('" & txtName.Text & "','" & txtEmail.Text & "','" &...
0
by: Guilherme Polo | last post by:
On Wed, Sep 3, 2008 at 8:57 PM, Kevin McKinley <kem1723@yahoo.comwrote: Come on.. "help on lines 384-403", that is not a good way to look for help. You are supposed to post some minimal code that...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.