Connecting Tech Pros Worldwide Forums | Help | Site Map

Help with pdf to text

noodle_snacks
Guest
 
Posts: n/a
#1: Sep 3 '06
I am trying to use the sample of code posted by thodge at ipswich dot
qld
dot gov dot au found here:

http://au2.php.net/pdf

In order to convert a PDF file to a string. I am currently trying with
this
document:
http://www.tececo.com/files/appraisa...oAppraisal.pdf
however others fail in the same fashion. Basically the file read works,

since echoing $content after this point:

$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);


Works fine, however using echo pdf2string($sourcefile) the final
result of
this script is blank output. Can anyone suggest what could be the
problem in
the way I am using it, or another easy to use, cross platform script
that
will extract the text from PDF files?


Entire script is copied here for easy reference (sorry but not very
sure
what is going wrong, i have no experiance with this):


<?php


function pdf2string($sourcefile) {

$fp = fopen($sourcefile, 'rb');

$content = fread($fp, filesize($sourcefile));
fclose($fp);

echo $content;
$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {

$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);

if ($pos !== false && $pos2 !== false){

if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}

if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] ==
0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}

$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = @gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;

}
}

return preg_replace('/(\s)+/', ' ', $pdfText);

}

function pdfExtractText($psData){

if (!is_string($psData)) {
return '';
}

$text = '';

// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);

preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$postScriptData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i],
$subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') .
$matches[4][$i];
}
}

// Translate special characters and put back brackets.
$trans = array(
'...' ='&hellip;',
'\205' ='&hellip;',
'\221' =chr(145),
'\222' =chr(146),
'\223' =chr(147),
'\224' =chr(148),
'\226' ='-',
'\267' ='&bull;',
'\(' ='(',
'\[' ='[',
'##ENDBRACKET##' =')',
'##ENDSBRACKET##' =']',
chr(133) ='-',
chr(141) =chr(147),
chr(142) =chr(148),
chr(143) =chr(145),
chr(144) =chr(146),
);
$text = strtr($text, $trans);

return $text;

}

echo pdf2string('GlasserTecEcoAppraisal.pdf');

?>


Tim Roberts
Guest
 
Posts: n/a
#2: Sep 5 '06

re: Help with pdf to text


"noodle_snacks" <noodle_snacks@yahoo.com.auwrote:
Quote:
>
>I am trying to use the sample of code posted by thodge at ipswich dot
>qld dot gov dot au found here:
>
>http://au2.php.net/pdf
>
>In order to convert a PDF file to a string. I am currently trying with
>this document:
>http://www.tececo.com/files/appraisa...oAppraisal.pdf
>however others fail in the same fashion. Basically the file read works,
>
>since echoing $content after this point:
>
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
>
>Works fine, however using echo pdf2string($sourcefile) the final
>result of this script is blank output. Can anyone suggest what could be
>the problem inthe way I am using it, or another easy to use, cross
>platform script that will extract the text from PDF files?
That PDF file is compressed, as most PDF files are. The script you
provided only works on uncompressed PDF files. You can use the "pdftk"
tool to uncompress it.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
Closed Thread


Similar PHP bytes