473,406 Members | 2,345 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

convert pdf to text content in db

Hi,
I want to convert pdf to text content in db. While insert pdf file to database before save it will will convert pdf to text content. I have to use pdf2text.class also but not converting.
ples help me.
pdf2text.class
Expand|Select|Wrap|Line Numbers
  1. <?php
  2.  
  3. class PDF2Text {
  4.     // Some settings
  5.     var $multibyte = 2; // Use setUnicode(TRUE|FALSE)
  6.     var $convertquotes = ENT_QUOTES; // ENT_COMPAT (double-quotes), ENT_QUOTES (Both), ENT_NOQUOTES (None)
  7.  
  8.     // Variables
  9.     var $filename = '';
  10.     var $decodedtext = '';
  11.  
  12.     function setFilename($filename) { 
  13.         // Reset
  14.         $this->decodedtext = '';
  15.         $this->filename = $filename;
  16.     }
  17.  
  18.     function output($echo = false) { 
  19.         if($echo) echo $this->decodedtext;
  20.         else return $this->decodedtext;
  21.     }
  22.  
  23.     function setUnicode($input) { 
  24.         // 4 for unicode. But 2 should work in most cases just fine
  25.         if($input == true) $this->multibyte = 4;
  26.         else $this->multibyte = 2;
  27.     }
  28.  
  29.     function decodePDF() { 
  30.         // Read the data from pdf file
  31.         $infile = @file_get_contents($this->filename, FILE_BINARY); 
  32.         if (empty($infile)) 
  33.             return ""; 
  34.  
  35.         // Get all text data.
  36.         $transformations = array(); 
  37.         $texts = array(); 
  38.  
  39.         // Get the list of all objects.
  40.         preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile, $objects); 
  41.         $objects = @$objects[1]; 
  42.  
  43.         // Select objects with streams.
  44.         for ($i = 0; $i < count($objects); $i++) { 
  45.             $currentObject = $objects[$i]; 
  46.  
  47.             // Check if an object includes data stream.
  48.             if (preg_match("#stream[\n|\r](.*)endstream[\n|\r]#ismU", $currentObject, $stream)) { 
  49.                 $stream = ltrim($stream[1]); 
  50.  
  51.                 // Check object parameters and look for text data. 
  52.                 $options = $this->getObjectOptions($currentObject); 
  53.  
  54.                 if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"]))) 
  55.                     continue; 
  56.  
  57.                 // Hack, length doesnt always seem to be correct
  58.                 unset($options["Length"]);
  59.  
  60.                 // So, we have text data. Decode it.
  61.                 $data = $this->getDecodedStream($stream, $options);  
  62.  
  63.                 if (strlen($data)) { 
  64.                     if (preg_match_all("#BT[\n|\r](.*)ET[\n|\r]#ismU", $data, $textContainers)) {
  65.                         $textContainers = @$textContainers[1]; 
  66.                         $this->getDirtyTexts($texts, $textContainers); 
  67.                     } else 
  68.                         $this->getCharTransformations($transformations, $data); 
  69.                 } 
  70.             } 
  71.         } 
  72.  
  73.         // Analyze text blocks taking into account character transformations and return results. 
  74.         $this->decodedtext = $this->getTextUsingTransformations($texts, $transformations); 
  75.     }
  76.  
  77.  
  78.     function decodeAsciiHex($input) {
  79.         $output = "";
  80.  
  81.         $isOdd = true;
  82.         $isComment = false;
  83.  
  84.         for($i = 0, $codeHigh = -1; $i < strlen($input) && $input[$i] != '>'; $i++) {
  85.             $c = $input[$i];
  86.  
  87.             if($isComment) {
  88.                 if ($c == '\r' || $c == '\n')
  89.                     $isComment = false;
  90.                 continue;
  91.             }
  92.  
  93.             switch($c) {
  94.                 case '\0': case '\t': case '\r': case '\f': case '\n': case ' ': break;
  95.                 case '%': 
  96.                     $isComment = true;
  97.                 break;
  98.  
  99.                 default:
  100.                     $code = hexdec($c);
  101.                     if($code === 0 && $c != '0')
  102.                         return "";
  103.  
  104.                     if($isOdd)
  105.                         $codeHigh = $code;
  106.                     else
  107.                         $output .= chr($codeHigh * 16 + $code);
  108.  
  109.                     $isOdd = !$isOdd;
  110.                 break;
  111.             }
  112.         }
  113.  
  114.         if($input[$i] != '>')
  115.             return "";
  116.  
  117.         if($isOdd)
  118.             $output .= chr($codeHigh * 16);
  119.  
  120.         return $output;
  121.     }
  122.  
  123.     function decodeAscii85($input) {
  124.         $output = "";
  125.  
  126.         $isComment = false;
  127.         $ords = array();
  128.  
  129.         for($i = 0, $state = 0; $i < strlen($input) && $input[$i] != '~'; $i++) {
  130.             $c = $input[$i];
  131.  
  132.             if($isComment) {
  133.                 if ($c == '\r' || $c == '\n')
  134.                     $isComment = false;
  135.                 continue;
  136.             }
  137.  
  138.             if ($c == '\0' || $c == '\t' || $c == '\r' || $c == '\f' || $c == '\n' || $c == ' ')
  139.                 continue;
  140.             if ($c == '%') {
  141.                 $isComment = true;
  142.                 continue;
  143.             }
  144.             if ($c == 'z' && $state === 0) {
  145.                 $output .= str_repeat(chr(0), 4);
  146.                 continue;
  147.             }
  148.             if ($c < '!' || $c > 'u')
  149.                 return "";
  150.  
  151.             $code = ord($input[$i]) & 0xff;
  152.             $ords[$state++] = $code - ord('!');
  153.  
  154.             if ($state == 5) {
  155.                 $state = 0;
  156.                 for ($sum = 0, $j = 0; $j < 5; $j++)
  157.                     $sum = $sum * 85 + $ords[$j];
  158.                 for ($j = 3; $j >= 0; $j--)
  159.                     $output .= chr($sum >> ($j * 8));
  160.             }
  161.         }
  162.         if ($state === 1)
  163.             return "";
  164.         elseif ($state > 1) {
  165.             for ($i = 0, $sum = 0; $i < $state; $i++)
  166.                 $sum += ($ords[$i] + ($i == $state - 1)) * pow(85, 4 - $i);
  167.             for ($i = 0; $i < $state - 1; $i++)
  168.                 $ouput .= chr($sum >> ((3 - $i) * 8));
  169.         }
  170.  
  171.         return $output;
  172.     }
  173.  
  174.     function decodeFlate($input) {
  175.         return gzuncompress($input);
  176.     }
  177.  
  178.     function getObjectOptions($object) {
  179.         $options = array();
  180.  
  181.         if (preg_match("#<<(.*)>>#ismU", $object, $options)) {
  182.             $options = explode("/", $options[1]);
  183.             @array_shift($options);
  184.  
  185.             $o = array();
  186.             for ($j = 0; $j < @count($options); $j++) {
  187.                 $options[$j] = preg_replace("#\s+#", " ", trim($options[$j]));
  188.                 if (strpos($options[$j], " ") !== false) {
  189.                     $parts = explode(" ", $options[$j]);
  190.                     $o[$parts[0]] = $parts[1];
  191.                 } else
  192.                     $o[$options[$j]] = true;
  193.             }
  194.             $options = $o;
  195.             unset($o);
  196.         }
  197.  
  198.         return $options;
  199.     }
  200.  
  201.     function getDecodedStream($stream, $options) {
  202.         $data = "";
  203.         if (empty($options["Filter"]))
  204.             $data = $stream;
  205.         else {
  206.             $length = !empty($options["Length"]) ? $options["Length"] : strlen($stream);
  207.             $_stream = substr($stream, 0, $length);
  208.  
  209.             foreach ($options as $key => $value) {
  210.                 if ($key == "ASCIIHexDecode")
  211.                     $_stream = $this->decodeAsciiHex($_stream);
  212.                 if ($key == "ASCII85Decode")
  213.                     $_stream = $this->decodeAscii85($_stream);
  214.                 if ($key == "FlateDecode")
  215.                     $_stream = $this->decodeFlate($_stream);
  216.                 if ($key == "Crypt") { // TO DO
  217.                 }
  218.             }
  219.             $data = $_stream;
  220.         }
  221.         return $data;
  222.     }
  223.     function getDirtyTexts(&$texts, $textContainers) {
  224.  
  225.         for ($j = 0; $j < count($textContainers); $j++) {
  226.             if (preg_match_all("#\[(.*)\]\s*TJ[\n|\r]#ismU", $textContainers[$j], $parts))
  227.                 $texts = array_merge($texts, @$parts[1]);
  228.             elseif(preg_match_all("#T[d|w|m|f]\s*(\(.*\))\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
  229.                 $texts = array_merge($texts, @$parts[1]);
  230.             elseif(preg_match_all("#T[d|w|m|f]\s*(\[.*\])\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
  231.                 $texts = array_merge($texts, @$parts[1]);
  232.         }
  233.     }
  234.     function getCharTransformations(&$transformations, $stream) {
  235.         preg_match_all("#([0-9]+)\s+beginbfchar(.*)endbfchar#ismU", $stream, $chars, PREG_SET_ORDER);
  236.         preg_match_all("#([0-9]+)\s+beginbfrange(.*)endbfrange#ismU", $stream, $ranges, PREG_SET_ORDER);
  237.  
  238.         for ($j = 0; $j < count($chars); $j++) {
  239.             $count = $chars[$j][1];
  240.             $current = explode("\n", trim($chars[$j][2]));
  241.             for ($k = 0; $k < $count && $k < count($current); $k++) {
  242.                 if (preg_match("#<([0-9a-f]{2,4})>\s+<([0-9a-f]{4,512})>#is", trim($current[$k]), $map))
  243.                     $transformations[str_pad($map[1], 4, "0")] = $map[2];
  244.             }
  245.         }
  246.         for ($j = 0; $j < count($ranges); $j++) {
  247.             $count = $ranges[$j][1];
  248.             $current = explode("\n", trim($ranges[$j][2]));
  249.             for ($k = 0; $k < $count && $k < count($current); $k++) {
  250.                 if (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+<([0-9a-f]{4})>#is", trim($current[$k]), $map)) {
  251.                     $from = hexdec($map[1]);
  252.                     $to = hexdec($map[2]);
  253.                     $_from = hexdec($map[3]);
  254.  
  255.                     for ($m = $from, $n = 0; $m <= $to; $m++, $n++)
  256.                         $transformations[sprintf("%04X", $m)] = sprintf("%04X", $_from + $n);
  257.                 } elseif (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+\[(.*)\]#ismU", trim($current[$k]), $map)) {
  258.                     $from = hexdec($map[1]);
  259.                     $to = hexdec($map[2]);
  260.                     $parts = preg_split("#\s+#", trim($map[3]));
  261.  
  262.                     for ($m = $from, $n = 0; $m <= $to && $n < count($parts); $m++, $n++)
  263.                         $transformations[sprintf("%04X", $m)] = sprintf("%04X", hexdec($parts[$n]));
  264.                 }
  265.             }
  266.         }
  267.     }
  268.     function getTextUsingTransformations($texts, $transformations) {
  269.         $document = "";
  270.         for ($i = 0; $i < count($texts); $i++) {
  271.             $isHex = false;
  272.             $isPlain = false;
  273.  
  274.             $hex = "";
  275.             $plain = "";
  276.             for ($j = 0; $j < strlen($texts[$i]); $j++) {
  277.                 $c = $texts[$i][$j];
  278.                 switch($c) {
  279.                     case "<":
  280.                         $hex = "";
  281.                         $isHex = true;
  282.                     break;
  283.                     case ">":
  284.                         $hexs = str_split($hex, $this->multibyte); // 2 or 4 (UTF8 or ISO)
  285.                         for ($k = 0; $k < count($hexs); $k++) {
  286.                             $chex = str_pad($hexs[$k], 4, "0"); // Add tailing zero
  287.                             if (isset($transformations[$chex]))
  288.                                 $chex = $transformations[$chex];
  289.                             $document .= html_entity_decode("&#x".$chex.";");
  290.                         }
  291.                         $isHex = false;
  292.                     break;
  293.                     case "(":
  294.                         $plain = "";
  295.                         $isPlain = true;
  296.                     break;
  297.                     case ")":
  298.                         $document .= $plain;
  299.                         $isPlain = false;
  300.                     break;
  301.                     case "\\":
  302.                         $c2 = $texts[$i][$j + 1];
  303.                         if (in_array($c2, array("\\", "(", ")"))) $plain .= $c2;
  304.                         elseif ($c2 == "n") $plain .= '\n';
  305.                         elseif ($c2 == "r") $plain .= '\r';
  306.                         elseif ($c2 == "t") $plain .= '\t';
  307.                         elseif ($c2 == "b") $plain .= '\b';
  308.                         elseif ($c2 == "f") $plain .= '\f';
  309.                         elseif ($c2 >= '0' && $c2 <= '9') {
  310.                             $oct = preg_replace("#[^0-9]#", "", substr($texts[$i], $j + 1, 3));
  311.                             $j += strlen($oct) - 1;
  312.                             $plain .= html_entity_decode("&#".octdec($oct).";", $this->convertquotes);
  313.                         }
  314.                         $j++;
  315.                     break;
  316.  
  317.                     default:
  318.                         if ($isHex)
  319.                             $hex .= $c;
  320.                         if ($isPlain)
  321.                             $plain .= $c;
  322.                     break;
  323.                 }
  324.             }
  325.             $document .= "\n";
  326.         }
  327.  
  328.         return $document;
  329.     }
  330. }
  331. ?>
Oct 31 '16 #1
7 1838
Dormilich
8,658 Expert Mod 8TB
are you using PHP 4?

what error messages do you get?
Nov 1 '16 #2
ya using PHP 4.

I din't get error, but nt converting pdf fils.
Nov 1 '16 #3
Dormilich
8,658 Expert Mod 8TB
then you need to debug your code. check at suitable positions whether the variables hold the values you expect.

also remove the @s and turn up error reporting.

note: this if ($c < '!' || $c > 'u') looks odd. strings will be cast to integers when used with < or >
Nov 1 '16 #4
Hi
pdf files will convert in my database not in folder.

i am inserting pdf files into db,it's inserting like pdf only not converting how to do?
Nov 1 '16 #5
Dormilich
8,658 Expert Mod 8TB
you need to debug your code. simple as that.
Nov 1 '16 #6
ya but getting error when i did upload pdf files.
Nov 2 '16 #7
Allowed memory size of 134217728 bytes exhausted (tried to allocate 7168 bytes) I getting error like this.
Nov 2 '16 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Erwin Bormans | last post by:
Hello I want to convert a value in a text propertie to a double. I can use CDbl(grid1.text) for this, but the problem is that some text properties are empty and when they are the code give...
4
by: Daniel Köster | last post by:
Is there someone who has got some tips on how to convert text encoded with character referense ({) to unicode or uft-8 format using VB.net? Is there a function or something that can help with the...
7
by: spiros | last post by:
Hi, suppose you have the class Class1 and the main program: class Class1 { public: Class1(); ~Class1();
4
by: michael | last post by:
I have an html text string within a div, eg.: <div id="example">Text text text</div> I know its easy to change styles by using getElementById - for example: ...
3
by: tungchau81 | last post by:
Hi, A Modal Dialog does not allow us to highlight the text content inside <DIV> tag unless I use a <textarea> tag to display that text content. However, the situation of my software makes it...
7
by: Arancaytar | last post by:
(Note: I am a Javascript newbie. I can handle PHP and Java, but this is unfamiliar territory.) For a wordcount feature, I need to collect the complete text content of a 'div' element inside a...
0
by: news | last post by:
I'm trying to preview some HTML formatted text to show the context that a particular search result was found in. (does that make sense?) The idea is to take that HTML and display it as a stream...
1
by: rajendiran | last post by:
hi friends i will type any text in textbox. and chose any language that text will converted in that language example like english to french.. how to convert text like this anyone help me..........
4
by: david.karr | last post by:
I have a simple test page with a "div" that just contains text content. I assigned a var to the "div" element, and I'm trying to get the "first child" of that element to get the text content. I'm...
3
by: manoharshrestha | last post by:
Is there any way to convert Cdata content to XML? From: <Organization>Test</Organization> <Author>Test</Author> <test><!]></test> <Date>November 2008</Date> ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.