473,320 Members | 1,722 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

PDF2Txt

How to use this module File::Extract::PDF to extract the text from pdf. Need the guidance in writing the program.

thank you
Oct 8 '07 #1
5 7507
Kelicula
176 Expert 100+
How to use this module File::Extract::PDF to extract the text from pdf. Need the guidance in writing the program.

thank you
I do not have that module loaded, and there is not a lot of documentation on it.
But from taking a look at the source, it seems this would print each line in the entire file:
Expand|Select|Wrap|Line Numbers
  1.  
  2. #!/usr/bin/perl 
  3.  
  4. use strict;
  5. use warnings;
  6.  
  7. use File::Extract::PDF;
  8.  
  9. my $target = new File::Extract::PDF;
  10.  
  11. $target->extract(FH, "pdfdocument.pdf") or die;
  12.  
  13. while <FH> {
  14. print "$_\n";
  15. }
  16.  
  17. close(FH);
  18.  
  19.  
Unfortunately I can't test it.
I am only hoping to get the ball rolling, and hope to learn from this myself.

If (or when) this doesn't work, post any errors you may get.

goodday
Oct 8 '07 #2
numberwhun
3,509 Expert Mod 2GB
I do not have that module loaded, and there is not a lot of documentation on it.
But from taking a look at the source, it seems this would print each line in the entire file:
Expand|Select|Wrap|Line Numbers
  1.  
  2. #!/usr/bin/perl 
  3.  
  4. use strict;
  5. use warnings;
  6.  
  7. use File::Extract::PDF;
  8.  
  9. my $target = new File::Extract::PDF;
  10.  
  11. $target->extract(FH, "pdfdocument.pdf") or die;
  12.  
  13. while <FH> {
  14. print "$_\n";
  15. }
  16.  
  17. close(FH);
  18.  
  19.  
Unfortunately I can't test it.
I am only hoping to get the ball rolling, and hope to learn from this myself.

If (or when) this doesn't work, post any errors you may get.

goodday
And, in addition, if you wanted to write each line to its own text file, then just use the open() function to open the text file and then add the file handle to the print statement, like so:

Expand|Select|Wrap|Line Numbers
  1.  
  2. #!/usr/bin/perl 
  3.  
  4. use strict;
  5. use warnings;
  6.  
  7. use File::Extract::PDF;
  8.  
  9. open(NEWFILE, ">./newfile.txt");
  10. my $target = new File::Extract::PDF;
  11.  
  12. $target->extract(FH, "pdfdocument.pdf") or die;
  13.  
  14. while <FH> {
  15. print NEWFILE "$_\n";
  16. }
  17.  
  18. close(FH);
  19. close(NEWFILE);
  20.  
Regards,

Jeff
Oct 8 '07 #3
Thank you for your reply.
If I execute this code I am getting following error. how to clear this?
Bareword "FH" not allowed


waiting for ur reply
regs,
kamalatanvi
Oct 9 '07 #4
numberwhun
3,509 Expert Mod 2GB
Thank you for your reply.
If I execute this code I am getting following error. how to clear this?
Bareword "FH" not allowed


waiting for ur reply
regs,
kamalatanvi
I would have to say that this module (being version .06, which is WELL below version 1.0) probably has many issues as it looks to be fairly new. It may be that the extract function is not completely debugged to work correctly.

You have a couple of options here.

1. I would go through the module code and ensure that the way you are using it is completely correct.
2. If it is, you could always email the author and see what their input is.
3. You could always implement your own solution to this ( a lot longer route).

This is generally the problem with modules that are so very new. They tend to be "not ready for primetime" but are available on CPAN. If you check, there is NO documentation on CPAN for this module either.

Regards,

Jeff
Oct 9 '07 #5
Hi, I'm quite new to Perl world but I think I can help you somehow, though using CAM::PDF module. I found I could extract text from pdf pages with the following sentences:

Expand|Select|Wrap|Line Numbers
  1. .........
  2. use CAM::PDF;
  3.  
  4. .........
  5.  
  6. my $pdf = CAM::PDF->new($filename);
  7.  
  8. print ARCHIVO ( CAM::PDF::PageText->render($pdf->getPageContentTree($numpage)));
  9.  
  10. ........
  11.  
  12.  
This should print into the Filehandle ARCHIVO, associated to a *.txt file in my program, the text in the pdf page as plain text, as the method returns a string, allowing you further processing. Hope this helps.
Oct 10 '07 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Mike | last post by:
Hello, I'm looking to create a PHP script that will automatically generate an index/menu/list (whatever) based on the PDF files that are within a particular directory. I would like the script...
2
by: Miki Tebeka | last post by:
Hello All, I'm looking for a PDF parser. Any pointers? 10x. Miki
7
by: B P | last post by:
Is there a way via Python or even Perl to capture records from a pdf and output a delimited text file? My work has a situation with a trunk load of data forms that were scanned as pdfs. The...
2
by: david | last post by:
hi: The file can be PDF or Word format. Any help? thx
1
by: Rukmal Fernando | last post by:
Hi, I'm working on a tool to do text indexing on documents and want to include support to index PDF files as well. Does anyone know any tool or method of extracting the text from PDF files into...
12
by: Jay | last post by:
Let's say, for instance, that one was programming a spell checker or some other function where the contents of a string from a text-editor's text box needed to be split so that the resulting array...
2
by: Vyz | last post by:
I am looking for a PDF to text script. I am working with multibyte language PDFs on Windows Xp. I need to batch convert them to text and feed into an encoding converter program Thanks for any...
3
by: grey | last post by:
does anyone suggest me how to write a windows application for comparing two pdf content. The requirement is very easy... i only need to inform user two pdf are differnet, no need to spot where the...
3
by: SteveB | last post by:
I have posted this question in the Visual Basic 2005 and Visual Basic .Net 2005 discussion groups, also. Hi. I am developing an application/web page with VB.Net that will populate a SQL...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.