469,312 Members | 2,524 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,312 developers. It's quick & easy.

PDF2Txt

How to use this module File::Extract::PDF to extract the text from pdf. Need the guidance in writing the program.

thank you
Oct 8 '07 #1
5 7258
Kelicula
176 Expert 100+
How to use this module File::Extract::PDF to extract the text from pdf. Need the guidance in writing the program.

thank you
I do not have that module loaded, and there is not a lot of documentation on it.
But from taking a look at the source, it seems this would print each line in the entire file:
Expand|Select|Wrap|Line Numbers
  1.  
  2. #!/usr/bin/perl 
  3.  
  4. use strict;
  5. use warnings;
  6.  
  7. use File::Extract::PDF;
  8.  
  9. my $target = new File::Extract::PDF;
  10.  
  11. $target->extract(FH, "pdfdocument.pdf") or die;
  12.  
  13. while <FH> {
  14. print "$_\n";
  15. }
  16.  
  17. close(FH);
  18.  
  19.  
Unfortunately I can't test it.
I am only hoping to get the ball rolling, and hope to learn from this myself.

If (or when) this doesn't work, post any errors you may get.

goodday
Oct 8 '07 #2
numberwhun
3,503 Expert Mod 2GB
I do not have that module loaded, and there is not a lot of documentation on it.
But from taking a look at the source, it seems this would print each line in the entire file:
Expand|Select|Wrap|Line Numbers
  1.  
  2. #!/usr/bin/perl 
  3.  
  4. use strict;
  5. use warnings;
  6.  
  7. use File::Extract::PDF;
  8.  
  9. my $target = new File::Extract::PDF;
  10.  
  11. $target->extract(FH, "pdfdocument.pdf") or die;
  12.  
  13. while <FH> {
  14. print "$_\n";
  15. }
  16.  
  17. close(FH);
  18.  
  19.  
Unfortunately I can't test it.
I am only hoping to get the ball rolling, and hope to learn from this myself.

If (or when) this doesn't work, post any errors you may get.

goodday
And, in addition, if you wanted to write each line to its own text file, then just use the open() function to open the text file and then add the file handle to the print statement, like so:

Expand|Select|Wrap|Line Numbers
  1.  
  2. #!/usr/bin/perl 
  3.  
  4. use strict;
  5. use warnings;
  6.  
  7. use File::Extract::PDF;
  8.  
  9. open(NEWFILE, ">./newfile.txt");
  10. my $target = new File::Extract::PDF;
  11.  
  12. $target->extract(FH, "pdfdocument.pdf") or die;
  13.  
  14. while <FH> {
  15. print NEWFILE "$_\n";
  16. }
  17.  
  18. close(FH);
  19. close(NEWFILE);
  20.  
Regards,

Jeff
Oct 8 '07 #3
Thank you for your reply.
If I execute this code I am getting following error. how to clear this?
Bareword "FH" not allowed


waiting for ur reply
regs,
kamalatanvi
Oct 9 '07 #4
numberwhun
3,503 Expert Mod 2GB
Thank you for your reply.
If I execute this code I am getting following error. how to clear this?
Bareword "FH" not allowed


waiting for ur reply
regs,
kamalatanvi
I would have to say that this module (being version .06, which is WELL below version 1.0) probably has many issues as it looks to be fairly new. It may be that the extract function is not completely debugged to work correctly.

You have a couple of options here.

1. I would go through the module code and ensure that the way you are using it is completely correct.
2. If it is, you could always email the author and see what their input is.
3. You could always implement your own solution to this ( a lot longer route).

This is generally the problem with modules that are so very new. They tend to be "not ready for primetime" but are available on CPAN. If you check, there is NO documentation on CPAN for this module either.

Regards,

Jeff
Oct 9 '07 #5
Hi, I'm quite new to Perl world but I think I can help you somehow, though using CAM::PDF module. I found I could extract text from pdf pages with the following sentences:

Expand|Select|Wrap|Line Numbers
  1. .........
  2. use CAM::PDF;
  3.  
  4. .........
  5.  
  6. my $pdf = CAM::PDF->new($filename);
  7.  
  8. print ARCHIVO ( CAM::PDF::PageText->render($pdf->getPageContentTree($numpage)));
  9.  
  10. ........
  11.  
  12.  
This should print into the Filehandle ARCHIVO, associated to a *.txt file in my program, the text in the pdf page as plain text, as the method returns a string, allowing you further processing. Hope this helps.
Oct 10 '07 #6

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

2 posts views Thread by Miki Tebeka | last post: by
7 posts views Thread by B P | last post: by
1 post views Thread by Rukmal Fernando | last post: by
12 posts views Thread by Jay | last post: by
2 posts views Thread by Vyz | last post: by
3 posts views Thread by SteveB | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
1 post views Thread by Geralt96 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.