By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,661 Members | 1,330 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,661 IT Pros & Developers. It's quick & easy.

HELP: parsing unicode web sites

P: 7
I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

<div class="movie"><img src="http://pic.tom365.com/imgs/tongjifan.jpg" class="mp" />

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you
Jul 31 '08 #1
Share this Question
Share on Google+
1 Reply


P: 7
Thanks to those who helped. Here's my working script:
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. # tom365crawl2.pl
  3. # http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
  4. # http://perldoc.perl.org/Encode.html
  5. # http://juerd.nl/site.plp/perluniadvice
  6. # http://www.perlmonks.org/?node_id=620068
  7.  
  8. use warnings;
  9. use strict;
  10.  
  11. use File::stat;
  12. use Tie::File;
  13.  
  14. use LWP::Simple;
  15. use LWP::UserAgent;
  16. use HTTP::Request;
  17. use HTTP::Response;
  18. use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
  19. #use File::Slurp;
  20.  
  21. use Encode;
  22.  
  23. my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
  24. my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
  25. my $delim1b = "\" class=\"mp\" \/\>";
  26. my $folder1 = "movie_2004/html/";
  27. my $url1;
  28. my $start1 = 1000;
  29. my $end1 = 1000;
  30. my $contents1;
  31. my $image1;
  32.  
  33. my $browser1 = LWP::UserAgent->new();
  34. $browser1->timeout(10);
  35. my $request1;
  36. my $response1;
  37.  
  38. my $count;
  39. for ($count=$start1; $count<=$end1; $count++) {
  40.   $url1 = $site1 . $folder1 . $count . ".html";
  41.   printf "Downloading %s\n", $url1;
  42.  
  43.   # Method 1
  44.   #$contents1 = get($url1);
  45.  
  46.   # Method 2
  47.   $request1 = HTTP::Request->new(GET => $url1);
  48.   $response1 = $browser1->request($request1);
  49.   if ($response1->is_error()) {
  50.     printf "%s\n", $response1->status_line;
  51.   }
  52.   $contents1 = $response1->decoded_content();
  53.  
  54.   #open(NEWFILE1, "> Debug.txt");
  55.   #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
  56.   #close(NEWFILE1);
  57.  
  58.   #print $contents1;
  59.  
  60.   if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
  61.     $image1 = "$1";
  62.     printf "Downloading %s\n", $image1;
  63.     `wget -q -O $count.jpg $image1`;
  64.  
  65.     #if ($image1 =~ /\/([^\/]*)$/m) {
  66.     #  printf "Renaming %s to $count.jpg\n", $1;
  67.     #} else {
  68.     #  printf "Could not rename %s to $count.jpg\n", $image1;
  69.     #}
  70.   } else {
  71.     #open(NEWFILE1, "> $count.txt");
  72.     #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
  73.     #close(NEWFILE1);
  74.   }
  75. }
Aug 4 '08 #2

Post your reply

Sign in to post your reply or Sign up for a free account.