473,383 Members | 1,762 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,383 software developers and data experts.

HELP: parsing unicode web sites

I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

<div class="movie"><img src="http://pic.tom365.com/imgs/tongjifan.jpg" class="mp" />

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you
Jul 31 '08 #1
1 2757
Thanks to those who helped. Here's my working script:
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. # tom365crawl2.pl
  3. # http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
  4. # http://perldoc.perl.org/Encode.html
  5. # http://juerd.nl/site.plp/perluniadvice
  6. # http://www.perlmonks.org/?node_id=620068
  7.  
  8. use warnings;
  9. use strict;
  10.  
  11. use File::stat;
  12. use Tie::File;
  13.  
  14. use LWP::Simple;
  15. use LWP::UserAgent;
  16. use HTTP::Request;
  17. use HTTP::Response;
  18. use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
  19. #use File::Slurp;
  20.  
  21. use Encode;
  22.  
  23. my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
  24. my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
  25. my $delim1b = "\" class=\"mp\" \/\>";
  26. my $folder1 = "movie_2004/html/";
  27. my $url1;
  28. my $start1 = 1000;
  29. my $end1 = 1000;
  30. my $contents1;
  31. my $image1;
  32.  
  33. my $browser1 = LWP::UserAgent->new();
  34. $browser1->timeout(10);
  35. my $request1;
  36. my $response1;
  37.  
  38. my $count;
  39. for ($count=$start1; $count<=$end1; $count++) {
  40.   $url1 = $site1 . $folder1 . $count . ".html";
  41.   printf "Downloading %s\n", $url1;
  42.  
  43.   # Method 1
  44.   #$contents1 = get($url1);
  45.  
  46.   # Method 2
  47.   $request1 = HTTP::Request->new(GET => $url1);
  48.   $response1 = $browser1->request($request1);
  49.   if ($response1->is_error()) {
  50.     printf "%s\n", $response1->status_line;
  51.   }
  52.   $contents1 = $response1->decoded_content();
  53.  
  54.   #open(NEWFILE1, "> Debug.txt");
  55.   #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
  56.   #close(NEWFILE1);
  57.  
  58.   #print $contents1;
  59.  
  60.   if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
  61.     $image1 = "$1";
  62.     printf "Downloading %s\n", $image1;
  63.     `wget -q -O $count.jpg $image1`;
  64.  
  65.     #if ($image1 =~ /\/([^\/]*)$/m) {
  66.     #  printf "Renaming %s to $count.jpg\n", $1;
  67.     #} else {
  68.     #  printf "Could not rename %s to $count.jpg\n", $image1;
  69.     #}
  70.   } else {
  71.     #open(NEWFILE1, "> $count.txt");
  72.     #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
  73.     #close(NEWFILE1);
  74.   }
  75. }
Aug 4 '08 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

19
by: Alex Mizrahi | last post by:
Hello, All! i have 3mb long XML document with about 150000 lines (i think it has about 200000 elements there) which i want to parse to DOM to work with. first i thought there will be no...
1
by: Markus Doering | last post by:
Hey, I am trying to process XML schema documents using namespace aware SAX handlers. Currently I am using the default python 2.3 parser: parser = xml.sax.make_parser()...
8
by: baustin75 | last post by:
Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie only when debugging in php designer 2005 -------------------------------------------------------------------------------- ...
3
by: RichW | last post by:
I've seen a couple other posts on this but no real answers. I'm trying to do a bulk insert and everything is fine until I run the objCom.ExecuteNonQuery() statement at which point I get the XML...
7
by: Csaba Gabor | last post by:
If I do alert(encodeURI(String.fromCharCode(250))); (in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA Now I was sort of expecting something like %u... (and a single (4 digit?) unicode hex...
4
by: WaterWalk | last post by:
Hello, I'm currently learning string manipulation. I'm curious about what is the favored way for string manipulation in C, expecially when strings contain non-ASCII characters. For example, if...
4
by: gheissenberger | last post by:
HELP! Guy who was here before me wrote a script to parse files in Python. Includes line: print u where u is a line from a file we are parsing. However, we have started recieving data from...
7
by: John Nagle | last post by:
Is there something available that will parse the "netloc" field as returned by URLparse, including all the hard cases? The "netloc" field can potentially contain a port number and a numeric IP...
16
by: william tanksley | last post by:
I'm trying to convert the URLs contained in iTunes' XML file into a form comparable with the filenames returned by iTunes' COM interface. I'm writing a podcast sorter in Python; I'm using iTunes...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.