By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,836 Members | 2,097 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,836 IT Pros & Developers. It's quick & easy.

Adapting Site Scraping Script

P: 5
"Newbie needs help"

Hi all,

I had a programmer do a site scraping script for me.. the aim was to scrape data from 5 different sites and upload directly into my website databse. I started to study the five .pl files and found that there is only small changes in the files e.g. ( the url from the site to be scraped changes for obvious reasons etc. ) There is also a few other lines of code which I dont understand. I would like to be able to add more sites by making additional .pl files but not sure what info to change.

One sample file section, where the code changes for each site to be scraped is

Expand|Select|Wrap|Line Numbers
  1. write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  2. $agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  3. write_to_logArray("Status of the above ping : ".$agent->status."\n");
  4.  
  5. my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
  6. my $len = scalar(@links);
  7. if ( $len == 0) {
  8.     $foundurl = -1;
  9.     next;
  10. }
  11.  
  12. ## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
  13. my %linkURLs;
  14. foreach my $link (@links) {
  15.     $linkURLs{$link->url_abs()} = ();
  16. }
  17.  
  18. @links = keys %linkURLs;
  19.  
  20.  
  21.  
  22. # write 5 links to runinfo;
  23. $count = 0;
  24.  
  25. #write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
  26. #write_to_logArray( "Connected successfully \n");
  27. #login();
  28.  
  29.  
  30. foreach my $link (@links) {
  31.     my $url = $link;
  32.     write_to_logArray("\nPinging ".$url);
  33.     $agent->get($url);
  34.     write_to_logArray("Status of the above ping : ".$agent->status."\n");
  35.     my $line = $agent->content;
  36.     get_values($line);
  37.     get_images();
  38.     %images = ();
  39. }
  40. # Disconnect from the database.
  41.  
Would some one be kind enough to advise me what values etc will change if I make a new .pl file for this url

" http://autos.blue-sock.com/clients/5/list.php?count= "
Mar 5 '07 #1
Share this Question
Share on Google+
9 Replies


KevinADC
Expert 2.5K+
P: 4,059
change these:

/12/

to:

/5/

and give it a try.
Mar 5 '07 #2

P: 5
I thought of that and did change it but it doesnt work...
the .pl file runs under cmd but no images, log file data is gathered.
Thats why i assumed that the lines of code below the url bit needs fixing etc.
Mar 5 '07 #3

KevinADC
Expert 2.5K+
P: 4,059
I don't see why it wouldn't work. Nothing after the URL's at the top of the script look like they should be altered. Maybe someone else can spot something.
Mar 5 '07 #4

P: 5
here is the complete file
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl -w
  2. use strict;
  3. #use LWP::Debug qw(+);
  4.  
  5. use WWW::Mechanize;
  6. use HTML::TokeParser;
  7. use DBConnect::db;
  8. use Date::Manip;
  9. use Cwd;
  10. use Net::FTP;
  11.  
  12.  
  13. sub loadlinks;
  14. sub write_to_log;
  15. sub write_to_logArray;
  16. sub get_values;
  17.  
  18. my @logLines;
  19. my $logCounter=0;
  20.  
  21. my %images;
  22. my $agent = WWW::Mechanize->new(agent=>"Mozilla/4.0 (compatible; MSIE 5.0b2; Windows NT)") ;
  23. my $db = DBConnect::db->new();
  24. my $ftp;
  25.  
  26. loadlinks();
  27. write_to_log;
  28.  
  29.  
  30.  
  31. sub write_to_log
  32. {
  33.         open(LOG,">>NN2.log") || die "Cannot open NN2.log";
  34.         my $temp=localtime(time());
  35.         print LOG ("Time:$temp\n");
  36.         foreach my $line (@logLines)
  37.         {
  38.          print LOG $line."\n";
  39.         }
  40.         close LOG;
  41. }
  42.  
  43. sub write_to_logArray
  44. {
  45.   push @logLines,$_[0];
  46.   #print "in the array log\n";
  47.   $logCounter++;
  48.   if ($logCounter > 10)
  49.   {
  50.         write_to_log;
  51.         $logCounter=0;
  52.         @logLines="";
  53.  
  54.   }
  55. }
  56.  
  57. sub loadlinks()
  58. {
  59.   my $count = 0;
  60.   my $sleepTime;
  61.   my $startTime = UnixDate("today","%Y/%m/%d %H:%M:%S");  
  62.  
  63.  
  64.   my $foundurl = 1;
  65.   my $pg = 0;
  66.  
  67.   est_ftp();
  68.   $db->connect();
  69.  
  70.   while ($foundurl == 1)
  71.   {
  72.           my $urlPg = $pg*30;
  73.           $pg ++;
  74.           ## remove code before shipping
  75.           #if ($pg > 1)
  76.           #{
  77.       #  $foundurl = -1;
  78.       #  next;
  79.           #}
  80.           write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  81.           $agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  82.           write_to_logArray("Status of the above ping : ".$agent->status."\n");
  83.  
  84.         my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
  85.       my $len = scalar(@links);
  86.       if ( $len == 0)
  87.       {
  88.         $foundurl = -1;
  89.         next;
  90.       }
  91.  
  92.       ## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
  93.       my %linkURLs;
  94.       foreach my $link (@links)
  95.       {
  96.         $linkURLs{$link->url_abs()} = ();
  97.       }
  98.  
  99.       @links = keys %linkURLs;
  100.  
  101.  
  102.  
  103.           # write 5 links to runinfo;
  104.           $count = 0;
  105.  
  106.           #write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
  107.           #write_to_logArray( "Connected successfully \n");
  108.           #login();
  109.  
  110.           foreach my $link (@links)
  111.           {
  112.             my $url = $link; 
  113.             write_to_logArray("\nPinging ".$url);
  114.             $agent->get($url);
  115.             write_to_logArray("Status of the above ping : ".$agent->status."\n");
  116.         my $line = $agent->content;
  117.           get_values($line);
  118.             get_images();
  119.             %images = ();
  120.           }
  121.           # Disconnect from the database.
  122.    }
  123.  
  124.    $ftp->quit();
  125.    ## delete those records that are not updated now
  126.    #$db->connect();
  127.    $db->deleteRecords($startTime,'NN2');
  128.    $db->disconnect();
  129.  
  130. }
  131.  
  132. sub get_values
  133. {
  134.     my $count;
  135.  
  136.         my $line = $_[0];
  137.         $line =~ s/\n//g;
  138.     $_ = $line;
  139.  
  140.     my @pat = m{Ref\s*<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  141.     my $refNum = join('',@pat);
  142.  
  143.         @pat = m{Reg\sYear\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  144.     my $year = join('',@pat);
  145.  
  146.     @pat = m{CC\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  147.     my $engine = join('',@pat);
  148.  
  149.     @pat = m{Price<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*\&pound\;\s*(.*?)\s*</td>}gis;
  150.     my $price = join('',@pat);
  151.  
  152.     @pat = m{Fuel\sType<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
  153.     my $fuel = join('',@pat);
  154.  
  155.     @pat = m{Colour\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  156.     my $color = join('',@pat);
  157.  
  158.     @pat = m{Mileage\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  159.     my $milage = join('',@pat);
  160.  
  161.     @pat = m{Category<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
  162.     my $category = join('',@pat);
  163.  
  164.     @pat = m{Damage\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  165.     my $description = join('',@pat);
  166.         $description =~ s/<BR>/ /g;
  167.  
  168.         @pat = m{Model<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  169.         my $make = join('',@pat);
  170.         $make =~ s/[\s]+/ /g;
  171.         $make =~ s/\&nbsp\;/ /g;
  172.         $make =~ s/\.//g;
  173.         $make = uc($make);
  174.  
  175.         my $itemname = "";
  176.  
  177.     ##images
  178.     my @allimages = m{src=\'(.*?)\'}gis;
  179.     my $image = join(',',@allimages);
  180.     $images{$refNum} = $image;
  181.     #print "\nNN2,".$refNum.",".$make.",".$year.",".$engine.",".$price.",".$fuel.",".$color.",".$milage.",".$category.",".$description;
  182.  
  183.     # Write to Database
  184.     $db->updateDB('NN2',$make,
  185.     $refNum,$year,$engine,$price,$fuel,$color,$milage,$category,$description);
  186.  
  187.  
  188.     write_to_logArray( "Written to the database successfully");
  189.  
  190. }
  191.  
  192. sub est_ftp
  193. {
  194.   $ftp = Net::FTP->new("deleted.com", Debug => 0)
  195.       or die "Cannot connect for FTPing to deleted.com: $@";
  196.   $ftp->login("deleted",'deleted')
  197.       or die "Cannot login ", $ftp->message;
  198.   $ftp->mkdir("www/scripts/NN2");
  199.   $ftp->cwd("www/scripts/NN2/")
  200.       or die "Cannot change directory ", $ftp->message;
  201.   mkdir "NN2";
  202.   $ftp->binary;
  203. }
  204.  
  205. sub get_images
  206. {
  207.   my $mainURL = "http://autos.blue-sock.com/clients/12/";
  208.   foreach my $image (keys %images)
  209.   {
  210.     my @imgs = split(',', $images{$image});
  211.     my $imgcnt = 0;
  212.     foreach my $img (@imgs)
  213.     {
  214.       $imgcnt++;
  215.       my $imgname = $image."_".$imgcnt.".jpg";
  216.       $agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
  217.       $ftp->put("NN2/".$imgname);
  218.       $img =~ s/small/large/;
  219.       $imgname = $image."_L_".$imgcnt.".jpg";
  220.       $agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
  221.       $ftp->put("NN2/".$imgname);
  222.     }
  223.   }
  224.  
  225. }
  226.  
  227. sub get_random_number
  228. {
  229.  my $min = $_[0];
  230.  my $max = $_[1]; ;
  231.  
  232.  my $randomnumber = int(rand($max))+$min ;
  233.  #print ("Random Number is $randomnumber \n");
  234.  return  $randomnumber;
  235. }
  236.  
  237.  
in addition to changing the client No. to 5, I also changed the ( NN2 ) to ( NN6 ) as this would be the new image and log file name for the new host.

As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Mar 6 '07 #5

P: 5
This is the logfile message when I run the new .pl file

Pinging http://autos.blue-sock.com/clients/5/list.php?count=0
Status of the above ping : 200
Mar 6 '07 #6

KevinADC
Expert 2.5K+
P: 4,059
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Mar 6 '07 #7

miller
Expert 100+
P: 1,089
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Ditto.

I certainly respect that you're someone new to perl who is trying to learn so that you can do this project yourself. But it feels like the only way to help you now is the do it for you instead of helping you to learn. I don't have the time to read through all that code and decypher what needs to be changed and what the possible pitfalls are.

I will tell you one thing though. This feels to be like a script that should be generalized in some way. The fact that you currently have 5 copies that are slightly modified for each site is not the best approach. Instead a generalized script should be created that can do all the parsing based off of a data file or some other indicator for which sites need to be scraped.

To accomplish that, you might need to invest in another perl programmer for a time. I do not expect that it will be a big job at all, but it is probably one that will require more than a beginners level of experience.

- Miller
Mar 6 '07 #8

P: 5
I understand.. Thanks for your help.
Looks like I just gotta pay my programmer another $200 to add 5 more sites
could anyone do it cheaper?
Mar 8 '07 #9

KevinADC
Expert 2.5K+
P: 4,059
check on freelance programming sites and post a job and take bids. $200 sounds reasonable to me though.
Mar 8 '07 #10

Post your reply

Sign in to post your reply or Sign up for a free account.