473,398 Members | 2,393 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

Adapting Site Scraping Script

5
"Newbie needs help"

Hi all,

I had a programmer do a site scraping script for me.. the aim was to scrape data from 5 different sites and upload directly into my website databse. I started to study the five .pl files and found that there is only small changes in the files e.g. ( the url from the site to be scraped changes for obvious reasons etc. ) There is also a few other lines of code which I dont understand. I would like to be able to add more sites by making additional .pl files but not sure what info to change.

One sample file section, where the code changes for each site to be scraped is

Expand|Select|Wrap|Line Numbers
  1. write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  2. $agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  3. write_to_logArray("Status of the above ping : ".$agent->status."\n");
  4.  
  5. my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
  6. my $len = scalar(@links);
  7. if ( $len == 0) {
  8.     $foundurl = -1;
  9.     next;
  10. }
  11.  
  12. ## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
  13. my %linkURLs;
  14. foreach my $link (@links) {
  15.     $linkURLs{$link->url_abs()} = ();
  16. }
  17.  
  18. @links = keys %linkURLs;
  19.  
  20.  
  21.  
  22. # write 5 links to runinfo;
  23. $count = 0;
  24.  
  25. #write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
  26. #write_to_logArray( "Connected successfully \n");
  27. #login();
  28.  
  29.  
  30. foreach my $link (@links) {
  31.     my $url = $link;
  32.     write_to_logArray("\nPinging ".$url);
  33.     $agent->get($url);
  34.     write_to_logArray("Status of the above ping : ".$agent->status."\n");
  35.     my $line = $agent->content;
  36.     get_values($line);
  37.     get_images();
  38.     %images = ();
  39. }
  40. # Disconnect from the database.
  41.  
Would some one be kind enough to advise me what values etc will change if I make a new .pl file for this url

" http://autos.blue-sock.com/clients/5/list.php?count= "
Mar 5 '07 #1
9 2248
KevinADC
4,059 Expert 2GB
change these:

/12/

to:

/5/

and give it a try.
Mar 5 '07 #2
perls
5
I thought of that and did change it but it doesnt work...
the .pl file runs under cmd but no images, log file data is gathered.
Thats why i assumed that the lines of code below the url bit needs fixing etc.
Mar 5 '07 #3
KevinADC
4,059 Expert 2GB
I don't see why it wouldn't work. Nothing after the URL's at the top of the script look like they should be altered. Maybe someone else can spot something.
Mar 5 '07 #4
perls
5
here is the complete file
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl -w
  2. use strict;
  3. #use LWP::Debug qw(+);
  4.  
  5. use WWW::Mechanize;
  6. use HTML::TokeParser;
  7. use DBConnect::db;
  8. use Date::Manip;
  9. use Cwd;
  10. use Net::FTP;
  11.  
  12.  
  13. sub loadlinks;
  14. sub write_to_log;
  15. sub write_to_logArray;
  16. sub get_values;
  17.  
  18. my @logLines;
  19. my $logCounter=0;
  20.  
  21. my %images;
  22. my $agent = WWW::Mechanize->new(agent=>"Mozilla/4.0 (compatible; MSIE 5.0b2; Windows NT)") ;
  23. my $db = DBConnect::db->new();
  24. my $ftp;
  25.  
  26. loadlinks();
  27. write_to_log;
  28.  
  29.  
  30.  
  31. sub write_to_log
  32. {
  33.         open(LOG,">>NN2.log") || die "Cannot open NN2.log";
  34.         my $temp=localtime(time());
  35.         print LOG ("Time:$temp\n");
  36.         foreach my $line (@logLines)
  37.         {
  38.          print LOG $line."\n";
  39.         }
  40.         close LOG;
  41. }
  42.  
  43. sub write_to_logArray
  44. {
  45.   push @logLines,$_[0];
  46.   #print "in the array log\n";
  47.   $logCounter++;
  48.   if ($logCounter > 10)
  49.   {
  50.         write_to_log;
  51.         $logCounter=0;
  52.         @logLines="";
  53.  
  54.   }
  55. }
  56.  
  57. sub loadlinks()
  58. {
  59.   my $count = 0;
  60.   my $sleepTime;
  61.   my $startTime = UnixDate("today","%Y/%m/%d %H:%M:%S");  
  62.  
  63.  
  64.   my $foundurl = 1;
  65.   my $pg = 0;
  66.  
  67.   est_ftp();
  68.   $db->connect();
  69.  
  70.   while ($foundurl == 1)
  71.   {
  72.           my $urlPg = $pg*30;
  73.           $pg ++;
  74.           ## remove code before shipping
  75.           #if ($pg > 1)
  76.           #{
  77.       #  $foundurl = -1;
  78.       #  next;
  79.           #}
  80.           write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  81.           $agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  82.           write_to_logArray("Status of the above ping : ".$agent->status."\n");
  83.  
  84.         my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
  85.       my $len = scalar(@links);
  86.       if ( $len == 0)
  87.       {
  88.         $foundurl = -1;
  89.         next;
  90.       }
  91.  
  92.       ## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
  93.       my %linkURLs;
  94.       foreach my $link (@links)
  95.       {
  96.         $linkURLs{$link->url_abs()} = ();
  97.       }
  98.  
  99.       @links = keys %linkURLs;
  100.  
  101.  
  102.  
  103.           # write 5 links to runinfo;
  104.           $count = 0;
  105.  
  106.           #write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
  107.           #write_to_logArray( "Connected successfully \n");
  108.           #login();
  109.  
  110.           foreach my $link (@links)
  111.           {
  112.             my $url = $link; 
  113.             write_to_logArray("\nPinging ".$url);
  114.             $agent->get($url);
  115.             write_to_logArray("Status of the above ping : ".$agent->status."\n");
  116.         my $line = $agent->content;
  117.           get_values($line);
  118.             get_images();
  119.             %images = ();
  120.           }
  121.           # Disconnect from the database.
  122.    }
  123.  
  124.    $ftp->quit();
  125.    ## delete those records that are not updated now
  126.    #$db->connect();
  127.    $db->deleteRecords($startTime,'NN2');
  128.    $db->disconnect();
  129.  
  130. }
  131.  
  132. sub get_values
  133. {
  134.     my $count;
  135.  
  136.         my $line = $_[0];
  137.         $line =~ s/\n//g;
  138.     $_ = $line;
  139.  
  140.     my @pat = m{Ref\s*<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  141.     my $refNum = join('',@pat);
  142.  
  143.         @pat = m{Reg\sYear\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  144.     my $year = join('',@pat);
  145.  
  146.     @pat = m{CC\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  147.     my $engine = join('',@pat);
  148.  
  149.     @pat = m{Price<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*\&pound\;\s*(.*?)\s*</td>}gis;
  150.     my $price = join('',@pat);
  151.  
  152.     @pat = m{Fuel\sType<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
  153.     my $fuel = join('',@pat);
  154.  
  155.     @pat = m{Colour\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  156.     my $color = join('',@pat);
  157.  
  158.     @pat = m{Mileage\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  159.     my $milage = join('',@pat);
  160.  
  161.     @pat = m{Category<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
  162.     my $category = join('',@pat);
  163.  
  164.     @pat = m{Damage\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  165.     my $description = join('',@pat);
  166.         $description =~ s/<BR>/ /g;
  167.  
  168.         @pat = m{Model<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  169.         my $make = join('',@pat);
  170.         $make =~ s/[\s]+/ /g;
  171.         $make =~ s/\&nbsp\;/ /g;
  172.         $make =~ s/\.//g;
  173.         $make = uc($make);
  174.  
  175.         my $itemname = "";
  176.  
  177.     ##images
  178.     my @allimages = m{src=\'(.*?)\'}gis;
  179.     my $image = join(',',@allimages);
  180.     $images{$refNum} = $image;
  181.     #print "\nNN2,".$refNum.",".$make.",".$year.",".$engine.",".$price.",".$fuel.",".$color.",".$milage.",".$category.",".$description;
  182.  
  183.     # Write to Database
  184.     $db->updateDB('NN2',$make,
  185.     $refNum,$year,$engine,$price,$fuel,$color,$milage,$category,$description);
  186.  
  187.  
  188.     write_to_logArray( "Written to the database successfully");
  189.  
  190. }
  191.  
  192. sub est_ftp
  193. {
  194.   $ftp = Net::FTP->new("deleted.com", Debug => 0)
  195.       or die "Cannot connect for FTPing to deleted.com: $@";
  196.   $ftp->login("deleted",'deleted')
  197.       or die "Cannot login ", $ftp->message;
  198.   $ftp->mkdir("www/scripts/NN2");
  199.   $ftp->cwd("www/scripts/NN2/")
  200.       or die "Cannot change directory ", $ftp->message;
  201.   mkdir "NN2";
  202.   $ftp->binary;
  203. }
  204.  
  205. sub get_images
  206. {
  207.   my $mainURL = "http://autos.blue-sock.com/clients/12/";
  208.   foreach my $image (keys %images)
  209.   {
  210.     my @imgs = split(',', $images{$image});
  211.     my $imgcnt = 0;
  212.     foreach my $img (@imgs)
  213.     {
  214.       $imgcnt++;
  215.       my $imgname = $image."_".$imgcnt.".jpg";
  216.       $agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
  217.       $ftp->put("NN2/".$imgname);
  218.       $img =~ s/small/large/;
  219.       $imgname = $image."_L_".$imgcnt.".jpg";
  220.       $agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
  221.       $ftp->put("NN2/".$imgname);
  222.     }
  223.   }
  224.  
  225. }
  226.  
  227. sub get_random_number
  228. {
  229.  my $min = $_[0];
  230.  my $max = $_[1]; ;
  231.  
  232.  my $randomnumber = int(rand($max))+$min ;
  233.  #print ("Random Number is $randomnumber \n");
  234.  return  $randomnumber;
  235. }
  236.  
  237.  
in addition to changing the client No. to 5, I also changed the ( NN2 ) to ( NN6 ) as this would be the new image and log file name for the new host.

As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Mar 6 '07 #5
perls
5
This is the logfile message when I run the new .pl file

Pinging http://autos.blue-sock.com/clients/5/list.php?count=0
Status of the above ping : 200
Mar 6 '07 #6
KevinADC
4,059 Expert 2GB
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Mar 6 '07 #7
miller
1,089 Expert 1GB
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Ditto.

I certainly respect that you're someone new to perl who is trying to learn so that you can do this project yourself. But it feels like the only way to help you now is the do it for you instead of helping you to learn. I don't have the time to read through all that code and decypher what needs to be changed and what the possible pitfalls are.

I will tell you one thing though. This feels to be like a script that should be generalized in some way. The fact that you currently have 5 copies that are slightly modified for each site is not the best approach. Instead a generalized script should be created that can do all the parsing based off of a data file or some other indicator for which sites need to be scraped.

To accomplish that, you might need to invest in another perl programmer for a time. I do not expect that it will be a big job at all, but it is probably one that will require more than a beginners level of experience.

- Miller
Mar 6 '07 #8
perls
5
I understand.. Thanks for your help.
Looks like I just gotta pay my programmer another $200 to add 5 more sites
could anyone do it cheaper?
Mar 8 '07 #9
KevinADC
4,059 Expert 2GB
check on freelance programming sites and post a job and take bids. $200 sounds reasonable to me though.
Mar 8 '07 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: David Jones | last post by:
Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an...
5
by: Christopher Brandsdal | last post by:
Hi! I'm stuck on a little problem... I want to get te article heading-text and teaser from http://www.avisa-valdres.no and display it on another page using asp code... An example on this:...
9
by: boclair | last post by:
In a discussion between AJ Flavel and Spartanicus, The later said. "Layouts that work well on the desktop and on small screen devices are single column layouts. Layouts that use more than one...
4
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...
2
by: Lance Geeck | last post by:
Is there an example someplace of how I can turn a web site into a string of text so I can parse it? I am trying to extract a returned value from an existing website that I have no control over. ...
2
by: Alan Silver | last post by:
Hello, I would like to pull some information off a site that requires a log in. I have a subscription to a premium content site, and I would like to be able to do a few automatic requests...
1
by: onceuponapriori | last post by:
Greetings gents. I'm a Railser working on a django app that needs to do some scraping to gather its data. I need to programatically access a site that requires a username and password. Once I...
3
by: bruce | last post by:
Hi... got a short test app that i'm playing with. the goal is to get data off the page in question. basically, i should be able to get a list of "tr" nodes, and then to iterate/parse them....
4
by: BryanA | last post by:
I was wondering where I would start to try and recreate something http://goohackle.com/scripts/google_parser.php where it just lists the urls. I would be using it to check what pages of my site are...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.