473,765 Members | 2,172 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Adapting Site Scraping Script

5 New Member
"Newbie needs help"

Hi all,

I had a programmer do a site scraping script for me.. the aim was to scrape data from 5 different sites and upload directly into my website databse. I started to study the five .pl files and found that there is only small changes in the files e.g. ( the url from the site to be scraped changes for obvious reasons etc. ) There is also a few other lines of code which I dont understand. I would like to be able to add more sites by making additional .pl files but not sure what info to change.

One sample file section, where the code changes for each site to be scraped is

Expand|Select|Wrap|Line Numbers
  1. write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  2. $agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  3. write_to_logArray("Status of the above ping : ".$agent->status."\n");
  4.  
  5. my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
  6. my $len = scalar(@links);
  7. if ( $len == 0) {
  8.     $foundurl = -1;
  9.     next;
  10. }
  11.  
  12. ## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
  13. my %linkURLs;
  14. foreach my $link (@links) {
  15.     $linkURLs{$link->url_abs()} = ();
  16. }
  17.  
  18. @links = keys %linkURLs;
  19.  
  20.  
  21.  
  22. # write 5 links to runinfo;
  23. $count = 0;
  24.  
  25. #write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
  26. #write_to_logArray( "Connected successfully \n");
  27. #login();
  28.  
  29.  
  30. foreach my $link (@links) {
  31.     my $url = $link;
  32.     write_to_logArray("\nPinging ".$url);
  33.     $agent->get($url);
  34.     write_to_logArray("Status of the above ping : ".$agent->status."\n");
  35.     my $line = $agent->content;
  36.     get_values($line);
  37.     get_images();
  38.     %images = ();
  39. }
  40. # Disconnect from the database.
  41.  
Would some one be kind enough to advise me what values etc will change if I make a new .pl file for this url

" http://autos.blue-sock.com/clients/5/list.php?count= "
Mar 5 '07 #1
9 2272
KevinADC
4,059 Recognized Expert Specialist
change these:

/12/

to:

/5/

and give it a try.
Mar 5 '07 #2
perls
5 New Member
I thought of that and did change it but it doesnt work...
the .pl file runs under cmd but no images, log file data is gathered.
Thats why i assumed that the lines of code below the url bit needs fixing etc.
Mar 5 '07 #3
KevinADC
4,059 Recognized Expert Specialist
I don't see why it wouldn't work. Nothing after the URL's at the top of the script look like they should be altered. Maybe someone else can spot something.
Mar 5 '07 #4
perls
5 New Member
here is the complete file
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl -w
  2. use strict;
  3. #use LWP::Debug qw(+);
  4.  
  5. use WWW::Mechanize;
  6. use HTML::TokeParser;
  7. use DBConnect::db;
  8. use Date::Manip;
  9. use Cwd;
  10. use Net::FTP;
  11.  
  12.  
  13. sub loadlinks;
  14. sub write_to_log;
  15. sub write_to_logArray;
  16. sub get_values;
  17.  
  18. my @logLines;
  19. my $logCounter=0;
  20.  
  21. my %images;
  22. my $agent = WWW::Mechanize->new(agent=>"Mozilla/4.0 (compatible; MSIE 5.0b2; Windows NT)") ;
  23. my $db = DBConnect::db->new();
  24. my $ftp;
  25.  
  26. loadlinks();
  27. write_to_log;
  28.  
  29.  
  30.  
  31. sub write_to_log
  32. {
  33.         open(LOG,">>NN2.log") || die "Cannot open NN2.log";
  34.         my $temp=localtime(time());
  35.         print LOG ("Time:$temp\n");
  36.         foreach my $line (@logLines)
  37.         {
  38.          print LOG $line."\n";
  39.         }
  40.         close LOG;
  41. }
  42.  
  43. sub write_to_logArray
  44. {
  45.   push @logLines,$_[0];
  46.   #print "in the array log\n";
  47.   $logCounter++;
  48.   if ($logCounter > 10)
  49.   {
  50.         write_to_log;
  51.         $logCounter=0;
  52.         @logLines="";
  53.  
  54.   }
  55. }
  56.  
  57. sub loadlinks()
  58. {
  59.   my $count = 0;
  60.   my $sleepTime;
  61.   my $startTime = UnixDate("today","%Y/%m/%d %H:%M:%S");  
  62.  
  63.  
  64.   my $foundurl = 1;
  65.   my $pg = 0;
  66.  
  67.   est_ftp();
  68.   $db->connect();
  69.  
  70.   while ($foundurl == 1)
  71.   {
  72.           my $urlPg = $pg*30;
  73.           $pg ++;
  74.           ## remove code before shipping
  75.           #if ($pg > 1)
  76.           #{
  77.       #  $foundurl = -1;
  78.       #  next;
  79.           #}
  80.           write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  81.           $agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
  82.           write_to_logArray("Status of the above ping : ".$agent->status."\n");
  83.  
  84.         my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
  85.       my $len = scalar(@links);
  86.       if ( $len == 0)
  87.       {
  88.         $foundurl = -1;
  89.         next;
  90.       }
  91.  
  92.       ## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
  93.       my %linkURLs;
  94.       foreach my $link (@links)
  95.       {
  96.         $linkURLs{$link->url_abs()} = ();
  97.       }
  98.  
  99.       @links = keys %linkURLs;
  100.  
  101.  
  102.  
  103.           # write 5 links to runinfo;
  104.           $count = 0;
  105.  
  106.           #write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
  107.           #write_to_logArray( "Connected successfully \n");
  108.           #login();
  109.  
  110.           foreach my $link (@links)
  111.           {
  112.             my $url = $link; 
  113.             write_to_logArray("\nPinging ".$url);
  114.             $agent->get($url);
  115.             write_to_logArray("Status of the above ping : ".$agent->status."\n");
  116.         my $line = $agent->content;
  117.           get_values($line);
  118.             get_images();
  119.             %images = ();
  120.           }
  121.           # Disconnect from the database.
  122.    }
  123.  
  124.    $ftp->quit();
  125.    ## delete those records that are not updated now
  126.    #$db->connect();
  127.    $db->deleteRecords($startTime,'NN2');
  128.    $db->disconnect();
  129.  
  130. }
  131.  
  132. sub get_values
  133. {
  134.     my $count;
  135.  
  136.         my $line = $_[0];
  137.         $line =~ s/\n//g;
  138.     $_ = $line;
  139.  
  140.     my @pat = m{Ref\s*<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  141.     my $refNum = join('',@pat);
  142.  
  143.         @pat = m{Reg\sYear\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  144.     my $year = join('',@pat);
  145.  
  146.     @pat = m{CC\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  147.     my $engine = join('',@pat);
  148.  
  149.     @pat = m{Price<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*\&pound\;\s*(.*?)\s*</td>}gis;
  150.     my $price = join('',@pat);
  151.  
  152.     @pat = m{Fuel\sType<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
  153.     my $fuel = join('',@pat);
  154.  
  155.     @pat = m{Colour\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  156.     my $color = join('',@pat);
  157.  
  158.     @pat = m{Mileage\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  159.     my $milage = join('',@pat);
  160.  
  161.     @pat = m{Category<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
  162.     my $category = join('',@pat);
  163.  
  164.     @pat = m{Damage\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  165.     my $description = join('',@pat);
  166.         $description =~ s/<BR>/ /g;
  167.  
  168.         @pat = m{Model<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
  169.         my $make = join('',@pat);
  170.         $make =~ s/[\s]+/ /g;
  171.         $make =~ s/\&nbsp\;/ /g;
  172.         $make =~ s/\.//g;
  173.         $make = uc($make);
  174.  
  175.         my $itemname = "";
  176.  
  177.     ##images
  178.     my @allimages = m{src=\'(.*?)\'}gis;
  179.     my $image = join(',',@allimages);
  180.     $images{$refNum} = $image;
  181.     #print "\nNN2,".$refNum.",".$make.",".$year.",".$engine.",".$price.",".$fuel.",".$color.",".$milage.",".$category.",".$description;
  182.  
  183.     # Write to Database
  184.     $db->updateDB('NN2',$make,
  185.     $refNum,$year,$engine,$price,$fuel,$color,$milage,$category,$description);
  186.  
  187.  
  188.     write_to_logArray( "Written to the database successfully");
  189.  
  190. }
  191.  
  192. sub est_ftp
  193. {
  194.   $ftp = Net::FTP->new("deleted.com", Debug => 0)
  195.       or die "Cannot connect for FTPing to deleted.com: $@";
  196.   $ftp->login("deleted",'deleted')
  197.       or die "Cannot login ", $ftp->message;
  198.   $ftp->mkdir("www/scripts/NN2");
  199.   $ftp->cwd("www/scripts/NN2/")
  200.       or die "Cannot change directory ", $ftp->message;
  201.   mkdir "NN2";
  202.   $ftp->binary;
  203. }
  204.  
  205. sub get_images
  206. {
  207.   my $mainURL = "http://autos.blue-sock.com/clients/12/";
  208.   foreach my $image (keys %images)
  209.   {
  210.     my @imgs = split(',', $images{$image});
  211.     my $imgcnt = 0;
  212.     foreach my $img (@imgs)
  213.     {
  214.       $imgcnt++;
  215.       my $imgname = $image."_".$imgcnt.".jpg";
  216.       $agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
  217.       $ftp->put("NN2/".$imgname);
  218.       $img =~ s/small/large/;
  219.       $imgname = $image."_L_".$imgcnt.".jpg";
  220.       $agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
  221.       $ftp->put("NN2/".$imgname);
  222.     }
  223.   }
  224.  
  225. }
  226.  
  227. sub get_random_number
  228. {
  229.  my $min = $_[0];
  230.  my $max = $_[1]; ;
  231.  
  232.  my $randomnumber = int(rand($max))+$min ;
  233.  #print ("Random Number is $randomnumber \n");
  234.  return  $randomnumber;
  235. }
  236.  
  237.  
in addition to changing the client No. to 5, I also changed the ( NN2 ) to ( NN6 ) as this would be the new image and log file name for the new host.

As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Mar 6 '07 #5
perls
5 New Member
This is the logfile message when I run the new .pl file

Pinging http://autos.blue-sock.com/clients/5/list.php?count= 0
Status of the above ping : 200
Mar 6 '07 #6
KevinADC
4,059 Recognized Expert Specialist
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Mar 6 '07 #7
miller
1,089 Recognized Expert Top Contributor
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Ditto.

I certainly respect that you're someone new to perl who is trying to learn so that you can do this project yourself. But it feels like the only way to help you now is the do it for you instead of helping you to learn. I don't have the time to read through all that code and decypher what needs to be changed and what the possible pitfalls are.

I will tell you one thing though. This feels to be like a script that should be generalized in some way. The fact that you currently have 5 copies that are slightly modified for each site is not the best approach. Instead a generalized script should be created that can do all the parsing based off of a data file or some other indicator for which sites need to be scraped.

To accomplish that, you might need to invest in another perl programmer for a time. I do not expect that it will be a big job at all, but it is probably one that will require more than a beginners level of experience.

- Miller
Mar 6 '07 #8
perls
5 New Member
I understand.. Thanks for your help.
Looks like I just gotta pay my programmer another $200 to add 5 more sites
could anyone do it cheaper?
Mar 8 '07 #9
KevinADC
4,059 Recognized Expert Specialist
check on freelance programming sites and post a job and take bids. $200 sounds reasonable to me though.
Mar 8 '07 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

4
4381
by: David Jones | last post by:
Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an interesting book "Spidering Hacks" which covered some great scraping hacks but it is all written in Perl. I don't know Perl and don't want to. I'm new to programing and have been advised to start with Python. So far so good ... but need some...
5
3262
by: Christopher Brandsdal | last post by:
Hi! I'm stuck on a little problem... I want to get te article heading-text and teaser from http://www.avisa-valdres.no and display it on another page using asp code... An example on this: www.valdres.no is picking news from www.avisa-valdres.no and displaying it on valdres.no...
9
2416
by: boclair | last post by:
In a discussion between AJ Flavel and Spartanicus, The later said. "Layouts that work well on the desktop and on small screen devices are single column layouts. Layouts that use more than one column rarely work properly on a wide range of viewport widths. This also applies to "css" layouts that use separate "screen" and "handheld" stylesheets due to the poor support for the handheld media type and it's intrinsic limitations. The same...
4
5740
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var tagType1 = '"mail'+'to:', tagType2 = '">', tagType3 = '<\/a>'; var user1 = 'web', user2 = 'master', user3 = '@'; var dom1 = 'danger', dom2 = 'ous', dom3 = 'ly'; var tld = '.us';...
2
1166
by: Lance Geeck | last post by:
Is there an example someplace of how I can turn a web site into a string of text so I can parse it? I am trying to extract a returned value from an existing website that I have no control over. Specifically http://www.ffiec.gov/ratespread/default.aspx I am trying to pull the rate spread field. Thanks Lance
2
1730
by: Alan Silver | last post by:
Hello, I would like to pull some information off a site that requires a log in. I have a subscription to a premium content site, and I would like to be able to do a few automatic requests instead of having to load the site manually in a browser. I have seen plenty articles that explain how to do screen scraping in ..NET, others that describe how to do it via a POST, but I couldn't find any that covered my scenario.
1
2592
by: onceuponapriori | last post by:
Greetings gents. I'm a Railser working on a django app that needs to do some scraping to gather its data. I need to programatically access a site that requires a username and password. Once I post to the login.php page, there seems to be a redirect and it seems that the site is using a session (perhaps a cookie) to determine whether the user is logged in. So I need to log in and then have cookies and or sessions maintained as I access...
3
2641
by: bruce | last post by:
Hi... got a short test app that i'm playing with. the goal is to get data off the page in question. basically, i should be able to get a list of "tr" nodes, and then to iterate/parse them. i'm missing something, as i think i can get a single node, but i can't figure out how to display the contents of the node.. nor how to get the list of the "tr" nodes....
4
1648
by: BryanA | last post by:
I was wondering where I would start to try and recreate something http://goohackle.com/scripts/google_parser.php where it just lists the urls. I would be using it to check what pages of my site are listed and then reporting it to my db. I can already do this for single pages but I need to do it for my entire domain. Any help is much appreciated Thanks in advance!
0
9568
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9398
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10156
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9832
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
5275
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5419
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3924
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3531
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2805
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.