"Newbie needs help"
Hi all,
I had a programmer do a site scraping script for me.. the aim was to scrape data from 5 different sites and upload directly into my website databse. I started to study the five .pl files and found that there is only small changes in the files e.g. ( the url from the site to be scraped changes for obvious reasons etc. ) There is also a few other lines of code which I dont understand. I would like to be able to add more sites by making additional .pl files but not sure what info to change.
One sample file section, where the code changes for each site to be scraped is -
write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
$agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
-
my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
-
my $len = scalar(@links);
-
if ( $len == 0) {
-
$foundurl = -1;
-
next;
-
}
-
-
## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
-
my %linkURLs;
-
foreach my $link (@links) {
-
$linkURLs{$link->url_abs()} = ();
-
}
-
-
@links = keys %linkURLs;
-
-
-
-
# write 5 links to runinfo;
-
$count = 0;
-
-
#write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
-
#write_to_logArray( "Connected successfully \n");
-
#login();
-
-
-
foreach my $link (@links) {
-
my $url = $link;
-
write_to_logArray("\nPinging ".$url);
-
$agent->get($url);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
my $line = $agent->content;
-
get_values($line);
-
get_images();
-
%images = ();
-
}
-
# Disconnect from the database.
-
Would some one be kind enough to advise me what values etc will change if I make a new .pl file for this url
" http://autos.blue-sock.com/clients/5/list.php?count= "
9 2248
change these:
/12/
to:
/5/
and give it a try.
I thought of that and did change it but it doesnt work...
the .pl file runs under cmd but no images, log file data is gathered.
Thats why i assumed that the lines of code below the url bit needs fixing etc.
I don't see why it wouldn't work. Nothing after the URL's at the top of the script look like they should be altered. Maybe someone else can spot something.
here is the complete file - #!/usr/bin/perl -w
-
use strict;
-
#use LWP::Debug qw(+);
-
-
use WWW::Mechanize;
-
use HTML::TokeParser;
-
use DBConnect::db;
-
use Date::Manip;
-
use Cwd;
-
use Net::FTP;
-
-
-
sub loadlinks;
-
sub write_to_log;
-
sub write_to_logArray;
-
sub get_values;
-
-
my @logLines;
-
my $logCounter=0;
-
-
my %images;
-
my $agent = WWW::Mechanize->new(agent=>"Mozilla/4.0 (compatible; MSIE 5.0b2; Windows NT)") ;
-
my $db = DBConnect::db->new();
-
my $ftp;
-
-
loadlinks();
-
write_to_log;
-
-
-
-
sub write_to_log
-
{
-
open(LOG,">>NN2.log") || die "Cannot open NN2.log";
-
my $temp=localtime(time());
-
print LOG ("Time:$temp\n");
-
foreach my $line (@logLines)
-
{
-
print LOG $line."\n";
-
}
-
close LOG;
-
}
-
-
sub write_to_logArray
-
{
-
push @logLines,$_[0];
-
#print "in the array log\n";
-
$logCounter++;
-
if ($logCounter > 10)
-
{
-
write_to_log;
-
$logCounter=0;
-
@logLines="";
-
-
}
-
}
-
-
sub loadlinks()
-
{
-
my $count = 0;
-
my $sleepTime;
-
my $startTime = UnixDate("today","%Y/%m/%d %H:%M:%S");
-
-
-
my $foundurl = 1;
-
my $pg = 0;
-
-
est_ftp();
-
$db->connect();
-
-
while ($foundurl == 1)
-
{
-
my $urlPg = $pg*30;
-
$pg ++;
-
## remove code before shipping
-
#if ($pg > 1)
-
#{
-
# $foundurl = -1;
-
# next;
-
#}
-
write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
$agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
-
my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
-
my $len = scalar(@links);
-
if ( $len == 0)
-
{
-
$foundurl = -1;
-
next;
-
}
-
-
## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
-
my %linkURLs;
-
foreach my $link (@links)
-
{
-
$linkURLs{$link->url_abs()} = ();
-
}
-
-
@links = keys %linkURLs;
-
-
-
-
# write 5 links to runinfo;
-
$count = 0;
-
-
#write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
-
#write_to_logArray( "Connected successfully \n");
-
#login();
-
-
foreach my $link (@links)
-
{
-
my $url = $link;
-
write_to_logArray("\nPinging ".$url);
-
$agent->get($url);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
my $line = $agent->content;
-
get_values($line);
-
get_images();
-
%images = ();
-
}
-
# Disconnect from the database.
-
}
-
-
$ftp->quit();
-
## delete those records that are not updated now
-
#$db->connect();
-
$db->deleteRecords($startTime,'NN2');
-
$db->disconnect();
-
-
}
-
-
sub get_values
-
{
-
my $count;
-
-
my $line = $_[0];
-
$line =~ s/\n//g;
-
$_ = $line;
-
-
my @pat = m{Ref\s*<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $refNum = join('',@pat);
-
-
@pat = m{Reg\sYear\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $year = join('',@pat);
-
-
@pat = m{CC\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $engine = join('',@pat);
-
-
@pat = m{Price<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*\£\;\s*(.*?)\s*</td>}gis;
-
my $price = join('',@pat);
-
-
@pat = m{Fuel\sType<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $fuel = join('',@pat);
-
-
@pat = m{Colour\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $color = join('',@pat);
-
-
@pat = m{Mileage\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $milage = join('',@pat);
-
-
@pat = m{Category<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $category = join('',@pat);
-
-
@pat = m{Damage\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $description = join('',@pat);
-
$description =~ s/<BR>/ /g;
-
-
@pat = m{Model<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $make = join('',@pat);
-
$make =~ s/[\s]+/ /g;
-
$make =~ s/\ \;/ /g;
-
$make =~ s/\.//g;
-
$make = uc($make);
-
-
my $itemname = "";
-
-
##images
-
my @allimages = m{src=\'(.*?)\'}gis;
-
my $image = join(',',@allimages);
-
$images{$refNum} = $image;
-
#print "\nNN2,".$refNum.",".$make.",".$year.",".$engine.",".$price.",".$fuel.",".$color.",".$milage.",".$category.",".$description;
-
-
# Write to Database
-
$db->updateDB('NN2',$make,
-
$refNum,$year,$engine,$price,$fuel,$color,$milage,$category,$description);
-
-
-
write_to_logArray( "Written to the database successfully");
-
-
}
-
-
sub est_ftp
-
{
-
$ftp = Net::FTP->new("deleted.com", Debug => 0)
-
or die "Cannot connect for FTPing to deleted.com: $@";
-
$ftp->login("deleted",'deleted')
-
or die "Cannot login ", $ftp->message;
-
$ftp->mkdir("www/scripts/NN2");
-
$ftp->cwd("www/scripts/NN2/")
-
or die "Cannot change directory ", $ftp->message;
-
mkdir "NN2";
-
$ftp->binary;
-
}
-
-
sub get_images
-
{
-
my $mainURL = "http://autos.blue-sock.com/clients/12/";
-
foreach my $image (keys %images)
-
{
-
my @imgs = split(',', $images{$image});
-
my $imgcnt = 0;
-
foreach my $img (@imgs)
-
{
-
$imgcnt++;
-
my $imgname = $image."_".$imgcnt.".jpg";
-
$agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
-
$ftp->put("NN2/".$imgname);
-
$img =~ s/small/large/;
-
$imgname = $image."_L_".$imgcnt.".jpg";
-
$agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
-
$ftp->put("NN2/".$imgname);
-
}
-
}
-
-
}
-
-
sub get_random_number
-
{
-
my $min = $_[0];
-
my $max = $_[1]; ;
-
-
my $randomnumber = int(rand($max))+$min ;
-
#print ("Random Number is $randomnumber \n");
-
return $randomnumber;
-
}
-
-
in addition to changing the client No. to 5, I also changed the ( NN2 ) to ( NN6 ) as this would be the new image and log file name for the new host.
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
This is the logfile message when I run the new .pl file
Pinging http://autos.blue-sock.com/clients/5/list.php?count=0
Status of the above ping : 200
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Ditto.
I certainly respect that you're someone new to perl who is trying to learn so that you can do this project yourself. But it feels like the only way to help you now is the do it for you instead of helping you to learn. I don't have the time to read through all that code and decypher what needs to be changed and what the possible pitfalls are.
I will tell you one thing though. This feels to be like a script that should be generalized in some way. The fact that you currently have 5 copies that are slightly modified for each site is not the best approach. Instead a generalized script should be created that can do all the parsing based off of a data file or some other indicator for which sites need to be scraped.
To accomplish that, you might need to invest in another perl programmer for a time. I do not expect that it will be a big job at all, but it is probably one that will require more than a beginners level of experience.
- Miller
I understand.. Thanks for your help.
Looks like I just gotta pay my programmer another $200 to add 5 more sites
could anyone do it cheaper?
check on freelance programming sites and post a job and take bids. $200 sounds reasonable to me though.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: David Jones |
last post by:
Hi, I'm interested in learning about web scraping/site scraping using
Python. Does anybody know of some online resources or have any modules that
are available to help out. O'Reilly published an...
|
by: Christopher Brandsdal |
last post by:
Hi!
I'm stuck on a little problem...
I want to get te article heading-text and teaser from
http://www.avisa-valdres.no and display it on another page using asp code...
An example on this:...
|
by: boclair |
last post by:
In a discussion between AJ Flavel and Spartanicus, The later said.
"Layouts that work well on the desktop and on small screen devices are
single column layouts. Layouts that use more than one...
|
by: Roland Hall |
last post by:
Am I correct in assuming screen scraping is just the response text sent to
the browser? If so, would that mean that this could not be screen scraped?
function moi() {
var tag = '<a href=';
var...
|
by: Lance Geeck |
last post by:
Is there an example someplace of how I can turn a web site into a string of text so I can parse it?
I am trying to extract a returned value from an existing website that I have no control over. ...
|
by: Alan Silver |
last post by:
Hello,
I would like to pull some information off a site that requires a log in.
I have a subscription to a premium content site, and I would like to be
able to do a few automatic requests...
|
by: onceuponapriori |
last post by:
Greetings gents. I'm a Railser working on a django app that needs to do
some scraping to gather its data.
I need to programatically access a site that requires a username and
password. Once I...
|
by: bruce |
last post by:
Hi...
got a short test app that i'm playing with. the goal is to get data off the
page in question.
basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them....
|
by: BryanA |
last post by:
I was wondering where I would start to try and recreate something
http://goohackle.com/scripts/google_parser.php where it just lists the
urls. I would be using it to check what pages of my site are...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
| |