"Newbie needs help"
Hi all,
I had a programmer do a site scraping script for me.. the aim was to scrape data from 5 different sites and upload directly into my website databse. I started to study the five .pl files and found that there is only small changes in the files e.g. ( the url from the site to be scraped changes for obvious reasons etc. ) There is also a few other lines of code which I dont understand. I would like to be able to add more sites by making additional .pl files but not sure what info to change.
One sample file section, where the code changes for each site to be scraped is -
write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
$agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
-
my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
-
my $len = scalar(@links);
-
if ( $len == 0) {
-
$foundurl = -1;
-
next;
-
}
-
-
## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
-
my %linkURLs;
-
foreach my $link (@links) {
-
$linkURLs{$link->url_abs()} = ();
-
}
-
-
@links = keys %linkURLs;
-
-
-
-
# write 5 links to runinfo;
-
$count = 0;
-
-
#write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
-
#write_to_logArray( "Connected successfully \n");
-
#login();
-
-
-
foreach my $link (@links) {
-
my $url = $link;
-
write_to_logArray("\nPinging ".$url);
-
$agent->get($url);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
my $line = $agent->content;
-
get_values($line);
-
get_images();
-
%images = ();
-
}
-
# Disconnect from the database.
-
Would some one be kind enough to advise me what values etc will change if I make a new .pl file for this url
" http://autos.blue-sock.com/clients/5/list.php?count= "
9 2272 KevinADC 4,059
Recognized Expert Specialist
change these:
/12/
to:
/5/
and give it a try.
I thought of that and did change it but it doesnt work...
the .pl file runs under cmd but no images, log file data is gathered.
Thats why i assumed that the lines of code below the url bit needs fixing etc.
KevinADC 4,059
Recognized Expert Specialist
I don't see why it wouldn't work. Nothing after the URL's at the top of the script look like they should be altered. Maybe someone else can spot something.
here is the complete file - #!/usr/bin/perl -w
-
use strict;
-
#use LWP::Debug qw(+);
-
-
use WWW::Mechanize;
-
use HTML::TokeParser;
-
use DBConnect::db;
-
use Date::Manip;
-
use Cwd;
-
use Net::FTP;
-
-
-
sub loadlinks;
-
sub write_to_log;
-
sub write_to_logArray;
-
sub get_values;
-
-
my @logLines;
-
my $logCounter=0;
-
-
my %images;
-
my $agent = WWW::Mechanize->new(agent=>"Mozilla/4.0 (compatible; MSIE 5.0b2; Windows NT)") ;
-
my $db = DBConnect::db->new();
-
my $ftp;
-
-
loadlinks();
-
write_to_log;
-
-
-
-
sub write_to_log
-
{
-
open(LOG,">>NN2.log") || die "Cannot open NN2.log";
-
my $temp=localtime(time());
-
print LOG ("Time:$temp\n");
-
foreach my $line (@logLines)
-
{
-
print LOG $line."\n";
-
}
-
close LOG;
-
}
-
-
sub write_to_logArray
-
{
-
push @logLines,$_[0];
-
#print "in the array log\n";
-
$logCounter++;
-
if ($logCounter > 10)
-
{
-
write_to_log;
-
$logCounter=0;
-
@logLines="";
-
-
}
-
}
-
-
sub loadlinks()
-
{
-
my $count = 0;
-
my $sleepTime;
-
my $startTime = UnixDate("today","%Y/%m/%d %H:%M:%S");
-
-
-
my $foundurl = 1;
-
my $pg = 0;
-
-
est_ftp();
-
$db->connect();
-
-
while ($foundurl == 1)
-
{
-
my $urlPg = $pg*30;
-
$pg ++;
-
## remove code before shipping
-
#if ($pg > 1)
-
#{
-
# $foundurl = -1;
-
# next;
-
#}
-
write_to_logArray("\nPinging http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
$agent->get("http://autos.blue-sock.com/clients/12/list.php?count=".$urlPg);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
-
my @links = $agent->find_all_links(url_regex => qr/item.php\?s=s\&count=/);
-
my $len = scalar(@links);
-
if ( $len == 0)
-
{
-
$foundurl = -1;
-
next;
-
}
-
-
## the above find_all may get duplicates, eliminate the same by not pinging until we get a new link
-
my %linkURLs;
-
foreach my $link (@links)
-
{
-
$linkURLs{$link->url_abs()} = ();
-
}
-
-
@links = keys %linkURLs;
-
-
-
-
# write 5 links to runinfo;
-
$count = 0;
-
-
#write_to_logArray( "Connecting to the database ".$properties{$dbname}." for the user ".$properties{$dbusername}."\n");
-
#write_to_logArray( "Connected successfully \n");
-
#login();
-
-
foreach my $link (@links)
-
{
-
my $url = $link;
-
write_to_logArray("\nPinging ".$url);
-
$agent->get($url);
-
write_to_logArray("Status of the above ping : ".$agent->status."\n");
-
my $line = $agent->content;
-
get_values($line);
-
get_images();
-
%images = ();
-
}
-
# Disconnect from the database.
-
}
-
-
$ftp->quit();
-
## delete those records that are not updated now
-
#$db->connect();
-
$db->deleteRecords($startTime,'NN2');
-
$db->disconnect();
-
-
}
-
-
sub get_values
-
{
-
my $count;
-
-
my $line = $_[0];
-
$line =~ s/\n//g;
-
$_ = $line;
-
-
my @pat = m{Ref\s*<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $refNum = join('',@pat);
-
-
@pat = m{Reg\sYear\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $year = join('',@pat);
-
-
@pat = m{CC\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $engine = join('',@pat);
-
-
@pat = m{Price<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*\£\;\s*(.*?)\s*</td>}gis;
-
my $price = join('',@pat);
-
-
@pat = m{Fuel\sType<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $fuel = join('',@pat);
-
-
@pat = m{Colour\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $color = join('',@pat);
-
-
@pat = m{Mileage\s<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $milage = join('',@pat);
-
-
@pat = m{Category<\/td>\s*<td\svalign\=\"top\"\s*nowrap\sclass=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $category = join('',@pat);
-
-
@pat = m{Damage\s*<\/td>\s*<td\swidth=\"48\%\"\s*valign\=\"top\"\s*\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $description = join('',@pat);
-
$description =~ s/<BR>/ /g;
-
-
@pat = m{Model<\/td>\s*<td\s*valign\=\"top\"\s*nowrap\sclass\=\'search2\'>\s*(.*?)\s*</td>}gis;
-
my $make = join('',@pat);
-
$make =~ s/[\s]+/ /g;
-
$make =~ s/\ \;/ /g;
-
$make =~ s/\.//g;
-
$make = uc($make);
-
-
my $itemname = "";
-
-
##images
-
my @allimages = m{src=\'(.*?)\'}gis;
-
my $image = join(',',@allimages);
-
$images{$refNum} = $image;
-
#print "\nNN2,".$refNum.",".$make.",".$year.",".$engine.",".$price.",".$fuel.",".$color.",".$milage.",".$category.",".$description;
-
-
# Write to Database
-
$db->updateDB('NN2',$make,
-
$refNum,$year,$engine,$price,$fuel,$color,$milage,$category,$description);
-
-
-
write_to_logArray( "Written to the database successfully");
-
-
}
-
-
sub est_ftp
-
{
-
$ftp = Net::FTP->new("deleted.com", Debug => 0)
-
or die "Cannot connect for FTPing to deleted.com: $@";
-
$ftp->login("deleted",'deleted')
-
or die "Cannot login ", $ftp->message;
-
$ftp->mkdir("www/scripts/NN2");
-
$ftp->cwd("www/scripts/NN2/")
-
or die "Cannot change directory ", $ftp->message;
-
mkdir "NN2";
-
$ftp->binary;
-
}
-
-
sub get_images
-
{
-
my $mainURL = "http://autos.blue-sock.com/clients/12/";
-
foreach my $image (keys %images)
-
{
-
my @imgs = split(',', $images{$image});
-
my $imgcnt = 0;
-
foreach my $img (@imgs)
-
{
-
$imgcnt++;
-
my $imgname = $image."_".$imgcnt.".jpg";
-
$agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
-
$ftp->put("NN2/".$imgname);
-
$img =~ s/small/large/;
-
$imgname = $image."_L_".$imgcnt.".jpg";
-
$agent->get($mainURL.$img, ":content_file" => "NN2/$imgname");
-
$ftp->put("NN2/".$imgname);
-
}
-
}
-
-
}
-
-
sub get_random_number
-
{
-
my $min = $_[0];
-
my $max = $_[1]; ;
-
-
my $randomnumber = int(rand($max))+$min ;
-
#print ("Random Number is $randomnumber \n");
-
return $randomnumber;
-
}
-
-
in addition to changing the client No. to 5, I also changed the ( NN2 ) to ( NN6 ) as this would be the new image and log file name for the new host.
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
This is the logfile message when I run the new .pl file
Pinging http://autos.blue-sock.com/clients/5/list.php?count= 0
Status of the above ping : 200
KevinADC 4,059
Recognized Expert Specialist
As they website layout for the 2 clients ( 12 ) and ( 5 ) is slightly different, surely there needs to be some additional changes for this to work??
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
miller 1,089
Recognized Expert Top Contributor
Maybe, maybe not. I don't know. But I honestly doubt anyone is going to go and research that for you, and most likely nobody is going to read through all that code and try and figure it out. Most people, myself included, are willing to help up to a point. I'm not going to wade through all that code and look at the website pages to try and figure it out. Thats more than I am willing to do, I hope you understand.
Ditto.
I certainly respect that you're someone new to perl who is trying to learn so that you can do this project yourself. But it feels like the only way to help you now is the do it for you instead of helping you to learn. I don't have the time to read through all that code and decypher what needs to be changed and what the possible pitfalls are.
I will tell you one thing though. This feels to be like a script that should be generalized in some way. The fact that you currently have 5 copies that are slightly modified for each site is not the best approach. Instead a generalized script should be created that can do all the parsing based off of a data file or some other indicator for which sites need to be scraped.
To accomplish that, you might need to invest in another perl programmer for a time. I do not expect that it will be a big job at all, but it is probably one that will require more than a beginners level of experience.
- Miller
I understand.. Thanks for your help.
Looks like I just gotta pay my programmer another $200 to add 5 more sites
could anyone do it cheaper?
KevinADC 4,059
Recognized Expert Specialist
check on freelance programming sites and post a job and take bids. $200 sounds reasonable to me though.
Sign in to post your reply or Sign up for a free account.
Similar topics |
by: David Jones |
last post by:
Hi, I'm interested in learning about web scraping/site scraping using
Python. Does anybody know of some online resources or have any modules that
are available to help out. O'Reilly published an interesting book
"Spidering Hacks" which covered some great scraping hacks but it is all
written in Perl. I don't know Perl and don't want to. I'm new to
programing and have been advised to start with Python. So far so good ...
but need some...
|
by: Christopher Brandsdal |
last post by:
Hi!
I'm stuck on a little problem...
I want to get te article heading-text and teaser from
http://www.avisa-valdres.no and display it on another page using asp code...
An example on this: www.valdres.no is picking news from www.avisa-valdres.no
and displaying it on valdres.no...
|
by: boclair |
last post by:
In a discussion between AJ Flavel and Spartanicus, The later said.
"Layouts that work well on the desktop and on small screen devices are
single column layouts. Layouts that use more than one column rarely work
properly on a wide range of viewport widths. This also applies to "css"
layouts that use separate "screen" and "handheld" stylesheets due to the
poor support for the handheld media type and it's intrinsic limitations.
The same...
|
by: Roland Hall |
last post by:
Am I correct in assuming screen scraping is just the response text sent to
the browser? If so, would that mean that this could not be screen scraped?
function moi() {
var tag = '<a href=';
var tagType1 = '"mail'+'to:', tagType2 = '">', tagType3 = '<\/a>';
var user1 = 'web', user2 = 'master', user3 = '@';
var dom1 = 'danger', dom2 = 'ous', dom3 = 'ly';
var tld = '.us';...
|
by: Lance Geeck |
last post by:
Is there an example someplace of how I can turn a web site into a string of text so I can parse it?
I am trying to extract a returned value from an existing website that I have no control over. Specifically http://www.ffiec.gov/ratespread/default.aspx
I am trying to pull the rate spread field.
Thanks
Lance
| |
by: Alan Silver |
last post by:
Hello,
I would like to pull some information off a site that requires a log in.
I have a subscription to a premium content site, and I would like to be
able to do a few automatic requests instead of having to load the site
manually in a browser.
I have seen plenty articles that explain how to do screen scraping in
..NET, others that describe how to do it via a POST, but I couldn't find
any that covered my scenario.
|
by: onceuponapriori |
last post by:
Greetings gents. I'm a Railser working on a django app that needs to do
some scraping to gather its data.
I need to programatically access a site that requires a username and
password. Once I post to the login.php page, there seems to be a
redirect and it seems that the site is using a session (perhaps a
cookie) to determine whether the user is logged in. So I need to log in
and then have cookies and or sessions maintained as I access...
|
by: bruce |
last post by:
Hi...
got a short test app that i'm playing with. the goal is to get data off the
page in question.
basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them. i'm missing something, as i think i can get a single
node, but i can't figure out how to display the contents of the node.. nor
how to get the list of the "tr" nodes....
|
by: BryanA |
last post by:
I was wondering where I would start to try and recreate something
http://goohackle.com/scripts/google_parser.php where it just lists the
urls. I would be using it to check what pages of my site are listed
and then reporting it to my db. I can already do this for single pages
but I need to do it for my entire domain.
Any help is much appreciated
Thanks in advance!
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |