469,343 Members | 5,478 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,343 developers. It's quick & easy.

Retreiving Web Page

43
Hello,

I wrote a small code to retreive web page. I'm doing a search engine that look up for sailboat ads on selected Web Site (not at large).

It works ok but some site won't open. My original code was this:

Expand|Select|Wrap|Line Numbers
  1. use LWP::UserAgent;
  2. use HTTP::Response;
  3. use URI::Heuristic;
  4.  
  5. unless (defined ($content = get ($URL) )) {
  6. print "could not get page at: $URL <br>";
  7. }
  8.  
Then $content was store into a file. But for these 2 sites:
www.babord.ca & sailboatlistings.com, I always get the error message : could not get page at: $URL

I also try this code:

Expand|Select|Wrap|Line Numbers
  1. require LWP::UserAgent;
  2.  my $ua = LWP::UserAgent->new;
  3.  $ua->timeout(10);
  4.  $ua->env_proxy;
  5.  
  6.  my $response = $ua->get($URL);
  7.  
  8.  if ($response->is_success) {
  9.      print $response->decoded_content;  # or whatever
  10.  }
  11.  else {
  12.      print "$response->status_line<p>";
  13.  
But I was getting this error message: HTTP::Response=HASH(0xa05109c)->status_line


I'm kind of new at Perl. Can some help?

A) if you could tell what is wrong with these 2 site, while all other works.

B) Is there some kind of code or debugger to help me find out the problems when dealing with Web Page?

All the help appreicated. Thanks,

yjulien
Feb 13 '11 #1

✓ answered by miller

You can debug this by using the extended way of getting pages like the following:
Expand|Select|Wrap|Line Numbers
  1. use LWP::UserAgent;
  2.  
  3. use strict;
  4.  
  5. my $url = 'http://www.babord.ca/';
  6. # my $url = 'http://www.sailboatlistings.com/';
  7.  
  8. my $ua = LWP::UserAgent->new;
  9.  
  10. my $response = $ua->get($url);
  11.  
  12. if (!$response->is_success) {
  13.     die $response->status_line;
  14. }
  15.  
  16. my $html = $response->decoded_content;
Doing this reveals a status line of "406 Not Acceptable". This could mean many things, but most likely the server is looking at User Agent Information and rejecting any clients that don't look like a standard web browser.

You therefore will probably need to use a fake user agent. This is normally very bad practice, as the server needs to be able to differentiate between spiders and regular browsers. But as you'll see in the below code, the first website will work when using a user agent that represents firefox 3.6.13 on windows 7.
Expand|Select|Wrap|Line Numbers
  1. use LWP::UserAgent;
  2.  
  3. use strict;
  4.  
  5. my $url = 'http://www.babord.ca/';
  6. # my $url = 'http://www.sailboatlistings.com/';
  7.  
  8. my $ua = LWP::UserAgent->new(
  9.     agent => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 GTB7.1",
  10. );
  11.  
  12. my $response = $ua->get($url);
  13.  
  14. if (!$response->is_success) {
  15.     die $response->status_line;
  16. }
  17.  
  18. my $html = $response->decoded_content;
  19.  
  20. print $html;
  21.  
Please try to follow the rules of use for whatever website that you're parsing for information.

- Miller

8 3599
miller
1,089 Expert 1GB
You can debug this by using the extended way of getting pages like the following:
Expand|Select|Wrap|Line Numbers
  1. use LWP::UserAgent;
  2.  
  3. use strict;
  4.  
  5. my $url = 'http://www.babord.ca/';
  6. # my $url = 'http://www.sailboatlistings.com/';
  7.  
  8. my $ua = LWP::UserAgent->new;
  9.  
  10. my $response = $ua->get($url);
  11.  
  12. if (!$response->is_success) {
  13.     die $response->status_line;
  14. }
  15.  
  16. my $html = $response->decoded_content;
Doing this reveals a status line of "406 Not Acceptable". This could mean many things, but most likely the server is looking at User Agent Information and rejecting any clients that don't look like a standard web browser.

You therefore will probably need to use a fake user agent. This is normally very bad practice, as the server needs to be able to differentiate between spiders and regular browsers. But as you'll see in the below code, the first website will work when using a user agent that represents firefox 3.6.13 on windows 7.
Expand|Select|Wrap|Line Numbers
  1. use LWP::UserAgent;
  2.  
  3. use strict;
  4.  
  5. my $url = 'http://www.babord.ca/';
  6. # my $url = 'http://www.sailboatlistings.com/';
  7.  
  8. my $ua = LWP::UserAgent->new(
  9.     agent => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 GTB7.1",
  10. );
  11.  
  12. my $response = $ua->get($url);
  13.  
  14. if (!$response->is_success) {
  15.     die $response->status_line;
  16. }
  17.  
  18. my $html = $response->decoded_content;
  19.  
  20. print $html;
  21.  
Please try to follow the rules of use for whatever website that you're parsing for information.

- Miller
Feb 14 '11 #2
yjulien
43
Hello Miller,

Thanks for your answer.

I tried your code and it did work for the fist site but not the second. I'm not even getting any error message either. It's just turn and turn and finally dies down... ??? I don't understand.

How can I fix this? What are the various parameter that might be necessary to adjust? And where can I find those rules you were refering too? Is searching there pages against there policies ?!?! How can I know that?

Finally, how is the weather in Sico? I used to live there long time ago (Sunset district). That was back in '85...

Regards,

Yves
Feb 14 '11 #3
miller
1,089 Expert 1GB
Question 1) Second url still not working for me.

That's strange, because it works fine for me when I comment out the first my $url statement and uncomment out the second. When I comment out the user agent fix, I get the exact same error code as well, so I can't think of what to suggest. This might be a strange question, but are you "sure" it's not working?

Question 2) Terms of Service

I don't know what your scraping this information for, but I just wanted to give you the standard advice that you follow the terms of service for whatever website that you're gathering information.

Spidering for information is a pretty standard practice, but there are good policies like respecting the servers bandwidth, that all programmers should follow. I suggest you just research whatever other good practices that are suggested out there, otherwise you might find your spider explicitly blocked by the websites that you're gathering information.

Question 3) San Francisco

It's great man. Feel free to come back anytime, although I'd probably have left as well if all I saw was the fog of the sunset. I just don't love surfing that much to want to live there. ;)

If there's anything else that I can help ya with, just let me know.

- Miller
Feb 14 '11 #4
yjulien
43
Very Strange it is !

On my test "new" server (which as no domain name yet), it does'nt works and on my regular it does. Maybe because I have not purchase a domain name just yet on that new server of mine. Could that be it?

As for your question about what I want to do with this code, is simple. I putting together a boat oriented "search engine" for selected Web Site only. The web site selected are Classifieds Ads in North America only. This is to avoid useless search on Forum and other places or out side of the contienent. Ther is no point in getting ads from England or Germany when you lived in the states or Canada...

Thanks for your help.

Yves
Feb 14 '11 #5
yjulien
43
Last night I went to bed asking myself why would your script work on my regular server and not on my new server !?!? My regular site is currently hosted at pair.com. They have very strict users policy (no spam, no overusage, no shit etc.) Good policy but sometime too strict. They have ask me to shut down our forum because they said it was taking too much resources on there share server. The offer me to rent a dedicated server but at lots of extra $$$. From $80/month for a share server to $300/month for a dedicated. Since my forum is a free access forum like most, I could not afford that. So I went around shopping and found a dedicated server for $100/month. I'm currently in the processs of moving all my site there.

My question, is it possible that the new provider gave me IP addresses that have been blocked because of previous users spaming the Web? If so, Where can I check if this is the case?

That would explain why the script was not working on this new "useless" server...

Thanks,

Yves
Feb 14 '11 #6
miller
1,089 Expert 1GB
The fact that those websites don't allow the default user agent of "libwww-perl/5.836" or whatever version of lwp your using, tells me that they either have their website configured in a strange way or they are intentionally trying to block spiders for whatever reason. It is certainly possible that they have also acquired a blacklist of ip address to block as well.

However, the more likely problem is that the server that you're trying to use doesn't allow outbound http connections or doesn't have LWP::UserAgent installed, or some such situation.

That is one problem with using a hosting service, is you can't always count on them allowing non-standard usage of the servers. And some places consider anything but basic html to be non-standard.

Obviously it seems like you've been able to get the code to work correctly on your local machine, so you'll just have to continue debugging be determining if this new server allows you to connect to other websites like yahoo or google. If those fail, then obviously it's your hosting environment.

Good luck,
- Miller
Feb 14 '11 #7
yjulien
43
Hi Miller,

Thanks you for your reply.

I now doubt that my IP is black listed. I got it a few years ago and actually never used it. Unless the previous ownerdid smething bad, it should be ok. I found several site that keep list of blacklisted IP and none of mine are. Why would they !?!?! I never did something bad with it...

As for my code, it work for every site on the Pair server but it refuse to work for the last site only on the new server. The LWP::UserAgent module is installed on both server and seem to work fine with all but one site. So I doubt this it.

Maybe that specific site have done somethhing to block most spider.

regards,

Yves
Feb 14 '11 #8
miller
1,089 Expert 1GB
Ok, another possible solution.

If the script and specific site work correctly on your local machine, but do not work on the specific server that you're using for hosting, then I can propose two solutions that will potentially work independently of whatever the actual problem is.

1) Routing LWP::UserAgent through a proxy.

Proxy's are kind of the rebel children of the internet and probably introduce a whole thread of potential new problems if you try to use one. Nevertheless, if it's the specific IP or IP block that you're on that is creating the problem then you can avoid that by routing your request through a proxy server.

Just search google or some public proxies if you want to give that a try.

2) Use Google's cached results

A more sure fire way of getting it to work would be to avoid the specific website entirely and use google's cached results instead. If you do a google search for the specific website, next to the first result you'll see a link to 'Cached' and 'Similar'. Clicking on Cached will bring up the copy of the website that google used for it's most recent spidering, most likely within the last 24 hours.

As long as you can work with a slightly outdated copy of the website, this could work as a means to avoid whatever problem you're having specific with this server. Yes, it's a hack, but sometimes that's what's required :)

Good luck,
- Miller
Feb 14 '11 #9

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

1 post views Thread by Guy Erez | last post: by
1 post views Thread by london calling | last post: by
4 posts views Thread by James Pemberton | last post: by
6 posts views Thread by Igor | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.