473,224 Members | 1,957 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,224 software developers and data experts.

Is my web crawler being blocked?

7
I am trying to write a web crawler (for academic research purposes) that grabs the number of links different websites/domain names have from other websites, as listed on Google (for example, to get the number of websites linking to YouTube, you could type into Google 'Link:YouTube.com' and get 11,100). I have a list of websites in a spreadsheet and would like to be able to output the number of links for each website in the sheet.

When I run the below script, it seems to run fast and accurately for a minute or so, or for several hundred websites. Then the crawler slows down, outputting that 0 links were found for all websites.

Am I getting this result because Google is detecting my crawler and blocking it? If so, is there anything that I can do, such as telling my crawler to sleep, and try again later (I've tried to sketch something out in the lower batch of code below)?

Thank you.
Expand|Select|Wrap|Line Numbers
  1. for i in range(len(lines)-1):
  2.     searchTerm = "link:"+ searchTerms[i]
  3.     br = MakeBrowser()
  4.     br.open('http://www.google.com')
  5.     br.select_form(name='f')
  6.     br['q'] = searchTerm
  7.     resp = br.submit().readlines()
  8.     # Get the line number that starts with '<table'
  9.     htmlSplitter = re.compile('<.*?>')
  10.     names=[]
  11.     y=htmlSplitter.split(resp[0])
  12.     value=''
  13.     for j in range(0,len(y)-1):
  14.       if y[j]==' linking to ':
  15.         value=string.replace(y[j-1],",","")
  16.  
  17. #...
  18.  
  19.  
  20. br = MakeBrowser()
  21. haveResp = False
  22.     while not haveResp:
  23.         try:
  24.             br.open('http://inventory.overture.com/d/' + \
  25.                 'searchinventory/suggestion/')
  26.             br.select_form(name='stst')
  27.             br['term'] = searchTerm
  28.             resp = br.submit().readlines()
  29.             haveResp = True
  30.         except urllib2.URLError:
  31.             time.sleep(10)
Nov 17 '07 #1
3 3942
oler1s
671 Expert 512MB
Don't be surprised that your program is being blocked. You are, after all, hammering their servers with inefficient requests, at an unsustainable rate. Keep this up and your IP will be banned.

Obviously, the rate at which you issue requests is a problem. Using the web interface is inherently inefficient. Even worse, I doubt you coded your program to behave like a good client, like allowing for gzip compression of the webpage, caching, and so on. I haven't looked into it myself, but see if Google has a more efficient way of issuing these requests, through some API or something.

At least, don't keep requesting the Google search page over and over. You can construct a search term through the URL directly.
Nov 18 '07 #2
dazzler
75
if some www-site is completely blocking your app, your web crawler can also pretend to be someone else:

Expand|Select|Wrap|Line Numbers
  1. class AppURLopener(urllib.FancyURLopener):
  2.     version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.8) Gecko/20051107 Firefox/1.5"   
  3. #so this is your application's version name (these version names I got from wikipedia, don't remember the article or if it exist anymore)
  4.  
  5. ....
  6.  
  7. urllib._urlopener = AppURLopener()
  8.  
of course this isn't so recommended way ;D

and if I remember correctly web site should tell in html if it want allow access to crawlers, so your app "should" first try to detect that code
Nov 23 '07 #3
mh121
7
Don't be surprised that your program is being blocked. You are, after all, hammering their servers with inefficient requests, at an unsustainable rate. Keep this up and your IP will be banned.

Obviously, the rate at which you issue requests is a problem. Using the web interface is inherently inefficient. Even worse, I doubt you coded your program to behave like a good client, like allowing for gzip compression of the webpage, caching, and so on. I haven't looked into it myself, but see if Google has a more efficient way of issuing these requests, through some API or something.

At least, don't keep requesting the Google search page over and over. You can construct a search term through the URL directly.
Thank you very much for your response. I have tried using urlopen through the URL directly with Google, but am finding that Google blocks this as well:

z=urlopen("http://www.google.com/search?q=link%3A" +www-site+ "S&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=1&startPage=1")
q=z.readlines()

I've found an API called PyGoogle that may work, though it seems to be used for more complicated tasks like getting a webpage's search ranking.

To accomplish more realistic searching/ good client behavior, are there any simple ways to allow for gzip compression or caching?

Thanks.
Jan 6 '08 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Gomez | last post by:
Hi, Is there a way to know if a session on my web server is from an actual user or an automated crawler. please advise. G
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
1
by: Steve Ocsic | last post by:
Hi, I've coded a basic crawler where by you enter the URL and it will then crawl the said URL. What I would like to do now is to take it one step further and do the following: 1. pick up the...
0
by: Nicolas | last post by:
I need HELP!!!!! The crawler (Google or other) don't index my web site unless the web site is currently visited If there is nobody visiting those .aspx page therefor activating the aspnet no...
3
by: Bill | last post by:
Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false positives/negatives? Thanks!
13
by: abhinav | last post by:
Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to...
3
by: Charles Zhang | last post by:
How can I get Request.Browser.Crawler to work correctly? Do I need to update <browserCapssection of the web configuration file? If yes, where can I find something that cover all browsers and...
0
by: kishorealla | last post by:
Hello I need to create a web bot/crawler/spider that would go into different web sites and collect data for us and store in a database. The crawler needs to 'READ' the options on a website (either...
4
by: sonich | last post by:
I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface?
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: mar23 | last post by:
Here's the situation. I have a form called frmDiceInventory with subform called subfrmDice. The subform's control source is linked to a query called qryDiceInventory. I've been trying to pick up the...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
by: jimatqsi | last post by:
The boss wants the word "CONFIDENTIAL" overlaying certain reports. He wants it large, slanted across the page, on every page, very light gray, outlined letters, not block letters. I thought Word Art...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.