By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
431,827 Members | 2,155 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 431,827 IT Pros & Developers. It's quick & easy.

Is my web crawler being blocked?

P: 7
I am trying to write a web crawler (for academic research purposes) that grabs the number of links different websites/domain names have from other websites, as listed on Google (for example, to get the number of websites linking to YouTube, you could type into Google 'Link:YouTube.com' and get 11,100). I have a list of websites in a spreadsheet and would like to be able to output the number of links for each website in the sheet.

When I run the below script, it seems to run fast and accurately for a minute or so, or for several hundred websites. Then the crawler slows down, outputting that 0 links were found for all websites.

Am I getting this result because Google is detecting my crawler and blocking it? If so, is there anything that I can do, such as telling my crawler to sleep, and try again later (I've tried to sketch something out in the lower batch of code below)?

Thank you.
Expand|Select|Wrap|Line Numbers
  1. for i in range(len(lines)-1):
  2.     searchTerm = "link:"+ searchTerms[i]
  3.     br = MakeBrowser()
  4.     br.open('http://www.google.com')
  5.     br.select_form(name='f')
  6.     br['q'] = searchTerm
  7.     resp = br.submit().readlines()
  8.     # Get the line number that starts with '<table'
  9.     htmlSplitter = re.compile('<.*?>')
  10.     names=[]
  11.     y=htmlSplitter.split(resp[0])
  12.     value=''
  13.     for j in range(0,len(y)-1):
  14.       if y[j]==' linking to ':
  15.         value=string.replace(y[j-1],",","")
  16.  
  17. #...
  18.  
  19.  
  20. br = MakeBrowser()
  21. haveResp = False
  22.     while not haveResp:
  23.         try:
  24.             br.open('http://inventory.overture.com/d/' + \
  25.                 'searchinventory/suggestion/')
  26.             br.select_form(name='stst')
  27.             br['term'] = searchTerm
  28.             resp = br.submit().readlines()
  29.             haveResp = True
  30.         except urllib2.URLError:
  31.             time.sleep(10)
Nov 17 '07 #1
Share this Question
Share on Google+
3 Replies


Expert 100+
P: 671
Don't be surprised that your program is being blocked. You are, after all, hammering their servers with inefficient requests, at an unsustainable rate. Keep this up and your IP will be banned.

Obviously, the rate at which you issue requests is a problem. Using the web interface is inherently inefficient. Even worse, I doubt you coded your program to behave like a good client, like allowing for gzip compression of the webpage, caching, and so on. I haven't looked into it myself, but see if Google has a more efficient way of issuing these requests, through some API or something.

At least, don't keep requesting the Google search page over and over. You can construct a search term through the URL directly.
Nov 18 '07 #2

P: 75
if some www-site is completely blocking your app, your web crawler can also pretend to be someone else:

Expand|Select|Wrap|Line Numbers
  1. class AppURLopener(urllib.FancyURLopener):
  2.     version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.8) Gecko/20051107 Firefox/1.5"   
  3. #so this is your application's version name (these version names I got from wikipedia, don't remember the article or if it exist anymore)
  4.  
  5. ....
  6.  
  7. urllib._urlopener = AppURLopener()
  8.  
of course this isn't so recommended way ;D

and if I remember correctly web site should tell in html if it want allow access to crawlers, so your app "should" first try to detect that code
Nov 23 '07 #3

P: 7
Don't be surprised that your program is being blocked. You are, after all, hammering their servers with inefficient requests, at an unsustainable rate. Keep this up and your IP will be banned.

Obviously, the rate at which you issue requests is a problem. Using the web interface is inherently inefficient. Even worse, I doubt you coded your program to behave like a good client, like allowing for gzip compression of the webpage, caching, and so on. I haven't looked into it myself, but see if Google has a more efficient way of issuing these requests, through some API or something.

At least, don't keep requesting the Google search page over and over. You can construct a search term through the URL directly.
Thank you very much for your response. I have tried using urlopen through the URL directly with Google, but am finding that Google blocks this as well:

z=urlopen("http://www.google.com/search?q=link%3A" +www-site+ "S&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=1&startPage=1")
q=z.readlines()

I've found an API called PyGoogle that may work, though it seems to be used for more complicated tasks like getting a webpage's search ranking.

To accomplish more realistic searching/ good client behavior, are there any simple ways to allow for gzip compression or caching?

Thanks.
Jan 6 '08 #4

Post your reply

Sign in to post your reply or Sign up for a free account.