Login or Sign up Help | Site Map
Connecting Tech Pros Worldwide

need to write a simple web crawler

Question posted by: Pradeep Vasudevan (Newbie) on September 16th, 2006 09:19 PM
hai

i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...

i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...

help in any knind would be appreciated..

thank u
Would you like to answer this question?
Sign up for a free account, or Login (if you're already a member).
kudos's Avatar
kudos
Expert
72 Posts
September 17th, 2006
09:34 AM
#2

Re: need to write a simple web crawler
Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)

Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)

hash = {}
stack = []
stack.push("www.thescripts.com")

while(len(stack) > 0):
currpage = stack.pop()
hash[currpage] = 1 # sets it to visited
links = findlinks(currpage) # this method finds all the links of the page
# here you can do what you would do, like finding some text, downloading
# some image etc etc
# push all the links on the stack
Code: ( text )
  1. for l in links:
  2.   if(hash[l] != 1):
  3.    stack.push(l)



This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.

-kudos



Quote:
Originally Posted by Pradeep Vasudevan
hai

i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...

i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...

help in any knind would be appreciated..

thank u

Reply
squzer's Avatar
squzer
Newbie
3 Posts
June 18th, 2007
11:05 AM
#3

Re: need to write a simple web crawler
Hi friend.. me too involving develpin a crawler.. share the deas you got please........

Reply
kudos's Avatar
kudos
Expert
72 Posts
June 18th, 2007
01:56 PM
#4

Re: need to write a simple web crawler
Quote:
Originally Posted by squzer
Hi friend.. me too involving develpin a crawler.. share the deas you got please........


Hi, what do you want to get from your crawl?

-kudos

Reply
mike171562's Avatar
mike171562
Newbie
1 Posts
August 6th, 2007
07:54 PM
#5

Re: need to write a simple web crawler
I am looking for one that will read from a list of urls and crawl them for certain text words and then list the results.

Reply
technoashis's Avatar
technoashis
Newbie
2 Posts
November 12th, 2007
10:28 AM
#6

Re: need to write a simple web crawler
I am also trying for that but my crawler takes a hell a lot of time to crwal i have done it in python. Can you folks give me some clue

Reply
dazzler's Avatar
dazzler
Member
75 Posts
November 12th, 2007
12:29 PM
#7

Re: need to write a simple web crawler
I have done crawler also which parses URLs from html. I think that python's html parser modules only work with clean & valid html code... and net is full of dirty html! so get ready to write your own html parser =)

Reply
heiro's Avatar
heiro
Member
53 Posts
November 24th, 2007
02:20 PM
#8

Re: need to write a simple web crawler
Quote:
Originally Posted by kudos
Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)

Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)

hash = {}
stack = []
stack.push("www.thescripts.com")

while(len(stack) > 0):
currpage = stack.pop()
hash[currpage] = 1 # sets it to visited
links = findlinks(currpage) # this method finds all the links of the page
# here you can do what you would do, like finding some text, downloading
# some image etc etc
# push all the links on the stack
Code: ( text )
  1. for l in links:
  2.   if(hash[l] != 1):
  3.    stack.push(l)



This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.

-kudos



I'm very interested how web crawler works..Would you mind if I ask for a sample code so that i could study and later make my own?

Reply
helena pap's Avatar
helena pap
Newbie
1 Posts
March 29th, 2008
09:51 AM
#9

Re: need to write a simple web crawler
hi, i am trying to make a crawler and have the most frequency keywords of the pages of one site ... any idea??

Reply
urgent's Avatar
urgent
Newbie
1 Posts
April 4th, 2008
06:05 AM
#10

Re: need to write a simple web crawler
Hi, I need to write a simple crawler too. it must have the ability to capture webpages from a certain site for example ww.CNN.com

and also it must parse those HTML webpages. I need any sample code please..urgently in order to help me with my project.

Reply
chaosAD's Avatar
chaosAD
Newbie
8 Posts
April 4th, 2008
06:58 AM
#11

Re: need to write a simple web crawler
a simple html parser, looks for thumbnail tags and prints the thumbnail information

Code: ( text )
  1. import urllib2, sgmllib
  2.  
  3.  
  4. class ImageScraper(sgmllib.SGMLParser):
  5.  
  6.     def __init__(self):
  7.  
  8.         sgmllib.SGMLParser.__init__(self)
  9.        
  10.         self.href = ''
  11.  
  12.     def start_a(self, attrs):
  13.         for tag, value in attrs:
  14.             if tag == 'href':
  15.                 self.href = value
  16.  
  17.     def end_a(self):
  18.         self.href = ''
  19.  
  20.     def start_img(self, attrs):
  21.         if self.href:
  22.             print "#####################################"
  23.             print "IMAGE URL: " + self.href
  24.             for tag, value in attrs:
  25.                 if tag == 'src':
  26.                     print "THUMBNAIL SRC: " + value
  27.                 elif tag == "width":
  28.                     print "THUMBNAIL WIDTH: " + value
  29.                 elif tag == "height":
  30.                     print "THUMBNAIL HEIGHT: " + value
  31.                 elif tag == "alt":
  32.                     print "THUMBNAIL NAME: " + value
  33.                 elif tag == "border":
  34.                     print "THUMBNAIL BORDER: " + value
  35.                 else:
  36.                     None
  37.             print "#####################################\n"
  38.  
  39.  
  40. url = "http://bytes.com/"
  41.  
  42. sock = urllib2.urlopen(url)
  43.  
  44. page = sock.read()
  45.  
  46. sock.close()
  47.  
  48. parser = ThumbnailScraper()
  49.  
  50. parser.feed(page)
  51.  
  52. parser.close()

Reply
varun1985's Avatar
varun1985
Newbie
1 Posts
July 1st, 2008
04:26 PM
#12

Re: need to write a simple web crawler
Quote:
Originally Posted by kudos
Hi, what do you want to get from your crawl?

-kudos
hi kudos,

I want to write a crawler which will fetch the data like company name,turnover,product for which they are working for..and store into my database.

actually i have to submit a project,i have made simple html tags based crawler but want to make a dynamic simple web crawler.

your help is required!!!

Thanks in advance!!!

Varun

Reply
kudos's Avatar
kudos
Expert
72 Posts
July 19th, 2008
10:25 PM
#13

Re: need to write a simple web crawler
ok, webcrawlers, there is usually alot of 'ifs', but have a sketched out a very simple webcrawler that illustrates the idea (with comments!)

Code: ( text )
  1. #webcrawler
  2. #this is basically a shell, illustrating use of the "breath-first" type of webcrawler
  3. # you have to add things for extracting the actual info from the webpage yourself
  4. # all it currently do is to print the url of the pages, and the number of candidates to visit
  5.  
  6. import urllib
  7. page = "http://bytes.com" # startpage
  8. stack = []
  9. stack.append(page)
  10. visit = {} # keeps track of pages that we visited, to avoid loops
  11. stopvar = 5 # I have added a variable that will allow you to exit after visiting x number of page, obviously we do not want to visit all page of the internet :)
  12.  
  13. while(stopvar >= 0):
  14.  stopvar-=1
  15.  cpage = stack.pop()
  16.  f = urllib.urlopen(cpage)
  17.  html=f.read()
  18.  sp = "a href=\""
  19.  
  20.  # you want extract things from the html code (such as images, text etc, etc around here)
  21.  # the rest of the thing is to extract hyperlinks, and put them into a stack, so we can
  22.  # continue to visit pages
  23.  
  24.  for i in range(len(html)):
  25.   if(sp == html[i:i+len(sp)]):
  26.    url = ""
  27.    i+=len(sp)
  28.    while(html[i] != "\""):
  29.     url+=html[i]
  30.     i+=1
  31.    # is our link a local link, or a global link? i leave local links as an exercise :)
  32.    if(url[0:4] == "http"):
  33.     if(visit.has_key(url) == False):
  34.      stack.append(url)
  35.      visit[url] = 1
  36.  print str(len(stack)) + " " + cpage


-kudos

Reply
Reply
Not the answer you were looking for? Post your question . . .
184,010 Experts ready to help you find a solution.
Sign up for a free account, or Login (if you're already a member).

Latest Articles: Read & Comment
Top Python Forum Contributors