hai
i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...
i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...
help in any knind would be appreciated..
thank u
13 21669
Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)
Lets assume that we use BFS, and use pythons list method, and that you start on a certain page ( www.thescripts.com ?:)
hash = {}
stack = []
stack.push("www.thescripts.com")
while(len(stack) > 0):
currpage = stack.pop()
hash[currpage] = 1 # sets it to visited
links = findlinks(currpage) # this method finds all the links of the page
# here you can do what you would do, like finding some text, downloading
# some image etc etc
# push all the links on the stack -
for l in links:
-
if(hash[l] != 1):
-
stack.push(l)
This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.
-kudos
hai
i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...
i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...
help in any knind would be appreciated..
thank u
Hi friend.. me too involving develpin a crawler.. share the deas you got please........
Hi friend.. me too involving develpin a crawler.. share the deas you got please........
Hi, what do you want to get from your crawl?
-kudos
I am looking for one that will read from a list of urls and crawl them for certain text words and then list the results.
I am also trying for that but my crawler takes a hell a lot of time to crwal i have done it in python. Can you folks give me some clue
I have done crawler also which parses URLs from html. I think that python's html parser modules only work with clean & valid html code... and net is full of dirty html! so get ready to write your own html parser =)
Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)
Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)
hash = {}
stack = []
stack.push("www.thescripts.com")
while(len(stack) > 0):
currpage = stack.pop()
hash[currpage] = 1 # sets it to visited
links = findlinks(currpage) # this method finds all the links of the page
# here you can do what you would do, like finding some text, downloading
# some image etc etc
# push all the links on the stack -
for l in links:
-
if(hash[l] != 1):
-
stack.push(l)
This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.
-kudos
I'm very interested how web crawler works..Would you mind if I ask for a sample code so that i could study and later make my own?
hi, i am trying to make a crawler and have the most frequency keywords of the pages of one site ... any idea??
Hi, I need to write a simple crawler too. it must have the ability to capture webpages from a certain site for example ww.CNN.com
and also it must parse those HTML webpages. I need any sample code please..urgently in order to help me with my project.
a simple html parser, looks for thumbnail tags and prints the thumbnail information - import urllib2, sgmllib
-
-
-
class ImageScraper(sgmllib.SGMLParser):
-
-
def __init__(self):
-
-
sgmllib.SGMLParser.__init__(self)
-
-
self.href = ''
-
-
def start_a(self, attrs):
-
for tag, value in attrs:
-
if tag == 'href':
-
self.href = value
-
-
def end_a(self):
-
self.href = ''
-
-
def start_img(self, attrs):
-
if self.href:
-
print "#####################################"
-
print "IMAGE URL: " + self.href
-
for tag, value in attrs:
-
if tag == 'src':
-
print "THUMBNAIL SRC: " + value
-
elif tag == "width":
-
print "THUMBNAIL WIDTH: " + value
-
elif tag == "height":
-
print "THUMBNAIL HEIGHT: " + value
-
elif tag == "alt":
-
print "THUMBNAIL NAME: " + value
-
elif tag == "border":
-
print "THUMBNAIL BORDER: " + value
-
else:
-
None
-
print "#####################################\n"
-
-
-
url = "http://bytes.com/"
-
-
sock = urllib2.urlopen(url)
-
-
page = sock.read()
-
-
sock.close()
-
-
parser = ThumbnailScraper()
-
-
parser.feed(page)
-
-
parser.close()
Hi, what do you want to get from your crawl?
-kudos
hi kudos,
I want to write a crawler which will fetch the data like company name,turnover,product for which they are working for..and store into my database.
actually i have to submit a project,i have made simple html tags based crawler but want to make a dynamic simple web crawler.
your help is required!!!
Thanks in advance!!!
Varun
ok, webcrawlers, there is usually alot of 'ifs', but have a sketched out a very simple webcrawler that illustrates the idea (with comments!) -
#webcrawler
-
#this is basically a shell, illustrating use of the "breath-first" type of webcrawler
-
# you have to add things for extracting the actual info from the webpage yourself
-
# all it currently do is to print the url of the pages, and the number of candidates to visit
-
-
import urllib
-
page = "http://bytes.com" # startpage
-
stack = []
-
stack.append(page)
-
visit = {} # keeps track of pages that we visited, to avoid loops
-
stopvar = 5 # I have added a variable that will allow you to exit after visiting x number of page, obviously we do not want to visit all page of the internet :)
-
-
while(stopvar >= 0):
-
stopvar-=1
-
cpage = stack.pop()
-
f = urllib.urlopen(cpage)
-
html=f.read()
-
sp = "a href=\""
-
-
# you want extract things from the html code (such as images, text etc, etc around here)
-
# the rest of the thing is to extract hyperlinks, and put them into a stack, so we can
-
# continue to visit pages
-
-
for i in range(len(html)):
-
if(sp == html[i:i+len(sp)]):
-
url = ""
-
i+=len(sp)
-
while(html[i] != "\""):
-
url+=html[i]
-
i+=1
-
# is our link a local link, or a global link? i leave local links as an exercise :)
-
if(url[0:4] == "http"):
-
if(visit.has_key(url) == False):
-
stack.append(url)
-
visit[url] = 1
-
print str(len(stack)) + " " + cpage
-
-kudos
Try Scrapy, a very powerful (and well documented) framework for writing web crawlers (and screen scrapers) in Python.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: mx2k |
last post by:
Hello @ all,
we have written a small program (code below) for our own
in-developement rpg system, which is getting values for 4
RPG-Characters and doing some calculations with it.
now we're...
|
by: OM |
last post by:
I need a simple Javascript shopping cart.
I did a few searches on Yahoo...
And got a few results of free Javascript shopping carts.
The problem is there tooo complicated and very hard to...
|
by: worzel |
last post by:
need some simple code to copy text to clipboard in c# - my app has right
click > copy to clicpboard feature, which is best way to do this?
|
by: Rafael Veronezi |
last post by:
I have a simple doubt about the Response.Write method...
Follows... I have a page that do some processing before show up, it could
take something like 10 or 15 seconds... But it's not the network...
|
by: mikespike21 |
last post by:
Hello,
I need a perl script that converts the content of a simple text document. Like the following:
Content before:
CC -0.007
ZZ 79.854
YY -0.002
XX -0.009
|
by: Rex |
last post by:
Hi I want to write a procedure which takes in a string of names
seperated by a whitespace and puts commas at each whitespace the last
name however, should have "and" before it.
Let me explain...
|
by: Girish Kanakagiri |
last post by:
How to write simple isapi filter code in C++
Just to Add "Hello World" to the Response ?
Can any one please help with initial start up so that I can build up further.
It is Urgent...
...
|
by: bdy120602 |
last post by:
In addition to the question in the subject line, if the answer is yes,
is it possible to locate keywords as part of the functionality of said
crawler (bot, spider)?
Basically, I would like to...
|
by: bIGMOS |
last post by:
I made a GUI ping program, now Im lost on how to do the server end of it
Need a simple VB 2005 express program that will listen to ping request and display something like YOU ARE BE PINGED like in...
|
by: =?GB2312?B?0rvK18qr?= |
last post by:
Hi all,
Today I was writing a simple test app for a video decoder library.
I use python to parse video files and input data to the library.
I got a problem here, I need a windows form, and...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Defcon1945 |
last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
|
by: af34tf |
last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |