473,668 Members | 2,583 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

need to write a simple web crawler

1 New Member
hai

i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...

i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...

help in any knind would be appreciated..

thank u
Sep 16 '06 #1
13 21692
kudos
127 Recognized Expert New Member
Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)

Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)

hash = {}
stack = []
stack.push("www .thescripts.com ")

while(len(stack ) > 0):
currpage = stack.pop()
hash[currpage] = 1 # sets it to visited
links = findlinks(currp age) # this method finds all the links of the page
# here you can do what you would do, like finding some text, downloading
# some image etc etc
# push all the links on the stack
Expand|Select|Wrap|Line Numbers
  1.  for l in links:
  2.   if(hash[l] != 1):
  3.    stack.push(l)

This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.

-kudos



hai

i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...

i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...

help in any knind would be appreciated..

thank u
Sep 17 '06 #2
squzer
3 New Member
Hi friend.. me too involving develpin a crawler.. share the deas you got please........
Jun 18 '07 #3
kudos
127 Recognized Expert New Member
Hi friend.. me too involving develpin a crawler.. share the deas you got please........
Hi, what do you want to get from your crawl?

-kudos
Jun 18 '07 #4
mike171562
1 New Member
I am looking for one that will read from a list of urls and crawl them for certain text words and then list the results.
Aug 6 '07 #5
technoashis
2 New Member
I am also trying for that but my crawler takes a hell a lot of time to crwal i have done it in python. Can you folks give me some clue
Nov 12 '07 #6
dazzler
75 New Member
I have done crawler also which parses URLs from html. I think that python's html parser modules only work with clean & valid html code... and net is full of dirty html! so get ready to write your own html parser =)
Nov 12 '07 #7
heiro
56 New Member
Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)

Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)

hash = {}
stack = []
stack.push("www .thescripts.com ")

while(len(stack ) > 0):
currpage = stack.pop()
hash[currpage] = 1 # sets it to visited
links = findlinks(currp age) # this method finds all the links of the page
# here you can do what you would do, like finding some text, downloading
# some image etc etc
# push all the links on the stack
Expand|Select|Wrap|Line Numbers
  1.  for l in links:
  2.   if(hash[l] != 1):
  3.    stack.push(l)

This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.

-kudos

I'm very interested how web crawler works..Would you mind if I ask for a sample code so that i could study and later make my own?
Nov 24 '07 #8
helena pap
1 New Member
hi, i am trying to make a crawler and have the most frequency keywords of the pages of one site ... any idea??
Mar 29 '08 #9
urgent
1 New Member
Hi, I need to write a simple crawler too. it must have the ability to capture webpages from a certain site for example ww.CNN.com

and also it must parse those HTML webpages. I need any sample code please..urgentl y in order to help me with my project.
Apr 4 '08 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

7
1518
by: mx2k | last post by:
Hello @ all, we have written a small program (code below) for our own in-developement rpg system, which is getting values for 4 RPG-Characters and doing some calculations with it. now we're trying hard to find out how to get it working with 'n' Characters, so you will be asked to enter a number at the beginning, asking you how many characters you want.
2
7147
by: OM | last post by:
I need a simple Javascript shopping cart. I did a few searches on Yahoo... And got a few results of free Javascript shopping carts. The problem is there tooo complicated and very hard to understand. I need something simpler and easier to understand as I need to customise them for my own needs. I'd like to know what the basics are of having a shopping cart in Javascript.
3
2150
by: worzel | last post by:
need some simple code to copy text to clipboard in c# - my app has right click > copy to clicpboard feature, which is best way to do this?
1
1352
by: Rafael Veronezi | last post by:
I have a simple doubt about the Response.Write method... Follows... I have a page that do some processing before show up, it could take something like 10 or 15 seconds... But it's not the network reply time, but the page processing time... So, as everybody knows, Response.Write outputs to the page, and, if you use that in your page load for example, it will show up on the top of the page, when it's fully rendered and sent to client... My...
2
1885
by: mikespike21 | last post by:
Hello, I need a perl script that converts the content of a simple text document. Like the following: Content before: CC -0.007 ZZ 79.854 YY -0.002 XX -0.009
23
1747
by: Rex | last post by:
Hi I want to write a procedure which takes in a string of names seperated by a whitespace and puts commas at each whitespace the last name however, should have "and" before it. Let me explain that with the help of an example: The original string is: "Toby Grant Michelle Tom" the procedure should return "Toby, Grant, Michelle and Tom" Cheers!
1
2070
by: Girish Kanakagiri | last post by:
How to write simple isapi filter code in C++ Just to Add "Hello World" to the Response ? Can any one please help with initial start up so that I can build up further. It is Urgent... Regards, Girish.
7
3515
by: bdy120602 | last post by:
In addition to the question in the subject line, if the answer is yes, is it possible to locate keywords as part of the functionality of said crawler (bot, spider)? Basically, I would like to write a stand-alone form (javascript app.) to perform a site-specific keyword search. Can I do the aforementioned in Javascript? Thanks.
0
1283
bIGMOS
by: bIGMOS | last post by:
I made a GUI ping program, now Im lost on how to do the server end of it Need a simple VB 2005 express program that will listen to ping request and display something like YOU ARE BE PINGED like in a lable. Do i use AXwiscox ??? If pingReply.Status = IPStatus.Success Then Label1.Text = "Ping Received"
4
3582
by: =?GB2312?B?0rvK18qr?= | last post by:
Hi all, Today I was writing a simple test app for a video decoder library. I use python to parse video files and input data to the library. I got a problem here, I need a windows form, and send the form handle to the library as a parameter, then it can output video on the form. Here is my problem:
0
8462
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8893
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8797
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8583
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8656
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7401
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5681
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
2791
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2023
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.