473,322 Members | 1,566 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

spidering script

Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

Any suggestions would be appreciated.

David
Jan 18 '07 #1
5 1313
On Thursday 18 January 2007 11:57, David Waizer wrote:
Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

Any suggestions would be appreciated.

David
David, this is a touchy topic but whatever :P Look into sgmllib, and you can
filter on the "A" tag. The book 'Dive Into Python' covers it quite nicely:
http://www.diveintopython.org/html_p...ing/index.html

Jonathan
Jan 18 '07 #2
Check out the quick start section in the documentation at Beautiful
Soup http://www.crummy.com/software/BeautifulSoup/
Wes
Jonathan Curran wrote:
On Thursday 18 January 2007 11:57, David Waizer wrote:
Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

Any suggestions would be appreciated.

David

David, this is a touchy topic but whatever :P Look into sgmllib, and you can
filter on the "A" tag. The book 'Dive Into Python' covers it quite nicely:
http://www.diveintopython.org/html_p...ing/index.html

Jonathan
Jan 18 '07 #3
4 easy steps to get the links:

1. Download BeautifulSoup and import it in your script file.
2. Use urllib2 to download the html of the url.
3. mash the html using BeautifulSoup
4.
Expand|Select|Wrap|Line Numbers
  1. for tag in BeautifulSoupisedHTML.findAll('a'):
  2. print tag
  3.  
David Waizer a écrit :
Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

Any suggestions would be appreciated.

David
Jan 19 '07 #4
In article <8N******************************@fdn.com>,
"David Waizer" <dw*****@noreply.comwrote:
Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..
David,
In addition to others' suggestions about Beautiful Soup, you might also
want to look at the HTMLData module:

http://oregonstate.edu/~barnesc/htmldata/

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jan 20 '07 #5
In <comp.os.linux.miscDavid Waizer <dw*****@noreply.comwrote:
Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
lynx -dump (look at the bottom)

--
William Park <op**********@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Jan 23 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Mark Watson | last post by:
Last year, I did an experiment of allowing a very polite web spider run for a few days trying to find RDF markup embedded in web pages. I found close to zero RDF - not encouraging! I a recent...
6
by: Mike Daniel | last post by:
I am attempting to use document.write(pageVar) that displays a new html page within a pop-up window and the popup is failing. Also note that pageVar is a complete HTML page containing other java...
5
by: el_roachmeister | last post by:
I am working on a spider script but I only want to parse english pages. Is there a way I can check to see what language the content is in? I suppose I could restrict my spider to just .com , .org,...
5
by: dananrg | last post by:
O'Reilly's Spidering Hacks books terrific. One problem. All the code samples are in Perl. Nothing Pythonic. Is there a book out there for Python which covers spidering / crawling in depth?
3
by: Water Cooler v2 | last post by:
Questions: 1. Can there be more than a single script block in a given HEAD tag? 2. Can there be more than a single script block in a given BODY tag? To test, I tried the following code. None...
2
by: bilaribilari | last post by:
Hi all, I am using Tidy (C) for parsing html pages. I encountered a page that has some script as follows: <script> .... var abc = "<script>some stuff here</" + "script>"; .... </script>
19
by: thisis | last post by:
Hi All, i have this.asp page: <script type="text/vbscript"> Function myFunc(val1ok, val2ok) ' do something ok myFunc = " return something ok" End Function </script>
1
by: George Orwell | last post by:
Would I be missing much if I stopped trying to learn Perl well enough to use for spidering, screen scraping etc. and converted over to PHP ? I am looking to do all, or at least most of the hacks...
1
KevinADC
by: KevinADC | last post by:
Note: You may skip to the end of the article if all you want is the perl code. Introduction Many websites have a form or a link you can use to download a file. You click a form button or click...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.