Hello..
I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..
Any suggestions would be appreciated.
David 5 1313
On Thursday 18 January 2007 11:57, David Waizer wrote:
Hello..
I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..
Any suggestions would be appreciated.
David
David, this is a touchy topic but whatever :P Look into sgmllib, and you can
filter on the "A" tag. The book 'Dive Into Python' covers it quite nicely: http://www.diveintopython.org/html_p...ing/index.html
Jonathan
Check out the quick start section in the documentation at Beautiful
Soup http://www.crummy.com/software/BeautifulSoup/
Wes
Jonathan Curran wrote:
On Thursday 18 January 2007 11:57, David Waizer wrote:
Hello..
I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..
Any suggestions would be appreciated.
David
David, this is a touchy topic but whatever :P Look into sgmllib, and you can
filter on the "A" tag. The book 'Dive Into Python' covers it quite nicely: http://www.diveintopython.org/html_p...ing/index.html
Jonathan
4 easy steps to get the links:
1. Download BeautifulSoup and import it in your script file.
2. Use urllib2 to download the html of the url.
3. mash the html using BeautifulSoup
4. -
for tag in BeautifulSoupisedHTML.findAll('a'):
-
print tag
-
David Waizer a écrit :
Hello..
I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..
Any suggestions would be appreciated.
David
In article <8N******************************@fdn.com>,
"David Waizer" <dw*****@noreply.comwrote:
Hello..
I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..
David,
In addition to others' suggestions about Beautiful Soup, you might also
want to look at the HTMLData module: http://oregonstate.edu/~barnesc/htmldata/
--
Philip http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
In <comp.os.linux.miscDavid Waizer <dw*****@noreply.comwrote:
Hello..
I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.
lynx -dump (look at the bottom)
--
William Park <op**********@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Mark Watson |
last post by:
Last year, I did an experiment of allowing a very polite
web spider run for a few days trying to find RDF markup
embedded in web pages. I found close to zero RDF - not
encouraging!
I a recent...
|
by: Mike Daniel |
last post by:
I am attempting to use document.write(pageVar) that displays a new html page
within a pop-up window and the popup is failing. Also note that pageVar is
a complete HTML page containing other java...
|
by: el_roachmeister |
last post by:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org,...
|
by: dananrg |
last post by:
O'Reilly's Spidering Hacks books terrific. One problem. All the code
samples are in Perl. Nothing Pythonic. Is there a book out there for
Python which covers spidering / crawling in depth?
|
by: Water Cooler v2 |
last post by:
Questions:
1. Can there be more than a single script block in a given HEAD tag?
2. Can there be more than a single script block in a given BODY tag?
To test, I tried the following code. None...
|
by: bilaribilari |
last post by:
Hi all,
I am using Tidy (C) for parsing html pages. I encountered a page that
has some script as follows:
<script>
....
var abc = "<script>some stuff here</" + "script>";
....
</script>
|
by: thisis |
last post by:
Hi All,
i have this.asp page:
<script type="text/vbscript">
Function myFunc(val1ok, val2ok)
' do something ok
myFunc = " return something ok"
End Function
</script>
|
by: George Orwell |
last post by:
Would I be missing much if I stopped trying to learn Perl well enough to use for
spidering, screen scraping etc. and converted over to PHP ? I am looking to do
all, or at least most of the hacks...
|
by: KevinADC |
last post by:
Note: You may skip to the end of the article if all you want is the perl code.
Introduction
Many websites have a form or a link you can use to download a file. You click a form button or click...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: af34tf |
last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
| |