By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,836 Members | 2,097 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,836 IT Pros & Developers. It's quick & easy.

saving a webpage's links to the hard disk

P: n/a
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to the
internet.
Jun 27 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
En Sun, 04 May 2008 01:33:45 -0300, Jetus <st********@gmail.comescribió:
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to the
internet.
Don't reinvent the wheel and use wget
http://en.wikipedia.org/wiki/Wget

--
Gabriel Genellina

Jun 27 '08 #2

P: n/a
On May 4, 12:33*am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 04 May 2008 01:33:45 -0300, Jetus <stevegi...@gmail.comescribió:
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to the
internet.

Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget

--
Gabriel Genellina
A lot of the functionality is already present.

import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
print urlparse.urljoin( 'http://python.org/', a )

Output snipped:

...
http://python.org/psf/
http://python.org/dev/
http://python.org/links/
http://python.org/download/releases/2.5.2
http://docs.python.org/
http://python.org/ftp/python/2.5.2/python-2.5.2.msi
...
Jun 27 '08 #3

P: n/a
On May 4, 7:22 am, castiro...@gmail.com wrote:
On May 4, 12:33 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 04 May 2008 01:33:45 -0300, Jetus <stevegi...@gmail.comescribió:
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to the
internet.
Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget
--
Gabriel Genellina

A lot of the functionality is already present.

import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
print urlparse.urljoin( 'http://python.org/', a )

Output snipped:

...http://python.org/psf/http://python....thon-2.5.2.msi
...
How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.
Jun 27 '08 #4

P: n/a
On May 7, 1:40*am, Jetus <stevegi...@gmail.comwrote:
On May 4, 7:22 am, castiro...@gmail.com wrote:


On May 4, 12:33 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 04 May 2008 01:33:45 -0300, Jetus <stevegi...@gmail.comescribió:
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to the
internet.
Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget
--
Gabriel Genellina
A lot of the functionality is already present.
import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
* * print urlparse.urljoin( 'http://python.org/', a )
Output snipped:
...http://python.org/psf/http://python....on.org/links/h....
...

How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.- Hide quoted text -

- Show quoted text -
You'd have to convert filenames in the loop to a file system path; try
writing as is with makedirs( ). You'd have to replace contents in a
file for links, so your best might be prefixing them with localhost
and spawning a small bounce-router.
Jun 27 '08 #5

P: n/a
Jetus wrote:
On May 4, 7:22 am, castiro...@gmail.com wrote:
>On May 4, 12:33 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 04 May 2008 01:33:45 -0300, Jetus <stevegi...@gmail.com>
escribió:
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to
the internet.
Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget
--
Gabriel Genellina

A lot of the functionality is already present.

import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
print urlparse.urljoin( 'http://python.org/', a )

Output snipped:

...http://python.org/psf/http://python....thon-2.5.2.msi
...

How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.
how about you *try* to do so - and if you have actual problems, you come
back and ask for help? Alternatively, there's always guru.com

Diez
Jun 27 '08 #6

P: n/a
On May 7, 8:36*am, "Diez B. Roggisch" <de...@nospam.web.dewrote:
Jetus wrote:
On May 4, 7:22 am, castiro...@gmail.com wrote:
On May 4, 12:33 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 04 May 2008 01:33:45 -0300, Jetus <stevegi...@gmail.com>
escribió:
Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to
the internet.
Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget
--
Gabriel Genellina
A lot of the functionality is already present.
import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
* * print urlparse.urljoin( 'http://python.org/', a )
Output snipped:
...http://python.org/psf/http://python....on.org/links/h...
...
How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.

how about you *try* to do so - and if you have actual problems, you come
back and ask for help? Alternatively, there's always guru.com

Diez- Hide quoted text -

- Show quoted text -
I've tried, no avail. How does the open-source plug to Python look/
work? Firefox was able to spawn Python in a toolbar in a distant
land. Does it still? I believe under DOM, return a file named X that
contains a list of changes to make to the page, or put it at the top
of one, to be removed by Firefox. At that point, X would pretty much
be the last lexicly-sorted file in a pre-established directory. Files
are really easy to create and add syntax too, if you create a bunch of
them. Sector size was bouncing though, which brings that all the way
up to file system.

for( int docID= 0; docID++ ) {
if ( doc.links[ docID ]== pythonfileA.links[ pyID ] ) {
doc.links[ docID ].anchor= pythonfileB.links[ pyID ];
pyID++;
}
}
Jun 27 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.