By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,213 Members | 1,950 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,213 IT Pros & Developers. It's quick & easy.

Parsing links within a html file.

P: n/a
Hello,
I have a html file over here by the name guide_ind.html and it
contains links to other html files like guides.html#outline . How do I
point BeautifulSoup (I want to use this module) to
guides.html#outline ?
Thanks
Shriphani P.
Jan 14 '08 #1
Share this Question
Share on Google+
1 Reply


P: n/a
On Jan 14, 9:59 am, Shriphani <shripha...@gmail.comwrote:
Hello,
I have a html file over here by the name guide_ind.html and it
contains links to other html files like guides.html#outline . How do I
point BeautifulSoup (I want to use this module) to
guides.html#outline ?
Thanks
Shriphani P.
Try Mark Pilgrim's excellent example at:
http://www.diveintopython.org/http_w...ces/index.html

From the above link, you can retrieve openanything.py which I use in
my example:

# list_url.py
# created by Hai Vu on 1/16/2008

from openanything import fetch
from sgmllib import SGMLParser

class RetrieveURLs(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attributes):
url = [v for k, v in attributes if k.lower() == 'href']
self.urls.extend(url)
print '\t%s' % (url)

#
--------------------------------------------------------------------------------------------------------------
# main
def main():
site = 'http://www.google.com'

result = fetch(site)
if result['status'] == 200:
# Extracts a list of URLs off the top page
parser = RetrieveURLs()
parser.feed(result['data'])
parser.close()

# Display the URLs we just retrieved
print '\nURL retrieved from %s' % (site)
print '\t' + '\n\t'.join(parser.urls)
else:
print 'Error (%d) retrieving %s' % (result['status'], site)

if __name__ == '__main__':
main()
Jan 17 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.