469,643 Members | 1,280 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,643 developers. It's quick & easy.

Getting URL's

How do I get all the URL's in a page?

May 19 '06 #1
5 1072
htmlparser or regular expression

May 19 '06 #2

May 19 '06 #3
"defcon8" <de*****@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
How do I get all the URL's in a page?

pyparsing comes with a simple example that does this, too.

-- Paul
Download pyparsing at http://sourceforge.net/projects/pyparsing
May 19 '06 #4
it is difficult to get all URL's in a page
you can use sgmllib module to parse html files
can get the standard href .

May 19 '06 #5
"softwindow" <so********@gmail.com> wrote in message
news:11*********************@j73g2000cwa.googlegro ups.com...
it is difficult to get all URL's in a page


Is this really so hard?:

from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
import urllib

# extract all <a> anchor tags - makeHTMLTags defines a
# fairly robust pair of match patterns, not just "<tag>","</tag>"
linkOpenTag,linkCloseTag = makeHTMLTags("a")
link = linkOpenTag + \
SkipTo(linkCloseTag).setResultsName("body") + \

# read the HTML source from some random URL
serverListPage = urllib.urlopen( "http://www.google.com" )
htmlText = serverListPage.read()

# use the link grammar to scan the HTML source
for toks,strt,end in link.scanString(htmlText):
print toks.startA.href,"->",toks.body

/url?sa=p&pref=ig&pval=2&q=http://www.google.com/ig%3Fhl%3Den ->
Personalized Home
https://www.google.com/accounts/Logi...gle.com/&hl=en ->
Sign in
/imghp?hl=en&tab=wi&ie=UTF-8 -> Images
http://groups.google.com/grphp?hl=en&tab=wg&ie=UTF-8 -> Groups
http://news.google.com/nwshp?hl=en&tab=wn&ie=UTF-8 -> News
http://froogle.google.com/frghp?hl=en&tab=wf&ie=UTF-8 -> Froogle
/maphp?hl=en&tab=wl&ie=UTF-8 -> Maps
/intl/en/options/ -> more&nbsp;&raquo;
/advanced_search?hl=en -> Advanced Search
/preferences?hl=en -> Preferences
/language_tools?hl=en -> Language Tools
/intl/en/ads/ -> Advertising&nbsp;Programs
/services/ -> Business Solutions
/intl/en/about.html -> About Google
-- Paul
May 19 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by jm.suresh | last post: by
33 posts views Thread by JamesB | last post: by
3 posts views Thread by tshad | last post: by
reply views Thread by =?Utf-8?B?RmFicml6aW8gQ2lwcmlhbmk=?= | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.