By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,514 Members | 1,815 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,514 IT Pros & Developers. It's quick & easy.

how can I extract all urls in a string by using re.findall() ?

P: n/a
I want to retrieve all urls in a string. When I use re.fiandall, I get
a list of tuples.
My code is like below:

Expand|Select|Wrap|Line Numbers
  1. url=unicode(r"((http|ftp)://)?(((([\d]+\.)+){3}[\d]+(/[\w./]+)?)|([a-z]\w*((\.\w+)+){2,})([/][\w.~]*)*)")
  2. m=re.findall(url,html)
  3. for i in m:
  4. print i
html is a variable of string type which contains many urls in it.
the code will print many tuples, and each tuple seems not to represent
a url. e.g, one of them is as below:

(u'http://', u'http', u'', u'',
u'', u'', u'', u'', u'.com', u'.com',

Why is there two "http" in it? and why are there so many ampty strings
in the tupe above? It's obviously not a url. How can I get the urls

Thanks in advance.
Jul 18 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.