On 24 Jan 2007 11:07:49 -0800, Paul McGuire <pt***@austin.rr.comwrote:
On Jan 24, 10:20 am, "Johny" <pyt...@hope.czwrote:
Does anyone know about a good regular expression for URL extracting?
J.
Google turns this up:
http://geekswithblogs.net/casualjim/.../01/61722.aspx
But I've seen other re's for this problem that are hundreds of
characters long.
-- Paul
--
http://mail.python.org/mailman/listinfo/python-list
These are the regexps that gnome-terminal uses for it's URL
auto-recognition, and I have shamelessly stolen them for use in one of
my own apps:
urlfinders = [
re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"),
re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?"),
re.compile("(~/|/|\\./)([-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]|\\\\
)+"),
re.compile("'\\<((mailto:)|)[-A-Za-z0-9\\.]+@[-A-Za-z0-9\\.]+"),
]