469,323 Members | 1,575 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,323 developers. It's quick & easy.

better regular expression?

Hi,

I am trying to construct a regular expression using the re module that
matches for
1. my hostname
2. absolute from the root URLs including just "/"
3. relative URLs.

Basically I want the attern to not match for URLs that are not on my
host.

The following statement satisfies numbers 1 and 2, but not 3:

line =
re.sub(r'(href=")(http[s]?://'+hostname+'[/]?|/)([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)

An improvement that also partially satisfies number 3 is

line =
re.sub(r'(href=")(http[s]?://'+hostname+'[/]?|/|[^h][^t][^t][^p][^:][^/][^/])([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)

This is not complete because if the relative url is less than seven
characters, than it will not match.

Any suggestions?

Thanx.

Jul 18 '05 #1
2 1172

Check out the 'urlparse' module, in the standard library, unless for
some reason you *have* to use regular expressions.

/arg

On Dec 6, 2004, at 7:46 PM, Vivek wrote:
Hi,

I am trying to construct a regular expression using the re module that
matches for
1. my hostname
2. absolute from the root URLs including just "/"
3. relative URLs.

Basically I want the attern to not match for URLs that are not on my
host.

The following statement satisfies numbers 1 and 2, but not 3:

line =
re.sub(r'(href=")(http[s]?://'+hostname+'[/]?
|/)([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)

An improvement that also partially satisfies number 3 is

line =
re.sub(r'(href=")(http[s]?://'+hostname+'[/]?|/
|[^h][^t][^t][^p][^:][^/][^/])([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)

This is not complete because if the relative url is less than seven
characters, than it will not match.

Any suggestions?

Thanx.

--
http://mail.python.org/mailman/listinfo/python-list


Jul 18 '05 #2
"Vivek" <vi***********@gmail.com> wrote:
Hi,

I am trying to construct a regular expression using the re module that
matches for
1. my hostname
2. absolute from the root URLs including just "/"
3. relative URLs.


Is your goal to learn more about regexes, or to parse URLs? If the
latter, my suggestion would be to look at the urlparse module; the hard
work has already been done for you.
Jul 18 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Buddy | last post: by
4 posts views Thread by Neri | last post: by
11 posts views Thread by Dimitris Georgakopuolos | last post: by
7 posts views Thread by Billa | last post: by
9 posts views Thread by Pete Davis | last post: by
25 posts views Thread by Mike | last post: by
1 post views Thread by NvrBst | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by listenups61195 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.