Connecting Tech Pros Worldwide Forums | Help | Site Map

Regex question

pedrito
Guest
 
Posts: n/a
#1: Aug 30 '07
I have a regex question and it never occurred to me to ask here, until I saw
Jesse Houwing's quick response to Phil for his Regex question.

I have some filenames that I'm trying to parse out of URLs.

(href=("|')http://.www\.thesite\.com/.{1,7}/)(?<filename>.[^"|'])

This generally works, but the problem is some of the image files have
..th.jpg at the end to indicate thumbnails. I want to exclude those. I just
want the ones that don't have .th. before the file extension.

I've tried using forward and reverse negative lookups, but I guess I'm not
using them correctly. Any help on how to get a non-match for a .th. would be
great.




Kevin Spencer
Guest
 
Posts: n/a
#2: Aug 30 '07

re: Regex question


(href=(?:"|')http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.))+?)["|']

A few words of explanation:

First, your regular expression has a "." prior to the "www". This indicates
that a single non-line-break character *must* precede the "www". I don't
know if that's intentional, but I took it out. Second, I'm not sure why you
specified any non-line-break character repeated between 1 and 7 times,
followed by a slash. That seemed excessive. Regardless of the length, the
important thing is that the length is non-zero, and it is followed by a
slash. So, I used the dot (non-line-break) with a lazy plus ("+?"). Lazy
matching is an important aspect of regular expressions. It means that the
match should be repeated as *few* times as possible. The default is "greedy"
matching, which indicates that the match should be repeated as *many* times
as possible. In a regular expression where a lazy match is followed by some
other match, the lazy match stops at the first incidence of the next match.
This means that the sequence:

..+?/

indicates "any non-line-break character repeated as few times as possible,
until a slash is reached, followed by a slash.

It wasn't so important in that section of the regular expression, but it's
critical to the latter part:

(?<filename>(?:.(?!\.th\.))+?)["|']

This is your filename group. You can use quantifiers with un-named groups.
So, what the last part says about the "filename" group is:

Match any single non-line-break character that is *not* followed by ".th."
character sequence as few times as possible.

This is followed by ["|'] - which indicates that the next match is either a
single quote or a double quote. Therefore, the matching of the "filename"
group halts when the quote is reached. In other words, as the group is at
the end of the match, it will be followed by a single or double quote, and I
use that to indicate the end of that matching group, by using a lazy
quantifier.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"pedrito" <pixbypedrito at yahoo.comwrote in message
news:ab2dnf-p7dMVgkvbnZ2dnUVZ_gqdnZ2d@giganews.com...
Quote:
>I have a regex question and it never occurred to me to ask here, until I
>saw Jesse Houwing's quick response to Phil for his Regex question.
>
I have some filenames that I'm trying to parse out of URLs.
>
(href=("|')http://.www\.thesite\.com/.{1,7}/)(?<filename>.[^"|'])
>
This generally works, but the problem is some of the image files have
.th.jpg at the end to indicate thumbnails. I want to exclude those. I just
want the ones that don't have .th. before the file extension.
>
I've tried using forward and reverse negative lookups, but I guess I'm not
using them correctly. Any help on how to get a non-match for a .th. would
be great.
>
>
>

Jesse Houwing
Guest
 
Posts: n/a
#3: Aug 30 '07

re: Regex question


Hello Kevin, Pedrito,

I've got a few things to add to what kevin already explained.

First, if you're using character sets, there's no need to use a or (|) in
there. So either singlequote or doublequote can be written as ["']. Writing
["|'] would actually allow a | on that position as well.

Second on the .+? between the two paths.

href=["']http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.)

I'd probably use a more precise expression:

/([^/'"]+/)+

This will capture all directories until it can no longer find a trailing
slash. It does not use a reluctant (+?) modifier and that usually makes things
faster. Also note that I excluded the quotes from the expression. Otherwise
there is a chance here it might try to capture the whole HTML file.

And last on the exclusing of thumbnails:

you're using (.(?!.th.)) to skip the thumbnail extention. It's probably faster
and easier to use a negative look behing in this case:

((?!\.th\.)[^'"])+

It will capture any chracter that isn't a singel or double quote. It also
makes sure every caracter captured isn't the start of .th.. But here your
mileage may vary. There are a few variants which will all accomplish the
same thing here:

[^'"]+(?<!\.th)\.jpg
[^'"]+(?<!.th.jpg)
(?!['"]+\.th\.)[^'"]+

You might want to test all possibilities to see which one is faster, or to
see which one you find the easiest to read. Also note that negative look
behinds aren't supported by all regex libraries (though the standard .NET
one will work just fine, Javascript actually won't).

Also, if you aren't using this code clientside in Javascript, you might want
to see if the System.IO.Path or System.IO.Uri might provide a better way
to parse the paths. In that case you can just capture all things that follow
href= and do the rest of the vaidation in code (migth be faster, though I'm
not sure).

Which would lead to the following expression:

href=["']http://www\.thesite\.com/([^/'"]+/)+(?<filename>((?!\.th\.)[^'"])+)["']

The last ["'] would even be optional as it's already caught by the filename
expression.

--
Jesse Houwing
jesse.houwing at sogeti.nl


Quote:
(href=(?:"|')http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.))
+?)["|']
>
A few words of explanation:
>
First, your regular expression has a "." prior to the "www". This
indicates that a single non-line-break character *must* precede the
"www". I don't know if that's intentional, but I took it out. Second,
I'm not sure why you specified any non-line-break character repeated
between 1 and 7 times, followed by a slash. That seemed excessive.
Regardless of the length, the important thing is that the length is
non-zero, and it is followed by a slash. So, I used the dot
(non-line-break) with a lazy plus ("+?"). Lazy matching is an
important aspect of regular expressions. It means that the match
should be repeated as *few* times as possible. The default is "greedy"
matching, which indicates that the match should be repeated as *many*
times as possible. In a regular expression where a lazy match is
followed by some other match, the lazy match stops at the first
incidence of the next match. This means that the sequence:
>
.+?/
>
indicates "any non-line-break character repeated as few times as
possible, until a slash is reached, followed by a slash.
>
It wasn't so important in that section of the regular expression, but
it's critical to the latter part:
>
(?<filename>(?:.(?!\.th\.))+?)["|']
>
This is your filename group. You can use quantifiers with un-named
groups. So, what the last part says about the "filename" group is:
>
Match any single non-line-break character that is *not* followed by
".th." character sequence as few times as possible.
>
This is followed by ["|'] - which indicates that the next match is
either a single quote or a double quote. Therefore, the matching of
the "filename" group halts when the quote is reached. In other words,
as the group is at the end of the match, it will be followed by a
single or double quote, and I use that to indicate the end of that
matching group, by using a lazy quantifier.
>
Kevin Spencer
Microsoft MVP
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
"pedrito" <pixbypedrito at yahoo.comwrote in message
news:ab2dnf-p7dMVgkvbnZ2dnUVZ_gqdnZ2d@giganews.com...
>
Quote:
>I have a regex question and it never occurred to me to ask here,
>until I saw Jesse Houwing's quick response to Phil for his Regex
>question.
>>
>I have some filenames that I'm trying to parse out of URLs.
>>
>(href=("|')http://.www\.thesite\.com/.{1,7}/)(?<filename>.[^"|'])
>>
>This generally works, but the problem is some of the image files have
>.th.jpg at the end to indicate thumbnails. I want to exclude those. I
>just want the ones that don't have .th. before the file extension.
>>
>I've tried using forward and reverse negative lookups, but I guess
>I'm not using them correctly. Any help on how to get a non-match for
>a .th. would be great.
>>

Kevin Spencer
Guest
 
Posts: n/a
#4: Aug 31 '07

re: Regex question


Hi Jesse,

You're of course correct about the character set for the quotes. I
overlooked that one in my conversion process. I'm so embarassed!

--
;-),

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Jesse Houwing" <jesse.houwing@newsgroup.nospamwrote in message
news:21effc901653e8c9b94f7022841a@news.microsoft.c om...
Quote:
Hello Kevin, Pedrito,
>
I've got a few things to add to what kevin already explained.
>
First, if you're using character sets, there's no need to use a or (|) in
there. So either singlequote or doublequote can be written as ["'].
Writing ["|'] would actually allow a | on that position as well.
>
Second on the .+? between the two paths.
>
href=["']http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.)
>
I'd probably use a more precise expression:
>
/([^/'"]+/)+
>
This will capture all directories until it can no longer find a trailing
slash. It does not use a reluctant (+?) modifier and that usually makes
things faster. Also note that I excluded the quotes from the expression.
Otherwise there is a chance here it might try to capture the whole HTML
file.
>
And last on the exclusing of thumbnails:
>
you're using (.(?!.th.)) to skip the thumbnail extention. It's probably
faster and easier to use a negative look behing in this case:
>
((?!\.th\.)[^'"])+
>
It will capture any chracter that isn't a singel or double quote. It also
makes sure every caracter captured isn't the start of .th.. But here your
mileage may vary. There are a few variants which will all accomplish the
same thing here:
>
[^'"]+(?<!\.th)\.jpg
[^'"]+(?<!.th.jpg)
(?!['"]+\.th\.)[^'"]+
>
You might want to test all possibilities to see which one is faster, or to
see which one you find the easiest to read. Also note that negative look
behinds aren't supported by all regex libraries (though the standard .NET
one will work just fine, Javascript actually won't).
>
Also, if you aren't using this code clientside in Javascript, you might
want to see if the System.IO.Path or System.IO.Uri might provide a better
way to parse the paths. In that case you can just capture all things that
follow href= and do the rest of the vaidation in code (migth be faster,
though I'm not sure).
>
Which would lead to the following expression:
>
href=["']http://www\.thesite\.com/([^/'"]+/)+(?<filename>((?!\.th\.)[^'"])+)["']
>
The last ["'] would even be optional as it's already caught by the
filename expression.
>
--
Jesse Houwing
jesse.houwing at sogeti.nl
>
>
>
Quote:
>(href=(?:"|')http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.))
>+?)["|']
>>
>A few words of explanation:
>>
>First, your regular expression has a "." prior to the "www". This
>indicates that a single non-line-break character *must* precede the
>"www". I don't know if that's intentional, but I took it out. Second,
>I'm not sure why you specified any non-line-break character repeated
>between 1 and 7 times, followed by a slash. That seemed excessive.
>Regardless of the length, the important thing is that the length is
>non-zero, and it is followed by a slash. So, I used the dot
>(non-line-break) with a lazy plus ("+?"). Lazy matching is an
>important aspect of regular expressions. It means that the match
>should be repeated as *few* times as possible. The default is "greedy"
>matching, which indicates that the match should be repeated as *many*
>times as possible. In a regular expression where a lazy match is
>followed by some other match, the lazy match stops at the first
>incidence of the next match. This means that the sequence:
>>
>.+?/
>>
>indicates "any non-line-break character repeated as few times as
>possible, until a slash is reached, followed by a slash.
>>
>It wasn't so important in that section of the regular expression, but
>it's critical to the latter part:
>>
>(?<filename>(?:.(?!\.th\.))+?)["|']
>>
>This is your filename group. You can use quantifiers with un-named
>groups. So, what the last part says about the "filename" group is:
>>
>Match any single non-line-break character that is *not* followed by
>".th." character sequence as few times as possible.
>>
>This is followed by ["|'] - which indicates that the next match is
>either a single quote or a double quote. Therefore, the matching of
>the "filename" group halts when the quote is reached. In other words,
>as the group is at the end of the match, it will be followed by a
>single or double quote, and I use that to indicate the end of that
>matching group, by using a lazy quantifier.
>>
>Kevin Spencer
>Microsoft MVP
>DSI PrintManager, Miradyne Component Libraries:
>http://www.miradyne.net
>"pedrito" <pixbypedrito at yahoo.comwrote in message
>news:ab2dnf-p7dMVgkvbnZ2dnUVZ_gqdnZ2d@giganews.com...
>>
Quote:
>>I have a regex question and it never occurred to me to ask here,
>>until I saw Jesse Houwing's quick response to Phil for his Regex
>>question.
>>>
>>I have some filenames that I'm trying to parse out of URLs.
>>>
>>(href=("|')http://.www\.thesite\.com/.{1,7}/)(?<filename>.[^"|'])
>>>
>>This generally works, but the problem is some of the image files have
>>.th.jpg at the end to indicate thumbnails. I want to exclude those. I
>>just want the ones that don't have .th. before the file extension.
>>>
>>I've tried using forward and reverse negative lookups, but I guess
>>I'm not using them correctly. Any help on how to get a non-match for
>>a .th. would be great.
>>>
>
>

Jesse Houwing
Guest
 
Posts: n/a
#5: Aug 31 '07

re: Regex question


Hello Kevin,
Quote:
Hi Jesse,
>
You're of course correct about the character set for the quotes. I
overlooked that one in my conversion process. I'm so embarassed!
No sweat... Happens to the best of us.

Jesse

Quote:
Kevin Spencer
Microsoft MVP
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
"Jesse Houwing" <jesse.houwing@newsgroup.nospamwrote in message
news:21effc901653e8c9b94f7022841a@news.microsoft.c om...
>
Quote:
>Hello Kevin, Pedrito,
>>
>I've got a few things to add to what kevin already explained.
>>
>First, if you're using character sets, there's no need to use a or
>(|) in there. So either singlequote or doublequote can be written as
>["']. Writing ["|'] would actually allow a | on that position as
>well.
>>
>Second on the .+? between the two paths.
>>
>href=["']http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.)
>>
>I'd probably use a more precise expression:
>>
>/([^/'"]+/)+
>>
>This will capture all directories until it can no longer find a
>trailing slash. It does not use a reluctant (+?) modifier and that
>usually makes things faster. Also note that I excluded the quotes
>from the expression. Otherwise there is a chance here it might try to
>capture the whole HTML file.
>>
>And last on the exclusing of thumbnails:
>>
>you're using (.(?!.th.)) to skip the thumbnail extention. It's
>probably faster and easier to use a negative look behing in this
>case:
>>
>((?!\.th\.)[^'"])+
>>
>It will capture any chracter that isn't a singel or double quote. It
>also makes sure every caracter captured isn't the start of .th.. But
>here your mileage may vary. There are a few variants which will all
>accomplish the same thing here:
>>
>[^'"]+(?<!\.th)\.jpg
>[^'"]+(?<!.th.jpg)
>(?!['"]+\.th\.)[^'"]+
>You might want to test all possibilities to see which one is faster,
>or to see which one you find the easiest to read. Also note that
>negative look behinds aren't supported by all regex libraries (though
>the standard .NET one will work just fine, Javascript actually
>won't).
>>
>Also, if you aren't using this code clientside in Javascript, you
>might want to see if the System.IO.Path or System.IO.Uri might
>provide a better way to parse the paths. In that case you can just
>capture all things that follow href= and do the rest of the vaidation
>in code (migth be faster, though I'm not sure).
>>
>Which would lead to the following expression:
>>
>href=["']http://www\.thesite\.com/([^/'"]+/)+(?<filename>((?!\.th\.)[
>^'"])+)["']
>>
>The last ["'] would even be optional as it's already caught by the
>filename expression.
>>
>--
>Jesse Houwing
>jesse.houwing at sogeti.nl
Quote:
>>(href=(?:"|')http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.
>>)) +?)["|']
>>>
>>A few words of explanation:
>>>
>>First, your regular expression has a "." prior to the "www". This
>>indicates that a single non-line-break character *must* precede the
>>"www". I don't know if that's intentional, but I took it out.
>>Second, I'm not sure why you specified any non-line-break character
>>repeated between 1 and 7 times, followed by a slash. That seemed
>>excessive. Regardless of the length, the important thing is that the
>>length is non-zero, and it is followed by a slash. So, I used the
>>dot (non-line-break) with a lazy plus ("+?"). Lazy matching is an
>>important aspect of regular expressions. It means that the match
>>should be repeated as *few* times as possible. The default is
>>"greedy" matching, which indicates that the match should be repeated
>>as *many* times as possible. In a regular expression where a lazy
>>match is followed by some other match, the lazy match stops at the
>>first incidence of the next match. This means that the sequence:
>>>
>>.+?/
>>>
>>indicates "any non-line-break character repeated as few times as
>>possible, until a slash is reached, followed by a slash.
>>>
>>It wasn't so important in that section of the regular expression,
>>but it's critical to the latter part:
>>>
>>(?<filename>(?:.(?!\.th\.))+?)["|']
>>>
>>This is your filename group. You can use quantifiers with un-named
>>groups. So, what the last part says about the "filename" group is:
>>>
>>Match any single non-line-break character that is *not* followed by
>>".th." character sequence as few times as possible.
>>>
>>This is followed by ["|'] - which indicates that the next match is
>>either a single quote or a double quote. Therefore, the matching of
>>the "filename" group halts when the quote is reached. In other
>>words, as the group is at the end of the match, it will be followed
>>by a single or double quote, and I use that to indicate the end of
>>that matching group, by using a lazy quantifier.
>>>
>>Kevin Spencer
>>Microsoft MVP
>>DSI PrintManager, Miradyne Component Libraries:
>>http://www.miradyne.net
>>"pedrito" <pixbypedrito at yahoo.comwrote in message
>>news:ab2dnf-p7dMVgkvbnZ2dnUVZ_gqdnZ2d@giganews.com...
>>>I have a regex question and it never occurred to me to ask here,
>>>until I saw Jesse Houwing's quick response to Phil for his Regex
>>>question.
>>>>
>>>I have some filenames that I'm trying to parse out of URLs.
>>>>
>>>(href=("|')http://.www\.thesite\.com/.{1,7}/)(?<filename>.[^"|'])
>>>>
>>>This generally works, but the problem is some of the image files
>>>have .th.jpg at the end to indicate thumbnails. I want to exclude
>>>those. I just want the ones that don't have .th. before the file
>>>extension.
>>>>
>>>I've tried using forward and reverse negative lookups, but I guess
>>>I'm not using them correctly. Any help on how to get a non-match
>>>for a .th. would be great.
>>>>
--
Jesse Houwing
jesse.houwing at sogeti.nl


Closed Thread


Similar C# / C Sharp bytes