469,904 Members | 2,081 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,904 developers. It's quick & easy.

More RegEx Questions

Greetings,

Lets say I have the following expression:

(<A|ABBR|ADDRESS|APPLET(\s){1,}(.*?)>(.*?)</A|ABBR|ADDRESS|APPLET)

Such that it'll match any HTML tag that opens with the above specified
(simplified for brevity) and the closing tag as well.

Assuming that I had a list of opening possibilities, how can I specify for
the RegEx that it should only match on the same closing occurance that it
first matched such that:

<A ...>...</Awill be matched as per the above example but not
<ADDRESS>...</Abut then, <ADDRESS>...<A>...</A>...</ADDRESSwill be
matched?

Do I have to input and maintain each of the hundreds of HTML tags
seperately?
Thanks,
Shawn
Apr 16 '07 #1
2 1778
Greetings,
>
Lets say I have the following expression:

(<A|ABBR|ADDRESS|APPLET(\s){1,}(.*?)>(.*?)</A|ABBR|ADDRESS|APPLET)

Such that it'll match any HTML tag that opens with the above specified
(simplified for brevity) and the closing tag as well.

Assuming that I had a list of opening possibilities, how can I specify
for the RegEx that it should only match on the same closing occurance
that it first matched such that:

<A ...>...</Awill be matched as per the above example but not
<ADDRESS>...</Abut then, <ADDRESS>...<A>...</A>...</ADDRESSwill be
matched?

Do I have to input and maintain each of the hundreds of HTML tags
seperately?

Thanks,
Shawn

<(A|ABBR|ADDRESS|APPLET)((\s)+(.*?))?>(.*?)</\1>

the \1 refers to the first captured group (within the first () pair, that
is the tags).
Note I also made the arguments list optional

Hans Kesting
Apr 17 '07 #2
OK, I love regular expressions, so I fiddled with this (really difficult)
problem a bit. First, a more succinct and extensible version of Hans'
solution:

(?i)(?s)<(\w+)([^>]*)>(.*?)</\1>

The first 2 items are simply encoding for "non-case-sensitive" and "dot
matches newline." After that, I substituted "\w" for the tag names, since
all HTML tags consist only or word characters. This will capture ANY tag.

However, this does not address the (common) problem of nested tags. Consider
the following HTML, which you can use to test all of these:

<table id="outer">
<tr>
<td>Outer</td>
<td>Outer</td>
<td>Outer</td>
</tr>
<tr>
<td>Outer</td>
<td>
<table id="inner">
<tr>
<td>Inner</td>
<td>Inner</td>
</tr>
<tr>
<td>Inner</td>
<td>Inner</td>
</tr>
</table>
</td>
<td>Outer</td>
</tr>
<tr>
<td>Outer</td>
<td>Outer</td>
<td>Outer</td>
</tr>
</table>
<table><tr><td></td></tr></table>

The regular expression above will capture from the first <tabletag to the
first </tabletag, and those are tags from 2 different tables, the outer
and the inner.

So, here's a solution that ensures that nested elements are captured:

(?i)(?s)<(\w+)[^>]*>(?(?=</\1)</\1>|.+</\1>)

The problem with this one is that it doesn't necessarily match at the end of
the first element; it will match at the last instance of the end of the
first element. In the example, it will capture the entire text as a single
match.

This solution captures only inner-most nested tags:

(?i)(?s)<(\w+)[^>]*>[^>]*</\1>

The problem with this one is that it doesn't capture any tags containing
nested tags.

So, depending upon the requirements, it would problably be necessary to
combine regular expressions with processing code, to do something like the
following:

1. Find innermost nested tags.
2. Remove the matches.
3. Repeat (recursively) until no matches are found.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Hans Kesting" <ne***********@spamgourmet.comwrote in message
news:c0*************************@news.microsoft.co m...
>Greetings,

Lets say I have the following expression:

(<A|ABBR|ADDRESS|APPLET(\s){1,}(.*?)>(.*?)</A|ABBR|ADDRESS|APPLET)

Such that it'll match any HTML tag that opens with the above specified
(simplified for brevity) and the closing tag as well.

Assuming that I had a list of opening possibilities, how can I specify
for the RegEx that it should only match on the same closing occurance
that it first matched such that:

<A ...>...</Awill be matched as per the above example but not
<ADDRESS>...</Abut then, <ADDRESS>...<A>...</A>...</ADDRESSwill be
matched?

Do I have to input and maintain each of the hundreds of HTML tags
seperately?

Thanks,
Shawn


<(A|ABBR|ADDRESS|APPLET)((\s)+(.*?))?>(.*?)</\1>

the \1 refers to the first captured group (within the first () pair, that
is the tags).
Note I also made the arguments list optional

Hans Kesting


Apr 17 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by rdimayuga | last post: by
20 posts views Thread by jeevankodali | last post: by
8 posts views Thread by Just Me | last post: by
13 posts views Thread by O-('' Q) | last post: by
9 posts views Thread by jmchadha | last post: by
1 post views Thread by Dan Holmes | last post: by
4 posts views Thread by Trev | last post: by
3 posts views Thread by jwgoerlich | last post: by
8 posts views Thread by Trev | last post: by
1 post views Thread by Waqarahmed | last post: by
reply views Thread by Salome Sato | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.