By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,811 Members | 1,978 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,811 IT Pros & Developers. It's quick & easy.

Regex, HTML string modification

P: 1
I have a regex that is designed to help improve readability for a html document.
Expand|Select|Wrap|Line Numbers
  1. "(?=((?!<\/?em).)*<\/em>)
The purpose of this regex is to escape " marks from within <EM> affected sentences. Example:

Before: <P>This "is" <EM>a <STRONG>"Test"</STRONG></EM></P>
After: <P>This "is" <EM>a <STRONG></EM>"<EM>Test</EM>"<EM></STRONG></EM></P>

Note the regex only affects " inside of <EM> elements. My problem is that i need to modify the regex to account for " inside of tags. <EM CLASS="a1"> or <STRONG CLASS="a1"> etc.

With the current regex those " marks will be modified. Any help in stopping that from happening would be appreciated.
Dec 8 '09 #1
Share this Question
Share on Google+
1 Reply

Expert 100+
P: 785
The regular expression you gave checks if there is an "<em" or "</em" between the double-quotation mark and the "</em>". (negative lookahead). if yes, it won't match, else it matches. This algorithm has many errors:
- it doesn't account for "em"-tags inside other "em"-tags, for example '<em> "hello" <em> you </em> </em>' would not be matched.
- it doesn't account for "<em" which are not a tag, for example '<em>"hello"<area>let x<emap</area></em>' would not be matched.
- the error you figured out: it matches inside tags.
- a match inside a subtag destroys the HTML-structure: it is not allowed to have nested tags as result. Look at your line with "After:...": the "strong" and "em" tags are illegally nested now!
- it destroys embedded javascript: '<em><script>x="Hello"</script></em> would be matched!
- some more errors which I have no time to describe now.

To fiddle with the errorneous expression and transform it into something new and error-free is very difficult, very lengthy and even much harder to understand for other programmers.. One idea to fix your "no replacement inside tags"-problem is by counting the number of ">" between the double-quotation mark and the "</em>". If it is odd, then you know you are still inside a tag; if it is even, you are outside and allowed to match.

So i would recomment to go a different and secure way for a solution:
Read the whole HTML-page with a DOM-Parser. Then walk through the DOM-object and search for "<em>" nodes. Look for text (not arguments or other nodes) inside them and search for double-quotation-marks inside this text. Then replace as you wish.
Dec 9 '09 #2

Post your reply

Sign in to post your reply or Sign up for a free account.