473,396 Members | 2,016 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Regex, HTML string modification

I have a regex that is designed to help improve readability for a html document.
Expand|Select|Wrap|Line Numbers
  1. "(?=((?!<\/?em).)*<\/em>)
The purpose of this regex is to escape " marks from within <EM> affected sentences. Example:

Before: <P>This "is" <EM>a <STRONG>"Test"</STRONG></EM></P>
After: <P>This "is" <EM>a <STRONG></EM>"<EM>Test</EM>"<EM></STRONG></EM></P>

Note the regex only affects " inside of <EM> elements. My problem is that i need to modify the regex to account for " inside of tags. <EM CLASS="a1"> or <STRONG CLASS="a1"> etc.

With the current regex those " marks will be modified. Any help in stopping that from happening would be appreciated.
Dec 8 '09 #1
1 1863
chaarmann
785 Expert 512MB
The regular expression you gave checks if there is an "<em" or "</em" between the double-quotation mark and the "</em>". (negative lookahead). if yes, it won't match, else it matches. This algorithm has many errors:
- it doesn't account for "em"-tags inside other "em"-tags, for example '<em> "hello" <em> you </em> </em>' would not be matched.
- it doesn't account for "<em" which are not a tag, for example '<em>"hello"<area>let x<emap</area></em>' would not be matched.
- the error you figured out: it matches inside tags.
- a match inside a subtag destroys the HTML-structure: it is not allowed to have nested tags as result. Look at your line with "After:...": the "strong" and "em" tags are illegally nested now!
- it destroys embedded javascript: '<em><script>x="Hello"</script></em> would be matched!
- some more errors which I have no time to describe now.

To fiddle with the errorneous expression and transform it into something new and error-free is very difficult, very lengthy and even much harder to understand for other programmers.. One idea to fix your "no replacement inside tags"-problem is by counting the number of ">" between the double-quotation mark and the "</em>". If it is odd, then you know you are still inside a tag; if it is even, you are outside and allowed to match.

So i would recomment to go a different and secure way for a solution:
Read the whole HTML-page with a DOM-Parser. Then walk through the DOM-object and search for "<em>" nodes. Look for text (not arguments or other nodes) inside them and search for double-quotation-marks inside this text. Then replace as you wish.
Dec 9 '09 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Alan Pretre | last post by:
Can anyone help me figure out a regex pattern for the following input example: xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m I would want four matches from this: 1. xxx a=b,c=d 2. yyy e=f 3....
4
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a...
1
by: Tim Conner | last post by:
How can I use regex to split an expression like the following : (Round(340/34.12)*2) into this list : ( Round (
7
by: bill tie | last post by:
I'd appreciate it if you could advise. 1. How do I replace "\" (backslash) with anything? 2. Suppose I want to replace (a) every occurrence of characters "a", "b", "c", "d" with "x", (b)...
1
by: kevin | last post by:
I am trying to strip the outermost html tag by capturing this tag with regex and then using the string replace function to replace it with an empty string. while stepping through the code, RegEx...
1
by: George Durzi | last post by:
Consider this excerpt from some HTML. (This is a copy from View->Source, except for the comment) <TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0> <?xml version="1.0" encoding="UTF-16"?>...
17
by: clintonG | last post by:
I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher ...
1
by: jonnyboy6969 | last post by:
Hi All Really hoping someone can help me out here with my deficient regex skills :) I have a function which takes a string of HTML and replaces a term (word or phrase) with a link. The pupose...
0
by: Karch | last post by:
I have these two methods that are chewing up a ton of CPU time in my application. Does anyone have any suggestions on how to optimize them or rewrite them without Regex? The most time-consuming...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.