472,805 Members | 1,225 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,805 software developers and data experts.

Regular Expressions Difficulty


I am writing a function to have its argument, HTML-containing string,
return a DOM 1 Document Fragment, and so it seems the use of regular
expressions (REs) is a natural.

My problem is that the browsers (IE and Mozilla) that I am using to write
and debug have a different idea about parsing strings using REs. Here is
the starting example:

stringPtr = "<div id=\"errblock\" style=\"color:red;\">" +
"<p>This is a simple doc frag";
elem = stringPtr.match(/<(.+)>/);

This is only the LATEST in SEVERAL different revisions of the RE for
'elem'. What the debugger (Venkman, but IE does the same) keeps returning
in elem[1], the variable of interest, is a string that includes the DIV and
the P element. I made a table of all the REs I have tried and their
results:

RE: /\<(.+)>\)/
elem[0]: "<div id=\"something\" style=\"color:red;\"><p>"
elem[1]: "div id=\"something\" style=\"color:red;\"><p"

RE: /(\<.+\>)/
elem[0]: "<div id=\"something\" style=\"color:red;\"><p>"
elem[1]: "<div id=\"something\" style=\"color:red;\"><p>"

RE: /<(\w+)>/
elem[0]: "<p>"
elem[1]: "p"

RE: /<(\S+\s*\S*)>/
elem[0]: "<p>"
elem[1]: "p"
All of these seem wrong to me. So long as what occurs between the '<' and
'>' matches the criteria, the parser should return JUST the first element
(the DIV) and look for elements that contain may or may not contain
attributes, depending upon the RE within the parenthesized subexpression of
the RE.

The problem is, that is is matching on the P element, ignoring the '><'
that occurs in between. It should not matter whether whitespace precedes
the P element, since it is not required and browsers can make sense of it.

My intention is to have an RE that recognizes elements with and without
attributes, and also to deal with container text as well.


//============== contents of dom1.js ============

/* Note, at least half of the lines in the code are UNTESTED and
almost certainly RIDDLED WITH ERROR and EXCEPTION, and
so the code is likely to change, and especially to make use of
optimizations to get around slow performance */

var nonEtagoElements = [ "input", "br", "img", "hr", "col", "frame",
"meta", "link", "param", "base", "basefont" ];

var RequiredEtagoElements = {
a: [ "a" , "area", "applet", "address", "abbr", "acronym" ],
b: [ "b", "body", "blockquote", "big", "bdo" ],
c: [ "center", "caption", "cite", "code" ],
d: [ "div", "dfn", "dl", "del", "dir" ],
e: [ "em" ],
f: [ "form", "font", "fieldset" ],
i: [ "i", "iframe", "ins", "inindex" ],
k: [ "kbd" ],
l: [ "label", "legend" ],
m: [ "map", "menu" ],
n: [ "noscript", "noframes" ],
o: [ "ol", "optgroup", "object" ],
p: [ "pre" ],
q: [ "q" ],
s: [ "span", "strong", "sub", "sup", "script", "select", "style",
"small", "samp", "strike", "s" ],
t: [ "table" , "title", "tt" ],
u: [ "ul", "u" ],
v: [ "var" ]
};

var OptionalEtagoElements = [ "p", "tr", "td" , "th", "li",
"colgroup" , "option", "dd", "dt", "thead", "tfoot" ];

var ImpliedElements = [ "tbody", "head", "html" ];

function verifyElem(elemStr, option)
{
var i, j, x;
if ((j = RequiredEtagoElements[x = elemStr.charAt(0)].length) > 0)
for (i = 0; i < j; i++)
if (elemStr.toLowerCase() == RequiredEtagoElements[x][i])
return (true);
for (i = 0; i < OptionalEtagoElements.length; i++)
if (elemStr == OptionalEtagoElements[i])
return (true);
for (i = 0; i < ImpliedElements.length; i++)
if (elemStr == ImpliedElements[i])
return (true);
if (option == 1)
return (false);
for (i = 0; i < nonEtagoElements.length; i++)
if (elemStr == nonEtagoElements[i])
return (true);
return (false);
}

function isContainer(elemStr)
{
return (verifyElem(elemStr, 1));
}

function makeHTMLDocFrag(HTMLstring)
{
var i, j, etago, elem, elemNode, attrs, txt, tag;
var levelTagName = new Array(25);
var level = 0;
if (typeof(HTMLstring) == "undefined")
return (null);
var docFrag = document.createDocumentFragment();
var levelNode = docFrag;
var stringPtr = HTMLstring;
debugger;
while ((i = stringPtr.search(/<*\w+/)) >= 0)
{
if (stringPtr.charAt(i) == '<')
{
if (stringPtr.charAt(i + 1) == '/') // end tag
{
etago = stringPtr.match(/<\/(\S+)/);
if (etago[1] == levelTagName[level] &&
levelNode.parentNode != null)
{
levelNode = levelNode.parentNode;
level--;
}
}
else if (stringPtr.search(/<[hH][1-6]\s+/) == 0)
{ // special case of the header
elem = stringPtr.match(/<([hH][1-6])\s+/);
elemNode = document.createElement(elem[1]);
if (levelNode != null)
levelNode.appendChild(elemNode);
levelTagName[level++] = elem[1];
}
else // element that is not header
{
elem = stringPtr.match(/(\<.+\>)/);
tag = elem[1].match(/(\w+)/);
if (verifyElem(tag) == true)
{
elemNode = document.createElement(tag);
if (levelNode != null)
levelNode.appendChild(elemNode);
if (isContainer(tag) == true)
{
levelNode = elemNode;
levelTagName[level++] = tag;
}
if ((attrs = elem[1].match(/(\w+)=(\w+)/g)) != null)
for (j = 1; j < attrs.length; j += 2)
{
attrs[j + 1] = attrs[j + 1].replace(/"/g); /* " quote
commented out for syntax-highlighting editors */
elemNode.setAttributes(attrs[j], attrs[j + 1]);
}
return;
}
}
i = stringPtr.search(/>/);
}
else
{
txt = stringPtr.match(/(.*)</);
levelNode.appendChild(document.createTextNode(txt[1]));
i = stringPtr.search(/</);
}
stringPtr = stringPtr.substr(i, stringPtr.length - 1);
}
return (docFrag);
}
Jul 23 '05 #1
4 1183


Befuddled wrote:
I am writing a function to have its argument, HTML-containing string,
return a DOM 1 Document Fragment, and so it seems the use of regular
expressions (REs) is a natural.


HTML browsers have HTML parsing built in so why do you neeed regular
expressions to parse HTML, why don't you simply create an element, set
its innerHTML to the HTML snippet and then read out the child nodes as
needed:
var div = document.createElement('div');
div.innerHTML = htmlString;
Now build a document fragment if needed and simply move the child nodes
of the div to the fragment if you want.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 23 '05 #2
Martin Honnen <ma*******@yahoo.de> wrote in news:41baedaf$0$16044
$9*******@newsread4.arcor-online.net:


Befuddled wrote:
I am writing a function to have its argument, HTML-containing string,
return a DOM 1 Document Fragment, and so it seems the use of regular
expressions (REs) is a natural.
HTML browsers have HTML parsing built in so why do you neeed regular
expressions to parse HTML, why don't you simply create an element, set
its innerHTML to the HTML snippet and then read out the child nodes as
needed:
var div = document.createElement('div');
div.innerHTML = htmlString;


I was avoiding the property 'innerHTML' because I did not know if it was
standardized in DOM at any level. I am ABSOLUTELY avoiding the use of
extensions beyond the standard (or more modestly put forth as a
"recommendation"), no matter how many browsers have the functionality to
interpret it, even if it is 99.999% of all browsers used on the planet.

If 'innerHTML' is now standardized, that saves a lot of
work/coding/function writing. Searches of the specifications for DOM
(and JavaScript for that matter) that I have in my possession for the
property 'innerHTML' produce ZERO results. Please provide a URL to the
DOM and/or JavaScript specification that I am missing so that I can make
use of that information. Thanks.
Now build a document fragment if needed and simply move the child nodes
of the div to the fragment if you want.

Jul 23 '05 #3


Befuddled wrote:
Martin Honnen <ma*******@yahoo.de> wrote

HTML browsers have HTML parsing built in so why do you neeed regular
expressions to parse HTML, why don't you simply create an element, set
its innerHTML to the HTML snippet and then read out the child nodes as
needed:
var div = document.createElement('div');
div.innerHTML = htmlString;

I was avoiding the property 'innerHTML' because I did not know if it was
standardized in DOM at any level. I am ABSOLUTELY avoiding the use of
extensions beyond the standard (or more modestly put forth as a
"recommendation")


So you would prefer createDocumentFragment for instance to innerHTML
because createDocumentFragment is in the W3C recommendation but
innerHTML is not? For istance IE 5.5 doesn't support
createDocumentFragment so your code will not work there. innerHTML
certainly has far greater support than createDocumentFragment.
But anyway, as for your regular expression problem, matching by default
is greedy meaning as much as possible is matched so your expression
correctyly consumes characters to the last > it can find.
If you want non greedy matching then you can use ? after the quantifier e.g.
.+?
but support for that is only in ECMAScript edition 3 compatible
implementations, with older browsers such a construct is likely to not
give the desired result.
There are workarounds such as
/<([^>]+)>/
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 23 '05 #4
Martin Honnen <ma*******@yahoo.de> wrote in
news:41***********************@newsread4.arcor-online.net:


Befuddled wrote:
Martin Honnen <ma*******@yahoo.de> wrote
HTML browsers have HTML parsing built in so why do you neeed regular
expressions to parse HTML, why don't you simply create an element,
set its innerHTML to the HTML snippet and then read out the child
nodes as needed:
var div = document.createElement('div');
div.innerHTML = htmlString;

I was avoiding the property 'innerHTML' because I did not know if it
was standardized in DOM at any level. I am ABSOLUTELY avoiding the
use of extensions beyond the standard (or more modestly put forth as
a "recommendation")


So you would prefer createDocumentFragment for instance to innerHTML
because createDocumentFragment is in the W3C recommendation but
innerHTML is not? For istance IE 5.5 doesn't support
createDocumentFragment so your code will not work there. innerHTML
certainly has far greater support than createDocumentFragment.


You're right. I was hasty in my explanation of adhering to the standard.
I should have said that while my first duty is to the standard and to get
its code in place, after writing its code, I attempt to include browser-
dependent code, where possible, to accomodate browsers that don't happen
to understand the standard. Sorry for being misleading, sounding
impractical, and standing too adamantly.
But anyway, as for your regular expression problem, matching by
default is greedy meaning as much as possible is matched so your
expression correctyly consumes characters to the last > it can find.
I suppose there was a good reason why the original developers of regular
expressions wanted them to consume as much text as possible in matching
criteria, rather than grabbing what was minimal (working from left to
right, rather than right to left). I would love to know their reasoning.
If you want non greedy matching then you can use ? after the
quantifier e.g.
.+?
but support for that is only in ECMAScript edition 3 compatible
implementations, with older browsers such a construct is likely to not
give the desired result.
There are workarounds such as
/<([^>]+)>/


Your solution appears to be working nicely. Thanks for all your good
information.
--

http://hume.realisticpolitics.com/
Jul 23 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
2
by: Carlos Guzmán Álvarez | last post by:
Hello: I need to extract named parameters from a SQL command using regular expressions, for example: select * from TEST_TABLE_01 where VARCHAR_FIELD = @varchar_field or CHAR_FIELD =...
8
by: Natalia DeBow | last post by:
Hi, I am stuck trying to come up with a regular expression for the following pattern: A string that contains "/*" but that does not contain */ within it. Basically I am searching for C-style...
3
by: Ryan Taylor | last post by:
Hello. I am trying to create a regular expression that will let me know if a string has the following criteria. Order does not matter in the string, but when building a regular expression it...
2
by: Sehboo | last post by:
Hi, I have several regular expressions that I need to run against documents. Is it possible to combine several expressions in one expression in Regex object. So that it is faster, or will I...
4
by: Együd Csaba | last post by:
Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
13
by: blair.bethwaite | last post by:
Hi all, Does anybody know of a module that allows you to enumerate all the strings a particular regular expression describes? Cheers, -Blair
1
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.