Regular Expressions Challenge

Patient Guy

Coding patterns for regular expressions is completely unintuitive, as far
as I can see. I have been trying to write script that produces an array
of attribute components within an HTML element.

Consider the example of the HTML element TABLE with the following
attributes producing sufficient complexity within the element:

<table id="machines" class="noborders inred"
style="margin:2em 4em;background-color:#ddd;">

Note that the HTML was created as a string in code, and thus there are NO
newlines ('\n') in the string, as if a file was parsed...so newlines are
not an issue. The only whitespace is the space character ' ' itself,
required to delimit the element components.

I want to write an RE containing paranthesized substring matching that
neatly orders attribute components. The resulting array, after the
execution of the string .match() method upon the example, should look as
follows:

attrs = [ "id", "machines", "class", "noborders inred", "style",
"margin:2em 4em;background-color:#ddd;" ]

I can then march down the array (in steps of 2) setting attributes
(name=value) to the element using standard DOM interface methods, right?

In approaching the writing of the RE, I have to take into account the
characters permitted to form the attribute name and the attribute value.

I assume a start to the RE pattern as:

<attribute name>=<attribute value>

I then try to find the right RE pattern for <attribute name>, keeping in
mind what the legal characters are for attribute names according to the
HTML standard ("recommendation"):

[A-Za-z0-9\-]+

I believe this patterns conforms to the standard for attribute values:

[,;'":!%A-Za-z0-9\s\.\-]+

That pattern tries to be more exclusive than inclusive, although I think
just about every character on the planet, including a newline, is
acceptable in an attribute value, at least the kind one might see in an
HTML document.

I also have to take into account that the <attribute value> may be
delimited by appropriate characters, the single quote and double quote
(which it should be, according to the HTML "recommendation").

So with all this information, assuming it is correct, writing the RE
should be as easy, if not painless, as falling off a chair:

attrsRE = /([A-Za-z0-9\-]+)=['"]?([,;'":!%A-Za-z0-9\s\.\-]+)/ig;

This was only the first of the tens of variations I have been writing on
the RE to make it work, which it has not, up to now. I have included
special expression controls, such as '?=' and '?:' only recently
introduced in JS1.5, but I would prefer not to include RE special
characters that will break in interpreters not doing version 1.5. The
above variation actually completely ignores the parenthesized substring
matching: it will produce an array that looks like this:

attrs = [ "id="\"machines", "class=\"noborders inred",
"style=\"margin:2em 4em;background-color:#ddd;" ]

I have come to the conclusion that perhaps the use of the global flag
(/.../g) and parenthesized substring matching does not really work, or is
mutually exclusive, because I don't recall ever seeing examples of its
use in the official JavaScript guide or reference. I suppose as a
general rule, it is best not to push the ability of the interpreter to
handle extremely complex tasks in a single JS statement, but to break
them down into simpler task in multiple JS statements, right?

Anyway, the code fragment with numbered lines below represents my code
that is supposed to deal with finding a start tag (end tags are
identified in code preceding this fragment) and handling its attributes.
I have thrown up my hands after hours and hours (over several days)
reading and reading, searching the Internet, and trying to find
variations that work.

1: elem = stringPtr.match(/<([^>]+)/);
2: tag = elem[1].match(/(\w+)/);
3: if (verifyElem(tag[1]) == true)
4: {
5: elemNode = document.createElement(tag[1]);
6: if (levelNode != null)
7: levelNode.appendChild(elemNode);
8: if (isContainer(tag[1]) == true)
9: {
10: levelNode = elemNode;
11: levelTagName[level++] = tag[1];
12: }
13: if ((attrs = elem[1].match(attrsRE)) != null)
14: for (j = 1; j < attrs.length; j += 2)
15: elemNode.setAttribute(attrs[j], attrs[j + 1]);
16: }

NOTES
Line 1 contains a completely unintuitive RE that matches one and only one
tag, and every character in between it. It was kindly provided by Martin
Honnen.
The element name itself is taken in line 2, its validity determined in a
function call in line 3 (function not shown), and the DOM element node
created and made a part of the document fragment in lines 5 and 6. If
the element can contain text and elements, an administrative procedure is
done in lines 9-12.
Then it's on to dealing with attributes in lines 13-15.

Jul 23 '05 #1

Subscribe Post Reply

2166

Matthew Lock

> Coding patterns for regular expressions is completely unintuitive, as
far

as I can see.
Regular expressions are unintuitive because pattern matching is
unintuitive.

I can't recommend the following book enough. After I read the first 3
chapters I have never struggled with regex since:
http://www.oreilly.com/catalog/regex/
[A-Za-z0-9\-]+
You can represent the above as [\w-]+
I believe this patterns conforms to the standard for attribute values:
[,;'":!%A-Za-z0-9\s\.\-]+

That pattern tries to be more exclusive than inclusive, although I think just about every character on the planet, including a newline, is
acceptable in an attribute value, at least the kind one might see in an HTML document.

Don't forget the hash/pound/bang symbol "#" for hex colour values,
like:

<body bgcolor="#ffffff">

Parsing HTML by hand with regex is notoriously difficult to get right.
If you are doing it to analyse HTML in the wild I would stick with
letting the browser's DOM parse it.

Good luck

Jul 23 '05 #2

osfameron

Matthew Lock wrote:

I can't recommend the following book enough. After I read the first 3
chapters I have never struggled with regex since:
http://www.oreilly.com/catalog/regex/
Seconded. Of course, depending on your needs, an introductory chapter
on Regexes in any Perl, javascript or similar book might do for you.
(Though if you're trying to parse HTML with regular expressions, you may
not fall into that category)
Parsing HTML by hand with regex is notoriously difficult to get right.
If you are doing it to analyse HTML in the wild I would stick with
letting the browser's DOM parse it.

Seconded. Actually, people tend to say it's impossible. I think the
O'Reilly book goes into why. You'd be better off writing an HTML parser
(which could of course make heavy use of regexes internally). This is
the advice that is regularly brought up on Perl newsgroups. (And bear
in mind that Perl hackers tend to love regexes, and love doing twisted,
clever things with them).

The advice to give up and use another parser (the browser's DOM, as
above) is a good idea.

(Don't give up on regular expressions though - for a certain class of
problems that don't necessarily include HTML, they are indispensable).

--
osfameron

Jul 23 '05 #3

Fred Oz

Patient Guy wrote:

Coding patterns for regular expressions is completely unintuitive, as far
as I can see. I have been trying to write script that produces an array
of attribute components within an HTML element.

[...]

Why are you parsing HTML? Are you reading HTML from somewhere,
then replacing the HTML with DOM create element commands?

Are you reading from the current document? If so, every element
has an "attributes" parameter that returns an array of all the
attributes on an element.

If you are dealing with HTML as ASCII text, how will you deal
with single word attributes such as "checked"?

What's the point?
--
Fred

Jul 23 '05 #4

Matthew Lock

osfameron wrote:

Seconded. Actually, people tend to say it's impossible. I think the O'Reilly book goes into why.

Yeah one of the reasons it's impossible is that keeping track of HTML
comments and possible nested comments requires a state machine. Other
things which are pretty difficult with regex are javascript blocks, and
attributes with escaped quotes in them.

Jul 23 '05 #5

Patient Guy

Fred Oz <oz****@iinet.net.auau> wrote in
news:42***********************@per-qv1-newsreader-01.iinet.net.au:

Patient Guy wrote:
Coding patterns for regular expressions is completely unintuitive, as
far as I can see. I have been trying to write script that produces
an array of attribute components within an HTML element.

[...]

Why are you parsing HTML? Are you reading HTML from somewhere,
then replacing the HTML with DOM create element commands?

Are you reading from the current document? If so, every element
has an "attributes" parameter that returns an array of all the
attributes on an element.

If you are dealing with HTML as ASCII text, how will you deal
with single word attributes such as "checked"?

What's the point?

Okay, now that it has been asked, here's what I am trying to do.

I am trying to write a completely client-side script that allows one to
create/write questions in making an examination and which allows a test-
taker to take the exam. The script involves accessing the filesystem
clearly, and the user will deal with that.

The interface has a title/banner (rendered in HTML) that never changes.
The dynamic parts of the interface will be the list of files and
directories (folders) on the file system, rendered as a table, with cells
that highlight on mouseovers and are clickable to select either changes
to directories or to open files. (ActiveX in IE and XPConnect in Mozilla
can be used to invoke interfaces that call open/save file dialog boxes,
but that actually might be confusing, and thus unnecessarily scary, for
the user.)

In constructing this table in HTML, because it is dynamic, it is
generated as a string.

Now, if I wanted to get away with this cheaply, I would just take the
string of HTML text and assign it as a value to the 'innerHTML' property
of a DIV element.

Voila, I'm done. Time for hot coffee and cookies.

But as we all know, the 'innerHTML' property has not been standardized or
"recommended" by the W3C or any other authoritative body, unless you have
acknowledged Microsoft as being a (the?) authoritative body, sufficiently
authoritative, if not also intimidating, in that even Mozilla now
recognizes the 'innerHTML' extension.

I have seen many a careful programmer advise me not to rely on
extensions, so now I am in the nasty habit of writing functions that
conform to the standards ("recommendations") and which handle exceptions
that occur when nonstandard properties/methods are used.

Now if it is possible to create an HTML Document Fragment (set of nodes
on a tree) that reads a string of HTML text and does NOT make use of
nonstandard actions, such as assigning that HTML text string to a
property that magically renders it, then I would very much like to see
that possibility.

But from my limited information, the only way I can see clear to using
standardized coding is to:

1) make a HTML Document Fragment (or root DIV element for browsers that
break on the DOM standard)
2) hang all containing nodes and text off that
3) find the code that must read the parts of the HTML text string and
make sense of it, short of doing a character-by-character reading of the
string

Someone has done it (such as the programmers who wrote that part of the
code that parses the text for browsers). You are correct about single
word attributes, and this probably makes the construction of the regular
expression pattern enormously difficult, but does it make it impossible?

Jul 23 '05 #6

Matthew Lock

> I am trying to write a completely client-side script that allows one
to

create/write questions in making an examination and which allows a test- taker to take the exam. The script involves accessing the filesystem clearly, and the user will deal with that.
Be careful with a completely client-side approach to exams, as all the
answers will have to be stored in the test-taker's browser, making it
possible for the test-taker to cheat.
1) make a HTML Document Fragment (or root DIV element for browsers that break on the DOM standard)
2) hang all containing nodes and text off that
3) find the code that must read the parts of the HTML text string and make sense of it, short of doing a character-by-character reading of the string
You have lost me somewhere, what do you want to do exactly? Allow the
exam writer to specify HTML code that will be attached to the document
at some stage?
Someone has done it (such as the programmers who wrote that part of the code that parses the text for browsers). You are correct about single word attributes, and this probably makes the construction of the regular expression pattern enormously difficult, but does it make it

impossible?

Yes but when the browser makers did it, they probably used a recursive
decent parser rather than regular expressions. Besides, just because
*some* programmers have done it, doesn't mean that it is within the
reach of you or I.

I would say that a "parser" that can parse real world HTML would be
practically impossible with just regular expressions.

Jul 23 '05 #7

Patient Guy

"Matthew Lock" <lo******@gmail.com> wrote in news:1107915720.640884.87420
@z14g2000cwz.googlegroups.com:

I am trying to write a completely client-side script that allows one

to
create/write questions in making an examination and which allows a

test-
taker to take the exam. The script involves accessing the filesystem

clearly, and the user will deal with that.

Be careful with a completely client-side approach to exams, as all the
answers will have to be stored in the test-taker's browser, making it
possible for the test-taker to cheat.

Actually, I was more than confusing here by saying it is client-side only
because the browser's functionality (ability to render HTML, interpret
scripts, style the content) is basically being used as a stand-alone
application on the system.

What will be done here is that the teacher will create the exam (see next
response paragraph for details of the interface) on the computer. Then
the exam will be opened by the teacher or the examinee, and the examinee
will be set in front of the computer and take the examination.

The test-writing/creating feature is as follows: HTML coding is used to
produce standard form controls (textbox or textarea, radio, checkboxes,
buttons) that ask the teacher to indicate the type of question (multiple
choice, fill-in or short answer), the correct answer to the question (if
a choice type question), and a textbox for the question itself. The test
writer hits a button, the form input data is properly formatted (possibly
encrypted), and stored to permanent media. There are at least two levels
to the interface: one for general settings and options and a broad view
of the file being worked (list of questions), and the specific interface
level just described for composing the exam question.

On the test-taking side, the test file is accessed (possibly decrypted),
and presented in the format using a form (with controls) that accepts the
answer of the user. Only one question and its place for answer appears
on a screen, with standard 'next' and 'previous' buttons for the examinee
to go from question to question. A timer might be started either to
measure the amount of time an examinee takes to answer a question, or to
impose a limit on total exam time. Sure the timer might be defeated by
sophisticated users in various ways, but if it is really called for, I
can write features that timestamp the viewing/opening of a question and
its answering. This test module is intended for taking all assaults
against the most sophisticated user. My original motivation for writing
this whole thing was as a tool to assist in the education of my 9-year
old daughter, who should use the computer for more than playing
"Spiderman 2."

When the examinee finishes, the test can be automatically scored if it is
one in which all the answers can be determined by the system. Besides
the examinee's score, I may also present the examiner the total time used
to take the test, as well as the time for each question, as this can be
an indicator of sticking points.

1) make a HTML Document Fragment (or root DIV element for browsers

that
break on the DOM standard)
2) hang all containing nodes and text off that
3) find the code that must read the parts of the HTML text string and

make sense of it, short of doing a character-by-character reading of

the
string

You have lost me somewhere, what do you want to do exactly? Allow the
exam writer to specify HTML code that will be attached to the document
at some stage?

The exam writer will interact with a standard browser form (rendered in
HTML). The script will read in the form control settings from the HTML
form (including the textbox that contains text---I allow for the exam
writer to include tags to format his text, such as with bold,
super/subscript, etc.), and store the written exam content in a file on
disk, the file format of my own creation, much like any database program
creator makes a database (or data) file having its own specialized
format. I may even encrypt the disk-stored data with a simple encryption
algorithm (the key embedded in the script itself rather than given by the
user), assuming that it really a concern.

The script on the test-taking side reads in the file and holds its
contents, then presents the questions in the HTML browser, formatting
appearance using HTML/CSS according the examiner's options settings when
the test was created. Thus a block of the page (DIV, DocFrag) is
dynamically updated, and thus a function is necessary to build a document
fragment tree with element nodes and content.

Someone has done it (such as the programmers who wrote that part of

the
code that parses the text for browsers). You are correct about

single
word attributes, and this probably makes the construction of the

regular
expression pattern enormously difficult, but does it make it

impossible?

Yes but when the browser makers did it, they probably used a recursive
decent parser rather than regular expressions. Besides, just because
*some* programmers have done it, doesn't mean that it is within the
reach of you or I.

I would say that a "parser" that can parse real world HTML would be
practically impossible with just regular expressions.

Based on the we-don't-want-to-even-think-of-going-there responses I am
getting about trying to build document fragments and plant them in an
existing document, I am thinking of trying another approach. The advice
here is that I should use the browser's own built-in capabilities of
presenting HTML, meaning that I have to use document.write() statements,
right? Well document.write() statements are explicitly or implicitly
preceded by document.open() and followed by document.close() statements,
correct? And they erase any of the previous contents of a window,
correct? So if I have a browser's presntation area ("client window
area"), I have to figure a way to make multiple windows out of it, with
the content of some of the windows being static---that is, content I
always want to be present on screen----and the content of one or more
other windows being dynamic. I think the use of HTML frames work in this
case, and I'll just have to put a <noframes> warning that the mini-
application will not work on frames-incapable browsers.

What do you think? Is that a reasonable solution, or is it worth it to
write a simple HTML parser that uses ONE or MORE regular expressions to
divide up the work of reading HTML?

Jul 23 '05 #8

Regular Expressions Challenge

Similar topics