Making a smart regex

Chris Lieb

I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],

, or

to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb

Dec 14 '05 #1

Subscribe Post Reply

2330

Chris Lieb

Chris Lieb wrote:

I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
, or

to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb

I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag () between them. Only problem is that I have no idea of how to
accomplish this.

Chris Lieb

Dec 14 '05 #2

Thomas 'PointedEars' Lahn

Chris Lieb wrote:

I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag () between them. Only problem is that I have no idea of how to
accomplish this.

It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.org/wiki/Chomsky_hierarchy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars

Dec 14 '05 #3

RobG

Chris Lieb wrote:
[...]

didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob

Dec 14 '05 #4

Chris Lieb

RobG wrote:

Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob

You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter. I look at XML based
languages and they make sense; every tag that opens must be closed. It
makes it much easier when parsing the source to know that you have a
closing tag that will always be there instead of havig to infer that
the element (ex. <li>) should end simply because it had a sibling
start. Of course, this is a discussion more suited for alt.html or
c.i.w.a.h., so I'll get off of my pedastal now.

Chris Lieb

Dec 14 '05 #5

Chris Lieb

Thomas 'PointedEars' Lahn wrote:

Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag () between them. Only problem is that I have no idea of how to
accomplish this.

It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.org/wiki/Chomsky_hierarchy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars

I'm sorry, but I really can't seem to make heads or tails of all of
that theory. My head is still spinning half of an hour later. I
originally designed the BBcode parser using stacks, so nesting worked
out OK, but my string tokenizing forced me to be very restrictive with
my input, otherwise the script would go into an infinite loop,
sometimes causing Firefox's memory usage to soar. With my current
regex-based engine, the infinite loop problem is fixed and I am no
longer concerned with the user's input. The only problem, like I
mentioned before, is that it does not handle nested elements very well.

I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me? I have never
written a parser that is this complicated before. Normally, parsing to
me is simply splitting a string on a delemiter. I figured that this
would be an easy place to start since it is such a small language and
it directly maps to a well-established language, so all I should have
to do is translate.

Chris Lieb

Dec 14 '05 #6

Vic Sowers

Chris Lieb wrote:

I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag () between them. Only problem is that I have no idea of how to
accomplish this.

Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/

Dec 14 '05 #7

Thomas 'PointedEars' Lahn

Chris Lieb wrote:

Thomas 'PointedEars' Lahn wrote:
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).

[...]
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me?

Not at all. Here's an example:

var
depth = -1,
match = null,
rx = /[{}]/g;

while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}

switch (match[0])
{
case '{':
depth++;
break;

case '}':
depth--;
}
}
PointedEars

Dec 14 '05 #8

Thomas 'PointedEars' Lahn

Vic Sowers wrote:

Chris Lieb wrote:
I need to test if two open bold tags () occur without a close bold
tag () between them. Only problem is that I have no idea of how to
accomplish this.

Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/

^^^^^^^^^^^^^^^
||| |greedy: longest match wins
||| end of the range
||each of these characters ('(', '?', ..., ')') is not matched
|inverse range, hence
start of the range

Example where it fails: "foobar:"
^

PointedEars

Dec 15 '05 #9

Lasse Reichstein Nielsen

"Chris Lieb" <ch********@gmail.com> writes:

I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!
This is a somewhat difficult problem, but not impossible.

You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):

We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>

/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/

This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.
Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

followed by whatever followed by followed by whatever followed
by . Since the first whatever could be , this matches
The negative match above is to make sure there is not [/b] between
[b]'s.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Dec 15 '05 #10

Chris Lieb

Lasse Reichstein Nielsen wrote:

"Chris Lieb" <ch********@gmail.com> writes:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

This is a somewhat difficult problem, but not impossible.

You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):

We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>

/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/

This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.

Thanks. It works like a charm. Took me a while to try to figure out
what was happening since I've never used lookaround before. It really
is amazing what can be accoplished with regular expressions.

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

followed by whatever followed by followed by whatever followed
by . Since the first whatever could be , this matches

The negative match above is to make sure there is not [/b] between
[b]'s.

Yep. Didn't take me long to figure out that that didn't work.

Chris Lieb

Dec 15 '05 #11

Chris Lieb

Thomas 'PointedEars' Lahn wrote:

Chris Lieb wrote:
Thomas 'PointedEars' Lahn wrote:
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).

[...]
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me?

Not at all. Here's an example:

var
depth = -1,
match = null,
rx = /[{}]/g;

while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}

switch (match[0])
{
case '{':
depth++;
break;

case '}':
depth--;
}
}

That is close to what I was doing before. However, since BBcode is a
tag-based language like HTML instead of a curly brace language like C
or JavaScript, I had to modify it a little. Whenever I encountered an
open tag, I placed it on the stack. When I encountered a close tag, I
checked to see if it matched the tag that was on the stack. If they
did not match, then there was inproper nesting.

Now, everything seems to be working. The only feature that I would
like to add at this point is the ability to nest lists. My current
regex-based version does not handle the nesting. An example:

item 1
item 2
- item a
- item b
item 3

produces
<ul><li>item 1</li><li>item 2</li>

<li>item a</li><li>item
b</li></ul><li>item 3</li>

instead of
<ul><li>item 1</li><li>item 2</li><ul><li>item a</li><li>item
b</li></ul><li>item 3</li></ul>

I'll get around to fixing that later. It shouldn't be that hard since
I already have a loop that does some pre-processing on the lists.

Chris Lieb

Dec 15 '05 #12

John W. Kennedy

Chris Lieb wrote:

You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter.

Probably because the origins of HTML go back to GML, a language for
typesetting on IBM mainframes in the 1970's.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"

Dec 20 '05 #13

Thomas 'PointedEars' Lahn

John W. Kennedy wrote:

Chris Lieb wrote:
You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter.

Probably because the origins of HTML go back to GML, a language for
typesetting on IBM mainframes in the 1970's.

I wonder what that would have to do with one another.

AIUI, it is because HTML is an SGML application[1], and the SGML declaration
of HTML[2] enables OMITTAG. But again, you do not _need_ to omit optional
end tags; in fact, declaring HTML 4.01 Strict or ISO HTML forces you to use
end tags for all elements with a non-EMPTY content model[3] (if you want
the markup to be Valid.[4])
X-Post & F'up2 ciwah

PointedEars
___________
[1] <URL:http://www.w3.org/TR/REC-html32#sgml>
<URL:http://www.w3.org/TR/html4/intro/sgmltut.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#INTRO>
[2] <URL:http://www.w3.org/TR/REC-html32#sgmldecl>
<URL:http://www.w3.org/TR/html4/sgml/sgmldecl.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#DCL>
[3] <URL:http://www.w3.org/TR/html4/sgml/dtd.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#INTRO>
[4] <URL:http://validator.w3.org/>

Dec 20 '05 #14

Making a smart regex

Similar topics