473,396 Members | 2,039 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Making a smart regex

I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
  • , or
to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb

Dec 14 '05 #1
13 2330
Chris Lieb wrote:
I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
  • , or
to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb


I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.

Chris Lieb

Dec 14 '05 #2
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.org/wiki/Chomsky_hierarchy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars
Dec 14 '05 #3
Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>


Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob
Dec 14 '05 #4
RobG wrote:
Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>


Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob


You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter. I look at XML based
languages and they make sense; every tag that opens must be closed. It
makes it much easier when parsing the source to know that you have a
closing tag that will always be there instead of havig to infer that
the element (ex. <li>) should end simply because it had a sibling
start. Of course, this is a discussion more suited for alt.html or
c.i.w.a.h., so I'll get off of my pedastal now.

Chris Lieb

Dec 14 '05 #5
Thomas 'PointedEars' Lahn wrote:
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.org/wiki/Chomsky_hierarchy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars


I'm sorry, but I really can't seem to make heads or tails of all of
that theory. My head is still spinning half of an hour later. I
originally designed the BBcode parser using stacks, so nesting worked
out OK, but my string tokenizing forced me to be very restrictive with
my input, otherwise the script would go into an infinite loop,
sometimes causing Firefox's memory usage to soar. With my current
regex-based engine, the infinite loop problem is fixed and I am no
longer concerned with the user's input. The only problem, like I
mentioned before, is that it does not handle nested elements very well.

I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me? I have never
written a parser that is this complicated before. Normally, parsing to
me is simply splitting a string on a delemiter. I figured that this
would be an easy place to start since it is such a small language and
it directly maps to a well-established language, so all I should have
to do is translate.

Chris Lieb

Dec 14 '05 #6
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/
Dec 14 '05 #7
Chris Lieb wrote:
Thomas 'PointedEars' Lahn wrote:
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).


[...]
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me?


Not at all. Here's an example:

var
depth = -1,
match = null,
rx = /[{}]/g;

while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}

switch (match[0])
{
case '{':
depth++;
break;

case '}':
depth--;
}
}
PointedEars
Dec 14 '05 #8
Vic Sowers wrote:
Chris Lieb wrote:
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/

^^^^^^^^^^^^^^^
||| |greedy: longest match wins
||| end of the range
||each of these characters ('(', '?', ..., ')') is not matched
|inverse range, hence
start of the range

Example where it fails: "foobar:"
^

PointedEars
Dec 15 '05 #9
"Chris Lieb" <ch********@gmail.com> writes:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!
This is a somewhat difficult problem, but not impossible.

You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):

We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>

/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/

This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.
Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/


followed by whatever followed by followed by whatever followed
by
. Since the first whatever could be
, this matches
The negative match above is to make sure there is not [/b] between
[b]'s.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Dec 15 '05 #10
Lasse Reichstein Nielsen wrote:
"Chris Lieb" <ch********@gmail.com> writes:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!


This is a somewhat difficult problem, but not impossible.

You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):

We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>

/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/

This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.


Thanks. It works like a charm. Took me a while to try to figure out
what was happening since I've never used lookaround before. It really
is amazing what can be accoplished with regular expressions.
Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/


followed by whatever followed by followed by whatever followed
by
. Since the first whatever could be
, this matches


The negative match above is to make sure there is not [/b] between
[b]'s.


Yep. Didn't take me long to figure out that that didn't work.

Chris Lieb

Dec 15 '05 #11
Thomas 'PointedEars' Lahn wrote:
Chris Lieb wrote:
Thomas 'PointedEars' Lahn wrote:
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).


[...]
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me?


Not at all. Here's an example:

var
depth = -1,
match = null,
rx = /[{}]/g;

while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}

switch (match[0])
{
case '{':
depth++;
break;

case '}':
depth--;
}
}


That is close to what I was doing before. However, since BBcode is a
tag-based language like HTML instead of a curly brace language like C
or JavaScript, I had to modify it a little. Whenever I encountered an
open tag, I placed it on the stack. When I encountered a close tag, I
checked to see if it matched the tag that was on the stack. If they
did not match, then there was inproper nesting.

Now, everything seems to be working. The only feature that I would
like to add at this point is the ability to nest lists. My current
regex-based version does not handle the nesting. An example:
  • item 1
  • item 2
    • item a
    • item b
  • item 3
produces
<ul><li>item 1</li><li>item 2</li>
  • <li>item a</li><li>item
    b</li></ul><li>item 3</li>
instead of
<ul><li>item 1</li><li>item 2</li><ul><li>item a</li><li>item
b</li></ul><li>item 3</li></ul>

I'll get around to fixing that later. It shouldn't be that hard since
I already have a loop that does some pre-processing on the lists.

Chris Lieb

Dec 15 '05 #12
Chris Lieb wrote:
You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter.


Probably because the origins of HTML go back to GML, a language for
typesetting on IBM mainframes in the 1970's.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
Dec 20 '05 #13
John W. Kennedy wrote:
Chris Lieb wrote:
You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter.


Probably because the origins of HTML go back to GML, a language for
typesetting on IBM mainframes in the 1970's.


I wonder what that would have to do with one another.

AIUI, it is because HTML is an SGML application[1], and the SGML declaration
of HTML[2] enables OMITTAG. But again, you do not _need_ to omit optional
end tags; in fact, declaring HTML 4.01 Strict or ISO HTML forces you to use
end tags for all elements with a non-EMPTY content model[3] (if you want
the markup to be Valid.[4])
X-Post & F'up2 ciwah

PointedEars
___________
[1] <URL:http://www.w3.org/TR/REC-html32#sgml>
<URL:http://www.w3.org/TR/html4/intro/sgmltut.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#INTRO>
[2] <URL:http://www.w3.org/TR/REC-html32#sgmldecl>
<URL:http://www.w3.org/TR/html4/sgml/sgmldecl.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#DCL>
[3] <URL:http://www.w3.org/TR/html4/sgml/dtd.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#INTRO>
[4] <URL:http://validator.w3.org/>
Dec 20 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

31
by: CYBER | last post by:
Hello Is there any other way under python to create blocks ?? instead of def sth(x): return x
37
by: yogpjosh | last post by:
Hello All, I was asked a question in an interview.. Its related to dynamically allocated and deallocated memory. eg. //start char * p = new char; ...
5
by: Noozer | last post by:
I'm looking for a "smart folder" program to run on my Windows XP machine. I'm not having any luck finding it and think the logic behind the program is pretty simple, but I'm not sure how I'd...
50
by: Juha Nieminen | last post by:
I asked a long time ago in this group how to make a smart pointer which works with incomplete types. I got this answer (only relevant parts included): ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.