473,549 Members | 2,639 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making a smart regex

I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
  • , or
to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb

Dec 14 '05 #1
13 2351
Chris Lieb wrote:
I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
  • , or
to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb


I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.

Chris Lieb

Dec 14 '05 #2
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.or g/wiki/Chomsky_hierarc hy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars
Dec 14 '05 #3
Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>


Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob
Dec 14 '05 #4
RobG wrote:
Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>


Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob


You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter. I look at XML based
languages and they make sense; every tag that opens must be closed. It
makes it much easier when parsing the source to know that you have a
closing tag that will always be there instead of havig to infer that
the element (ex. <li>) should end simply because it had a sibling
start. Of course, this is a discussion more suited for alt.html or
c.i.w.a.h., so I'll get off of my pedastal now.

Chris Lieb

Dec 14 '05 #5
Thomas 'PointedEars' Lahn wrote:
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.or g/wiki/Chomsky_hierarc hy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars


I'm sorry, but I really can't seem to make heads or tails of all of
that theory. My head is still spinning half of an hour later. I
originally designed the BBcode parser using stacks, so nesting worked
out OK, but my string tokenizing forced me to be very restrictive with
my input, otherwise the script would go into an infinite loop,
sometimes causing Firefox's memory usage to soar. With my current
regex-based engine, the infinite loop problem is fixed and I am no
longer concerned with the user's input. The only problem, like I
mentioned before, is that it does not handle nested elements very well.

I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me? I have never
written a parser that is this complicated before. Normally, parsing to
me is simply splitting a string on a delemiter. I figured that this
would be an easy place to start since it is such a small language and
it directly maps to a well-established language, so all I should have
to do is translate.

Chris Lieb

Dec 14 '05 #6
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/
Dec 14 '05 #7
Chris Lieb wrote:
Thomas 'PointedEars' Lahn wrote:
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).


[...]
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me?


Not at all. Here's an example:

var
depth = -1,
match = null,
rx = /[{}]/g;

while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}

switch (match[0])
{
case '{':
depth++;
break;

case '}':
depth--;
}
}
PointedEars
Dec 14 '05 #8
Vic Sowers wrote:
Chris Lieb wrote:
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/

^^^^^^^^^^^^^^^
||| |greedy: longest match wins
||| end of the range
||each of these characters ('(', '?', ..., ')') is not matched
|inverse range, hence
start of the range

Example where it fails: "foobar:"
^

PointedEars
Dec 15 '05 #9
"Chris Lieb" <ch********@gma il.com> writes:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!
This is a somewhat difficult problem, but not impossible.

You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):

We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>

/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/

This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.
Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/


followed by whatever followed by followed by whatever followed
by
. Since the first whatever could be
, this matches
The negative match above is to make sure there is not [/b] between
[b]'s.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'
Dec 15 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

31
2041
by: CYBER | last post by:
Hello Is there any other way under python to create blocks ?? instead of def sth(x): return x
37
2736
by: yogpjosh | last post by:
Hello All, I was asked a question in an interview.. Its related to dynamically allocated and deallocated memory. eg. //start char * p = new char; ...
5
2882
by: Noozer | last post by:
I'm looking for a "smart folder" program to run on my Windows XP machine. I'm not having any luck finding it and think the logic behind the program is pretty simple, but I'm not sure how I'd implement this. I've done some VB6 programming and dabbled in VS.Net. Can someone share some pointers in how I could implement the following? ...
50
4435
by: Juha Nieminen | last post by:
I asked a long time ago in this group how to make a smart pointer which works with incomplete types. I got this answer (only relevant parts included): //------------------------------------------------------------------ template<typename Data_t> class SmartPointer { Data_t* data; void(*deleterFunc)(Data_t*);
0
7532
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7462
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7730
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7975
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7492
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6059
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
3510
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1069
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
777
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.