473,735 Members | 2,125 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making a smart regex

I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
  • , or
to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb

Dec 14 '05 #1
13 2371
Chris Lieb wrote:
I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>

However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:

bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');

First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*],
  • , or
to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?

Your help is much appreciated

Chris Lieb


I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.

Chris Lieb

Dec 14 '05 #2
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.or g/wiki/Chomsky_hierarc hy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars
Dec 14 '05 #3
Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>


Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob
Dec 14 '05 #4
RobG wrote:
Chris Lieb wrote:
[...]
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:

<ul>
<li>item 1
<li>item2
</ul>


Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.

<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>

[...]

--
Rob


You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter. I look at XML based
languages and they make sense; every tag that opens must be closed. It
makes it much easier when parsing the source to know that you have a
closing tag that will always be there instead of havig to infer that
the element (ex. <li>) should end simply because it had a sibling
start. Of course, this is a discussion more suited for alt.html or
c.i.w.a.h., so I'll get off of my pedastal now.

Chris Lieb

Dec 14 '05 #5
Thomas 'PointedEars' Lahn wrote:
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.

<URL:http://en.wikipedia.or g/wiki/Chomsky_hierarc hy>

There are several ways to work around that:

1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH

PointedEars


I'm sorry, but I really can't seem to make heads or tails of all of
that theory. My head is still spinning half of an hour later. I
originally designed the BBcode parser using stacks, so nesting worked
out OK, but my string tokenizing forced me to be very restrictive with
my input, otherwise the script would go into an infinite loop,
sometimes causing Firefox's memory usage to soar. With my current
regex-based engine, the infinite loop problem is fixed and I am no
longer concerned with the user's input. The only problem, like I
mentioned before, is that it does not handle nested elements very well.

I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me? I have never
written a parser that is this complicated before. Normally, parsing to
me is simply splitting a string on a delemiter. I figured that this
would be an easy place to start since it is such a small language and
it directly maps to a well-established language, so all I should have
to do is translate.

Chris Lieb

Dec 14 '05 #6
Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!

Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/

to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/
Dec 14 '05 #7
Chris Lieb wrote:
Thomas 'PointedEars' Lahn wrote:
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).


[...]
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me?


Not at all. Here's an example:

var
depth = -1,
match = null,
rx = /[{}]/g;

while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}

switch (match[0])
{
case '{':
depth++;
break;

case '}':
depth--;
}
}
PointedEars
Dec 14 '05 #8
Vic Sowers wrote:
Chris Lieb wrote:
I need to test if two open bold tags () occur without a close bold
tag (
) between them. Only problem is that I have no idea of how to
accomplish this.


Oddly, this seems to work...

/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/

^^^^^^^^^^^^^^^
||| |greedy: longest match wins
||| end of the range
||each of these characters ('(', '?', ..., ')') is not matched
|inverse range, hence
start of the range

Example where it fails: "foobar:"
^

PointedEars
Dec 15 '05 #9
"Chris Lieb" <ch********@gma il.com> writes:
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.

ex. Hello world!!!
This is a somewhat difficult problem, but not impossible.

You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):

We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>

/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/

This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.
Using the regex

/\[b\](.*?)\[b\](.*?)\[\/b\]/


followed by whatever followed by followed by whatever followed
by
. Since the first whatever could be
, this matches
The negative match above is to make sure there is not [/b] between
[b]'s.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'
Dec 15 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

31
2080
by: CYBER | last post by:
Hello Is there any other way under python to create blocks ?? instead of def sth(x): return x
37
2776
by: yogpjosh | last post by:
Hello All, I was asked a question in an interview.. Its related to dynamically allocated and deallocated memory. eg. //start char * p = new char; ...
5
2899
by: Noozer | last post by:
I'm looking for a "smart folder" program to run on my Windows XP machine. I'm not having any luck finding it and think the logic behind the program is pretty simple, but I'm not sure how I'd implement this. I've done some VB6 programming and dabbled in VS.Net. Can someone share some pointers in how I could implement the following? Basically, you drag a file to the "smart" folder and, depending on the type of file and settings for that...
50
4492
by: Juha Nieminen | last post by:
I asked a long time ago in this group how to make a smart pointer which works with incomplete types. I got this answer (only relevant parts included): //------------------------------------------------------------------ template<typename Data_t> class SmartPointer { Data_t* data; void(*deleterFunc)(Data_t*);
0
8786
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9327
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9253
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9201
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8202
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6747
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6049
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4823
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3277
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.