I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag.
Implementing the list tag itself was fairly easy. What was not was
trying to handle the list items. For some reason, in BBcode, they
didn't bother defining an end tag for a list item. I guess that they
designed it with bad old HTML 3.2 in mind where you could make a list
by using:
<ul>
<li>item 1
<li>item2
</ul>
However, I need to make this XHTML compliant, so I needed to add the
</li> tag into the mix. Unfortunately, the only way to find where to
put it is to find the next[*] (<li>) tag or an open list (in the case
of nested lists) or close list tag. I was trying to get a rule that
handles the list items to work, but it only matches the first item in
any list. Here is the line of code:
bbcode =
bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g,
'[list$1]$2<li>$3</li>$4');
First, I check to make sure that the list item is inside a list. Then,
I match the[*] tag to find the start of the item, then I match either
the next[*], to determine the end of the item.
This successfully prevents a list item outside of a list from being
made into a <li> element, but only matches the first list item in a
list. Is there any way to make this match all occurances of this
pattern without looping over the statement until the pattern can no
longer be found?
Your help is much appreciated
Chris Lieb 13 2330
Chris Lieb wrote: I am trying to write a regex that will parse BBcode into HTML using JavaScript. Everything was going smoothly using the string class replace() operator with regex's until I got to the list tag. Implementing the list tag itself was fairly easy. What was not was trying to handle the list items. For some reason, in BBcode, they didn't bother defining an end tag for a list item. I guess that they designed it with bad old HTML 3.2 in mind where you could make a list by using:
<ul> <li>item 1 <li>item2 </ul>
However, I need to make this XHTML compliant, so I needed to add the </li> tag into the mix. Unfortunately, the only way to find where to put it is to find the next[*] (<li>) tag or an open list (in the case of nested lists) or close list tag. I was trying to get a rule that handles the list items to work, but it only matches the first item in any list. Here is the line of code:
bbcode = bbcode.replace(/\[list(\=1|\=a|)\](.*?)\[\*\](.*?)(\[\*\]|\[list\]|\[\/list\])/g, '[list$1]$2<li>$3</li>$4');
First, I check to make sure that the list item is inside a list. Then, I match the[*] tag to find the start of the item, then I match either the next[*],to determine the end of the item. This successfully prevents a list item outside of a list from being made into a <li> element, but only matches the first list item in a list. Is there any way to make this match all occurances of this pattern without looping over the statement until the pattern can no longer be found?
Your help is much appreciated
Chris Lieb
I've decided to not allow nested lists since even the "gold standard"
of sorts, phpBB does not correctly interpret them. To enforce this, I
am trying to make a regex that will detect a tag that is nested inside
itself, since none of there are parsed correctly by regex's. Basically
I need the regex to detect when a tag is placed within itself.
ex. Hello world!!!
Using the regex
/\[b\](.*?)\[b\](.*?)\[\/b\]/
to test for a [b] within a does prevent a [b] from being placed
in a . However, it also prevents a second bold from being used.
I need to test if two open bold tags ( ) occur without a close bold
tag () between them. Only problem is that I have no idea of how to
accomplish this.
Chris Lieb
Chris Lieb wrote: I've decided to not allow nested lists since even the "gold standard" of sorts, phpBB does not correctly interpret them. To enforce this, I am trying to make a regex that will detect a tag that is nested inside itself, since none of there are parsed correctly by regex's. Basically I need the regex to detect when a tag is placed within itself.
ex. Hello world!!!
Using the regex
/\[b\](.*?)\[b\](.*?)\[\/b\]/
to test for a [b] within a does prevent a [b] from being placed in a . However, it also prevents a second bold from being used. I need to test if two open bold tags () occur without a close bold tag () between them. Only problem is that I have no idea of how to accomplish this.
It is not possible to parse such _context-free_ languages with _one_
application of _one_ _Regular_ Expression, which also was your initial
problem.
<URL:http://en.wikipedia.org/wiki/Chomsky_hierarchy>
There are several ways to work around that:
1. Use a loop to parse the ... recursively.
2. Define the level up to the ... may be nested and
design the RE accordingly. Be aware that even a nesting
of three levels is already a pain.
3. Use a PDA (Push-Down Automaton) implementation for parsing
(which can make use of RE to speed-up parsing).
HTH
PointedEars
Chris Lieb wrote:
[...] didn't bother defining an end tag for a list item. I guess that they designed it with bad old HTML 3.2 in mind where you could make a list by using:
<ul> <li>item 1 <li>item2 </ul>
Thomas answered your question, but the above is valid HTML 4 too -
closing tags are optional for LIs.
<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>
[...]
--
Rob
RobG wrote: Chris Lieb wrote: [...]
didn't bother defining an end tag for a list item. I guess that they designed it with bad old HTML 3.2 in mind where you could make a list by using:
<ul> <li>item 1 <li>item2 </ul>
Thomas answered your question, but the above is valid HTML 4 too - closing tags are optional for LIs.
<URL:http://www.w3.org/TR/html401/struct/lists.html#edef-UL>
[...]
-- Rob
You have a point, but I hate that syntax. It makes me wonder why, back
in the days of limited processing power, they made a language where so
much had to be inferred by the interpreter. I look at XML based
languages and they make sense; every tag that opens must be closed. It
makes it much easier when parsing the source to know that you have a
closing tag that will always be there instead of havig to infer that
the element (ex. <li>) should end simply because it had a sibling
start. Of course, this is a discussion more suited for alt.html or
c.i.w.a.h., so I'll get off of my pedastal now.
Chris Lieb
Thomas 'PointedEars' Lahn wrote: Chris Lieb wrote:
I've decided to not allow nested lists since even the "gold standard" of sorts, phpBB does not correctly interpret them. To enforce this, I am trying to make a regex that will detect a tag that is nested inside itself, since none of there are parsed correctly by regex's. Basically I need the regex to detect when a tag is placed within itself.
ex. Hello world!!!
Using the regex
/\[b\](.*?)\[b\](.*?)\[\/b\]/
to test for a [b] within a does prevent a [b] from being placed in a . However, it also prevents a second bold from being used. I need to test if two open bold tags () occur without a close bold tag () between them. Only problem is that I have no idea of how to accomplish this.
It is not possible to parse such _context-free_ languages with _one_ application of _one_ _Regular_ Expression, which also was your initial problem.
<URL:http://en.wikipedia.org/wiki/Chomsky_hierarchy>
There are several ways to work around that:
1. Use a loop to parse the ... recursively. 2. Define the level up to the ... may be nested and design the RE accordingly. Be aware that even a nesting of three levels is already a pain. 3. Use a PDA (Push-Down Automaton) implementation for parsing (which can make use of RE to speed-up parsing).
HTH
PointedEars
I'm sorry, but I really can't seem to make heads or tails of all of
that theory. My head is still spinning half of an hour later. I
originally designed the BBcode parser using stacks, so nesting worked
out OK, but my string tokenizing forced me to be very restrictive with
my input, otherwise the script would go into an infinite loop,
sometimes causing Firefox's memory usage to soar. With my current
regex-based engine, the infinite loop problem is fixed and I am no
longer concerned with the user's input. The only problem, like I
mentioned before, is that it does not handle nested elements very well.
I am revisiting using a stack, but I am not sure how to implement it.
If you have any advice for making a parsing engine capable of handling
nested elements, would you mind sharing them with me? I have never
written a parser that is this complicated before. Normally, parsing to
me is simply splitting a string on a delemiter. I figured that this
would be an easy place to start since it is such a small language and
it directly maps to a well-established language, so all I should have
to do is translate.
Chris Lieb
Chris Lieb wrote: I've decided to not allow nested lists since even the "gold standard" of sorts, phpBB does not correctly interpret them. To enforce this, I am trying to make a regex that will detect a tag that is nested inside itself, since none of there are parsed correctly by regex's. Basically I need the regex to detect when a tag is placed within itself.
ex. Hello world!!!
Using the regex
/\[b\](.*?)\[b\](.*?)\[\/b\]/
to test for a [b] within a does prevent a [b] from being placed in a . However, it also prevents a second bold from being used. I need to test if two open bold tags () occur without a close bold tag () between them. Only problem is that I have no idea of how to accomplish this.
Oddly, this seems to work...
/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/
Chris Lieb wrote: Thomas 'PointedEars' Lahn wrote: 3. Use a PDA (Push-Down Automaton) implementation for parsing (which can make use of RE to speed-up parsing).
[...] I am revisiting using a stack, but I am not sure how to implement it. If you have any advice for making a parsing engine capable of handling nested elements, would you mind sharing them with me?
Not at all. Here's an example:
var
depth = -1,
match = null,
rx = /[{}]/g;
while (depth != 0 && (match = rx.exec(text)))
{
// no brace before
if (depth != 0)
{
depth = 0;
}
switch (match[0])
{
case '{':
depth++;
break;
case '}':
depth--;
}
}
PointedEars
Vic Sowers wrote: Chris Lieb wrote: I need to test if two open bold tags () occur without a close bold tag () between them. Only problem is that I have no idea of how to accomplish this.
Oddly, this seems to work...
/\[b\][^(?:\[\/b\])]*(\[b\].*?\[\/b\]).*?\[\/b\]/
^^^^^^^^^^^^^^^
||| |greedy: longest match wins
||| end of the range
||each of these characters ('(', '?', ..., ')') is not matched
|inverse range, hence
start of the range
Example where it fails: " foobar:"
^
PointedEars
"Chris Lieb" <ch********@gmail.com> writes: I've decided to not allow nested lists since even the "gold standard" of sorts, phpBB does not correctly interpret them. To enforce this, I am trying to make a regex that will detect a tag that is nested inside itself, since none of there are parsed correctly by regex's. Basically I need the regex to detect when a tag is placed within itself.
ex. Hello world!!!
This is a somewhat difficult problem, but not impossible.
You have to test for is two start tags without an end tag in between.
For now, I'll assume that all start tags have end tags (but I expect
images not to):
We try to match a tag [<word>...] followed by non ['s
and ['s not followed /<word>, and finally followed by [<word>
/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/
This uses negative lookahead to do a generic test. Negative matching
is generally a pain, but modern versions of ECMAScript has option of
using negative lookahead.
Using the regex
/\[b\](.*?)\[b\](.*?)\[\/b\]/ followed by whatever followed by followed by whatever followed
by . Since the first whatever could be , this matches
The negative match above is to make sure there is not [/b] between
[b]'s.
/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Lasse Reichstein Nielsen wrote: "Chris Lieb" <ch********@gmail.com> writes:
I've decided to not allow nested lists since even the "gold standard" of sorts, phpBB does not correctly interpret them. To enforce this, I am trying to make a regex that will detect a tag that is nested inside itself, since none of there are parsed correctly by regex's. Basically I need the regex to detect when a tag is placed within itself.
ex. Hello world!!!
This is a somewhat difficult problem, but not impossible.
You have to test for is two start tags without an end tag in between. For now, I'll assume that all start tags have end tags (but I expect images not to):
We try to match a tag [<word>...] followed by non ['s and ['s not followed /<word>, and finally followed by [<word>
/\[(\w+)[^\]]*\]([^[]*\[(?!\/\1\b))*[^[]*\[\1\b/
This uses negative lookahead to do a generic test. Negative matching is generally a pain, but modern versions of ECMAScript has option of using negative lookahead.
Thanks. It works like a charm. Took me a while to try to figure out
what was happening since I've never used lookaround before. It really
is amazing what can be accoplished with regular expressions. Using the regex
/\[b\](.*?)\[b\](.*?)\[\/b\]/
followed by whatever followed by followed by whatever followed by . Since the first whatever could be , this matches
The negative match above is to make sure there is not [/b] between [b]'s.
Yep. Didn't take me long to figure out that that didn't work.
Chris Lieb
Thomas 'PointedEars' Lahn wrote: Chris Lieb wrote:
Thomas 'PointedEars' Lahn wrote: 3. Use a PDA (Push-Down Automaton) implementation for parsing (which can make use of RE to speed-up parsing).
[...] I am revisiting using a stack, but I am not sure how to implement it. If you have any advice for making a parsing engine capable of handling nested elements, would you mind sharing them with me?
Not at all. Here's an example:
var depth = -1, match = null, rx = /[{}]/g;
while (depth != 0 && (match = rx.exec(text))) { // no brace before if (depth != 0) { depth = 0; }
switch (match[0]) { case '{': depth++; break;
case '}': depth--; } }
That is close to what I was doing before. However, since BBcode is a
tag-based language like HTML instead of a curly brace language like C
or JavaScript, I had to modify it a little. Whenever I encountered an
open tag, I placed it on the stack. When I encountered a close tag, I
checked to see if it matched the tag that was on the stack. If they
did not match, then there was inproper nesting.
Now, everything seems to be working. The only feature that I would
like to add at this point is the ability to nest lists. My current
regex-based version does not handle the nesting. An example: produces
<ul><li>item 1</li><li>item 2</li> - <li>item a</li><li>item
b</li></ul><li>item 3</li>
instead of
<ul><li>item 1</li><li>item 2</li><ul><li>item a</li><li>item
b</li></ul><li>item 3</li></ul>
I'll get around to fixing that later. It shouldn't be that hard since
I already have a loop that does some pre-processing on the lists.
Chris Lieb
Chris Lieb wrote: You have a point, but I hate that syntax. It makes me wonder why, back in the days of limited processing power, they made a language where so much had to be inferred by the interpreter.
Probably because the origins of HTML go back to GML, a language for
typesetting on IBM mainframes in the 1970's.
--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
John W. Kennedy wrote: Chris Lieb wrote: You have a point, but I hate that syntax. It makes me wonder why, back in the days of limited processing power, they made a language where so much had to be inferred by the interpreter.
Probably because the origins of HTML go back to GML, a language for typesetting on IBM mainframes in the 1970's.
I wonder what that would have to do with one another.
AIUI, it is because HTML is an SGML application[1], and the SGML declaration
of HTML[2] enables OMITTAG. But again, you do not _need_ to omit optional
end tags; in fact, declaring HTML 4.01 Strict or ISO HTML forces you to use
end tags for all elements with a non-EMPTY content model[3] (if you want
the markup to be Valid.[4])
X-Post & F'up2 ciwah
PointedEars
___________
[1] <URL:http://www.w3.org/TR/REC-html32#sgml>
<URL:http://www.w3.org/TR/html4/intro/sgmltut.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#INTRO>
[2] <URL:http://www.w3.org/TR/REC-html32#sgmldecl>
<URL:http://www.w3.org/TR/html4/sgml/sgmldecl.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#DCL>
[3] <URL:http://www.w3.org/TR/html4/sgml/dtd.html>
<URL:https://www.cs.tcd.ie/15445/15445.html#INTRO>
[4] <URL:http://validator.w3.org/> This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: CYBER |
last post by:
Hello
Is there any other way under python to create blocks ??
instead of
def sth(x):
return x
|
by: yogpjosh |
last post by:
Hello All,
I was asked a question in an interview..
Its related to dynamically allocated and deallocated memory.
eg.
//start
char * p = new char;
...
|
by: Noozer |
last post by:
I'm looking for a "smart folder" program to run on my Windows XP machine.
I'm not having any luck finding it and think the logic behind the program is
pretty simple, but I'm not sure how I'd...
|
by: Juha Nieminen |
last post by:
I asked a long time ago in this group how to make a smart pointer
which works with incomplete types. I got this answer (only relevant
parts included):
...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |