On Jun 27, 1:32 pm, "David C. Ullrich" <dullr...@sprynet.comwrote:
Quote:
In article
<62f752f3-d840-42de-a414-0d56d15d7...@w4g2000prd.googlegroups.com>,
Jonathan Gardner <jgard...@jonathangardner.netwrote:
>
Quote:
On Jun 26, 3:22 pm, MRAB <goo...@mrabarnett.plus.comwrote:
Quote:
Try something like:
>
Quote:
Quote:
re.compile(r'<table\b.*?>.*?</table>', re.DOTALL)
>
Quote:
So you would pick up strings like "<table><tr><td><table><tr><td>foo</
td></tr></table>"? I doubt that is what oyster wants.
>
I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:
>
(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?
Yes. For the proof, you want to look at the pumping lemma found in
your favorite Theory of Computation textbook.
Quote:
>
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?
So, I think most of the extensions fall into syntactic sugar
(certainly all the character classes \b \s \w, etc). The ability to
look at input without consuming it is more than syntactic sugar, but
my intuition is that it could be pretty easily modeled by a
nondeterministic finite state machine, which is of equivalent power to
REs. The only thing I can really think of that is completely non-
regular is the \1 \2, etc syntax to match previously match strings
exactly. But since you can't to an arbitrary number of them, I don't
think its actually context free. (I'm not prepared to give a proof
either way). Needless to say that even if you could, it would be
highly impractical to match parentheses using those.
So, yeah, to match arbitrary nested delimiters, you need a real
context free parser.
Quote:
>
--
David C. Ullrich
-Dan