By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,799 Members | 1,376 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,799 IT Pros & Developers. It's quick & easy.

Re: ask for a RE pattern to match TABLE in html

P: n/a
Le Thursday 26 June 2008 15:53:06 oyster, vous avez écrit*:
that is, there is no TABLE tag between a TABLE, for example
<table >something with out table tag</table>
what is the RE pattern? thanks

the following is not right
<table.*?>[^table]*?</table>
The construct [abc] does not match a whole word but only one char, so
[^table] means "any char which is not t, a, b, l or e".

Anyway the inside table word won't match your pattern, as there are '<'
and '>' in it, and these chars have to be escaped when used as simple text.
So this should work:

re.compile(r'<table(|[ ].*)>.*</table>')
^ this is to avoid matching a tag name starting with table
(like <table_ext>)

--
Cédric Lucantis
Jun 27 '08 #1
Share this Question
Share on Google+
6 Replies

P: n/a
In article <ma*************************************@python.or g>,
Cédric Lucantis <om**@no-log.orgwrote:
Le Thursday 26 June 2008 15:53:06 oyster, vous avez écrit*:
that is, there is no TABLE tag between a TABLE, for example
<table >something with out table tag</table>
what is the RE pattern? thanks

the following is not right
<table.*?>[^table]*?</table>

The construct [abc] does not match a whole word but only one char, so
[^table] means "any char which is not t, a, b, l or e".

Anyway the inside table word won't match your pattern, as there are '<'
and '>' in it, and these chars have to be escaped when used as simple text.
So this should work:

re.compile(r'<table(|[ ].*)>.*</table>')
^ this is to avoid matching a tag name starting with
table
(like <table_ext>)
Doesn't work - for example it matches '<table></table><table></table>'
(and in fact if the html contains any number of tables it's going
to match the string starting at the start of the first table and
ending at the end of the last one.)

--
David C. Ullrich
Jun 27 '08 #2

P: n/a
In article
<62**********************************@w4g2000prd.g ooglegroups.com>,
Jonathan Gardner <jg******@jonathangardner.netwrote:
On Jun 26, 3:22*pm, MRAB <goo...@mrabarnett.plus.comwrote:
Try something like:

re.compile(r'<table\b.*?>.*?</table>', re.DOTALL)

So you would pick up strings like "<table><tr><td><table><tr><td>foo</
td></tr></table>"? I doubt that is what oyster wants.
I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:

(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?

(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?

--
David C. Ullrich
Jun 27 '08 #3

P: n/a
Dan
On Jun 27, 1:32 pm, "David C. Ullrich" <dullr...@sprynet.comwrote:
In article
<62f752f3-d840-42de-a414-0d56d15d7...@w4g2000prd.googlegroups.com>,
Jonathan Gardner <jgard...@jonathangardner.netwrote:
On Jun 26, 3:22 pm, MRAB <goo...@mrabarnett.plus.comwrote:
Try something like:
re.compile(r'<table\b.*?>.*?</table>', re.DOTALL)
So you would pick up strings like "<table><tr><td><table><tr><td>foo</
td></tr></table>"? I doubt that is what oyster wants.

I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:

(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?
Yes. For the proof, you want to look at the pumping lemma found in
your favorite Theory of Computation textbook.
>
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?
So, I think most of the extensions fall into syntactic sugar
(certainly all the character classes \b \s \w, etc). The ability to
look at input without consuming it is more than syntactic sugar, but
my intuition is that it could be pretty easily modeled by a
nondeterministic finite state machine, which is of equivalent power to
REs. The only thing I can really think of that is completely non-
regular is the \1 \2, etc syntax to match previously match strings
exactly. But since you can't to an arbitrary number of them, I don't
think its actually context free. (I'm not prepared to give a proof
either way). Needless to say that even if you could, it would be
highly impractical to match parentheses using those.

So, yeah, to match arbitrary nested delimiters, you need a real
context free parser.
>
--
David C. Ullrich

-Dan
Jun 27 '08 #4

P: n/a
In article
<50**********************************@56g2000hsm.g ooglegroups.com>,
Dan <th********@gmail.comwrote:
On Jun 27, 1:32 pm, "David C. Ullrich" <dullr...@sprynet.comwrote:
In article
<62f752f3-d840-42de-a414-0d56d15d7...@w4g2000prd.googlegroups.com>,
Jonathan Gardner <jgard...@jonathangardner.netwrote:
On Jun 26, 3:22 pm, MRAB <goo...@mrabarnett.plus.comwrote:
Try something like:
re.compile(r'<table\b.*?>.*?</table>', re.DOTALL)
So you would pick up strings like "<table><tr><td><table><tr><td>foo</
td></tr></table>"? I doubt that is what oyster wants.
I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:

(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?

Yes. For the proof, you want to look at the pumping lemma found in
your favorite Theory of Computation textbook.
Ah, thanks. Don't have a favorite text, not having any at all.
But wikipedia works - what I found at

http://en.wikipedia.org/wiki/Pumping...ular_languages

was pretty clear. (Yes, it's exactly that \1, \2 stuff that
convinced me I really don't understand what one can do with
a Python regex.)

(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?

So, I think most of the extensions fall into syntactic sugar
(certainly all the character classes \b \s \w, etc). The ability to
look at input without consuming it is more than syntactic sugar, but
my intuition is that it could be pretty easily modeled by a
nondeterministic finite state machine, which is of equivalent power to
REs. The only thing I can really think of that is completely non-
regular is the \1 \2, etc syntax to match previously match strings
exactly. But since you can't to an arbitrary number of them, I don't
think its actually context free. (I'm not prepared to give a proof
either way). Needless to say that even if you could, it would be
highly impractical to match parentheses using those.

So, yeah, to match arbitrary nested delimiters, you need a real
context free parser.

--
David C. Ullrich


-Dan
--
David C. Ullrich
Jun 30 '08 #5

P: n/a
On Jun 27, 10:32*am, "David C. Ullrich" <dullr...@sprynet.comwrote:
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?
In perl, there are some pretty wild extensions to the regex syntax,
features that make it much more than a regular expression engine.

Yes, it is possible to match parentheses and other nested structures
(such as HTML), and the regex to do so isn't incredibly difficult.
Note that Python doesn't support this extension.

See http://www.perl.com/pub/a/2003/08/21/perlcookbook.html
Jun 30 '08 #6

P: n/a
In article
<87**********************************@p39g2000prm. googlegroups.com>,
Jonathan Gardner <jg******@jonathangardner.netwrote:
On Jun 27, 10:32*am, "David C. Ullrich" <dullr...@sprynet.comwrote:
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?

In perl, there are some pretty wild extensions to the regex syntax,
features that make it much more than a regular expression engine.

Yes, it is possible to match parentheses and other nested structures
(such as HTML), and the regex to do so isn't incredibly difficult.
Note that Python doesn't support this extension.
Huh. My evidently misinformed impression was that the regexes
in P and P were essentially equivalent. (I hope nobody takes
that as a complaint...)
See http://www.perl.com/pub/a/2003/08/21/perlcookbook.html
--
David C. Ullrich
Jul 1 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.