473,729 Members | 2,353 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Re: ask for a RE pattern to match TABLE in html

Le Thursday 26 June 2008 15:53:06 oyster, vous avez écrit*:
that is, there is no TABLE tag between a TABLE, for example
<table >something with out table tag</table>
what is the RE pattern? thanks

the following is not right
<table.*?>[^table]*?</table>
The construct [abc] does not match a whole word but only one char, so
[^table] means "any char which is not t, a, b, l or e".

Anyway the inside table word won't match your pattern, as there are '<'
and '>' in it, and these chars have to be escaped when used as simple text.
So this should work:

re.compile(r'<t able(|[ ].*)>.*</table>')
^ this is to avoid matching a tag name starting with table
(like <table_ext>)

--
Cédric Lucantis
Jun 27 '08 #1
6 1210
In article <ma************ *************** **********@pyth on.org>,
Cédric Lucantis <om**@no-log.orgwrote:
Le Thursday 26 June 2008 15:53:06 oyster, vous avez écrit*:
that is, there is no TABLE tag between a TABLE, for example
<table >something with out table tag</table>
what is the RE pattern? thanks

the following is not right
<table.*?>[^table]*?</table>

The construct [abc] does not match a whole word but only one char, so
[^table] means "any char which is not t, a, b, l or e".

Anyway the inside table word won't match your pattern, as there are '<'
and '>' in it, and these chars have to be escaped when used as simple text.
So this should work:

re.compile(r'<t able(|[ ].*)>.*</table>')
^ this is to avoid matching a tag name starting with
table
(like <table_ext>)
Doesn't work - for example it matches '<table></table><table></table>'
(and in fact if the html contains any number of tables it's going
to match the string starting at the start of the first table and
ending at the end of the last one.)

--
David C. Ullrich
Jun 27 '08 #2
In article
<62************ *************** *******@w4g2000 prd.googlegroup s.com>,
Jonathan Gardner <jg******@jonat hangardner.netw rote:
On Jun 26, 3:22*pm, MRAB <goo...@mrabarn ett.plus.comwro te:
Try something like:

re.compile(r'<t able\b.*?>.*?</table>', re.DOTALL)

So you would pick up strings like "<table><tr><td ><table><tr><td >foo</
td></tr></table>"? I doubt that is what oyster wants.
I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:

(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?

(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?

--
David C. Ullrich
Jun 27 '08 #3
Dan
On Jun 27, 1:32 pm, "David C. Ullrich" <dullr...@spryn et.comwrote:
In article
<62f752f3-d840-42de-a414-0d56d15d7...@w4 g2000prd.google groups.com>,
Jonathan Gardner <jgard...@jonat hangardner.netw rote:
On Jun 26, 3:22 pm, MRAB <goo...@mrabarn ett.plus.comwro te:
Try something like:
re.compile(r'<t able\b.*?>.*?</table>', re.DOTALL)
So you would pick up strings like "<table><tr><td ><table><tr><td >foo</
td></tr></table>"? I doubt that is what oyster wants.

I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:

(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?
Yes. For the proof, you want to look at the pumping lemma found in
your favorite Theory of Computation textbook.
>
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?
So, I think most of the extensions fall into syntactic sugar
(certainly all the character classes \b \s \w, etc). The ability to
look at input without consuming it is more than syntactic sugar, but
my intuition is that it could be pretty easily modeled by a
nondeterministi c finite state machine, which is of equivalent power to
REs. The only thing I can really think of that is completely non-
regular is the \1 \2, etc syntax to match previously match strings
exactly. But since you can't to an arbitrary number of them, I don't
think its actually context free. (I'm not prepared to give a proof
either way). Needless to say that even if you could, it would be
highly impractical to match parentheses using those.

So, yeah, to match arbitrary nested delimiters, you need a real
context free parser.
>
--
David C. Ullrich

-Dan
Jun 27 '08 #4
In article
<50************ *************** *******@56g2000 hsm.googlegroup s.com>,
Dan <th********@gma il.comwrote:
On Jun 27, 1:32 pm, "David C. Ullrich" <dullr...@spryn et.comwrote:
In article
<62f752f3-d840-42de-a414-0d56d15d7...@w4 g2000prd.google groups.com>,
Jonathan Gardner <jgard...@jonat hangardner.netw rote:
On Jun 26, 3:22 pm, MRAB <goo...@mrabarn ett.plus.comwro te:
Try something like:
re.compile(r'<t able\b.*?>.*?</table>', re.DOTALL)
So you would pick up strings like "<table><tr><td ><table><tr><td >foo</
td></tr></table>"? I doubt that is what oyster wants.
I asked a question recently - nobody answered, I think
because they assumed it was just a rhetorical question:

(i) It's true, isn't it, that it's impossible for the
formal CS notion of "regular expression" to correctly
parse nested open/close delimiters?

Yes. For the proof, you want to look at the pumping lemma found in
your favorite Theory of Computation textbook.
Ah, thanks. Don't have a favorite text, not having any at all.
But wikipedia works - what I found at

http://en.wikipedia.org/wiki/Pumping...ular_languages

was pretty clear. (Yes, it's exactly that \1, \2 stuff that
convinced me I really don't understand what one can do with
a Python regex.)

(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?

So, I think most of the extensions fall into syntactic sugar
(certainly all the character classes \b \s \w, etc). The ability to
look at input without consuming it is more than syntactic sugar, but
my intuition is that it could be pretty easily modeled by a
nondeterministi c finite state machine, which is of equivalent power to
REs. The only thing I can really think of that is completely non-
regular is the \1 \2, etc syntax to match previously match strings
exactly. But since you can't to an arbitrary number of them, I don't
think its actually context free. (I'm not prepared to give a proof
either way). Needless to say that even if you could, it would be
highly impractical to match parentheses using those.

So, yeah, to match arbitrary nested delimiters, you need a real
context free parser.

--
David C. Ullrich


-Dan
--
David C. Ullrich
Jun 30 '08 #5
On Jun 27, 10:32*am, "David C. Ullrich" <dullr...@spryn et.comwrote:
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?
In perl, there are some pretty wild extensions to the regex syntax,
features that make it much more than a regular expression engine.

Yes, it is possible to match parentheses and other nested structures
(such as HTML), and the regex to do so isn't incredibly difficult.
Note that Python doesn't support this extension.

See http://www.perl.com/pub/a/2003/08/21/perlcookbook.html
Jun 30 '08 #6
In article
<87************ *************** *******@p39g200 0prm.googlegrou ps.com>,
Jonathan Gardner <jg******@jonat hangardner.netw rote:
On Jun 27, 10:32*am, "David C. Ullrich" <dullr...@spryn et.comwrote:
(ii) The regexes in languages like Python and Perl include
features that are not part of the formal CS notion of
"regular expression". Do they include something that
does allow parsing nested delimiters properly?

In perl, there are some pretty wild extensions to the regex syntax,
features that make it much more than a regular expression engine.

Yes, it is possible to match parentheses and other nested structures
(such as HTML), and the regex to do so isn't incredibly difficult.
Note that Python doesn't support this extension.
Huh. My evidently misinformed impression was that the regexes
in P and P were essentially equivalent. (I hope nobody takes
that as a complaint...)
See http://www.perl.com/pub/a/2003/08/21/perlcookbook.html
--
David C. Ullrich
Jul 1 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
15536
by: rk | last post by:
Hi, I'm a beginner for perl/cgi programs and i tried to write a cgi script and when i ran it, i got the following error. But when i verified it from the book i typed exactly whatever it is there and i checked other examples too. I did't get any clue.Can someone please help me on this. #!/usr/bin/perl use warnings;
2
5056
by: ahogue at theory dot lcs dot mit dot edu | last post by:
Hello - Is there any way to match complex subtree patterns with XPath? The functions I see all seem to match along a single path from root to leaf. I would like to match full subtrees. For example, given the XHTML: <html> <body>
2
7412
by: David Nedrow | last post by:
OK, I have a problem which I'm guessing is simply my inability to figure out a select pattern in XSL. I have an XML file similar to the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="vmv.xsl"?> <ruleset xmlns="https://foo.com"
9
2238
by: Tjerk Wolterink | last post by:
I have an xsl file wich xsl:includes this file: <?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:page="http://www.wolterinkwebdesign.com/xml/page" xmlns:xc="http://www.wolterinkwebdesign.com/xml/xcontent"> <xsl:output method="xml" indent="yes"/>
5
3323
by: Kelmen Wong | last post by:
Greeting, I want to extract all "" from a string, what pattern should I used? eg. = - return array or test1, or test2
2
6006
by: Ed Brown | last post by:
I'm working on a VB.Net application that needs to do quite a bit of string pattern matching, and am having problems using the "LIKE" operator to match the same string twice in the pattern. For example, in the following code: Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim theString As String theString = "1234 TEST 5432 TEST ABCD" If theString Like "*TEST*TEST*" Then...
5
1659
by: Terry Olsen | last post by:
Is there a good way to find a pattern of bytes/chars in a stream? I've got a serial port connected to a tcp port. I need to be able to catch a unique character string in the stream so that I can perform certain functions. For example, I have a telnet client connected to an Apple II through the serial port. The user at the telnet terminal is using the BBS running on the Apple II just like the good ole days of dialup BBS's. I need to be...
4
1587
by: Jéjé | last post by:
Hi, I have a file which contain 1 pair of values by line like: Name1=Value1 = I nned to store these pair of values in a sortedlist. So the result expected for the 2 samples lines is: Key Value Name1 Value1
1
944
by: oyster | last post by:
that is, there is no TABLE tag between a TABLE, for example <table >something with out table tag</table> what is the RE pattern? thanks the following is not right <table.*?>*?</table>
0
8917
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9426
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9200
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9142
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6022
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4525
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
2680
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2163
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.