By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,732 Members | 1,435 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,732 IT Pros & Developers. It's quick & easy.

Regular expression fun. Repeated matching of a group Q

P: n/a
Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

Feb 24 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
There's more to re than just sub. How about:

sanesplit = re.split(r"</td><td>|<td>|</td>", text)
date = sanesplit[1]
times = times = [time for time in sanesplit if re.match("\d\d:\d\d",
time)]

.... then "date" contains the date at the beginning of the line and
"times" contains all your times.

Feb 24 '06 #2

P: n/a
Thanks,

The date = sanesplit[1] line complains about the "list index being out
of range", which is probably due to the fact that not all lines have
the <td> in them, something i didn't explain in the previous post.

I'd need some way of ensuring, as with the pattern I'd concocted, that
a valid line actually starts with a <td> containing a / separated date
tag.

As an aside, is it not actually possible to do what I was trying with a
single pattern or is it just not practical?

M@

Feb 24 '06 #3

P: n/a
You can check len(sanesplit) to see how big your list is. If it is <
2, then there were no <td>'s, so move on to the next line.

It is probably possible to do the whole thing with a regular
expression. It is probably not wise to do so. Regular expressions are
difficult to read, and, as you discovered, difficult to program and
debug. In many cases, Python code that relies on regular expressions
for lots of program logic runs slower than code that uses normal
Python.

Suppose "words" contains all the words in English. Compare these two
lines:

foobarwords1 = [x for x in words if re.search("foo|bar", x) ]
foobarwords2 = [x for x in words if "foo" in x or "bar" in x ]

I haven't tested this with 2.4, but as of a few years ago it was a safe
bet that foobarwords2 will be calculated much, much faster. Also, I
think you will agree, foobarwords2 is a lot easier to read.

Feb 24 '06 #4

P: n/a
Yes, it's easier to read without a doubt. I just wondered if i was
failing to do what i was trying to do because it couldn't be done or
because i hadn't properly understood what i was doing. Alas, it was
probably the latter.

Thanks for your help,

M@

Feb 24 '06 #5

P: n/a
Here's a (surprise!) pyparsing solution. -- Paul
(Get pyparsing at http://pyparsing.sourceforge.net.)

data = [
"""<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>""",
"""<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>"""
]

from pyparsing import *

startTD,endTD = makeHTMLTags("TD")
startTD = startTD.suppress()
endTD = endTD.suppress()
dayOfWeek = oneOf("Sunday Monday Tuesday Wednesday Thursday Friday
Saturday")
nbsp = Literal("&nbsp;")
time = Combine(Word(nums,exact=2) + ":" + Word(nums,exact=2))
date = Combine(Word(nums,exact=2) + "/" + Word(nums,exact=2) + "/" +
Word(nums,exact=4))

entry = ( startTD + date.setResultsName("date") + endTD +
startTD + dayOfWeek.setResultsName("dayOfWeek") + endTD +
startTD + ( Suppress(nbsp) |
Word(alphanums+"_").setResultsName("name") ) + endTD +
OneOrMore(startTD + (Suppress(nbsp) | time) + endTD
).setResultsName("dates")
)

for d in data:
res = entry.parseString(d)
print res.date
print res.dayOfWeek
print res.name
print res.dates
print
Returns:

04/01/2006
Wednesday

['09:14', '12:44', '12:50', '17:58', '08:14']

03/01/2006
Tuesday
Annual_Holiday
['08:00']

Feb 24 '06 #6

P: n/a
Doesn't this do what you want?

import re

DATE_TIME_RE =
re.compile(r'<td>((\d{2}\/\d{2}\/\d{4})|(\d{2}:\d{2}))<\/td>')

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td>&nbsp;</td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>08:14</td>'

out = [m[0] for m in DATE_TIME_RE.findall(test)]

for m in out:
print m

Feb 24 '06 #7

P: n/a
ma***********@gmail.com wrote:
Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

This works:

import BeautifulSoup

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td>&nbsp;</td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>08:14</td>'

c=BeautifulSoup.BeautifulSoup(test)
times=[]
for i in c.childGenerator():
if i.contents[0] == "&nbsp;": continue
times.append(i.contents[0])

date=times.pop(0)
day=times.pop(0)

print "date=", date
print "day=", day
print "times=", times

-Larry Bates
Feb 25 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.