473,406 Members | 2,343 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Regular expression fun. Repeated matching of a group Q

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

Feb 24 '06 #1
7 1684
There's more to re than just sub. How about:

sanesplit = re.split(r"</td><td>|<td>|</td>", text)
date = sanesplit[1]
times = times = [time for time in sanesplit if re.match("\d\d:\d\d",
time)]

.... then "date" contains the date at the beginning of the line and
"times" contains all your times.

Feb 24 '06 #2
Thanks,

The date = sanesplit[1] line complains about the "list index being out
of range", which is probably due to the fact that not all lines have
the <td> in them, something i didn't explain in the previous post.

I'd need some way of ensuring, as with the pattern I'd concocted, that
a valid line actually starts with a <td> containing a / separated date
tag.

As an aside, is it not actually possible to do what I was trying with a
single pattern or is it just not practical?

M@

Feb 24 '06 #3
You can check len(sanesplit) to see how big your list is. If it is <
2, then there were no <td>'s, so move on to the next line.

It is probably possible to do the whole thing with a regular
expression. It is probably not wise to do so. Regular expressions are
difficult to read, and, as you discovered, difficult to program and
debug. In many cases, Python code that relies on regular expressions
for lots of program logic runs slower than code that uses normal
Python.

Suppose "words" contains all the words in English. Compare these two
lines:

foobarwords1 = [x for x in words if re.search("foo|bar", x) ]
foobarwords2 = [x for x in words if "foo" in x or "bar" in x ]

I haven't tested this with 2.4, but as of a few years ago it was a safe
bet that foobarwords2 will be calculated much, much faster. Also, I
think you will agree, foobarwords2 is a lot easier to read.

Feb 24 '06 #4
Yes, it's easier to read without a doubt. I just wondered if i was
failing to do what i was trying to do because it couldn't be done or
because i hadn't properly understood what i was doing. Alas, it was
probably the latter.

Thanks for your help,

M@

Feb 24 '06 #5
Here's a (surprise!) pyparsing solution. -- Paul
(Get pyparsing at http://pyparsing.sourceforge.net.)

data = [
"""<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>""",
"""<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>"""
]

from pyparsing import *

startTD,endTD = makeHTMLTags("TD")
startTD = startTD.suppress()
endTD = endTD.suppress()
dayOfWeek = oneOf("Sunday Monday Tuesday Wednesday Thursday Friday
Saturday")
nbsp = Literal("&nbsp;")
time = Combine(Word(nums,exact=2) + ":" + Word(nums,exact=2))
date = Combine(Word(nums,exact=2) + "/" + Word(nums,exact=2) + "/" +
Word(nums,exact=4))

entry = ( startTD + date.setResultsName("date") + endTD +
startTD + dayOfWeek.setResultsName("dayOfWeek") + endTD +
startTD + ( Suppress(nbsp) |
Word(alphanums+"_").setResultsName("name") ) + endTD +
OneOrMore(startTD + (Suppress(nbsp) | time) + endTD
).setResultsName("dates")
)

for d in data:
res = entry.parseString(d)
print res.date
print res.dayOfWeek
print res.name
print res.dates
print
Returns:

04/01/2006
Wednesday

['09:14', '12:44', '12:50', '17:58', '08:14']

03/01/2006
Tuesday
Annual_Holiday
['08:00']

Feb 24 '06 #6
Doesn't this do what you want?

import re

DATE_TIME_RE =
re.compile(r'<td>((\d{2}\/\d{2}\/\d{4})|(\d{2}:\d{2}))<\/td>')

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td>&nbsp;</td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>08:14</td>'

out = [m[0] for m in DATE_TIME_RE.findall(test)]

for m in out:
print m

Feb 24 '06 #7
ma***********@gmail.com wrote:
Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

This works:

import BeautifulSoup

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td>&nbsp;</td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>08:14</td>'

c=BeautifulSoup.BeautifulSoup(test)
times=[]
for i in c.childGenerator():
if i.contents[0] == "&nbsp;": continue
times.append(i.contents[0])

date=times.pop(0)
day=times.pop(0)

print "date=", date
print "day=", day
print "times=", times

-Larry Bates
Feb 25 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: kaptain kernel | last post by:
can anyone translate this into plain english? preg_match_all("/(\w+)+/U", $text, $words);
1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
15
by: Mark Rae | last post by:
Hi, I'm trying to construct a RegEx pattern which will validate a string so that it can contain: only the numerical characters from 0 to 9 i.e. no decimal points, negative signs, exponentials...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
2
by: Tedmond | last post by:
Dear all, I want to read a file data block by block using regular expression. The file contents is like MWH ........ ................. ..................... MWH ....................
10
by: venugopal.sjce | last post by:
Hi Friends, I'm constructing a regular expression for validating an expression which looks as any of the following forms: 1. =4*++2 OR 2. =Sum()*6 Some of the samples I have constructed...
9
by: netimen | last post by:
I have a text containing brackets (or what is the correct term for '>'?). I'd like to match text in the uppermost level of brackets. So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt ff 2...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.