Regular expression fun. Repeated matching of a group Q

matteosartori

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td> </td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td> </td><td> </td><td> </td><td> </td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

Feb 24 '06 #1

Subscribe Post Reply

1684

johnzenger

There's more to re than just sub. How about:

sanesplit = re.split(r"</td><td>|<td>|</td>", text)
date = sanesplit[1]
times = times = [time for time in sanesplit if re.match("\d\d:\d\d",
time)]

.... then "date" contains the date at the beginning of the line and
"times" contains all your times.

Feb 24 '06 #2

matteosartori

Thanks,

The date = sanesplit[1] line complains about the "list index being out
of range", which is probably due to the fact that not all lines have
the <td> in them, something i didn't explain in the previous post.

I'd need some way of ensuring, as with the pattern I'd concocted, that
a valid line actually starts with a <td> containing a / separated date
tag.

As an aside, is it not actually possible to do what I was trying with a
single pattern or is it just not practical?

M@

Feb 24 '06 #3

johnzenger

You can check len(sanesplit) to see how big your list is. If it is <
2, then there were no <td>'s, so move on to the next line.

It is probably possible to do the whole thing with a regular
expression. It is probably not wise to do so. Regular expressions are
difficult to read, and, as you discovered, difficult to program and
debug. In many cases, Python code that relies on regular expressions
for lots of program logic runs slower than code that uses normal
Python.

Suppose "words" contains all the words in English. Compare these two
lines:

foobarwords1 = [x for x in words if re.search("foo|bar", x) ]
foobarwords2 = [x for x in words if "foo" in x or "bar" in x ]

I haven't tested this with 2.4, but as of a few years ago it was a safe
bet that foobarwords2 will be calculated much, much faster. Also, I
think you will agree, foobarwords2 is a lot easier to read.

Feb 24 '06 #4

matteosartori

Yes, it's easier to read without a doubt. I just wondered if i was
failing to do what i was trying to do because it couldn't be done or
because i hadn't properly understood what i was doing. Alas, it was
probably the latter.

Thanks for your help,

M@

Feb 24 '06 #5

Paul McGuire

Here's a (surprise!) pyparsing solution. -- Paul
(Get pyparsing at http://pyparsing.sourceforge.net.)

data = [
"""<td>04/01/2006</td><td>Wednesday</td><td> </td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td> </td><td> </td><td> </td><td> </td><td>08:14</td>""",
"""<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td>08:00</td>"""
]

from pyparsing import *

startTD,endTD = makeHTMLTags("TD")
startTD = startTD.suppress()
endTD = endTD.suppress()
dayOfWeek = oneOf("Sunday Monday Tuesday Wednesday Thursday Friday
Saturday")
nbsp = Literal(" ")
time = Combine(Word(nums,exact=2) + ":" + Word(nums,exact=2))
date = Combine(Word(nums,exact=2) + "/" + Word(nums,exact=2) + "/" +
Word(nums,exact=4))

entry = ( startTD + date.setResultsName("date") + endTD +
startTD + dayOfWeek.setResultsName("dayOfWeek") + endTD +
startTD + ( Suppress(nbsp) |
Word(alphanums+"_").setResultsName("name") ) + endTD +
OneOrMore(startTD + (Suppress(nbsp) | time) + endTD
).setResultsName("dates")
)

for d in data:
res = entry.parseString(d)
print res.date
print res.dayOfWeek
print res.name
print res.dates
print
Returns:

04/01/2006
Wednesday

['09:14', '12:44', '12:50', '17:58', '08:14']

03/01/2006
Tuesday
Annual_Holiday
['08:00']

Feb 24 '06 #6

plahey

Doesn't this do what you want?

import re

DATE_TIME_RE =
re.compile(r'<td>((\d{2}\/\d{2}\/\d{4})|(\d{2}:\d{2}))<\/td>')

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td> </td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td>08:14</td>'

out = [m[0] for m in DATE_TIME_RE.findall(test)]

for m in out:
print m

Feb 24 '06 #7

Larry Bates

ma***********@gmail.com wrote:

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td> </td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td> </td><td> </td><td> </td><td> </td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

This works:

import BeautifulSoup

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td> </td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td>08:14</td>'

c=BeautifulSoup.BeautifulSoup(test)
times=[]
for i in c.childGenerator():
if i.contents[0] == " ": continue
times.append(i.contents[0])

date=times.pop(0)
day=times.pop(0)

print "date=", date
print "day=", day
print "times=", times

-Larry Bates

Feb 25 '06 #8

Similar topics

regular expression - help

by: kaptain kernel | last post by:

can anyone translate this into plain english? preg_match_all("/(\w+)+/U", $text, $words);

PHP

Request for Feedback; a module making it easier to use regular expressions.

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Regular expression optimization

by: Billa | last post by:

Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...

.NET Framework

Ignoring spaces in regular expression matching

by: Mark Rae | last post by:

Hi, I'm trying to construct a RegEx pattern which will validate a string so that it can contain: only the numerical characters from 0 to 9 i.e. no decimal points, negative signs, exponentials...

C# / C Sharp

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

Regular Expression

by: Tedmond | last post by:

Dear all, I want to read a file data block by block using regular expression. The file contents is like MWH ........ ................. ..................... MWH ....................

.NET Framework

Regular expression for validating [GrandTotal]=4*[TotalCharges]+[currentCharges]+2

by: venugopal.sjce | last post by:

Hi Friends, I'm constructing a regular expression for validating an expression which looks as any of the following forms: 1. =4*++2 OR 2. =Sum()*6 Some of the samples I have constructed...

C# / C Sharp

brackets content regular expression

by: netimen | last post by:

I have a text containing brackets (or what is the correct term for '>'?). I'd like to match text in the uppermost level of brackets. So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt ff 2...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA