By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,818 Members | 1,313 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,818 IT Pros & Developers. It's quick & easy.

a RegEx puzzle

P: n/a
I'm still shaky on some of sre's syntax. Here's the task: I've got
strings (never longer than about a dozen characters) that are
guaranteed to be made only of characters 'x' and '/'. In each string I
want to find the longest continuous stretch of pairs whose first
character is 'x' and the second is either mark. So if marks =
'/xx/xxx///', the "winning" stretch begins at position 2 and is 6
characters long ('x/xxx/'), which requires finding a second match that
overlaps the first match (which is just 'xx' in position 1). (When
there are multiple "winning" stretches, I might want to adjudicate
among them, but that's a separate problem.) I hope this is clear
enough.

Here's the best I've come up with so far:

pat = sre.compile('(x[x/])+')
(longest, startlongest) = max([(fnd.end()-fnd.start(), fnd.start()) for
i in range(len(marks))
for fnd in pat.finditer(marks,i)])

It seems to work, but it's not very smart. Given the string
'//////xx//', it returns a list whose first seven tuples are all
'(2,6)', followed by a single '(2,7)', which gets returned because of
max's secondary attention to the second element in the tuple. In other
words this code searches from position 0 and finds (2,6), then searches
from position 1 and finds (2,6) . . .

Obviously it ought to adjust the second argument to pat.finditer
according to what it finds, instead of blindly stepping. (By the way, I
can't give 'range' a stepsize of 2 because the "winning" stretch could
begin with an odd-numbered character.) That sounds like a job for a
generator. But (1) I haven't been able to work out the details (I'm
even shakier on generators than I am on regexes!) and (2) that *seems*
likely to be less efficient, maybe by a large enough factor that the
obnoxiousness of getting the first search-return seven times is
something I should just swallow.

What magic am I missing?

Charles Hartman
Professor of English, Poet in Residence
http://cherry.conncoll.edu/cohar
http://villex.blogspot.com

Jul 18 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Charles Hartman <ch*************@conncoll.edu> wrote:
I'm still shaky on some of sre's syntax. Here's the task: I've got
strings (never longer than about a dozen characters) that are
guaranteed to be made only of characters 'x' and '/'.
One possibility is to cheat completely, and depending on memory constraints
and how often you need to do this, it might even make sense to do so.

There's only two possible values for each character, so you're really
dealing with binary numbers. The max length is 12 characters (I'm
deliberately being oblivious to where you said "about" :-)), so there are
only 2^12 or 4096 possible strings. You could probably afford to
pre-compute all 4096 strings, use them as dictionary keys, and store a
(start, end) tuple for each key. Computation then becomes a simple
dictionary lookup.
In each string I
want to find the longest continuous stretch of pairs whose first
character is 'x' and the second is either mark. So if marks =
'/xx/xxx///', the "winning" stretch begins at position 2 and is 6
characters long ('x/xxx/'), which requires finding a second match that
overlaps the first match (which is just 'xx' in position 1). (When
there are multiple "winning" stretches, I might want to adjudicate
among them, but that's a separate problem.) I hope this is clear
enough.
Unfortunately, no, it's not very clear. In fact, I have no clue what
you're trying to do. Could you try explaining it again?
Charles Hartman
Professor of English, Poet in Residence


Hmmm. Are you, by any chance, looking for meter patterns in verse?
Jul 18 '05 #2

P: n/a
Charles Hartman wrote:
pat = sre.compile('(x[x/])+')
(longest, startlongest) = max([(fnd.end()-fnd.start(), fnd.start()) for
i in range(len(marks))
for fnd in pat.finditer(marks,i)])


If I'm understanding that correctly, the only way for you to get different
best matches are at offsets 0 and 1; offset 2 will yield the same matches
as 0, with the possibility of excluding the first two characters -- i. e.
any different matches should be guaranteed to be shorter. Therefore

.... for i in range(2) ...

instead of

.... for i in range(len(marks)) ...

should be sufficient.

Peter

Jul 18 '05 #3

P: n/a
On Fri, 11 Mar 2005 08:38:36 -0500, Charles Hartman wrote:
I'm still shaky on some of sre's syntax. Here's the task: I've got
strings (never longer than about a dozen characters) that are
guaranteed to be made only of characters 'x' and '/'. In each string I
want to find the longest continuous stretch of pairs whose first
character is 'x' and the second is either mark. So if marks =
'/xx/xxx///', the "winning" stretch begins at position 2 and is 6
characters long ('x/xxx/')


I don't think regexes are a good match for this task. They just aren't
optimized for this very well.

Here's what I have, though I don't know if it's *exactly* what you want:

def xCounter(s, initialOffset):
offset = initialOffset
count = 0
slen = len(s)
while s[offset] == 'x':
count = count + 1
offset += 2
if offset > slen:
break
return count, s[initialOffset:offset]

def longestRun(s):
return max([xCounter(s, i) for i in range(len(s))])[1]

In other words, start at all the positions and find the max, sort of "by
hand".

(In particular, I don't know how this will handle two runs of equal size,
if it will prefer the first or the last. I don't know how *you* handle
that, either, so I'm not sweating it for now. If you clarify this I can
help, the easiest way is to add another number into the tuple that
xCounter generates for max to work on.)

This is not optimized, and if you're processing millions of these, this
may be too slow. What I'd suggest is that if this isn't fast enough,
memoize the function, i.e., with this:

longestRunResults = {}

def memoizedLongestRun(s):
try:
return longestRunResults[s]
except KeyError: pass

result = longestRun(s)
longestRunResults[s] = result
return result

You'll probably want to change the names of things to better fit.

This will get you the benefits of pre-computation without paying the
up-front costs (you only compute what you need, not all possible
combinations), and this ought to be plenty fast to process a whole lotta
data really quickly.

(This is a great example of why readable code can be more important than
fast code; the speedup can be added later, it's more important that you
can satisfy yourself that the code is correct, and a "clever" algorithm
might make that hard.)
Jul 18 '05 #4

P: n/a
Charles Hartman wrote:
I'm still shaky on some of sre's syntax. Here's the task: I've got
strings (never longer than about a dozen characters) that are guaranteed
to be made only of characters 'x' and '/'. In each string I want to find
the longest continuous stretch of pairs whose first character is 'x' and
the second is either mark. So if marks = '/xx/xxx///', the "winning"
stretch begins at position 2 and is 6 characters long ('x/xxx/'), which
requires finding a second match that overlaps the first match (which is
just 'xx' in position 1). (When there are multiple "winning" stretches,
I might want to adjudicate among them, but that's a separate problem.) I
hope this is clear enough.

Here's the best I've come up with so far:

pat = sre.compile('(x[x/])+')
(longest, startlongest) = max([(fnd.end()-fnd.start(), fnd.start()) for
i in range(len(marks))
for fnd in pat.finditer(marks,i)])


It's pretty simple to put re.search() into a loop where subsequent searches start from the character
after where the previous one matched. Here is a solution that uses a general-purpose longest match
function:

import re

# RE solution
def longestMatch(rx, s):
''' Find the longest match for rx in s.
Returns (start, length) for the match or (None, None) if no match found.
'''

start = length = current = 0

while True:
m = rx.search(s, current)
if not m:
break

mStart, mEnd = m.span()
current = mStart + 1

if (mEnd - mStart) > length:
start = mStart
length = mEnd - mStart

if length:
return start, length

return None, None
pairsRe = re.compile(r'(x[x/])+')

for s in [ '/xx/xxx///', '//////xx//' ]:
print s, longestMatch(pairsRe, s)
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.