By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,089 Members | 2,359 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,089 IT Pros & Developers. It's quick & easy.

problem with regular expression?

P: n/a
I'm trying to scan a (binary) file for a string matching a particular
pattern, and am getting unexpected results. I don't know if this is a bug
or just my own misunderstanding of regular expressions.

The string I'm searching for is a "versioned file name" of the form:
"AMS_epXXXx.flt", where 'XXX' is 1 to 3 numerals, the 'x' is lower case 'a-z',
and the '_' and 'ep' are each optional. In other words, the following are
examples that match:

AMSep12a.flt
ams_ep101b.flt
ams_123z.flt
ams12z.flt

The regular expression pattern I'm using is:

prefix='ams'
pat = re.compile(prefix + r'(?:(_)?(ep)?([0-9]{1,3}[a-z])\.flt)', re.I)
I'm using the parenthesized groups to conditionally process the match, i.e.,
if there is no '_' or 'ep' in the name, I still want the match but handle
it differently. In my pattern above, group 1 is the (_) group, group 2 is
the (ep) group, and group 3 is the "version string" group.

The problem I'm having is that the following string of bytes (hex data from
a file I'm scanning) returns a '_' in match group 1 even though it is
outside the filename pattern that is properly detected:

Here's a code snippet to illustrate:
#================================================= ===========================
import binascii, re

prefix = 'ams'
#...
pat = re.compile(prefix + r'(?:(_)?(ep)?([0-9]{1,3}[a-z])\.flt)', re.I)

#...scan file...

#-------------------------
# bytes in problem string (note that this section is arbitrary and not part
# of the actual problem; it's just my attempt at converting the output of
# a hexdump file utility into a python string so as to illustrate the problem
# in a self-contained test case:

# problem data in file:
#
# 000a 0004 0002 0020 414d 535f 6a75 6c00
# 0000 0000 0000 0000 0000 0000 0000 0000
# 0000 0000 000a 0004 003f 00d8 414d 5365
# 7031 3031 692e 666c 7400 0000 0000 0000
# 000a 0004 0002 0020 414d 535f 6a75 6c00 ....... AMS_jul.
# 0000 0000 0000 0000 0000 0000 0000 0000 ................
# 0000 0000 000a 0004 003f 00d8 414d 5365 .........?..AMSe
# 7031 3031 692e 666c 7400 0000 0000 0000 p101i.flt.......

bytes = '000a000400020020414d535f6a756c0000000000000000000 00000000000000000000000000a0004003f00d8414d5365703 13031692e666c7400000000000000'

ascii = binascii.a2b_hex(bytes)
#-------------------------

m = pat.search(ascii)
print m.groups()
print m.span(0), m.span(1), m.span(2), m.span(3)

#output: ('_', 'ep', '101i')
# (44, 57) (11, 12) (47, 49) (49, 53)
#
# Note that the '_' reported at position 11 in "AMS_jul" is outside the
# range of the "real" matched string "AMSep101i.flt" at positions (44-57)!

Jul 18 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.