473,804 Members | 3,750 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

problem with regular expression?

I'm trying to scan a (binary) file for a string matching a particular
pattern, and am getting unexpected results. I don't know if this is a bug
or just my own misunderstandin g of regular expressions.

The string I'm searching for is a "versioned file name" of the form:
"AMS_epXXXx.flt ", where 'XXX' is 1 to 3 numerals, the 'x' is lower case 'a-z',
and the '_' and 'ep' are each optional. In other words, the following are
examples that match:

AMSep12a.flt
ams_ep101b.flt
ams_123z.flt
ams12z.flt

The regular expression pattern I'm using is:

prefix='ams'
pat = re.compile(pref ix + r'(?:(_)?(ep)?([0-9]{1,3}[a-z])\.flt)', re.I)
I'm using the parenthesized groups to conditionally process the match, i.e.,
if there is no '_' or 'ep' in the name, I still want the match but handle
it differently. In my pattern above, group 1 is the (_) group, group 2 is
the (ep) group, and group 3 is the "version string" group.

The problem I'm having is that the following string of bytes (hex data from
a file I'm scanning) returns a '_' in match group 1 even though it is
outside the filename pattern that is properly detected:

Here's a code snippet to illustrate:
#============== =============== =============== =============== =============== ==
import binascii, re

prefix = 'ams'
#...
pat = re.compile(pref ix + r'(?:(_)?(ep)?([0-9]{1,3}[a-z])\.flt)', re.I)

#...scan file...

#-------------------------
# bytes in problem string (note that this section is arbitrary and not part
# of the actual problem; it's just my attempt at converting the output of
# a hexdump file utility into a python string so as to illustrate the problem
# in a self-contained test case:

# problem data in file:
#
# 000a 0004 0002 0020 414d 535f 6a75 6c00
# 0000 0000 0000 0000 0000 0000 0000 0000
# 0000 0000 000a 0004 003f 00d8 414d 5365
# 7031 3031 692e 666c 7400 0000 0000 0000
# 000a 0004 0002 0020 414d 535f 6a75 6c00 ....... AMS_jul.
# 0000 0000 0000 0000 0000 0000 0000 0000 ............... .
# 0000 0000 000a 0004 003f 00d8 414d 5365 .........?..AMS e
# 7031 3031 692e 666c 7400 0000 0000 0000 p101i.flt...... .

bytes = '000a0004000200 20414d535f6a756 c00000000000000 000000000000000 000000000000000 0a0004003f00d84 14d536570313031 692e666c7400000 000000000'

ascii = binascii.a2b_he x(bytes)
#-------------------------

m = pat.search(asci i)
print m.groups()
print m.span(0), m.span(1), m.span(2), m.span(3)

#output: ('_', 'ep', '101i')
# (44, 57) (11, 12) (47, 49) (49, 53)
#
# Note that the '_' reported at position 11 in "AMS_jul" is outside the
# range of the "real" matched string "AMSep101i. flt" at positions (44-57)!

Jul 18 '05 #1
0 1406

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
3155
by: Harry | last post by:
Hi there, does anyone know how I can build a regular expression e.g. for the string.search() function on runtime, depending on the content of variables? Should be something like this: var strkey = "something"; var str = "Somethin like this"; if( str.search( / + strkey + / ) > -1 )
11
5396
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However, I want to use the text included within the brackets to do a lookup so that I can replace the expression with the new text. For example:
18
3047
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How ??
7
3833
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I want to avoid that. My question here is if there is a way to pass either a memory stream or array of "find", "replace" expressions or any other way to avoid multiple copies of a string. Any help will be highly appreciated
9
3363
by: Pete Davis | last post by:
I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use an app call The Regulator, which makes it pretty easy to build and test regular expressions. As a warning, I'm real weak with regular expressions. Let's say my regular expression is:
25
5177
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How do I gain access to the expression (not the matches) at runtime? Thanks, Mike
5
3791
by: shawnmkramer | last post by:
Anyone every heard of the Regex.IsMatch and Regex.Match methods just hanging and eventually getting a message "Requested Service not found"? I have the following pattern: ^(?<OrgCity>(+)+), City of, (?<OrgState>(()|( +\.)))( \((?<OrgCountry>{2,})\))?$ (ignore the line wrap)
1
1096
by: Davy | last post by:
Hi all, I have read a re.sub() that confused me. s = 'P & Q' s = re.sub(r'(+)', r'Expr("\1")', s) What's "\1" and the whole re.sub() mean? Best regards,
1
1695
by: Shawn B. | last post by:
Greetings, I'm using a custom WebBrowser control: http://www.codeproject.com/KB/miscctrl/csEXWB.aspx When I get the DocumentSource of a web page I browsed, and run a regular expression against it, the Expression never matches anything, nothing, nadda. Never. I know it is a correct Regular Expression because if I use the intrinsic WebBrowser control, it the expression works. I know that if I
14
4996
by: Andy B | last post by:
I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any text</aClosingTag> I need a Regex that will get all of the text between the html tags above (the html tags are random and i do not know them before hand). The match string always starts with at least 5 digits.
0
9706
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10583
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10337
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10323
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9160
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7622
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6854
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5654
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3822
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.