473,703 Members | 2,279 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Request for Feedback; a module making it easier to use regular expressions.

I'm working on the 0.8 release of my 'rex' module, and would appreciate
feedback, suggestions, and criticism as I work towards finalizing the
API and feature sets. rex is a module intended to make regular expressions
easier to create and use (and in my experience as a regular expression
user, it makes them MUCH easier to create and use.)

I'm still working on formal documentation, and in any case, such
documentation isn't necessarily the easiest way to learn rex. So, I've
appended below a rex interactive session which was then annotated
with explanations of the features being used. I believe it presents a
reasonably good view of what rex can do. If you have time, please
read it, and then send your feedback via email. Unfortunately, I do not
currently have time to keep track of everything on comp.lang.pytho n.

Ken McDonald

=============== =============== ===============

What follows is an illustration by example of how the 'rex' module works, for those already knowledgable of regular expressions as used in Python's 're' (or similar) regular expressions package. It consists of a quick explanation of a rex feature, followed by an interactive demo of that feature. You need to understand a couple of quick points to understand rex and the demo.

1) To distinguish between standard regular expressions as constructed by hand and used with the 're' package, and regular expressions constructed by and used in 'rex', I'll call the former 'regexps', and the latter 'rexps'.

2) The Rexp class, of which every rexp is an instance, is simply a subclass of Python's regular string class, with some modified functionality (for example, the __add__ method has been changed to modify the action of the '+' operation), and many more operators and methods. I'm not sure this was the wisest thing to do, but it sure helps when trying to relate rexps to regexps; just construct a rexp interactively or in a program and print it, and in either case you'll see the underlying string that is passed to the 're' module functions and methods.

On to the tutorial.

'rex' is designed have few public names, so the easiest way to use
it is to import all the names:
from rex import *
The most basic rex function is PATTERN, which simply takes a string or strings, and produces a rexp which will match exactly the argument strings when used to match or search text. As mentioned above, what you see printed as the result of executing PATTERN is the string that will be (invisibly) passed to 're' as a regexp string. PATTERN("abc") 'abc'

If given more than one argument, PATTERN will concatenate them into a single rexp. PATTERN("abc", "d") 'abcd'

The other rex function which converts standard strings to rexps is CHARSET, which produces patterns which match a single character in searched text if that character is in a set of characters defined by the CHARSET operation. This is the equivalent of the regexp [...] notation. Every character in a string passed to CHARSET will end up in the resulting set of characters. CHARSET("ab") '[ab]'

If CHARSET is passed more than one string, all characters in all arguments are included in the result rexp. CHARSET("ab", "cd") '[abcd]'

If an argument to CHARSET is a two-tuple of characters, it is taken as indicating the range of characters between and including those two characters. This is the same as the regexp [a-z] type notation. For example, this defines a rexp matching any single consonant. CHARSET('bcd', 'fgh', ('j', 'n'), ('p', 't'), 'vwxz') '[bcdfghj-np-tvwxz]'

When using CHARSET (or any other rexp operation), you do _not_ need to worry about escaping any characters which have special meanings in regexps; that is handled automatically. For example, in the follwing character set containing square brackets, a - sign, and a backslash, we have to escape the backslash only because it has a special meaning in normal Python strings. This could be avoided by using raw strings. The other three characters, which have special meaning in regexps, would have to be escaped if building this character set by hand. CHARSET('[]-\\') '[\\[\\]\\-\\\\]'
The result above is what you'd need to type using re and regexps to directly define this character set. Think you can get it right the first time?

CHARSET provides a number of useful attributes defining commonly used character sets. Some of these are defined using special sequences defined in regexp syntax, others are defined as standard character sets. In all cases, the common factor is that CHARSET attributes all define patterns matching a _single_ character. Here are a few examples: CHARSET.digit '\\d' CHARSET.alphanu m '\\w' CHARSET.uspunct uation '[~`!@#$%\\^&*()_ \\-+={\\[}\\]|\\\\:;"\'<,>.?/]'

Character sets can be negated using the '~' operator. Here is a rexp which matches anything _except_ a digit. ~CHARSET(('0',' 9')) '[^0-9]'

Remember from above that PATTERN constructs rexps out of literals, and also concatenates multiple arguments to form a rexp which matches if all of those arguments match in sequence. However, the arguments to PATTERN don't have to be just strings; they can be other rexps, which are concatenated correctly to produce a new rexp. The following expression produces a rexp which matches the string 'abc' followed by any of 'd', 'e', or 'f'. PATTERN("abc", CHARSET("def")) 'abc[def]'

Instead of passing multiple arguments to PATTERN to obtain concatenation, you can simple use the '+' operator, which has exactly the same effect, but in many circumstances may produce easier-to-read code. However, if '+' is used in this way, its left operand _must_ be a rexp; a plain string won't work. PATTERN("abc") + CHARSET("def") 'abc[def]'

To obtain an alternation rexp--one which matches if any one of several other rexps match--we use the ANYONEOF function. This is equivalent to the "|" character in regexp notation. ANYONEOF("a", "b", "c") 'a|b|c'

As with PATTERN and '+', the '|' operator may be used in place of ANYONEOF to obtain alternation. As usual, the left-hand operand must be a rexp: PATTERN("a") | "b" | "c" 'a|b|c'
Note in the above that only the _first_ operand needs to be a rexp; this is because the first and second operands combine to form a rexp, and that rexp then becomes the left operand for the second '|' operator.

Now we come to a very significant difference between regexps and rexps; the ability to combine smaller expressions into larger expressions. Below are two regexps, the first matching any one of 'a' or 'b' or 'c', the second matching 'd', 'e', or 'f'. It would be nice if there were an easy way to combine them to match strings of the form ('a' or 'b' or 'c') followed by ('d' or 'e' or 'f') using simple string addition: "a|b|c" + "d|e|f" 'a|b|cd|e|f'
Unfortunately, this produces a regexp which matches any one of 'a', 'b', 'cd', 'e', or 'f'. The simplest way I know of to achieve the desired result in this case is something like "("+"a|b|c"+")( "+"d|e|f"+" )". This is not exactly pretty, or easy to type. Something like this isn't necessary when dealing with all string literals as above, but what if the two operands were other regexps? Then you would have to type something like "("+X+")("+Y+") ".

This is much clearer useing rexps: PATTERN(ANYONEO F("a","b", "c"), ANYONEOF("d", "e", "f")) '(?:a|b|c)(?:d| e|f)'
or, shortening the expression using the '+' operator: ANYONEOF("a", "b", "c") + ANYONEOF("d", "e", "f") '(?:a|b|c)(?:d| e|f)'
Note that when rexps are put together like this, the parentheses used for grouping are 'numberless' parentheses--they will not be considered when extracting match subresults using numbered groups. Since the insertion of these parentheses in the produced regexp are invisible to the rex user, this is exactly what is desired.

Precedence works as you might expect, with '+' having higher precedence than '|' (though the example below is rather simple as an illustration of this.) PATTERN("a") + "b" | "e" + "f" 'ab|ef'

To match a pattern 0 or more times, use the ZEROORMORE function. This is analogous to the regexp '*' character. Note that parentheses are inserted to ensure the function applies to all of what you pass in. ZEROORMORE("a") '(?:a)*' ZEROORMORE("abc ") '(?:abc)*'

ONEORMORE matches a sequence of one or more rexps, and is like the "+" regexp operator. ONEORMORE("abc" ) '(?:abc)+'

The short way of obtaining repetition, and of matching more limited repetitions of a pattern, is to use the "*" operator. This expression is the same as ZEROORMORE("abc "): PATTERN("abc")* 0 '(?:abc)*'

....and this is the same as ONEORMORE("abc" ): PATTERN("abc")* 1 '(?:abc)+'

If a negative sign precedes the match number, it indicates the resulting rexp should match _no more_ than that many repetitions of the (positive) number. This matches anywhere from 0 to 3 repetitions of "abc": PATTERN("abc")*-3 '(?:abc){0,3}'
Use a two-tuple to specify both an upper and lower bound. Match anywhere from 2 to five repetitions of "abc". PATTERN("abc")* (2,5) '(?:abc){2,5}'

The OPTIONAL function indicates that the argument rexp is optional (the containing pattern will match whether or not the rexp produced by OPTIONAL matches.) OPTIONAL("-") '(?:\\-)?'
There is no shorthand form for OPTIONAL. However, the following is semantically identical, though it produces a different regexp: PATTERN("-")*(0,1) '(?:\\-){0,1}'

Let's look a bit more at how easy it is to combine rexps into more complex rexps. PATTERN provides an attribute defining a rexp which matches floating-point numbers (without exponent): PATTERN.float '(?:\\+|\\-)?\\d+(?:\\.\\d *)?'

Using this to build a complex number matcher (assuming no whitespace) is trivial: PATTERN.float + ANYONEOF("+", "-") + PATTERN.float + "i" '(?:\\+|\\-)?\\d+(?:\\.\\d *)?(?:\\+|\\-)(?:\\+|\\-)?\\d+(?:\\.\\d *)?i'
I think the rexp construct is a little easier to understand and modify than the produced regexp :-)

What if we want to extract the real and imaginary parts of any complex number we happen to match? To do this, we name the rexp subpatterns which match the numeric portions of the complex number, by indexing them with the desired name. This corresponds to re's named groups facility for regexps, using the (?P<name>...) notation. complexrexp = PATTERN.float['re'] + ANYONEOF("+", "-") + PATTERN.float['im'] + "i"
complexresult = complexrexp.mat ch("-3.14+2.17i")
By the way, here's the regexp resulting from the above rexp. complexrexp '(?P<re>(?:\\+| \\-)?\\d+(?:\\.\\d *)?)(?:\\+|\\-)(?P<im>(?:\\+| \\-)?\\d+(?:\\.\\d *)?)i'
Would you really like to write it out by hand?

To extract the what matched the named group, we simply index the match result: complexresult['re'] '-3.14'

I highly recommend using named groups when constructing rexps; it makes code more readable and less error-prone. However, if you do want to use a numbered group for some reason, use the group() method on an existing rexp: PATTERN.float.g roup() + ANYONEOF("+", "-") + PATTERN.float.g roup() + "i" '((?:\\+|\\-)?\\d+(?:\\.\\d *)?)(?:\\+|\\-)((?:\\+|\\-)?\\d+(?:\\.\\d *)?)i'

If a match fails, we get a MatchResult which evaluates to False when used as a boolean: complexresult = complexrexp.mat ch("-3.14*2.17i")
complexresult <rex.MatchResul t object at 0x6af50> bool(complexres ult) False

Attempting to extract a subgroup from a failed match raises a KeyError and prints an appropriate error message. complexresult['re'] Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/local/python/packages/rex/__init__.py", line 508, in __getitem__
return self.get(key)
File "/local/python/packages/rex/__init__.py", line 518, in get
else: raise KeyError, "Invalid group index: "+ `key` + " (a failed match result only has one group, indexed by 0)."
KeyError: "Invalid group index: 're' (a failed match result only has one group, indexed by 0)."

However, extracting group 0 (which in a successful match always represents the entirety of the matched text) of a failed MatchResult still results in the entire string against which the match was attempted. This may seem pointless now, but will be very useful when we get into iterative searches using rexps. complexresult[0] '-3.14*2.17i'

We can do some nice things with named groups which cannot be accomplished with standard regexps. The keys() method returns the names of all named subgroups which participated in a match (but does not return the names of subgroups which did _not_ participate in the match, such as a subgroup contained in a failed OPTIONAL rexp): complexresult.k eys() ['re', 'im']

In addition, if a named group matches the _entire_ matched string, then the name of that group can be obtained with the 'getname' method. This is useful for determining which of a number top-level alternative rexps matched. altrexp = PATTERN.float['number'] | PATTERN.word['symbol']
altrexp.match(" 3.14").getname( ) 'number' altrexp.match(" abc").getname( ) 'symbol'
Note that if more than one named group matches the entire matched substring, then getname() will return one of the appropriate names, but which one is not predictable.

Lesser-used pattern matching facilities have not been neglected. Non-greedy reptition can be expressed in the same way as standard (greedy) repetition, by using the ** operator in place of *: PATTERN("a")**0 '(?:a)*?' PATTERN("a")**1 '(?:a)+?'
In this next example, the big number in the resulting regexp is sys.MAXINT. This is the closest I know how to express "three to infinity" in a regexp pattern. PATTERN("a")**3 '(?:a){3,214748 3647}?'

Lookahead and lookback assertions are supported with the '+' and '-' unary operators: +PATTERN("a") '(?=a)' -PATTERN("a") '(?<=a)'

Both types of assertions can be negated by prepending with a tilde, as can be done with CHARSET rexps: ~-PATTERN("a") '(?<!a)' ~+PATTERN("a") '(?!a)'

Any regular expression can be considered as denoting the set of all strings which it matches. (Or, for those who've taken a formal class on RE's and finite automate, the set of strings which it "generates" .) So, matching a piece of text against a regular expression is really the same thing as asking if that text is in the set of strings "generated" by the regular expression. Rexps provide a nice way of doing this using Python's "in" operator. The examples below ask if a couple of strings are in the set of strings consisting of a sequence of one or more 'a' characters: "aa" in PATTERN("a")*1 True "ab" in PATTERN("a")*1 False

Searching text is done using a rexp's search() method. Let's find the string "cd" in the text "abcdef": searchresult = PATTERN("cd").s earch("abcdef")
We know the search succeeded by evaluating the MatchResult as a boolean... bool(searchresu lt) True
....and can easily extract the start and end positions of the matched string, and the string itself (which might be useful if the search rexp was not a literal): searchresult.st art(0) 2 searchresult.en d(0) 4 searchresult[0] 'cd'

Iterative searching--that is, searching for _all_ instances in a piece of text matched by a regular expression--can be a bit awkward when using regexps. It is very easy when using rexps. The example below uses the fact that __str__ in a MatchResult object is defined so that str(matchresult ) returns the entire substring matched by the MatchResult; str() is mapped over the sequence of MatchResult instances generated by itersearch() to get a list of the matched substrings. map(str, CHARSET.digit.i tersearch("ab0c 9de7")) ['0', '9', '7']

'itersearch' is a generator function, which means that it only computes and returns MatchResult instances as they are requested by the enclosing loop. So, itersearch() can be used in a memory-efficient manner even on very large pieces of text.

'itersearch' can also be used more flexibly. If defines an optional paramater named 'matched' which defaults to True and indicates that only successful MatchResults should be returned. If we perform a search with this parameter set to False, then only _failing_ MatchResults will be returned... map(str, CHARSET.digit.i tersearch("ab0c 9de7", matched=False)) ['ab', 'c', 'de']
....and if None is passed as the value of 'matched', then both successful and failed MatchResults will be returned: map(str, CHARSET.digit.i tersearch("ab0c 9de7", matched=None)) ['ab', '0', 'c', '9', 'de', '7']
We can still determine which of these results are failures and which are successes by using the MatchResults as a boolean: map(bool, CHARSET.digit.i tersearch("ab0c 9de7", matched=None)) [False, True, False, True, False, True]

This leads to a great little idiom for going through _all_ the text of a string, and processing each part as appropriate (the bit of Python code below is not part of the interactive session):

for result in myRexp.itersear ch(myText, matched=None):
if result: ...process the successful match...
else: ...process the failed match...

Rexps also have a 'replace' method, to replace found text with other text. Let's replace all digits in a string with the word "DIGIT": CHARSET.digit.r eplace("DIGIT", "ab0c9de7") 'abDIGITcDIGITd eDIGIT'

More specific replacements can be achieved by passing in a dictionary as the replace argument. Any matched substring must have a key defined in the dictionary (else a KeyError will be thrown), and is replaced with the value associated with that key: CHARSET.digit.r eplace({"9":"NI NE", "0":"ZERO", "7":"SEVEN" }, "ab0c9de7") 'abZEROcNINEdeS EVEN'

For the ultimate in flexibility, we can pass in a function as the replace argument. Whenever a match is found, its MatchResult will be passed as an argument to the function, and the result of the function will be used as the replacement value. Here's an example which increments the integer interpretation of each digit in some text by 1. def incr(matchresul t): ... return str(1+int(match result[0]))
... CHARSET.digit.r eplace(incr, "ab0c9de7")

Jul 18 '05 #1
1 4175

Ken> rex is a module intended to make regular expressions easier to
Ken> create and use...

Have you checked out Ping's rxb module?


Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

by: Michele Simionato | last post by:
I often feel the need to extend the string method ".endswith" to tuple arguments, in such a way to automatically check for multiple endings. For instance, here is a typical use case: if filename.endswith(('.jpg','.jpeg','.gif','.png')): print "This is a valid image file" Currently this is not valid Python and I must use the ugly if filename.endswith('.jpg') or filename.endswith('.jpeg') \
by: jwaixs | last post by:
arg... I've lost 1.5 hours of my precious time to try letting re work correcty. There's really not a single good re tutorial or documentation I could found! There are only reference, and if you don't know how a module work you won't learn it from a reference! This is the problem: >>> import re >>> str = "blabla<python>Re modules sucks!</python>blabla" >>> re.search("(<python>)(/python>)", str).group()
by: William L. Bahn | last post by:
I'm looking for a few kinds of feedback here. First, there is a program at the end of this post that has a function kgets() that I would like any feedback on - including style. Second, for those that are interested, I present an outline of the approach I am looking at using in class this semester. I would be very interesting in any feedback on that, but if you are not interested, feel free to skip over it.
by: Együd Csaba | last post by:
Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or adate LIKE "2004.01.10 __:15") .... into something like this: .... where adate LIKE "2004.01.10 __:(30/15)" ...
by: bobano | last post by:
Hi everyone, I am writing a POP3 Client program in Perl. You connect to a POP3 Server and have a running conversation with the mail server using commands from the RFC 1939 Post Office Protocol. This program can perform 5 options from a menu on your POP3 mail by logging in with the correct POP3 server along with a user name and password that you use to log in to your ISP. The user name and password as well as the server name are all hard-coded...
by: Kenneth McDonald | last post by:
Would a mailing list and newsgroup for "python contributions" be of interest? I currently have a module which is built on top of, and is intended to semantically replace, the 're' module. I use it constantly to great advantage, but have not made it public for the following reasons: * The API should probably be cleaned up in places. * Documentation is reasonable, but should be more organized and put into D'Oxygen format. It also needs to...
by: Licheng Fang | last post by:
Basically, the problem is this: 'do' Python's NFA regexp engine trys only the first option, and happily rests on that. There's another example: 'oneself' The Python regular expression engine doesn't exaust all the
by: Ron Adam | last post by:
I put together the following module today and would like some feedback on any obvious problems. Or even opinions of weather or not it is a good approach. While collating is not a difficult thing to do for experienced programmers, I have seen quite a lot of poorly sorted lists in commercial applications, so it seems it would be good to have an easy to use ready made API for collating. I tried to make this both easy to use and flexible. ...
by: Neil Cerutti | last post by:
A found some clues on lexing using the re module in Python in an article by Martin L÷wis. http://www.python.org/community/sigs/retired/parser-sig/towards-standard/ He writes: A scanner based on regular expressions is usually implemented as an alternative of all token definitions. For XPath, a fragment of this expressions looks like this:
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.