473,545 Members | 2,009 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Looking for a regexp generator based on a set of known string representative of a string set

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

Sep 8 '06 #1
9 1920
Ant

vb******@gmail. com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea
def getRegex(list_o f_strings):
return ".*"

;-)

Sep 8 '06 #2

vb******@gmail. com wrote:
I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression
(s1|s2|s3|s4|s5 )
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

Sep 8 '06 #3
vb******@gmail. com:
I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine patterns, i.e. families of strings...).
This may be a very simple starting point:
>>import re
strings = ["foo", "bar", "$spam"]
strings2 = "|".join(re.esc ape(s) for s in strings)
strings2
'foo|bar|\\$spa m'
>>finds = re.compile(stri ngs2)
But I don't know how well this may work with many longer strings.
If that's not enoug we can think about more complex solutions.

Bye,
bearophile

Sep 8 '06 #4
"Andy Dingley" <di*****@codesm iths.comwrote in message
news:11******** **************@ i3g2000cwc.goog legroups.com...
>
vb******@gmail. com wrote:
>I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5 )
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this
It turns out this problem is a little trickier, especially when one of your
strings is a leading subset of another. For instance, let's say we are
looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply
concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!= "

However, the leading '<' and '>' alternatives mask the later '<=', '<>', and
'>=' alternatives, and so the regexp never matches the longer options (I was
not able to find a greediness switch that would override this). So when
searching "a >= b" we get this:
>>re.findall("< |>|=|<=|>=|!=" , "a >= b")
['>', '=']

By moving the longer option to the front of the regexp, the longer option is
no longer masked by the shorter:
>>re.findall("> =|<|>|=|<=|!=", "a >= b")
['>=']
You also can't just concatenate input strings, since it is very likely they
will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape
needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes the
input strings, before concatenating them to a valid regexp. Of course, the
simplest approach would be to just sort the symbols by descending length,
but you may have some a priori knowledge that 'X' is a very common match,
and want that option tested as early as possible. So this method only
reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbo ls[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert( i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
>>print createREfrom(["ABC","ABCDEF", "ABCGHI"])
ABCDEF|ABCGHI|A BC
>>print createREfrom("< = <= >= != << ><<< >>>".split())
\>\=|\>\>\>|\>\ >|\>|\<\=|\<\<\ <|\<\<|\<|\=|\! \=
>>re.findall( createREfrom("< = <= >= != << ><<< >>>".split()) , "a <=
b")
['<=']
Note, this does not do any optimization, such as collapsing "ABCDEF|ABC GHI"
to "ABC(DEF|GH I)". I think there are some recipes in the Python cookbook
for such optimization.

-- Paul
Sep 8 '06 #5
vb******@gmail. com schrieb:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).
For matching the exact set, of course a concatenation can be used.
However, this is of limited use, because then simple string find will
suffice.

In general, this can't be done. It might be possible that the underlying
structure of the language isn't regular at all - for example, the simple
language of palindromes isn't.

And even if it was - the search space is just to large to explore. You
could generate a bazillion matching expressions, but .* will always
match - so how do you prevent the generator from converging towards that?

If all you have are those strings, you are better off trying to infer
some structure yourself.

Diez
Sep 8 '06 #6
vb******@gmail. com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea
I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 9 '06 #7

James Stroud a écrit :
vb******@gmail. com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html
Indeed, this seems cool! Thanks for the suggestion

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
>
But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/
I will look at that more precisely, but at my first look
it seems this is more specialized and less accessible
for the common mortal...
>
James
Thanks. This may help me.

In addition I continue to look for other ideas, notably
because I want code that I can change myself,
and exclusively python code

>
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 9 '06 #8
vb******@gmail. com wrote:
I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
Well, in the realm of sequence analysis, it is trivial to devise a regex
for these values because they are already aligned and of fixed length.
This is a couple of more lines than it needs to be, only so its easier
to follow the logic. This uses throw-away groups to avoid bracketed
sets, becuase you have dashes in your items. You might need to tweak the
following code if you have characters special to regex in your sequences
or if you want to use bracketed sets. The place to do this is in the
_joiner() function.
values = """
cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
""".split()

# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
if len(c) == 1:
# will raise KeyError for empty column
return c.pop()
else:
return "(?:%s)" % '|'.join(c)

columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_str s)
exact_rgx_str = "^%s$" % rgx_str

# '(?:c|n)(?:p|s) (?:k|u|d)g-3(?:1|0)0(?:A|0 )(?:A|B|1|0|3|2 |6|8)'
print rgx_str
But, if you get much more complicated that this, you will definitely
want to check out hmmer, especially if you can align your sequences.

James
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 9 '06 #9

"Paul McGuire" <pt***@austin.r r._bogus_.comwr ote in message
news:OT******** **********@torn ado.texas.rr.co m...
"Andy Dingley" <di*****@codesm iths.comwrote in message
news:11******** **************@ i3g2000cwc.goog legroups.com...
>>
vb******@gmail. com wrote:
>>I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5 )
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of
your strings is a leading subset of another. For instance, let's say we
are looking for comparison operators, one of <, >, =, >=, <=, or !=.
Simply concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!= "

However, the leading '<' and '>' alternatives mask the later '<=', '<>',
and '>=' alternatives, and so the regexp never matches the longer options
(I was not able to find a greediness switch that would override this). So
when searching "a >= b" we get this:
>>>re.findall(" <|>|=|<=|>=|!=" , "a >= b")
['>', '=']

By moving the longer option to the front of the regexp, the longer option
is no longer masked by the shorter:
>>>re.findall(" >=|<|>|=|<=|!=" , "a >= b")
['>=']
You also can't just concatenate input strings, since it is very likely
they will contain one of the magic re symbols ()[]?*./\+, etc. So
re.escape needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes
the input strings, before concatenating them to a valid regexp. Of
course, the simplest approach would be to just sort the symbols by
descending length, but you may have some a priori knowledge that 'X' is a
very common match, and want that option tested as early as possible. So
this method only reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbo ls[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert( i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
>>>print createREfrom(["ABC","ABCDEF", "ABCGHI"])
ABCDEF|ABCGHI|A BC
>>>print createREfrom("< = <= >= != << ><<< >>>".split())
\>\=|\>\>\>|\>\ >|\>|\<\=|\<\<\ <|\<\<|\<|\=|\! \=
>>>re.findall ( createREfrom("< = <= >= != << ><<< >>>".split()) , "a <=
b")
['<=']
Note, this does not do any optimization, such as collapsing
"ABCDEF|ABC GHI" to "ABC(DEF|GH I)". I think there are some recipes in the
Python cookbook for such optimization.

-- Paul


Sep 15 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
2166
by: Magnus Lie Hetland | last post by:
I'm working on a project (Atox) where I need to match quite a few regular expressions (several hundred) in reasonably large text files. I've found that this can easily get rather slow. (There are many things that slow Atox down -- it hasn't been designed for speed, and any optimizations will entail quite a bit of refactoring.) I've tried to...
10
7664
by: Andrew DeFaria | last post by:
I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following not appear to work: var email_address = "Joe@Schmoe"; var email_regex = new RegExp ("^(\\w+)(\@)(\\w+)(\.)(\\w+)$"); var result = email_regex.exec...
8
2455
by: Aaron | last post by:
I need some help writing this function the function returns a list of random non-duplicate integers in a string, 1 parameter indicating the maximum range. string randGen(int maxRange) maxRange=1000 the output would be:
1
1116
by: Chris Dunaway | last post by:
I have a string: 555-1234,12345,"Jones, John" I want to split the string based on a comma as a delimiter. Since the name portion has a comma and is delimited with quotation marks, I need to ignore or strip them out. I also want to reverse the last and first names so that "Jones, John" becomes John Jones (no quotation marks or commas, last...
7
1899
by: arno | last post by:
Hi, I want to search a substring within a string : fonction (str, substr) { if (str.search(substr) != -1) { // do something } }
23
1560
by: walterbyrd | last post by:
Way back when, I got a lot of training and experience in highly structued software development. These days, I dabble with web-development, but I may become more serious. I consider php to be an abombination, the backward compatibility issues alone are reason enough to make me hate it. Rail looks promising, but it's difficult to find...
9
6839
by: =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?= | last post by:
With regexps you can search for strings matching it. For example, given the regexp: "foobar\d\d\d". "foobar123" would match. I want to do the reverse, from a regexp generate all strings that could match it. The regexp: "{3}\d{3}" should generate the strings "AAA000", "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999". Is this...
3
1479
by: Paddy | last post by:
Lets say i have a generator running that generates successive characters of a 'string' characters then I would have to 'freeze' the generator and pass the characters so far to re.search. It is expensive to create successive characters, but caching could be used for past characters. is it possible to wrap the generator in a class, possibly...
5
3629
by: George Sakkis | last post by:
Is there any package that parses regular expressions and returns an AST ? Something like: Regex('i ', Or('love', 'hate'), ' h', Or('is', 'er'), ' ', Or('cat', 'dog'), Optional('s'), ZeroOrMore(r'\s'), OneOrMore('!')) Given such a structure, I want to create a generator that can generate all strings matched by this regexp. Obviously if the...
0
7475
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7664
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
1
7437
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
5982
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
4958
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3446
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1900
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1023
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
720
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.