473,320 Members | 2,158 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Looking for a regexp generator based on a set of known string representative of a string set

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

Sep 8 '06 #1
9 1907
Ant

vb******@gmail.com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea
def getRegex(list_of_strings):
return ".*"

;-)

Sep 8 '06 #2

vb******@gmail.com wrote:
I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression
(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

Sep 8 '06 #3
vb******@gmail.com:
I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine patterns, i.e. families of strings...).
This may be a very simple starting point:
>>import re
strings = ["foo", "bar", "$spam"]
strings2 = "|".join(re.escape(s) for s in strings)
strings2
'foo|bar|\\$spam'
>>finds = re.compile(strings2)
But I don't know how well this may work with many longer strings.
If that's not enoug we can think about more complex solutions.

Bye,
bearophile

Sep 8 '06 #4
"Andy Dingley" <di*****@codesmiths.comwrote in message
news:11**********************@i3g2000cwc.googlegro ups.com...
>
vb******@gmail.com wrote:
>I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this
It turns out this problem is a little trickier, especially when one of your
strings is a leading subset of another. For instance, let's say we are
looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply
concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>', and
'>=' alternatives, and so the regexp never matches the longer options (I was
not able to find a greediness switch that would override this). So when
searching "a >= b" we get this:
>>re.findall("<|>|=|<=|>=|!=", "a >= b")
['>', '=']

By moving the longer option to the front of the regexp, the longer option is
no longer masked by the shorter:
>>re.findall(">=|<|>|=|<=|!=", "a >= b")
['>=']
You also can't just concatenate input strings, since it is very likely they
will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape
needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes the
input strings, before concatenating them to a valid regexp. Of course, the
simplest approach would be to just sort the symbols by descending length,
but you may have some a priori knowledge that 'X' is a very common match,
and want that option tested as early as possible. So this method only
reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
>>print createREfrom(["ABC","ABCDEF","ABCGHI"])
ABCDEF|ABCGHI|ABC
>>print createREfrom("< = <= >= != << ><<< >>>".split())
\>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
>>re.findall( createREfrom("< = <= >= != << ><<< >>>".split()), "a <=
b")
['<=']
Note, this does not do any optimization, such as collapsing "ABCDEF|ABCGHI"
to "ABC(DEF|GHI)". I think there are some recipes in the Python cookbook
for such optimization.

-- Paul
Sep 8 '06 #5
vb******@gmail.com schrieb:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).
For matching the exact set, of course a concatenation can be used.
However, this is of limited use, because then simple string find will
suffice.

In general, this can't be done. It might be possible that the underlying
structure of the language isn't regular at all - for example, the simple
language of palindromes isn't.

And even if it was - the search space is just to large to explore. You
could generate a bazillion matching expressions, but .* will always
match - so how do you prevent the generator from converging towards that?

If all you have are those strings, you are better off trying to infer
some structure yourself.

Diez
Sep 8 '06 #6
vb******@gmail.com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea
I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 9 '06 #7

James Stroud a écrit :
vb******@gmail.com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html
Indeed, this seems cool! Thanks for the suggestion

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
>
But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/
I will look at that more precisely, but at my first look
it seems this is more specialized and less accessible
for the common mortal...
>
James
Thanks. This may help me.

In addition I continue to look for other ideas, notably
because I want code that I can change myself,
and exclusively python code

>
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 9 '06 #8
vb******@gmail.com wrote:
I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
Well, in the realm of sequence analysis, it is trivial to devise a regex
for these values because they are already aligned and of fixed length.
This is a couple of more lines than it needs to be, only so its easier
to follow the logic. This uses throw-away groups to avoid bracketed
sets, becuase you have dashes in your items. You might need to tweak the
following code if you have characters special to regex in your sequences
or if you want to use bracketed sets. The place to do this is in the
_joiner() function.
values = """
cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
""".split()

# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
if len(c) == 1:
# will raise KeyError for empty column
return c.pop()
else:
return "(?:%s)" % '|'.join(c)

columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_strs)
exact_rgx_str = "^%s$" % rgx_str

# '(?:c|n)(?:p|s)(?:k|u|d)g-3(?:1|0)0(?:A|0)(?:A|B|1|0|3|2|6|8)'
print rgx_str
But, if you get much more complicated that this, you will definitely
want to check out hmmer, especially if you can align your sequences.

James
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 9 '06 #9

"Paul McGuire" <pt***@austin.rr._bogus_.comwrote in message
news:OT******************@tornado.texas.rr.com...
"Andy Dingley" <di*****@codesmiths.comwrote in message
news:11**********************@i3g2000cwc.googlegro ups.com...
>>
vb******@gmail.com wrote:
>>I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of
your strings is a leading subset of another. For instance, let's say we
are looking for comparison operators, one of <, >, =, >=, <=, or !=.
Simply concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>',
and '>=' alternatives, and so the regexp never matches the longer options
(I was not able to find a greediness switch that would override this). So
when searching "a >= b" we get this:
>>>re.findall("<|>|=|<=|>=|!=", "a >= b")
['>', '=']

By moving the longer option to the front of the regexp, the longer option
is no longer masked by the shorter:
>>>re.findall(">=|<|>|=|<=|!=", "a >= b")
['>=']
You also can't just concatenate input strings, since it is very likely
they will contain one of the magic re symbols ()[]?*./\+, etc. So
re.escape needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes
the input strings, before concatenating them to a valid regexp. Of
course, the simplest approach would be to just sort the symbols by
descending length, but you may have some a priori knowledge that 'X' is a
very common match, and want that option tested as early as possible. So
this method only reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
>>>print createREfrom(["ABC","ABCDEF","ABCGHI"])
ABCDEF|ABCGHI|ABC
>>>print createREfrom("< = <= >= != << ><<< >>>".split())
\>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
>>>re.findall( createREfrom("< = <= >= != << ><<< >>>".split()), "a <=
b")
['<=']
Note, this does not do any optimization, such as collapsing
"ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the
Python cookbook for such optimization.

-- Paul


Sep 15 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
by: Magnus Lie Hetland | last post by:
I'm working on a project (Atox) where I need to match quite a few regular expressions (several hundred) in reasonably large text files. I've found that this can easily get rather slow. (There are...
10
by: Andrew DeFaria | last post by:
I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following...
8
by: Aaron | last post by:
I need some help writing this function the function returns a list of random non-duplicate integers in a string, 1 parameter indicating the maximum range. string randGen(int maxRange) ...
1
by: Chris Dunaway | last post by:
I have a string: 555-1234,12345,"Jones, John" I want to split the string based on a comma as a delimiter. Since the name portion has a comma and is delimited with quotation marks, I need to...
7
by: arno | last post by:
Hi, I want to search a substring within a string : fonction (str, substr) { if (str.search(substr) != -1) { // do something } }
23
by: walterbyrd | last post by:
Way back when, I got a lot of training and experience in highly structued software development. These days, I dabble with web-development, but I may become more serious. I consider php to be an...
9
by: =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?= | last post by:
With regexps you can search for strings matching it. For example, given the regexp: "foobar\d\d\d". "foobar123" would match. I want to do the reverse, from a regexp generate all strings that could...
3
by: Paddy | last post by:
Lets say i have a generator running that generates successive characters of a 'string' characters then I would have to 'freeze' the generator and pass the characters so far to re.search. It is...
5
by: George Sakkis | last post by:
Is there any package that parses regular expressions and returns an AST ? Something like: Regex('i ', Or('love', 'hate'), ' h', Or('is', 'er'), ' ', Or('cat', 'dog'), Optional('s'),...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.