Looking for a regexp generator based on a set of known string representative of a string set

vbfoobar

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

Sep 8 '06 #1

Subscribe Reply

1920

Ant

vb******@gmail. com wrote:

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

def getRegex(list_o f_strings):
return ".*"

;-)

Sep 8 '06 #2

Andy Dingley

vb******@gmail. com wrote:

I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5 )
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

Sep 8 '06 #3

bearophileHUGS

vb******@gmail. com:

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine patterns, i.e. families of strings...).

This may be a very simple starting point:

>>import re
strings = ["foo", "bar", "$spam"]
strings2 = "|".join(re.esc ape(s) for s in strings)
strings2

'foo|bar|\\$spa m'

>>finds = re.compile(stri ngs2)

But I don't know how well this may work with many longer strings.
If that's not enoug we can think about more complex solutions.

Bye,
bearophile

Sep 8 '06 #4

Paul McGuire

"Andy Dingley" <di*****@codesm iths.comwrote in message
news:11******** **************@ i3g2000cwc.goog legroups.com...

>
vb******@gmail. com wrote:

>I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5 )
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of your
strings is a leading subset of another. For instance, let's say we are
looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply
concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!= "

However, the leading '<' and '>' alternatives mask the later '<=', '<>', and
'>=' alternatives, and so the regexp never matches the longer options (I was
not able to find a greediness switch that would override this). So when
searching "a >= b" we get this:

>>re.findall("< |>|=|<=|>=|!=" , "a >= b")

['>', '=']

By moving the longer option to the front of the regexp, the longer option is
no longer masked by the shorter:

>>re.findall("> =|<|>|=|<=|!=", "a >= b")

['>=']
You also can't just concatenate input strings, since it is very likely they
will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape
needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes the
input strings, before concatenating them to a valid regexp. Of course, the
simplest approach would be to just sort the symbols by descending length,
but you may have some a priori knowledge that 'X' is a very common match,
and want that option tested as early as possible. So this method only
reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbo ls[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert( i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )

>>print createREfrom(["ABC","ABCDEF", "ABCGHI"])

ABCDEF|ABCGHI|A BC

>>print createREfrom("< = <= >= != << ><<< >>>".split())

\>\=|\>\>\>|\>\ >|\>|\<\=|\<\<\ <|\<\<|\<|\=|\! \=

>>re.findall( createREfrom("< = <= >= != << ><<< >>>".split()) , "a <=
b")

['<=']
Note, this does not do any optimization, such as collapsing "ABCDEF|ABC GHI"
to "ABC(DEF|GH I)". I think there are some recipes in the Python cookbook
for such optimization.

-- Paul

Sep 8 '06 #5

Diez B. Roggisch

vb******@gmail. com schrieb:

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

For matching the exact set, of course a concatenation can be used.
However, this is of limited use, because then simple string find will
suffice.

In general, this can't be done. It might be possible that the underlying
structure of the language isn't regular at all - for example, the simple
language of palindromes isn't.

And even if it was - the search space is just to large to explore. You
could generate a bazillion matching expressions, but .* will always
match - so how do you prevent the generator from converging towards that?

If all you have are those strings, you are better off trying to infer
some structure yourself.

Diez

Sep 8 '06 #6

James Stroud

vb******@gmail. com wrote:

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Sep 9 '06 #7

vbfoobar

James Stroud a écrit :

vb******@gmail. com wrote:
Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

Indeed, this seems cool! Thanks for the suggestion

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3

>
But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

I will look at that more precisely, but at my first look
it seems this is more specialized and less accessible
for the common mortal...

>
James

Thanks. This may help me.

In addition I continue to look for other ideas, notably
because I want code that I can change myself,
and exclusively python code

>
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Sep 9 '06 #8

James Stroud

vb******@gmail. com wrote:

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3

Well, in the realm of sequence analysis, it is trivial to devise a regex
for these values because they are already aligned and of fixed length.
This is a couple of more lines than it needs to be, only so its easier
to follow the logic. This uses throw-away groups to avoid bracketed
sets, becuase you have dashes in your items. You might need to tweak the
following code if you have characters special to regex in your sequences
or if you want to use bracketed sets. The place to do this is in the
_joiner() function.
values = """
cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
""".split()

# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
if len(c) == 1:
# will raise KeyError for empty column
return c.pop()
else:
return "(?:%s)" % '|'.join(c)

columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_str s)
exact_rgx_str = "^%s$" % rgx_str

# '(?:c|n)(?:p|s) (?:k|u|d)g-3(?:1|0)0(?:A|0 )(?:A|B|1|0|3|2 |6|8)'
print rgx_str
But, if you get much more complicated that this, you will definitely
want to check out hmmer, especially if you can align your sequences.

James
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Sep 9 '06 #9

Paul McGuire

"Paul McGuire" <pt***@austin.r r._bogus_.comwr ote in message
news:OT******** **********@torn ado.texas.rr.co m...

"Andy Dingley" <di*****@codesm iths.comwrote in message
news:11******** **************@ i3g2000cwc.goog legroups.com...
>>
vb******@gmail. com wrote:

>>I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5 )
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of
your strings is a leading subset of another. For instance, let's say we
are looking for comparison operators, one of <, >, =, >=, <=, or !=.
Simply concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!= "

However, the leading '<' and '>' alternatives mask the later '<=', '<>',
and '>=' alternatives, and so the regexp never matches the longer options
(I was not able to find a greediness switch that would override this). So
when searching "a >= b" we get this:

>>>re.findall(" <|>|=|<=|>=|!=" , "a >= b")

['>', '=']

By moving the longer option to the front of the regexp, the longer option
is no longer masked by the shorter:

>>>re.findall(" >=|<|>|=|<=|!=" , "a >= b")

['>=']
You also can't just concatenate input strings, since it is very likely
they will contain one of the magic re symbols ()[]?*./\+, etc. So
re.escape needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes
the input strings, before concatenating them to a valid regexp. Of
course, the simplest approach would be to just sort the symbols by
descending length, but you may have some a priori knowledge that 'X' is a
very common match, and want that option tested as early as possible. So
this method only reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbo ls[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert( i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )

>>>print createREfrom(["ABC","ABCDEF", "ABCGHI"])

ABCDEF|ABCGHI|A BC

>>>print createREfrom("< = <= >= != << ><<< >>>".split())

\>\=|\>\>\>|\>\ >|\>|\<\=|\<\<\ <|\<\<|\<|\=|\! \=

>>>re.findall ( createREfrom("< = <= >= != << ><<< >>>".split()) , "a <=
b")

['<=']
Note, this does not do any optimization, such as collapsing
"ABCDEF|ABC GHI" to "ABC(DEF|GH I)". I think there are some recipes in the
Python cookbook for such optimization.

-- Paul

Sep 15 '06 #10

Similar topics

2166

Regexp optimization question

by: Magnus Lie Hetland | last post by:

I'm working on a project (Atox) where I need to match quite a few regular expressions (several hundred) in reasonably large text files. I've found that this can easily get rather slow. (There are many things that slow Atox down -- it hasn't been designed for speed, and any optimizations will entail quite a bit of refactoring.) I've tried to...

Python

7664

RegExp to validate an email address

by: Andrew DeFaria | last post by:

I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following not appear to work: var email_address = "Joe@Schmoe"; var email_regex = new RegExp ("^(\\w+)(\@)(\\w+)(\.)(\\w+)$"); var result = email_regex.exec...

Javascript

2455

random number generator

by: Aaron | last post by:

I need some help writing this function the function returns a list of random non-duplicate integers in a string, 1 parameter indicating the maximum range. string randGen(int maxRange) maxRange=1000 the output would be:

.NET Framework

1116

Help with RegExp

by: Chris Dunaway | last post by:

I have a string: 555-1234,12345,"Jones, John" I want to split the string based on a comma as a delimiter. Since the name portion has a comma and is delimited with quotation marks, I need to ignore or strip them out. I also want to reverse the last and first names so that "Jones, John" becomes John Jones (no quotation marks or commas, last...

Visual Basic .NET

1899

RegExp and special caracters

by: arno | last post by:

Hi, I want to search a substring within a string : fonction (str, substr) { if (str.search(substr) != -1) { // do something } }

Javascript

1560

Looking for a language/framework

by: walterbyrd | last post by:

Way back when, I got a lot of training and experience in highly structued software development. These days, I dabble with web-development, but I may become more serious. I consider php to be an abombination, the backward compatibility issues alone are reason enough to make me hate it. Rail looks promising, but it's difficult to find...

Python

6839

Generating all permutations from a regexp

by: =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?= | last post by:

With regexps you can search for strings matching it. For example, given the regexp: "foobar\d\d\d". "foobar123" would match. I want to do the reverse, from a regexp generate all strings that could match it. The regexp: "{3}\d{3}" should generate the strings "AAA000", "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999". Is this...

Python

1479

regexp search on infinite string?

by: Paddy | last post by:

Lets say i have a generator running that generates successive characters of a 'string' characters then I would have to 'freeze' the generator and pass the characters so far to re.search. It is expensive to create successive characters, but caching could be used for past characters. is it possible to wrap the generator in a class, possibly...

Python

3629

Regexp parser and generator

by: George Sakkis | last post by:

Is there any package that parses regular expressions and returns an AST ? Something like: Regex('i ', Or('love', 'hate'), ' h', Or('is', 'er'), ' ', Or('cat', 'dog'), Optional('s'), ZeroOrMore(r'\s'), OneOrMore('!')) Given such a structure, I want to create a generator that can generate all strings matched by this regexp. Obviously if the...

Python

7475

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7664

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

7437

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

5982

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

4958

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3446

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1900

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1023

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

720

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General