473,385 Members | 1,907 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

convert list of strings to set of regexes; convert list of strings to trie

Hello,

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2regexes(string_list)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]

I am aware of the fact that converting the list to a *trie* would almost do
the job. But I couldn't find anything about Python modules that produce tries.

Are there any modules that I could use? Or do I have to implement it myself?

Klaus
Jul 18 '05 #1
7 2928
[Klaus Neuner]
I need a function that converts a list into a set of regexes. Like so:
string_list = ["blaa", "blab", "raaa", "rabb"]
print string_list2regexes(string_list)
This should return something like:
["bla(a|b)", "ra(aa|bb)"]


I once wrote code for doing something very close, but not finding it,
I guess I deleted it; so presumably, it would not be that difficult
redoing from scratch. By the way, the regexp compiler in Python library
already turns, at least logically, `(blaa)|(blab)|(raaa)|(rabb)' into
`(bla(a|b))|(ra(aa|bb))' -- disregarding of course how groups are being
saved by parentheses. So, you may also peek in this direction.

--
François Pinard http://www.iro.umontreal.ca/~pinard
Jul 18 '05 #2
François Pinard <pi****@iro.umontreal.ca> wrote in message news:<ma*************************************@pyth on.org>...
By the way, the regexp compiler in Python library
already turns, at least logically, `(blaa)|(blab)|(raaa)|(rabb)' into
`(bla(a|b))|(ra(aa|bb))' -- disregarding of course how groups are being
saved by parentheses. So, you may also peek in this direction.


What do you mean exactly by "logically"? That Python translates

(1) (blaa)|(blab)|(raaa)|(rabb)

to something (logically equivalent to)

(2) (bla(a|b))|(ra(aa|bb))

before matching it? Does this mean that (1) (taken as normal regex) is always
exactly as fast as (2) (taken as normal regex)?

Klaus
Jul 18 '05 #3
Klaus Neuner wrote:
Hello,

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2regexes(string_list)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]
So, I guess you just want to isolate common prefixes.
I am aware of the fact that converting the list to a *trie* would almost do
the job. But I couldn't find anything about Python modules that produce tries.


Maybe
http://cvs.bioperl.org/cgi-bin/viewc...root=biopython
can be of some use for you, even if I didn't have a look at it.

--
Ciao,
Matteo
Jul 18 '05 #4
Klaus Neuner wrote:
What do you mean exactly by "logically"? That Python translates

(1) (blaa)|(blab)|(raaa)|(rabb)

to something (logically equivalent to)

(2) (bla(a|b))|(ra(aa|bb))

before matching it? Does this mean that (1) (taken as normal regex) is always
exactly as fast as (2) (taken as normal regex)?


Looks like (2) is a little bit more efficient in case of not matching
strings: edited from an interactive session,

In [22]: timer = timeit.Timer("re1.match('blorb')", "import re;re1 =
re.compile(r'(blaa)|(blab)|(raaa)|(rabb)')")

In [23]: timer.timeit()
Out[23]: 1.4223082065582275

In [24]: timer2 = timeit.Timer("re2.match('blorb')", "import re;re2 =
re.compile(r'(bla(a|b))|(ra(aa|bb))')")

In [25]: timer2.timeit()
Out[25]: 1.2273669242858887

In [28]: timer3 = timeit.Timer("re3.match('blab')", "import re;re3 =
re.compile(r'(bla(a|b))|(ra(aa|bb))')")

In [29]: timer3.timeit()
Out[29]: 1.4784541130065918

In [30]: timer4 = timeit.Timer("re4.match('blab')", "import re;re4 =
re.compile(r'(blaa)|(blab)|(raaa)|(rabb)')")

In [31]: timer4.timeit()
Out[31]: 1.4766991138458252

--
Ciao,
Matteo
Jul 18 '05 #5
[Klaus Neuner]
[François Pinard]
By the way, the regexp compiler in Python library already
turns, at least logically, `(blaa)|(blab)|(raaa)|(rabb)' into
`(bla(a|b))|(ra(aa|bb))' -- disregarding of course how groups
are being saved by parentheses. So, you may also peek in this
direction.

What do you mean exactly by "logically"? That Python translates (1) to
something (logically equivalent to) (2) before matching it? Does this
mean that (1) (taken as normal regex) is always exactly as fast as (2)
(taken as normal regex)?


This is what I understand, yes: it is not worth optimising regexps the
above way ourselves (as we sometimes do for Emacs Lisp regexps, say),
before giving them to the Python regexp compiler.

--
François Pinard http://www.iro.umontreal.ca/~pinard
Jul 18 '05 #6
Klaus Neuner <kl************@yahoo.de> wrote:
I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2regexes(string_list)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]


The program below does exactly this... Run it like this

$ ./words-to-regexp.pl 5
Extracting 5 letter words
blaa
blab
raaa
rabb

(Giving the list of words to stdin)

It produced this regexp which matches just the above strings.

Re: bla[ab]|ra(?:aa|bb)

For your input. Eg find a regexp to match all the 1&2 letter words
in the dictionary...

../words-to-regexp.pl 2 < /usr/share/dict/words

Re: [aa]|a[cdghlmmnrssttuy]|[bb]|b[aeeikry]|[cc]|c[adfilmorssu]|[dd]|d[bor]|[ee]|e[dhmrssux]|[ff]|f[aemr]|[gg]|g[adeeos]|[hh]|h[aeefgiooz]|[ii]|i[dfnnorstt]|[jj]|j[or]|[kk]|k[crsw]|[ll]|l[aaeiorstu]|[mm]|m[abdeginorsstuy]|[nn]|n[abdeiopu]|[oo]|o[bfhknrswxz]|[pp]|p[aabdhimotu]|[qqrr]|r[abdeehnsux]|[ss]|s[bcehimnort]|[tt]|t[abchiilmosy]|[uu]|u[hprs]|[vv]|v[as]|[ww]|w[emu]|[xx]|xe|[yy]|y[abe]|[zz]|z[nr]

Yes its in perl, but its almost entirely regexps! I haven't got the
time to pythonise it at the moment - hope you enjoy a challenge ;-)

#!/usr/bin/perl -w
#
# The challenge - to write a function which given a list of words
# returns a regexp which will match those and only those words.

use strict;
my $LIMIT = shift || 3;
$|=1;

print "Extracting $LIMIT letter words\n";

my @list = ();
while (<>)
{
chomp;
next if $_ eq "";
push @list, lc($_) if length($_) <= $LIMIT;
}

print "Extracted ", scalar(@list), " words\n";

my ($re, $old_re) = ("", "1");

while ($re ne $old_re)
{
$old_re = $re;
print "-" x 60, "\n";
$re = word_list_to_regexp( @list );
check_word_list_to_regexp($re, \@list);
print "Length: ", length($re), "\n";
}
exit;

################################################## ##########
# Converts a list of words into a regexp which will
# match those words and those numbers only.
#
# It does this by constructing a regexp and then progressively
# simplifying it - recursively if necessary. It uses regexp's to
# transform the regexp of course! This is almost a general purpose
# regexp optimiser.
#
# We assume that the caller will bound the regexp with ^( and )$ or
# \W(?: and )\W or whatever takes their fancy
#
# Set $DEBUG to 1 if you want to print lots of info and check the
# regexp works after each transformation.
#
# Warning: code contains heavy regexps - lift with care ;-)
# Caution: Code may use exponential time and space ;-(
################################################## ##########

sub word_list_to_regexp
{
my (@list) = @_;
my $DEBUG = 1;

# The basic regexp with |'s on the start and end to make our life
# easier
# Should uniq here too...
$re = join("|", sort {
#length($a) <=> length($b) ||
$a cmp $b
} @list);

$re = "|$re|";

# Transform the regexp in stages, making sure at all time the
# regexp is correct if $DEBUG is set

check_word_list_to_regexp($re, \@list) if $DEBUG;

# 1) Concatenate all the single characters a|b|c into [abc]'s
$re =~ s{ \| ( \w (?: \| \w )+ ) (?= \| ) }
{
my ( $string ) = ( $1 );
print "string = '$string'\n" if $DEBUG;
"|[" . join("", split m{\|}, $string) . "]"
}gex;

check_word_list_to_regexp($re, \@list) if $DEBUG;

# 2) Find all the Xa|Xb|Xc and change to X(?:a|b|c)]
$re =~ s{ \| ( (\w+)(\w+) (?: \| \2\w+ )+ ) (?= \| ) }
{
my ( $string, $prefix ) = ( $1, $2 );
print "prefix = '$prefix', string = '$string'\n" if $DEBUG;
"|$prefix\(?:" . join("|", map { substr($_, length $prefix) } split m{\|}, $string) . ")"
}gex;

check_word_list_to_regexp($re, \@list) if $DEBUG;

# 3) Find all the aX|bX|cX and change to (a|b|c)X]
$re =~ s{ \| ( (\w+?)(.+) (?: \| \w+\3 )+ ) (?= \| ) }
{
my ( $string, $postfix ) = ( $1, $3 );
print "postfix = '$postfix', string = '$string'\n" if $DEBUG;
$string =~ s{ \Q$postfix\E (?= \| | $ ) }{}gx;
print "...string = '$string'\n" if $DEBUG;
"|(?:$string)$postfix"
}gex;

check_word_list_to_regexp($re, \@list) if $DEBUG;

# 4) Change (?:a|b|c) into [abc]
$re =~ s{ \(\?\: ( \w (?: \| \w )+ ) \) }
{
my ( $string ) = ( $1 );
print "string = '$string'\n" if $DEBUG;
"[" . join("", split m{\|}, $string) . "]"
}gex;

check_word_list_to_regexp($re, \@list) if $DEBUG;

# 5) Optimise [abc] into [a-c] or \d
# This doesn't optimise all the cases only the complete continuous
# range in the [ ... ]
# $re =~ s{ \[ ( \w{3,} ) \] }
# {
# my ( $string, $start, $end ) = ( $1, substr($1, 0, 1), substr($1, -1, 1) );
# print "match ['$string']...range [$start-$end]\n" if $DEBUG;
# if ($end - $start + 1 == length $string)
# {
# $start == 0 && $end == 9 ? '\d' : "[$start-$end]";
# }
# else
# {
# "[$string]";
# }
# }gex;

check_word_list_to_regexp($re, \@list) if $DEBUG;

my $re_length;

do
{
$re_length = length($re);

# 6) recurse on any sequences left (?:ab|cd|ef)
$re =~ s{ \(\?\: ( \w+ (?: \| \w+ )+ ) \) }
{
my ( $string ) = ( $1 );
if (length($string) < length($re) - 4)
{
print "**** Recursing on '$string'\n" if $DEBUG;
"(?:" . word_list_to_regexp(split m{\|}, $string) . ")";
}
else
{
"(?:$string)";
}
}gex;

# 6a) recurse on any sequences left |ab|cd|ef|
$re =~ s{ \| ( \w+ (?: \| \w+ )+ ) \| }
{
my ( $string ) = ( $1 );
if (length($string) < length($re) - 2)
{
print "**** Recursing on '$string'\n" if $DEBUG;
"|" . word_list_to_regexp(split m{\|}, $string) . "|";
}
else
{
"|$string|";
}
}gex;
}
until (length($re) == $re_length);

check_word_list_to_regexp($re, \@list) if $DEBUG;

# 7) fix the | on each end
$re =~ s{^\|}{};
$re =~ s{\|$}{};

print "**** Returning '$re'\n" if $DEBUG;

return $re;
}

################################################## ##########
# Test subroutine to check the regexp performs as advertised
#
# Call with a regexp and a reference to a list of numbers
# it will check that the regexp matches all the list and
# doesn't match some others (obviously it can't check them
# all can it!) die-ing on any failures.
################################################## ##########

sub check_word_list_to_regexp
{
my ($re, $list) = @_;
my %list = map { $_ => 1 } @$list;
print "Re: $re\n";

# Put some other test cases in
$list{$_} += 0 for (0..999);
$list{int(rand()*1000)} += 0 for (0..99);
$list{int(rand()*10000)} += 0 for (0..99);
$list{int(rand()*100000)} += 0 for (0..99);

# print join(", ", map {"$_ => $list{$_}"} keys %list), "\n";
$re =~ s{^\|}{}; # fix | on start and end
$re =~ s{\|$}{};
$re = "^(?:$re)\$"; # put in ^(?: ... )$
$re = qr{$re}; # compile the regexp for speed

# Check all the keys in list against the regexp - some should pass
# and some should fail
for my $item (keys %list)
{
if ($list{$item} xor ($item =~ /$re/))
{
die "*** FAILED '$re' for '$item' ShouldMatch: $list{$item}\n";
}
else
{
# print "OK '$re' for '$item'\n";
}
}
}

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #7
Thanks to all who participated in this thread.

Klaus
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Joseph | last post by:
Hi all, I want to build a compressed suffix trie from a string for string matching.instead of doing like: input:a string with 10 chars insert char array 10 insert char array 9,10 insert...
4
by: Ney André de Mello Zunino | last post by:
Hello. Given a sorted collection of strings, what would a good (the best?) strategy be to allow fast access to an item, based on a search substring which should match the beginning of the...
7
by: drewnoakes | last post by:
I have an application that performs custom deserialisation of object state from byte arrays. This happens very regularly, so needs to be fast. In addition, most of the strings repeat, meaning I'm...
4
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a...
0
by: Polar | last post by:
Hi! Do you know a C library for trie data structures (a particular type of tree)? And some link about trie with C language? Thank you Polar
3
by: Paul | last post by:
Does anyone have an example of a trie data structure implemented in c#? I'm looking for a string trie to use to hold the word list for my spell checker. Any examples or suggestions would be great....
19
by: jason_box | last post by:
I'm alittle new at C and I'm trying to write a simple program that will record the frequency of words and just print it out. It is suppose to take stdin and I heard it's only a few lines but I'm...
9
by: Thomas Ploch | last post by:
Hello fellow pythonists, I have a question concerning posting code on this list. I want to post source code of a module, which is a homework for university (yes yes, I know, please read...
5
by: iaminsik | last post by:
I implemented a TRIE in memory by myself in C. But, as you know, when a size of data increases, loading time of data becomes longer. I want to implement a TRIE in file, which means division of...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.