473,703 Members | 3,300 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

convert list of strings to set of regexes; convert list of strings to trie

Hello,

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2reg exes(string_lis t)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]

I am aware of the fact that converting the list to a *trie* would almost do
the job. But I couldn't find anything about Python modules that produce tries.

Are there any modules that I could use? Or do I have to implement it myself?

Klaus
Jul 18 '05 #1
7 2961
[Klaus Neuner]
I need a function that converts a list into a set of regexes. Like so:
string_list = ["blaa", "blab", "raaa", "rabb"]
print string_list2reg exes(string_lis t)
This should return something like:
["bla(a|b)", "ra(aa|bb)"]


I once wrote code for doing something very close, but not finding it,
I guess I deleted it; so presumably, it would not be that difficult
redoing from scratch. By the way, the regexp compiler in Python library
already turns, at least logically, `(blaa)|(blab)| (raaa)|(rabb)' into
`(bla(a|b))|(ra (aa|bb))' -- disregarding of course how groups are being
saved by parentheses. So, you may also peek in this direction.

--
François Pinard http://www.iro.umontreal.ca/~pinard
Jul 18 '05 #2
François Pinard <pi****@iro.umo ntreal.ca> wrote in message news:<ma******* *************** *************** @python.org>...
By the way, the regexp compiler in Python library
already turns, at least logically, `(blaa)|(blab)| (raaa)|(rabb)' into
`(bla(a|b))|(ra (aa|bb))' -- disregarding of course how groups are being
saved by parentheses. So, you may also peek in this direction.


What do you mean exactly by "logically" ? That Python translates

(1) (blaa)|(blab)|( raaa)|(rabb)

to something (logically equivalent to)

(2) (bla(a|b))|(ra( aa|bb))

before matching it? Does this mean that (1) (taken as normal regex) is always
exactly as fast as (2) (taken as normal regex)?

Klaus
Jul 18 '05 #3
Klaus Neuner wrote:
Hello,

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2reg exes(string_lis t)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]
So, I guess you just want to isolate common prefixes.
I am aware of the fact that converting the list to a *trie* would almost do
the job. But I couldn't find anything about Python modules that produce tries.


Maybe
http://cvs.bioperl.org/cgi-bin/viewc...root=biopython
can be of some use for you, even if I didn't have a look at it.

--
Ciao,
Matteo
Jul 18 '05 #4
Klaus Neuner wrote:
What do you mean exactly by "logically" ? That Python translates

(1) (blaa)|(blab)|( raaa)|(rabb)

to something (logically equivalent to)

(2) (bla(a|b))|(ra( aa|bb))

before matching it? Does this mean that (1) (taken as normal regex) is always
exactly as fast as (2) (taken as normal regex)?


Looks like (2) is a little bit more efficient in case of not matching
strings: edited from an interactive session,

In [22]: timer = timeit.Timer("r e1.match('blorb ')", "import re;re1 =
re.compile(r'(b laa)|(blab)|(ra aa)|(rabb)')")

In [23]: timer.timeit()
Out[23]: 1.4223082065582 275

In [24]: timer2 = timeit.Timer("r e2.match('blorb ')", "import re;re2 =
re.compile(r'(b la(a|b))|(ra(aa |bb))')")

In [25]: timer2.timeit()
Out[25]: 1.2273669242858 887

In [28]: timer3 = timeit.Timer("r e3.match('blab' )", "import re;re3 =
re.compile(r'(b la(a|b))|(ra(aa |bb))')")

In [29]: timer3.timeit()
Out[29]: 1.4784541130065 918

In [30]: timer4 = timeit.Timer("r e4.match('blab' )", "import re;re4 =
re.compile(r'(b laa)|(blab)|(ra aa)|(rabb)')")

In [31]: timer4.timeit()
Out[31]: 1.4766991138458 252

--
Ciao,
Matteo
Jul 18 '05 #5
[Klaus Neuner]
[François Pinard]
By the way, the regexp compiler in Python library already
turns, at least logically, `(blaa)|(blab)| (raaa)|(rabb)' into
`(bla(a|b))|(ra (aa|bb))' -- disregarding of course how groups
are being saved by parentheses. So, you may also peek in this
direction.

What do you mean exactly by "logically" ? That Python translates (1) to
something (logically equivalent to) (2) before matching it? Does this
mean that (1) (taken as normal regex) is always exactly as fast as (2)
(taken as normal regex)?


This is what I understand, yes: it is not worth optimising regexps the
above way ourselves (as we sometimes do for Emacs Lisp regexps, say),
before giving them to the Python regexp compiler.

--
François Pinard http://www.iro.umontreal.ca/~pinard
Jul 18 '05 #6
Klaus Neuner <kl************ @yahoo.de> wrote:
I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2reg exes(string_lis t)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]


The program below does exactly this... Run it like this

$ ./words-to-regexp.pl 5
Extracting 5 letter words
blaa
blab
raaa
rabb

(Giving the list of words to stdin)

It produced this regexp which matches just the above strings.

Re: bla[ab]|ra(?:aa|bb)

For your input. Eg find a regexp to match all the 1&2 letter words
in the dictionary...

../words-to-regexp.pl 2 < /usr/share/dict/words

Re: [aa]|a[cdghlmmnrssttuy]|[bb]|b[aeeikry]|[cc]|c[adfilmorssu]|[dd]|d[bor]|[ee]|e[dhmrssux]|[ff]|f[aemr]|[gg]|g[adeeos]|[hh]|h[aeefgiooz]|[ii]|i[dfnnorstt]|[jj]|j[or]|[kk]|k[crsw]|[ll]|l[aaeiorstu]|[mm]|m[abdeginorsstuy]|[nn]|n[abdeiopu]|[oo]|o[bfhknrswxz]|[pp]|p[aabdhimotu]|[qqrr]|r[abdeehnsux]|[ss]|s[bcehimnort]|[tt]|t[abchiilmosy]|[uu]|u[hprs]|[vv]|v[as]|[ww]|w[emu]|[xx]|xe|[yy]|y[abe]|[zz]|z[nr]

Yes its in perl, but its almost entirely regexps! I haven't got the
time to pythonise it at the moment - hope you enjoy a challenge ;-)

#!/usr/bin/perl -w
#
# The challenge - to write a function which given a list of words
# returns a regexp which will match those and only those words.

use strict;
my $LIMIT = shift || 3;
$|=1;

print "Extracting $LIMIT letter words\n";

my @list = ();
while (<>)
{
chomp;
next if $_ eq "";
push @list, lc($_) if length($_) <= $LIMIT;
}

print "Extracted ", scalar(@list), " words\n";

my ($re, $old_re) = ("", "1");

while ($re ne $old_re)
{
$old_re = $re;
print "-" x 60, "\n";
$re = word_list_to_re gexp( @list );
check_word_list _to_regexp($re, \@list);
print "Length: ", length($re), "\n";
}
exit;

############### ############### ############### ###############
# Converts a list of words into a regexp which will
# match those words and those numbers only.
#
# It does this by constructing a regexp and then progressively
# simplifying it - recursively if necessary. It uses regexp's to
# transform the regexp of course! This is almost a general purpose
# regexp optimiser.
#
# We assume that the caller will bound the regexp with ^( and )$ or
# \W(?: and )\W or whatever takes their fancy
#
# Set $DEBUG to 1 if you want to print lots of info and check the
# regexp works after each transformation.
#
# Warning: code contains heavy regexps - lift with care ;-)
# Caution: Code may use exponential time and space ;-(
############### ############### ############### ###############

sub word_list_to_re gexp
{
my (@list) = @_;
my $DEBUG = 1;

# The basic regexp with |'s on the start and end to make our life
# easier
# Should uniq here too...
$re = join("|", sort {
#length($a) <=> length($b) ||
$a cmp $b
} @list);

$re = "|$re|";

# Transform the regexp in stages, making sure at all time the
# regexp is correct if $DEBUG is set

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 1) Concatenate all the single characters a|b|c into [abc]'s
$re =~ s{ \| ( \w (?: \| \w )+ ) (?= \| ) }
{
my ( $string ) = ( $1 );
print "string = '$string'\n" if $DEBUG;
"|[" . join("", split m{\|}, $string) . "]"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 2) Find all the Xa|Xb|Xc and change to X(?:a|b|c)]
$re =~ s{ \| ( (\w+)(\w+) (?: \| \2\w+ )+ ) (?= \| ) }
{
my ( $string, $prefix ) = ( $1, $2 );
print "prefix = '$prefix', string = '$string'\n" if $DEBUG;
"|$prefix\( ?:" . join("|", map { substr($_, length $prefix) } split m{\|}, $string) . ")"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 3) Find all the aX|bX|cX and change to (a|b|c)X]
$re =~ s{ \| ( (\w+?)(.+) (?: \| \w+\3 )+ ) (?= \| ) }
{
my ( $string, $postfix ) = ( $1, $3 );
print "postfix = '$postfix', string = '$string'\n" if $DEBUG;
$string =~ s{ \Q$postfix\E (?= \| | $ ) }{}gx;
print "...string = '$string'\n" if $DEBUG;
"|(?:$string)$p ostfix"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 4) Change (?:a|b|c) into [abc]
$re =~ s{ \(\?\: ( \w (?: \| \w )+ ) \) }
{
my ( $string ) = ( $1 );
print "string = '$string'\n" if $DEBUG;
"[" . join("", split m{\|}, $string) . "]"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 5) Optimise [abc] into [a-c] or \d
# This doesn't optimise all the cases only the complete continuous
# range in the [ ... ]
# $re =~ s{ \[ ( \w{3,} ) \] }
# {
# my ( $string, $start, $end ) = ( $1, substr($1, 0, 1), substr($1, -1, 1) );
# print "match ['$string']...range [$start-$end]\n" if $DEBUG;
# if ($end - $start + 1 == length $string)
# {
# $start == 0 && $end == 9 ? '\d' : "[$start-$end]";
# }
# else
# {
# "[$string]";
# }
# }gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

my $re_length;

do
{
$re_length = length($re);

# 6) recurse on any sequences left (?:ab|cd|ef)
$re =~ s{ \(\?\: ( \w+ (?: \| \w+ )+ ) \) }
{
my ( $string ) = ( $1 );
if (length($string ) < length($re) - 4)
{
print "**** Recursing on '$string'\n" if $DEBUG;
"(?:" . word_list_to_re gexp(split m{\|}, $string) . ")";
}
else
{
"(?:$string )";
}
}gex;

# 6a) recurse on any sequences left |ab|cd|ef|
$re =~ s{ \| ( \w+ (?: \| \w+ )+ ) \| }
{
my ( $string ) = ( $1 );
if (length($string ) < length($re) - 2)
{
print "**** Recursing on '$string'\n" if $DEBUG;
"|" . word_list_to_re gexp(split m{\|}, $string) . "|";
}
else
{
"|$string|" ;
}
}gex;
}
until (length($re) == $re_length);

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 7) fix the | on each end
$re =~ s{^\|}{};
$re =~ s{\|$}{};

print "**** Returning '$re'\n" if $DEBUG;

return $re;
}

############### ############### ############### ###############
# Test subroutine to check the regexp performs as advertised
#
# Call with a regexp and a reference to a list of numbers
# it will check that the regexp matches all the list and
# doesn't match some others (obviously it can't check them
# all can it!) die-ing on any failures.
############### ############### ############### ###############

sub check_word_list _to_regexp
{
my ($re, $list) = @_;
my %list = map { $_ => 1 } @$list;
print "Re: $re\n";

# Put some other test cases in
$list{$_} += 0 for (0..999);
$list{int(rand( )*1000)} += 0 for (0..99);
$list{int(rand( )*10000)} += 0 for (0..99);
$list{int(rand( )*100000)} += 0 for (0..99);

# print join(", ", map {"$_ => $list{$_}"} keys %list), "\n";
$re =~ s{^\|}{}; # fix | on start and end
$re =~ s{\|$}{};
$re = "^(?:$re)\$ "; # put in ^(?: ... )$
$re = qr{$re}; # compile the regexp for speed

# Check all the keys in list against the regexp - some should pass
# and some should fail
for my $item (keys %list)
{
if ($list{$item} xor ($item =~ /$re/))
{
die "*** FAILED '$re' for '$item' ShouldMatch: $list{$item}\n" ;
}
else
{
# print "OK '$re' for '$item'\n";
}
}
}

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #7
Thanks to all who participated in this thread.

Klaus
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
4372
by: Joseph | last post by:
Hi all, I want to build a compressed suffix trie from a string for string matching.instead of doing like: input:a string with 10 chars insert char array 10 insert char array 9,10 insert char array 8,9,10
4
1658
by: Ney André de Mello Zunino | last post by:
Hello. Given a sorted collection of strings, what would a good (the best?) strategy be to allow fast access to an item, based on a search substring which should match the beginning of the searched item? The actual goal is to implement a functionality similar to that found in help indices, where one can locate an item by gradually typing its initial characters. I expect that some kind of tree structure be present in the solution, but I...
7
1744
by: drewnoakes | last post by:
I have an application that performs custom deserialisation of object state from byte arrays. This happens very regularly, so needs to be fast. In addition, most of the strings repeat, meaning I'm deserialising the same sequence of bytes repeatedly, giving the same output string. Let's ignore the text encoding method, as it's not relevant to my question. Right now, I'm using BinaryReader.ReadString() which gives the correct result,...
4
9757
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a numeric value according to an arbitrary regular expression.
0
1421
by: Polar | last post by:
Hi! Do you know a C library for trie data structures (a particular type of tree)? And some link about trie with C language? Thank you Polar
3
3557
by: Paul | last post by:
Does anyone have an example of a trie data structure implemented in c#? I'm looking for a string trie to use to hold the word list for my spell checker. Any examples or suggestions would be great. thanks Paul
19
3138
by: jason_box | last post by:
I'm alittle new at C and I'm trying to write a simple program that will record the frequency of words and just print it out. It is suppose to take stdin and I heard it's only a few lines but I'm not sure where to start. The only example I have to work with is if you ran the program say: File list contains: This is a test.
9
1292
by: Thomas Ploch | last post by:
Hello fellow pythonists, I have a question concerning posting code on this list. I want to post source code of a module, which is a homework for university (yes yes, I know, please read on...). It is a web crawler (which I will *never* let out into the wide world) which uses regular expressions (and yes, I know, thats not good, too). I have finished it (as far as I can), but since I need a good mark to
5
3391
by: iaminsik | last post by:
I implemented a TRIE in memory by myself in C. But, as you know, when a size of data increases, loading time of data becomes longer. I want to implement a TRIE in file, which means division of INDEX & DATA in file and whenever a user asks a data, TRIE accesses a disk according to INDEX.
0
8670
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9252
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9122
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8966
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7872
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5922
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4433
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4687
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2070
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.