convert list of strings to set of regexes; convert list of strings to trie

Klaus Neuner

Hello,

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2reg exes(string_lis t)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]

I am aware of the fact that converting the list to a *trie* would almost do
the job. But I couldn't find anything about Python modules that produce tries.

Are there any modules that I could use? Or do I have to implement it myself?

Klaus

Jul 18 '05 #1

Subscribe Reply

2966

François Pinard

[Klaus Neuner]

I need a function that converts a list into a set of regexes. Like so:
string_list = ["blaa", "blab", "raaa", "rabb"]
print string_list2reg exes(string_lis t)
This should return something like:
["bla(a|b)", "ra(aa|bb)"]

I once wrote code for doing something very close, but not finding it,
I guess I deleted it; so presumably, it would not be that difficult
redoing from scratch. By the way, the regexp compiler in Python library
already turns, at least logically, `(blaa)|(blab)| (raaa)|(rabb)' into
`(bla(a|b))|(ra (aa|bb))' -- disregarding of course how groups are being
saved by parentheses. So, you may also peek in this direction.

--
François Pinard http://www.iro.umontreal.ca/~pinard

Jul 18 '05 #2

Klaus Neuner

François Pinard <pi****@iro.umo ntreal.ca> wrote in message news:<ma******* *************** *************** @python.org>...

By the way, the regexp compiler in Python library
already turns, at least logically, `(blaa)|(blab)| (raaa)|(rabb)' into
`(bla(a|b))|(ra (aa|bb))' -- disregarding of course how groups are being
saved by parentheses. So, you may also peek in this direction.

What do you mean exactly by "logically" ? That Python translates

(1) (blaa)|(blab)|( raaa)|(rabb)

to something (logically equivalent to)

(2) (bla(a|b))|(ra( aa|bb))

before matching it? Does this mean that (1) (taken as normal regex) is always
exactly as fast as (2) (taken as normal regex)?

Klaus

Jul 18 '05 #3

Matteo Dell'Amico

Klaus Neuner wrote:

Hello,

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2reg exes(string_lis t)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]
So, I guess you just want to isolate common prefixes.
I am aware of the fact that converting the list to a *trie* would almost do
the job. But I couldn't find anything about Python modules that produce tries.

Maybe
http://cvs.bioperl.org/cgi-bin/viewc...root=biopython
can be of some use for you, even if I didn't have a look at it.

--
Ciao,
Matteo

Jul 18 '05 #4

Matteo Dell'Amico

Klaus Neuner wrote:

What do you mean exactly by "logically" ? That Python translates

(1) (blaa)|(blab)|( raaa)|(rabb)

to something (logically equivalent to)

(2) (bla(a|b))|(ra( aa|bb))

before matching it? Does this mean that (1) (taken as normal regex) is always
exactly as fast as (2) (taken as normal regex)?

Looks like (2) is a little bit more efficient in case of not matching
strings: edited from an interactive session,

In [22]: timer = timeit.Timer("r e1.match('blorb ')", "import re;re1 =
re.compile(r'(b laa)|(blab)|(ra aa)|(rabb)')")

In [23]: timer.timeit()
Out[23]: 1.4223082065582 275

In [24]: timer2 = timeit.Timer("r e2.match('blorb ')", "import re;re2 =
re.compile(r'(b la(a|b))|(ra(aa |bb))')")

In [25]: timer2.timeit()
Out[25]: 1.2273669242858 887

In [28]: timer3 = timeit.Timer("r e3.match('blab' )", "import re;re3 =
re.compile(r'(b la(a|b))|(ra(aa |bb))')")

In [29]: timer3.timeit()
Out[29]: 1.4784541130065 918

In [30]: timer4 = timeit.Timer("r e4.match('blab' )", "import re;re4 =
re.compile(r'(b laa)|(blab)|(ra aa)|(rabb)')")

In [31]: timer4.timeit()
Out[31]: 1.4766991138458 252

--
Ciao,
Matteo

Jul 18 '05 #5

François Pinard

[Klaus Neuner]

[François Pinard]
By the way, the regexp compiler in Python library already
turns, at least logically, `(blaa)|(blab)| (raaa)|(rabb)' into
`(bla(a|b))|(ra (aa|bb))' -- disregarding of course how groups
are being saved by parentheses. So, you may also peek in this
direction.

What do you mean exactly by "logically" ? That Python translates (1) to
something (logically equivalent to) (2) before matching it? Does this
mean that (1) (taken as normal regex) is always exactly as fast as (2)
(taken as normal regex)?

This is what I understand, yes: it is not worth optimising regexps the
above way ourselves (as we sometimes do for Emacs Lisp regexps, say),
before giving them to the Python regexp compiler.

--
François Pinard http://www.iro.umontreal.ca/~pinard

Jul 18 '05 #6

Nick Craig-Wood

Klaus Neuner <kl************ @yahoo.de> wrote:

I need a function that converts a list into a set of regexes. Like so:

string_list = ["blaa", "blab", "raaa", "rabb"]

print string_list2reg exes(string_lis t)
This should return something like:

["bla(a|b)", "ra(aa|bb)"]

The program below does exactly this... Run it like this

$ ./words-to-regexp.pl 5
Extracting 5 letter words
blaa
blab
raaa
rabb

(Giving the list of words to stdin)

It produced this regexp which matches just the above strings.

Re: bla[ab]|ra(?:aa|bb)

For your input. Eg find a regexp to match all the 1&2 letter words
in the dictionary...

../words-to-regexp.pl 2 < /usr/share/dict/words

Re: [aa]|a[cdghlmmnrssttuy]|[bb]|b[aeeikry]|[cc]|c[adfilmorssu]|[dd]|d[bor]|[ee]|e[dhmrssux]|[ff]|f[aemr]|[gg]|g[adeeos]|[hh]|h[aeefgiooz]|[ii]|i[dfnnorstt]|[jj]|j[or]|[kk]|k[crsw]|[ll]|l[aaeiorstu]|[mm]|m[abdeginorsstuy]|[nn]|n[abdeiopu]|[oo]|o[bfhknrswxz]|[pp]|p[aabdhimotu]|[qqrr]|r[abdeehnsux]|[ss]|s[bcehimnort]|[tt]|t[abchiilmosy]|[uu]|u[hprs]|[vv]|v[as]|[ww]|w[emu]|[xx]|xe|[yy]|y[abe]|[zz]|z[nr]

Yes its in perl, but its almost entirely regexps! I haven't got the
time to pythonise it at the moment - hope you enjoy a challenge ;-)

#!/usr/bin/perl -w
#
# The challenge - to write a function which given a list of words
# returns a regexp which will match those and only those words.

use strict;
my $LIMIT = shift || 3;
$|=1;

print "Extracting $LIMIT letter words\n";

my @list = ();
while (<>)
{
chomp;
next if $_ eq "";
push @list, lc($_) if length($_) <= $LIMIT;
}

print "Extracted ", scalar(@list), " words\n";

my ($re, $old_re) = ("", "1");

while ($re ne $old_re)
{
$old_re = $re;
print "-" x 60, "\n";
$re = word_list_to_re gexp( @list );
check_word_list _to_regexp($re, \@list);
print "Length: ", length($re), "\n";
}
exit;

############### ############### ############### ###############
# Converts a list of words into a regexp which will
# match those words and those numbers only.
#
# It does this by constructing a regexp and then progressively
# simplifying it - recursively if necessary. It uses regexp's to
# transform the regexp of course! This is almost a general purpose
# regexp optimiser.
#
# We assume that the caller will bound the regexp with ^( and )$ or
# \W(?: and )\W or whatever takes their fancy
#
# Set $DEBUG to 1 if you want to print lots of info and check the
# regexp works after each transformation.
#
# Warning: code contains heavy regexps - lift with care ;-)
# Caution: Code may use exponential time and space ;-(
############### ############### ############### ###############

sub word_list_to_re gexp
{
my (@list) = @_;
my $DEBUG = 1;

# The basic regexp with |'s on the start and end to make our life
# easier
# Should uniq here too...
$re = join("|", sort {
#length($a) <=> length($b) ||
$a cmp $b
} @list);

$re = "|$re|";

# Transform the regexp in stages, making sure at all time the
# regexp is correct if $DEBUG is set

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 1) Concatenate all the single characters a|b|c into [abc]'s
$re =~ s{ \| ( \w (?: \| \w )+ ) (?= \| ) }
{
my ( $string ) = ( $1 );
print "string = '$string'\n" if $DEBUG;
"|[" . join("", split m{\|}, $string) . "]"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 2) Find all the Xa|Xb|Xc and change to X(?:a|b|c)]
$re =~ s{ \| ( (\w+)(\w+) (?: \| \2\w+ )+ ) (?= \| ) }
{
my ( $string, $prefix ) = ( $1, $2 );
print "prefix = '$prefix', string = '$string'\n" if $DEBUG;
"|$prefix$ ?:" . join("|", map { substr($_, length $prefix) } split m{\|}, $string) . ")"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 3) Find all the aX|bX|cX and change to (a|b|c)X]
$re =~ s{ \| ( (\w+?)(.+) (?: \| \w+\3 )+ ) (?= \| ) }
{
my ( $string, $postfix ) = ( $1, $3 );
print "postfix = '$postfix', string = '$string'\n" if $DEBUG;
$string =~ s{ \Q$postfix\E (?= \| | $ ) }{}gx;
print "...string = '$string'\n" if $DEBUG;
"|(?:$string)$p ostfix"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 4) Change (?:a|b|c) into [abc]
$re =~ s{ \(\?\: ( \w (?: \| \w )+ ) $ }
{
my ( $string ) = ( $1 );
print "string = '$string'\n" if $DEBUG;
"[" . join("", split m{\|}, $string) . "]"
}gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 5) Optimise [abc] into [a-c] or \d
# This doesn't optimise all the cases only the complete continuous
# range in the [ ... ]
# $re =~ s{ \[ ( \w{3,} ) \] }
# {
# my ( $string, $start, $end ) = ( $1, substr($1, 0, 1), substr($1, -1, 1) );
# print "match ['$string']...range [$start-$end]\n" if $DEBUG;
# if ($end - $start + 1 == length $string)
# {
# $start == 0 && $end == 9 ? '\d' : "[$start-$end]";
# }
# else
# {
# "[$string]";
# }
# }gex;

check_word_list _to_regexp($re, \@list) if $DEBUG;

my $re_length;

do
{
$re_length = length($re);

# 6) recurse on any sequences left (?:ab|cd|ef)
$re =~ s{ $\?\: ( \w+ (?: \| \w+ )+ ) $ }
{
my ( $string ) = ( $1 );
if (length($string ) < length($re) - 4)
{
print "**** Recursing on '$string'\n" if $DEBUG;
"(?:" . word_list_to_re gexp(split m{\|}, $string) . ")";
}
else
{
"(?:$string )";
}
}gex;

# 6a) recurse on any sequences left |ab|cd|ef|
$re =~ s{ \| ( \w+ (?: \| \w+ )+ ) \| }
{
my ( $string ) = ( $1 );
if (length($string ) < length($re) - 2)
{
print "**** Recursing on '$string'\n" if $DEBUG;
"|" . word_list_to_re gexp(split m{\|}, $string) . "|";
}
else
{
"|$string|" ;
}
}gex;
}
until (length($re) == $re_length);

check_word_list _to_regexp($re, \@list) if $DEBUG;

# 7) fix the | on each end
$re =~ s{^\|}{};
$re =~ s{\|$}{};

print "**** Returning '$re'\n" if $DEBUG;

return $re;
}

############### ############### ############### ###############
# Test subroutine to check the regexp performs as advertised
#
# Call with a regexp and a reference to a list of numbers
# it will check that the regexp matches all the list and
# doesn't match some others (obviously it can't check them
# all can it!) die-ing on any failures.
############### ############### ############### ###############

sub check_word_list _to_regexp
{
my ($re, $list) = @_;
my %list = map { $_ => 1 } @$list;
print "Re: $re\n";

# Put some other test cases in
$list{$_} += 0 for (0..999);
$list{int(rand( )*1000)} += 0 for (0..99);
$list{int(rand( )*10000)} += 0 for (0..99);
$list{int(rand( )*100000)} += 0 for (0..99);

# print join(", ", map {"$_ => $list{$_}"} keys %list), "\n";
$re =~ s{^\|}{}; # fix | on start and end
$re =~ s{\|$}{};
$re = "^(?:$re)\$ "; # put in ^(?: ... )$
$re = qr{$re}; # compile the regexp for speed

# Check all the keys in list against the regexp - some should pass
# and some should fail
for my $item (keys %list)
{
if ($list{$item} xor ($item =~ /$re/))
{
die "*** FAILED '$re' for '$item' ShouldMatch: $list{$item}\n" ;
}
else
{
# print "OK '$re' for '$item'\n";
}
}
}

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 18 '05 #7

Klaus Neuner

Thanks to all who participated in this thread.

Klaus

Jul 18 '05 #8

Similar topics

4376

compressed suffix trie

by: Joseph | last post by:

Hi all, I want to build a compressed suffix trie from a string for string matching.instead of doing like: input:a string with 10 chars insert char array 10 insert char array 9,10 insert char array 8,9,10

C / C++

1663

Strategy for locating strings based on initial characters

by: Ney André de Mello Zunino | last post by:

Hello. Given a sorted collection of strings, what would a good (the best?) strategy be to allow fast access to an item, based on a search substring which should match the beginning of the searched item? The actual goal is to implement a functionality similar to that found in help indices, where one can locate an item by gradually typing its initial characters. I expect that some kind of tree structure be present in the solution, but I...

C / C++

1749

Fast deserialisation of strings from byte[]

by: drewnoakes | last post by:

I have an application that performs custom deserialisation of object state from byte arrays. This happens very regularly, so needs to be fast. In addition, most of the strings repeat, meaning I'm deserialising the same sequence of bytes repeatedly, giving the same output string. Let's ignore the text encoding method, as it's not relevant to my question. Right now, I'm using BinaryReader.ReadString() which gives the correct result,...

.NET Framework

9768

Convert.ToString( double ) + xsd:pattern + RegEx == ?

by: aevans1108 | last post by:

expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a numeric value according to an arbitrary regular expression.

.NET Framework

1426

Trie

by: Polar | last post by:

Hi! Do you know a C library for trie data structures (a particular type of tree)? And some link about trie with C language? Thank you Polar

C / C++

3563

trie data structure

by: Paul | last post by:

Does anyone have an example of a trie data structure implemented in c#? I'm looking for a string trie to use to hold the word list for my spell checker. Any examples or suggestions would be great. thanks Paul

C# / C Sharp

3144

frequency list

by: jason_box | last post by:

I'm alittle new at C and I'm trying to write a simple program that will record the frequency of words and just print it out. It is suppose to take stdin and I heard it's only a few lines but I'm not sure where to start. The only example I have to work with is if you ran the program say: File list contains: This is a test.

C / C++

1294

Question concerning this list

by: Thomas Ploch | last post by:

Hello fellow pythonists, I have a question concerning posting code on this list. I want to post source code of a module, which is a homework for university (yes yes, I know, please read on...). It is a web crawler (which I will *never* let out into the wide world) which uses regular expressions (and yes, I know, thats not good, too). I have finished it (as far as I can), but since I need a good mark to

Python

3393

implementation of trie & file index

by: iaminsik | last post by:

I implemented a TRIE in memory by myself in C. But, as you know, when a size of data increases, loading time of data becomes longer. I want to implement a TRIE in file, which means division of INDEX & DATA in file and whenever a user asks a data, TRIE accesses a disk according to INDEX.

C / C++

9589

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9423

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10049

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9865

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6675

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5310

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5448

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3565

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2815

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General