473,753 Members | 6,868 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Efficient String Lookup?

I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?
Jul 18 '05 #1
21 2442
Chris S. wrote:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?


If it were me, I would store them as compiled regular expressions.

See the re module documentation and use re.compile().

If you want a better solution, it might help if you supply a little more
information about your problem and why this solution is unsuitable
(maybe it is :]).
--
Michael Hoffman
Jul 18 '05 #2
Chris S. wrote:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?


As a compiled regular expression, I guess - you don't give much info here,
so maybe there is a better way. But to me it looks like a classic regexp
thing. Maybe if your wildcards are equivalent to .*, then using subsequent
string searches lik this helps you:

pattern = 'abc#e#'.split( '#')
s = 'abcdef'
found = True
pos = 0
for p in pattern:
h = s.find(p)
if h != -1:
p = h + 1b
else:
found = False
That might be faster, if the string.find operation uses something else than
simple brute force linear searching - but I don't know enough about the
internals of python's string implementation to give an definite answer
here.

But to be honest: I don't think regexps are easy to beat, unless your
usecase is modeled in a way that makes it prone to other approaches.

--
Regards,

Diez B. Roggisch
Jul 18 '05 #3
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?


Start with a Trie, and virtually merge branches as necessary.

- Josiah

Jul 18 '05 #4
Josiah Carlson wrote:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?

Start with a Trie, and virtually merge branches as necessary.

- Josiah


Yup, you might also have a look at the Aho-Corasick algorithm, which can
match a test string against a big number of strings quite efficiently :

http://www-sr.informatik.uni-tuebing...ler/AC/AC.html

You'll have to adapt the algorithm so that it can handle your wilcard,
though. I found it easy to implement the '.' (one character wildcard),
but the '.*' (zero or more character wildcard) forces you to have
backtracking.

But if you can use the regexp engine, do not hesitate, it will save you
a lot of headaches. Unless of course you're a student and your teacher
asked you this question ;).

Regards,

Nicolas
Jul 18 '05 #5
Michael Hoffman wrote:
Chris S. wrote:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where
# is anything), which I want to match with a test string (e.g
'abcdef'). What would be the best way for me to store my strings so
lookup is as fast as possible?

If it were me, I would store them as compiled regular expressions.

See the re module documentation and use re.compile().

If you want a better solution, it might help if you supply a little more
information about your problem and why this solution is unsuitable
(maybe it is :]).


The problem is I want to associate some data with my pattern, as in a
dictionary. Basically, my application consists of a number of
conditions, represented as strings with wildcards. Associated to each
condition is arbitrary data explaining "what I must do". My task is to
find this data by match a state string with these condition strings. Of
course, the brute force approach is to just add each pattern to a
dictionary, and linearly seach every key for a match. To improve this, I
considered a trie, implemented as a special dictionary:

class Trie(dict):
'''Implements a traditional Patricia-style Tria.
Keys must be sequence types. None key represents a value.'''

def __init__(self):
dict.__init__(s elf)

def __setitem__(sel f, key, value):
assert key, 'invalid key '+str(key)
d = self
last = None
for n in key:
if n not in d:
dict.__setitem_ _(d, n, {})
last = (d,n)
d = dict.__getitem_ _(d, n)
(d,n) = last
dict.__getitem_ _(d, n)[None] = value

def __getitem__(sel f, key):
d = self
for n in key:
assert n in d, 'invalid key '+str(key)
d = dict.__getitem_ _(d, n)
assert None in d, 'missing value for key '+str(key)
return dict.__getitem_ _(d, None)

def __delitem__(sel f, key):
previous = []
d = self
for n in key:
assert n in d, 'invalid key '+str(key)
previous.append ((d,n))
d = dict.__getitem_ _(d, n)
assert None in d, 'missing value for key '+str(key)
# remove value
dict.__delitem_ _(d, None)
# find and remove empty keys
while len(previous):
(d,n) = previous.pop()
if not len(dict.__geti tem__(d, n)):
dict.__delitem_ _(d, n)
else:
break

However, I'm uncertain about the efficiency of this approach. I'd like
to use regexps, but how would I associate data with each pattern?
Jul 18 '05 #6
Diez B. Roggisch wrote:
That might be faster, if the string.find operation uses something else than
simple brute force linear searching - but I don't know enough about the
internals of python's string implementation to give an definite answer
here.

But to be honest: I don't think regexps are easy to beat, unless your
usecase is modeled in a way that makes it prone to other approaches.


The problem is, I want to find all patterns that match my test string,
so even with re I'd still have to search through every pattern, which is
what I'm trying to avoid. Something like a trie might be better, but
they don't seem very efficient when implemented in Python.
Jul 18 '05 #7
Chris S. wrote:
The problem is I want to associate some data with my pattern, as in a
dictionary. Basically, my application consists of a number of
conditions, represented as strings with wildcards. Associated to each
condition is arbitrary data explaining "what I must do". ... However, I'm uncertain about the efficiency of this approach. I'd like
to use regexps, but how would I associate data with each pattern?


One way is with groups. Make each pattern into a regexp
pattern then concatenate them as
(pat1)|(pat2)|( pat3)| ... |(patN)

Do the match and find which group has the non-None value.

You may need to tack a "$" on the end of string (in which
case remember to enclose everything in a () so the $ doesn't
affect only the last pattern).

One things to worry about is you can only have 99 groups
in a pattern.

Here's example code.
import re

config_data = [
("abc#e#", "Reactor meltdown imminent"),
("ab##", "Antimatter containment field breach"),
("b####f", "Coffee too strong"),
]

as_regexps = ["(%s)" % pattern.replace ("#", ".")
for (pattern, text) in config_data]

full_regexp = "|".join(as_reg exps) + "$"
pat = re.compile(full _regexp)
input_data = [
"abadb",
"abcdef",
"zxc",
"abcq",
"b1234f",
]

for text in input_data:
m = pat.match(text)
if not m:
print "%s? That's okay." % (text,)
else:
for i, val in enumerate(m.gro ups()):
if val is not None:
print "%s? We've got a %r warning!" % (text,
config_data[i][1],)

Here's the output I got when I ran it
abadb? We've got a 'Antimatter containment field breach' warning!
abcdef? We've got a 'Reactor meltdown imminent' warning!
zxc? That's okay.
abcq? We've got a 'Antimatter containment field breach' warning!
b1234f? We've got a 'Coffee too strong' warning!
Andrew
da***@dalkescie ntific.com
Jul 18 '05 #8
On Sat, 16 Oct 2004 09:11:37 GMT, "Chris S." <ch*****@NOSPAM .udel.edu> wrote:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?


Insufficient info. But 'fast as possible' suggests putting your strings in
a flex grammar and generating a parser in c. See
http://www.gnu.org/software/flex/
Defining a grammar is a good exercise in precise definition of the problem anyway ;-)

If you want to do it in python, you still need to be more precise...

- is # a single character? any number of characters?
- if your test string were 'abcdefabcdef' would you want 'abc#e#' to match the whole thing?
or match abcdef twice?
- if one wild card string matches, does that "use up" the test string so other wild card strings
mustn't match? If so, what has priority? Longest? shortest? Other criterion?
- etc etc

Regards,
Bengt Richter
Jul 18 '05 #9
Bengt Richter wrote:
On Sat, 16 Oct 2004 09:11:37 GMT, "Chris S." <ch*****@NOSPAM .udel.edu> wrote:

I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?

Insufficient info. But 'fast as possible' suggests putting your strings in
a flex grammar and generating a parser in c. See
http://www.gnu.org/software/flex/
Defining a grammar is a good exercise in precise definition of the problem anyway ;-)

If you want to do it in python, you still need to be more precise...

- is # a single character? any number of characters?
- if your test string were 'abcdefabcdef' would you want 'abc#e#' to match the whole thing?
or match abcdef twice?
- if one wild card string matches, does that "use up" the test string so other wild card strings
mustn't match? If so, what has priority? Longest? shortest? Other criterion?
- etc etc


Sorry for the ambiguity. My case is actually pretty simple. '#'
represents any single character, so it's essentially the same as re's
'.'. The match must be exact, so the string and pattern must be of equal
lengths. Each wildcard is independent of other wildcards. For example,
suppose we restricted the possible characters to 1 and 0, then the
pattern '##' would only match '00', '01', '10', and '11'. This pattern
would not match '0', '111', etc. I feel that a trie would work well, but
I'm concerned that for large patterns, the overhead in the Python
implementation would be too inefficient.
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
6121
by: Linus Nikander | last post by:
Having recently load-tested the application we are developing I noticed that one of the most expensive (time-wise) calls was my fetch of a db-connection from the defined db-pool. At present I fetch my connections using : private Connection getConnection() throws SQLException { try { Context jndiCntx = new InitialContext(); DataSource ds = (DataSource)
6
2657
by: Narendra C. Tulpule | last post by:
Hi, if you know the Python internals, here is a newbie question for you. If I have a list with 100 elements, each element being a long string, is it more efficient to maintain it as a dictionary (with a key = a string from the list and value = None) for the purpose of insertion and removal? Basically, if Python really implements lists as linked lists but dictionaries as hash tables, it may well be that hashing a key takes negligible time...
6
2067
by: JezB | last post by:
What is the most efficient way to scan an array for a match ? I could just iterate directly through it comparing each array entry with what I am looking for, I could use an enumerator to do the same thing (though I dont know if this is better), I could convert the array to some other structure which makes direct lookup possible (if I want to do the same thing many times), or maybe some other entirely different approach is better. Any ideas?
11
3618
by: hoopsho | last post by:
Hi Everyone, I am trying to write a program that does a few things very fast and with efficient use of memory... a) I need to parse a space-delimited file that is really large, upwards fo a million lines. b) I need to store the contents into a unique hash. c) I need to then sort the data on a specific field. d) I need to pull out certain fields and report them to the user.
2
2627
by: David Pratt | last post by:
Hi. I like working with lists of dictionaries since order is preserved in a list when I want order and the dictionaries make it explicit what I have got inside them. I find this combination very useful for storing constants especially. Generally I find myself either needing to retrieve the values of constants in an iterative way (as in my contrived example below). Perhaps even more frequent is given one value is to look up the matching...
6
1362
by: Karlo Lozovina | last post by:
Let's say I have a class with few string properties and few integers, and a lot of methods defined for that class. Now if I have hundreds of thousands (or even more) of instances of that class - is it more efficient to remove those methods and make them separate functions, or it doesn't matter? Thanks... --
9
1725
by: igor.tatarinov | last post by:
Hi, I am pretty new to Python and trying to use it for a relatively simple problem of loading a 5 million line text file and converting it into a few binary files. The text file has a fixed format (like a punchcard). The columns contain integer, real, and date values. The output files are the same values in binary. I have to parse the values and write the binary tuples out into the correct file based on a given column. It's a little more...
2
2012
by: Ryan Liu | last post by:
Is DataRow uses DataRow and DataRow much efficient than DataRow? Thanks, ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. Ryan Liu Shanghai Fengpu Software Co. Ltd Shanghai , China
3
2843
by: Ken Fine | last post by:
This is a question that someone familiar with ASP.NET and ADO.NET DataSets and DataTables should be able to answer fairly easily. The basic question is how I can efficiently match data from one dataset to data in a second dataset, using a common key. I will first describe the problem in words and then I will show my code, which has most of the solution done already. I have built an ASP.NET that queries an Index Server and returns a...
0
9072
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9653
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9421
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6869
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4771
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4942
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3395
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2872
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2284
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.