By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,507 Members | 1,185 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,507 IT Pros & Developers. It's quick & easy.

Fast way to determine if a string contains a member of a list of strings

P: n/a
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow.
Mar 5 '08 #1
Share this Question
Share on Google+
14 Replies


P: n/a
I think RegEx will help you out here. As far as not iterating over the
entire list, I've no clue how you can selectively not fall through given the
only match in the list is the last item in the list or when there is no
match at all.
"Karch" <news.microsoft.comwrote in message
news:e9**************@TK2MSFTNGP06.phx.gbl...
>I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee
of the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following,
there would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given
point (zero position), but I need to keep processing until I have
exhausted the candidate string in the list - as shown above - to prevent a
false match.

I have played around with some different data structures, such as prefix
and suffix trees, an these work in the case that you have a string that
you are trying to match in a list, not vice versa. The approach is
required to be very performant because it will be evaluated millions of
times. I am also okay with an unsafe code approach that works. I just need
the evaluations to terminate as soon as possible rather than looping
through every single item in the list. Even an IndexOf is too slow.

Mar 5 '08 #2

P: n/a
Karch wrote:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu
Could you give us some idea of the likely length of the list? And
whether it's static or dynamic ?

jd
Mar 5 '08 #3

P: n/a
On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft.comwrote:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.
I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn't.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.

If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here:
http://groups.google.com/group/micro...06f696d4500b77

I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either. :)

I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.

Pete
Mar 5 '08 #4

P: n/a
One problem I see with this is that you apparently are not doing
simply a string search, but some kind of word search, without defining
what a word is.

You say the "Smithperson said ..." should not have any matches, so
while the string Smith is a substring, it does not match.

Would it match:
(Smith
(Smith)
Smith,
Smith;
Smith-
smith

etc.

The answer to this will probably constrian your possible algorithms.
If it didn't get too many false hits, doing a string search first and
then re-checking to see if the matches are valid word matches might be
OK.

On Wed, 5 Mar 2008 11:05:22 -0600, "Karch" <news.microsoft.comwrote:
>I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow.
Mar 6 '08 #5

P: n/a
The list is static (read into memory during initialization) and could be up
to approximately 1000 items (hence the need for a fast method)

"John Daragon" <jo**@argv.co.ukwrote in message
news:47***********************@news.zen.co.uk...
Karch wrote:
>I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings. So, think of it in terms of this: I have a string such as the
following:

"Smith said the dog belongs (or did belong) to the secretary, an employee
of the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

Could you give us some idea of the likely length of the list? And whether
it's static or dynamic ?

jd

Mar 6 '08 #6

P: n/a
I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today. I don't think Regex is an option for me because of performance and
the fact that, although the list itself is static, it will be loaded into
memory at run-time. So, the expression building would be complex, not to
mention eating cycles just to prep.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft.comwrote:
>I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.

I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn't.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.

If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here:
http://groups.google.com/group/micro...06f696d4500b77

I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either. :)

I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.

Pete

Mar 6 '08 #7

P: n/a
On Thu, 06 Mar 2008 10:08:22 -0800, Karch <news.microsoft.comwrote:
I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today.
Have you tried the dictionary approach suggested by a couple of others?

I wasn't paying attention when I first read your question and failed to
note that your search is delimited by spaces. Or, at least in the
examples you provided it was.

I think that if your data is actually restricted like that, the dictionary
lookup might be as fast as a state graph and it would be a lot simpler in
terms of implementation.

Just a thought. Obviously I really like state graphs :), but when there
is information you know about the input data that allows you to constrain
the problem a bit (e.g. by using spaces to identify the beginning and
ending of any possible match), it's often possible to solve the problem
efficiently without using a state graph.

Pete
Mar 6 '08 #8

P: n/a
On Thu, 06 Mar 2008 10:12:51 -0800, Karch <news.microsoft.comwrote:
The Hashtable would work great alongside a tokenized string, but the
problem
is that I don't have a distinct delimiter since the string to be matched
can
span multiple words. It's really a question of "do any of these strings
appear in this source string", irrespective of whitespace.
Then how do you distinguish "Smith" from "Smitherson", as in your
example? Can you at least make a requirement that the search pattern will
always begin and end on a space, even if it includes spaces within?

Pete
Mar 6 '08 #9

P: n/a
Mufaka wrote:
Karch wrote:
>The Hashtable would work great alongside a tokenized string, but the
problem is that I don't have a distinct delimiter since the string to
be matched can span multiple words. It's really a question of "do any
of these strings appear in this source string", irrespective of
whitespace.
<snip>
This is a little different than your original post, but if you can't
delimit search string, you might try something like the following:

Hash the static list of words
Keep a list of unique lengths for those words
for each unique length, iterate the string pulling out substrings of
that length
see if that substring is in your hashed list of words

It at least reduces the amount of InString's you'd have to do.

I'd be interested in seeing how this performs against large strings.
Here's my test code:
<snip>
>
if (text.Length < length) break;
<snip>
That should be continue instead of break.
Mar 6 '08 #10

P: n/a
WOW! Really disappointing timings for the state machine method. Here are the
results:

Method: Looping through 359 strings doing an if (IndexOf 0) and a source
string of length = 47
Execution time: 0.055008 seconds.
Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 2.359223 seconds.

"Karch" <news.microsoft.comwrote in message
news:eX**************@TK2MSFTNGP03.phx.gbl...
>I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today. I don't think Regex is an option for me because of performance and
the fact that, although the list itself is static, it will be loaded into
memory at run-time. So, the expression building would be complex, not to
mention eating cycles just to prep.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
>On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft.comwrote:
>>I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.

I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be
if you concatenate all of your possible matches, maybe it doesn't.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.

If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile
ago and I posted some sample code to get someone else started. You can
find that post here:
http://groups.google.com/group/micro...06f696d4500b77

I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either. :)

I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.

Pete


Mar 7 '08 #11

P: n/a
Karch wrote:
Snip 8<
>
Yeah, this would work really good - but one point that I didn't mention is
that I can't really tokenize the string since the strings to match could
span multiple words and whitepace. I looked at the data and I couldn't find
a realiable way to tokenize.
Snip 8<

Is the data in your list arbitrarily complex? I guess what I'm getting
at is: if the majority of the strings to match are simple strings with
no embedded spaces or other special characters, it may well be possible
to build a hashtable (fixated, me?) that caters for the simplest
(tokenised) case, but which returns an object identifying whether this
string can be standalone or is a prefix for a more complex one.
Everything apart from the simple case would have to be handled
exceptionally, but if they're few and far between you may well still be
a long way ahead of the game adopting a technique that is based on hashing.
jd
Mar 7 '08 #12

P: n/a
WOW! Really disappointing timings for the state machine method. Here are the
results:

Method: Looping through 359 strings doing an if (IndexOf 0) and a source
string of length = 47
Execution time: 0.055008 seconds.
Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 2.359223 seconds.

In my case, I can terminate the search as soon as ONE of the strings is
found because the list is mutually exclusive. In the sample code that you
provided, at what point do I know I have a match and can terminate the
traversal?

Also, to answer your question, I do not have space delimiters - the strings
can span multiple words, etc.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Thu, 06 Mar 2008 10:08:22 -0800, Karch <news.microsoft.comwrote:
>I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today.

Have you tried the dictionary approach suggested by a couple of others?

I wasn't paying attention when I first read your question and failed to
note that your search is delimited by spaces. Or, at least in the
examples you provided it was.

I think that if your data is actually restricted like that, the dictionary
lookup might be as fast as a state graph and it would be a lot simpler in
terms of implementation.

Just a thought. Obviously I really like state graphs :), but when there
is information you know about the input data that allows you to constrain
the problem a bit (e.g. by using spaces to identify the beginning and
ending of any possible match), it's often possible to solve the problem
efficiently without using a state graph.

Pete

Mar 13 '08 #13

P: n/a
On Thu, 13 Mar 2008 16:35:05 -0700, Karch <news.microsoft.comwrote:
I only create the state graph one time and I only populate the list of
strings one time. So, those are not interfering with proper timings; I am
only timing the searches. But, just for fun, here is the same test with
1M
iterations. The ratio is about the same.
What is your input data? Does the input string contain any of the search
strings? If so, where is that search string in the list of search
strings? Are you doing 1 million identical searches, or are you somehow
randomizing the input?

Basically, that's a fairly minimal improvement relative to the potential
of the state graph versus a brute force IndexOf() solution. So either
your test case is somehow favorable to the IndexOf(), or there's still
some remaining problem with the state graph implementation.

Pete
Mar 13 '08 #14

P: n/a
Yes, you are right; my input data is not as random as it could be. I just
wanted to get a quick snapshot, so it's likely that the improvement would be
greater over a larger and more varied set. I'll put together a more
realistic set of test data and report back.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Thu, 13 Mar 2008 16:35:05 -0700, Karch <news.microsoft.comwrote:
>I only create the state graph one time and I only populate the list of
strings one time. So, those are not interfering with proper timings; I am
only timing the searches. But, just for fun, here is the same test with
1M
iterations. The ratio is about the same.

What is your input data? Does the input string contain any of the search
strings? If so, where is that search string in the list of search
strings? Are you doing 1 million identical searches, or are you somehow
randomizing the input?

Basically, that's a fairly minimal improvement relative to the potential
of the state graph versus a brute force IndexOf() solution. So either
your test case is somehow favorable to the IndexOf(), or there's still
some remaining problem with the state graph implementation.

Pete

Mar 14 '08 #15

This discussion thread is closed

Replies have been disabled for this discussion.