How to efficiently determine if a string contains any one of many strings

I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc. etc.
etc.

I am wondering what the efficient way to do this in code might be. The dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison all
at once.

(Though I'm interested in the answer to the code question above, I'm
wondering if a better approach would be to use an existing indexing tool
(SQL Server or Indexing Server, maybe?) as a short-term intermediary. )

Thanks in advance!
-KF

Jun 20 '07 #1

Subscribe Post Reply

2661

Jon Skeet [C# MVP]

<ke*****@nospam.nospamwrote:

I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc. etc.
etc.

I am wondering what the efficient way to do this in code might be. The dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison all
at once.

(Though I'm interested in the answer to the code question above, I'm
wondering if a better approach would be to use an existing indexing tool
(SQL Server or Indexing Server, maybe?) as a short-term intermediary. )

Well, what are your performance criteria? Have you checked whether the
simplest possible solution *is* actually too slow? That would be my
first port of call.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jun 20 '07 #2

Many thanks, Jon. It may indeed not be too slow, especially since I have
throwaway servers I can dedicate to this task.

My purposes in asking included:
a) to learn if there is an obvious, language-specific way to do a
"simultanous comparison" of a dictionary of search terms against another
string
b) to learn common solutions that other devs might apply to this sort of
problem.

Always trying (and needing) to learn more.

Thanks,
-KF

"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP*********************@msnews.microsoft.com. ..

<ke*****@nospam.nospamwrote:
>I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use
for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred
Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc.
etc.
etc.

I am wondering what the efficient way to do this in code might be. The
dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison
all
at once.

(Though I'm interested in the answer to the code question above, I'm
wondering if a better approach would be to use an existing indexing tool
(SQL Server or Indexing Server, maybe?) as a short-term intermediary. )

Well, what are your performance criteria? Have you checked whether the
simplest possible solution *is* actually too slow? That would be my
first port of call.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jun 20 '07 #3

Jon Skeet [C# MVP]

<ke*****@nospam.nospamwrote:

Many thanks, Jon. It may indeed not be too slow, especially since I have
throwaway servers I can dedicate to this task.

My purposes in asking included:
a) to learn if there is an obvious, language-specific way to do a
"simultanous comparison" of a dictionary of search terms against another
string
b) to learn common solutions that other devs might apply to this sort of
problem.

Always trying (and needing) to learn more.

That's fair enough - so long as "check that you've got a performance
problem before you try to resolve it with more complicated code" is
something you've already learned :)

There's not a language-specific way to do it. A regular expression may
well optimise it as far as it can be done - that would be the next
step, I'd say.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jun 20 '07 #4

Peter Duniho

On Wed, 20 Jun 2007 11:48:55 -0700, <ke*****@nospam.nospamwrote:

I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use
for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred
Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc.

I am wondering what the efficient way to do this in code might be. The
dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison
all
at once.

Since we've already covered the "don't bother optimizing until you know
you need to" part of the question... :)

I would expect that this is a standard algorithm, akin to spell-checking.
I admit, I tend not to be as up on the "science" side of "computer
science" as I ought to be, but I'd guess you'd find the solution in a book
like "Programming Pearls", etc.

Ignoring the likelihood that someone's already figured out a good,
efficient algorithm to do this... :)

To me it seems like a good opportunity to use a state diagram sort of
thing. Based on your input strings, you'd build a state diagram in which
the next state from a given state is determined by the next input
character. For example, starting from the base state, reading a "J" would
send you to the next state that is shared by the possible input strings
"John P. Jones" and "Jane T. Smith". From that state, reading the next
character would send you to one of two states, one used by "John P. Jones"
and the other used by "Jane T. Smith".

Of course, assuming more input strings to match, you'd have a lot more
shared states. But the general idea is the same. At some point, you'd
reach a terminal state that would identify a matched string. If you want
to allow matches where some of your search strings contain other search
strings, then you'd have to identify those cases and have a given terminal
state identify multiple matches. If you want to allow matches to overlap,
then things get really complicated. :) I think at that point, you'd have
to provide for the ability to have multiple state diagrams active
concurrently (not difficult, just more complicated).

Using a mechanism like that, you'd only have to scan through the string
once, using each character to drive the state diagram(s).

Note that this technique would work best when you have a constant set of
strings to match, which you apply repeatedly to a large number of input
texts. It sounds like that's the scenario you're in, but if not then
perhaps a brute-force, or indexing-based solution would work as well, or
even better. The state diagram solution is fairly memory hungry, and
requires some processing time up-front to create the state diagram.

Pete

Jun 20 '07 #5

Thanks Peter. Your response is most generous and appreciated, and is exactly
the kind of insight I was looking for. As a Fine Arts major, "computer
science" was not something I had the opportunity to study rigorously.
Learning about it now is pretty darned interesting.

Perhaps I could press you or others to speculate on one other CS-ish thing.
I am interested in machine classification of unknown texts. Some kinds of
classifications lend themselves to searches for string literals: "Department
of Oncology", or "Peter L. Ward" or "Lewis Hall." Other kinds of
classifications are less clear via exact string matching: we have things
like "Physical Science" or "Campus" or "Sports."

Spam filters somehow analyze texts and come up with a numeric score that
assesses an unknown piece of content's likelihood of being spam -- let's
call it "Spamminess". I am interested in analyzing unknown text and being
able to assess their "Campus-ness" or "Physical Science-ness" or
"Sports-ness".

I could imagine this working at least two ways:
1) Develop a series of rules for something like the category "Science" --
e.g. if the unknown article contains X and Y and Z metadata terms, give one
point for each instance of the term and divide by the number of words. In
this scenario you spend a lot of time thinking about and programming good
rules.

2) I have large collections of known texts which are already classified with
metadata pertaining to topics. Turn the machine on those texts and have it
derive rules and observations from that pre-existing base of content. Use
those rules for scans of unknown texts.

#2 sounds a lot more interesting but I have no idea how to proceed, what
classics of CS literature I should be looking at, or whether it is even
viable. If you have a starting point, I'd be very interested.

Thanks again for your help, much appreciated.

-KF
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...

On Wed, 20 Jun 2007 11:48:55 -0700, <ke*****@nospam.nospamwrote:

>I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use
for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred
Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc.

I am wondering what the efficient way to do this in code might be. The
dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison
all
at once.

Since we've already covered the "don't bother optimizing until you know
you need to" part of the question... :)

I would expect that this is a standard algorithm, akin to spell-checking.
I admit, I tend not to be as up on the "science" side of "computer
science" as I ought to be, but I'd guess you'd find the solution in a book
like "Programming Pearls", etc.

Ignoring the likelihood that someone's already figured out a good,
efficient algorithm to do this... :)

To me it seems like a good opportunity to use a state diagram sort of
thing. Based on your input strings, you'd build a state diagram in which
the next state from a given state is determined by the next input
character. For example, starting from the base state, reading a "J" would
send you to the next state that is shared by the possible input strings
"John P. Jones" and "Jane T. Smith". From that state, reading the next
character would send you to one of two states, one used by "John P. Jones"
and the other used by "Jane T. Smith".

Of course, assuming more input strings to match, you'd have a lot more
shared states. But the general idea is the same. At some point, you'd
reach a terminal state that would identify a matched string. If you want
to allow matches where some of your search strings contain other search
strings, then you'd have to identify those cases and have a given terminal
state identify multiple matches. If you want to allow matches to overlap,
then things get really complicated. :) I think at that point, you'd have
to provide for the ability to have multiple state diagrams active
concurrently (not difficult, just more complicated).

Using a mechanism like that, you'd only have to scan through the string
once, using each character to drive the state diagram(s).

Note that this technique would work best when you have a constant set of
strings to match, which you apply repeatedly to a large number of input
texts. It sounds like that's the scenario you're in, but if not then
perhaps a brute-force, or indexing-based solution would work as well, or
even better. The state diagram solution is fairly memory hungry, and
requires some processing time up-front to create the state diagram.

Pete

Jun 20 '07 #6

=?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=

Ken,
You might consider approaching this with DotLucene. It will do a very nice
job of indexing your pages, and you can then get very rapid search results
based on your thousands of "metadata search terms". The product is written in
C# (ported from the original JAVA implementation) and has a lot of
flexibility.

-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
Short urls & more: http://ittyurl.net

"ke*****@nospam.nospam" wrote:

I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc. etc.
etc.

I am wondering what the efficient way to do this in code might be. The dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison all
at once.

(Though I'm interested in the answer to the code question above, I'm
wondering if a better approach would be to use an existing indexing tool
(SQL Server or Indexing Server, maybe?) as a short-term intermediary. )

Thanks in advance!
-KF

Jun 20 '07 #7

Nicholas Paldino [.NET/C# MVP]

If you are looking to apply an algorithm similar to determining what is
spam (in your case, the same algorithm, but using it to classify it as
something else), then you might want to google the term "bayesian". This is
the algorithm that is used in most heuristic spam filters.

Even better, search for "bayesian spam" and the second item that is
returned (on google) should give you some open source projects you can use
to implement the algorithm and tweak it to your needs. Here is the direct
link, in case you can't find it:

http://spambayes.sourceforge.net/windows.html

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<ke*****@nospam.nospamwrote in message
news:u8**************@TK2MSFTNGP05.phx.gbl...

Thanks Peter. Your response is most generous and appreciated, and is
exactly the kind of insight I was looking for. As a Fine Arts major,
"computer science" was not something I had the opportunity to study
rigorously. Learning about it now is pretty darned interesting.

Perhaps I could press you or others to speculate on one other CS-ish
thing. I am interested in machine classification of unknown texts. Some
kinds of classifications lend themselves to searches for string literals:
"Department of Oncology", or "Peter L. Ward" or "Lewis Hall." Other kinds
of classifications are less clear via exact string matching: we have
things like "Physical Science" or "Campus" or "Sports."

Spam filters somehow analyze texts and come up with a numeric score that
assesses an unknown piece of content's likelihood of being spam -- let's
call it "Spamminess". I am interested in analyzing unknown text and being
able to assess their "Campus-ness" or "Physical Science-ness" or
"Sports-ness".

I could imagine this working at least two ways:
1) Develop a series of rules for something like the category "Science" --
e.g. if the unknown article contains X and Y and Z metadata terms, give
one point for each instance of the term and divide by the number of words.
In this scenario you spend a lot of time thinking about and programming
good rules.

2) I have large collections of known texts which are already classified
with metadata pertaining to topics. Turn the machine on those texts and
have it derive rules and observations from that pre-existing base of
content. Use those rules for scans of unknown texts.

#2 sounds a lot more interesting but I have no idea how to proceed, what
classics of CS literature I should be looking at, or whether it is even
viable. If you have a starting point, I'd be very interested.

Thanks again for your help, much appreciated.

-KF
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
>On Wed, 20 Jun 2007 11:48:55 -0700, <ke*****@nospam.nospamwrote:

>>I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use
for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred
Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc.

I am wondering what the efficient way to do this in code might be. The
dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison
all
at once.

Since we've already covered the "don't bother optimizing until you know
you need to" part of the question... :)

I would expect that this is a standard algorithm, akin to spell-checking.
I admit, I tend not to be as up on the "science" side of "computer
science" as I ought to be, but I'd guess you'd find the solution in a
book like "Programming Pearls", etc.

Ignoring the likelihood that someone's already figured out a good,
efficient algorithm to do this... :)

To me it seems like a good opportunity to use a state diagram sort of
thing. Based on your input strings, you'd build a state diagram in which
the next state from a given state is determined by the next input
character. For example, starting from the base state, reading a "J"
would send you to the next state that is shared by the possible input
strings "John P. Jones" and "Jane T. Smith". From that state, reading
the next character would send you to one of two states, one used by "John
P. Jones" and the other used by "Jane T. Smith".

Of course, assuming more input strings to match, you'd have a lot more
shared states. But the general idea is the same. At some point, you'd
reach a terminal state that would identify a matched string. If you want
to allow matches where some of your search strings contain other search
strings, then you'd have to identify those cases and have a given
terminal state identify multiple matches. If you want to allow matches
to overlap, then things get really complicated. :) I think at that
point, you'd have to provide for the ability to have multiple state
diagrams active concurrently (not difficult, just more complicated).

Using a mechanism like that, you'd only have to scan through the string
once, using each character to drive the state diagram(s).

Note that this technique would work best when you have a constant set of
strings to match, which you apply repeatedly to a large number of input
texts. It sounds like that's the scenario you're in, but if not then
perhaps a brute-force, or indexing-based solution would work as well, or
even better. The state diagram solution is fairly memory hungry, and
requires some processing time up-front to create the state diagram.

Pete

Jun 20 '07 #8

Peter Duniho

On Wed, 20 Jun 2007 14:06:23 -0700, <ke*****@nospam.nospamwrote:

Thanks Peter. Your response is most generous and appreciated, and is
exactly
the kind of insight I was looking for. As a Fine Arts major, "computer
science" was not something I had the opportunity to study rigorously.
Learning about it now is pretty darned interesting.

Depending on what your background in computer science is, given that it
wasn't your major in college, you may or may not have run into the concept
of a state diagram or state graph. I'm guessing that if not, by now you
may have already done some research on the topic. It's not a complicated
idea, so if you looked it up you may already have a handle on the problem
by now.

However, because it seemed like an interesting question and because I'm
pretty good at avoiding real work by distracting myself with tinkering, I
went ahead and put together a simple state graph class that you might be
able to use. I did it for my own edification, but you are welcome to use
it if you like.

It's a generic class, and you have to create it using two types: a
"TransitionKey" which is the type that will be used to determine which
state to transition to from a given state, and a "Terminal" which is
whatever kind of data you want to get back after matching something.

In your case, I would guess that "TransitionKey" would be "char" and
"Terminal" might simply be "string" (if all you need is the actual text
found), or perhaps some kind of index used in your matching (so "int" or
some other kind of indexing value). There are methods in the class then
that take sequences of transition keys and do things with them: add a
sequence to the graph, remove a sequence from the graph, and search in a
given sequence for sub-sequences already in the graph.

As an example, let's say you had an array of strings ("rgstrFind") that
you want to find within some text ("strText"). You might use the graph
class like this:

int[] IndicesSearchText(string strText, string[] rgstrFind)
{
StateGraph<char, intgraph = new StateGraph<char, int>();

for (int istr = 0; istr < rgstrFind.Length; istr++)
{
graph.Add(rgstrFind[istr], istr);
}

return RgobjFromCollection(strText);
}

The function would return an array of indices, corresponding to the
entries in "rgstrFind", one for each matching string (with duplicates, if
a given string matched more than once).

In the class I also included the functionality I alluded to with respect
to finding overlapping strings. There's a different method, called
RgobjFromCollectionParallel() that does the same thing as above, except
that it maintains multiple concurrent states, starting a new traversal
with each character in the string. By doing that, it will find all
instances of your search strings, even if they overlap or are contained in
one another. Again, duplicates will be in the results if any given input
string matches more than once.

I make no guarantees that the code is perfect or bug-free. I made no
effort to optimize performance, being satisfied to simply address the
high-level problem. There are a number of useful features that I didn't
bother to implement (for example, avoiding duplicates or calling a
delegate on a match instead of adding data to an array), mainly because to
me it was an academic exercise. It is entirely possible (likely, even)
that if you do a Google search on .NET and state graphs, you'll find much
better implementations than this. :) If you find problems, feel free to
comment on them here, and of course feel free to expand on the class as
you see fit (duh :) )).

Here you go:

using System;
using System.Collections.Generic;
using System.Text;
using System.Collections;

namespace TestStateGraph
{
public class StateGraph<TransitionKey, Terminal>
{
#region Public Methods

#region Graph Maintenance Methods

// Adds a new object to a given sequence of transition keys in the
// graph. Traverses existing transitions and adds new states and
// transitions as necessary.
public void Add(IEnumerable<TransitionKeyrgtkTerminal, Terminal
obj)
{
StateNode state = _stateRoot;

foreach (TransitionKey tk in rgtkTerminal)
{
state = state.NextState(tk, true);
}

state.AddTerminalObj(obj);
}

// Removes a given sequence of transition keys from the graph.
public void Remove(TransitionKey[] rgtkTerminal, Terminal obj)
{
_StateRemove(rgtkTerminal, 0, obj, _stateRoot);
}

// Clears the entire graph
public void Clear()
{
_stateRoot = new StateNode();
}

#endregion

#region Graph Traversal Methods

// Reset the current state of the graph. Used only in conjunction
// with the RgobjFromNextState() method.
public void ResetGraphState()
{
_stateCur = _stateRoot;
}

// Transition the graph to the next state given a key, and return
// the array of objects associated with the new state
public Terminal[] RgobjFromNextState(TransitionKey tk)
{

_stateCur = _NodeNextState(tk, _stateCur);

return _stateCur.RgobjItems;
}

// Given a sequence of transition keys, traverses the state graph
// linearly and returns all of the objects associated with each
state
// visited during the traversal.
public Terminal[] RgobjFromCollection(IEnumerable<TransitionKey
rgtk)
{
List<Terminallobj = new List<Terminal>();

foreach (TransitionKey tk in rgtk)
{
lobj.AddRange(RgobjFromNextState(tk));
}

return lobj.ToArray();
}

// Given a sequence of transition keys, returns all matching
sequences of
// transition keys represented in the graph. This method will
correctly
// find all overlapping transition sequences, including transition
sequences
// contained by other transition sequences.
public Terminal[]
RgobjFromCollectionParallel(IEnumerable<Transition Keyrgtk)
{
Queue<StateNodeqstate = new Queue<StateNode>();
List<Terminallobj = new List<Terminal>();

foreach (TransitionKey tk in rgtk)
{
qstate.Enqueue(_stateRoot);
for (int i = qstate.Count; i 0; i--)
{
StateNode state = qstate.Dequeue();

state = _NodeNextState(tk, state);
lobj.AddRange(state.RgobjItems);

if (state != _stateRoot && state.Clink 0)
{
qstate.Enqueue(state);
}
}
}

return lobj.ToArray();
}
#endregion

#endregion

#region Private Members

// The initial state of the graph
private StateNode _stateRoot = new StateNode();

// The current state of the graph. Used only when the client uses
// the "one state at a time" RgobjFromNextState() method.
private StateNode _stateCur = null;

// An individual node within the state graph.
private class StateNode
{
// Lookup table to find the next state, given a transition key
private Dictionary<TransitionKey, StateNode_mpchstate =new
Dictionary<TransitionKey, StateNode>();

// List of objects associated with this state. Empty for any
state
// that doesn't terminate a string.
private List<Terminal_lobj = new List<Terminal>();

// Gets a copy of the list of objects associated with this
state
public Terminal[] RgobjItems
{
get { return _lobj.ToArray(); }
}

// Gets the number of objects associated with this state
public int CobjTerminal
{
get { return _lobj.Count; }
}

// Gets the number non-root states reachable by transition from
// this state.
public int Clink
{
get { return _mpchstate.Count; }
}

// Adds a new object associated with this state
public void AddTerminalObj(Terminal obj)
{
_lobj.Add(obj);
}

// Removes a new object associated with this state
public void RemoveTerminalObj(Terminal obj)
{
_lobj.Remove(obj);
}

// Given a transition key, find the next state. Optionally,
// add a new state when the key isn't yet a valid transition
// for this state.
public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state;

try
{
state = _mpchstate[tk];
}
catch (KeyNotFoundException)
{
state = null;
}

if (state == null && fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}
return state;
}

// Remove a given transition to another state from this state
public void RemoveLink(TransitionKey tk)
{
_mpchstate.Remove(tk);
}
}

// Removes a terminal object associated with a given sequence of
// transition keys from the graph. This will also remove any state
// nodes that as a result wind up without any useful information
// in them (either a transition to a new state, or a list of
objects
// associated with a terminal state).
private StateNode _StateRemove(TransitionKey[] rgtkTerminal, int
itk, Terminal obj, StateNode stateCur)
{
if (itk < rgtkTerminal.Length)
{
TransitionKey tk = rgtkTerminal[itk];
StateNode stateNext = stateCur.NextState(tk, false);

if (stateNext == null)
{
throw new ArgumentException("String not found in
graph", "rgtkTerminal");
}

stateNext = _StateRemove(rgtkTerminal, itk + 1, obj,
stateNext);

if (stateNext == null)
{
stateCur.RemoveLink(tk);
if (stateCur.Clink == 0)
{
stateCur = null;
}
}
}
else
{
if (stateCur.CobjTerminal == 0)
{
throw new ArgumentException("String not found in
graph", "rgtkTerminal");
}

stateCur.RemoveTerminalObj(obj);
if (stateCur.CobjTerminal == 0)
{
stateCur = null;
}
}

return stateCur;
}

private StateNode _NodeNextState(TransitionKey tk, StateNode state)
{
StateNode stateNew = state.NextState(tk, false);

if (stateNew == null)
{
stateNew = _stateRoot.NextState(tk, false);
}

return stateNew != null ? stateNew : _stateRoot;
}

#endregion
}
}

Jun 24 '07 #9

Peter,

Just getting back to this thread. I appreciate this very much. Thanks for
taking the time to write this class out for me - super-helpful. I'll study
it at length later today.

-KF

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Wed, 20 Jun 2007 14:06:23 -0700, <ke*****@nospam.nospamwrote:

Thanks Peter. Your response is most generous and appreciated, and is
exactly
the kind of insight I was looking for. As a Fine Arts major, "computer
science" was not something I had the opportunity to study rigorously.
Learning about it now is pretty darned interesting.

Depending on what your background in computer science is, given that it
wasn't your major in college, you may or may not have run into the concept
of a state diagram or state graph. I'm guessing that if not, by now you
may have already done some research on the topic. It's not a complicated
idea, so if you looked it up you may already have a handle on the problem
by now.

However, because it seemed like an interesting question and because I'm
pretty good at avoiding real work by distracting myself with tinkering, I
went ahead and put together a simple state graph class that you might be
able to use. I did it for my own edification, but you are welcome to use
it if you like.

It's a generic class, and you have to create it using two types: a
"TransitionKey" which is the type that will be used to determine which
state to transition to from a given state, and a "Terminal" which is
whatever kind of data you want to get back after matching something.

In your case, I would guess that "TransitionKey" would be "char" and
"Terminal" might simply be "string" (if all you need is the actual text
found), or perhaps some kind of index used in your matching (so "int" or
some other kind of indexing value). There are methods in the class then
that take sequences of transition keys and do things with them: add a
sequence to the graph, remove a sequence from the graph, and search in a
given sequence for sub-sequences already in the graph.

As an example, let's say you had an array of strings ("rgstrFind") that
you want to find within some text ("strText"). You might use the graph
class like this:

int[] IndicesSearchText(string strText, string[] rgstrFind)
{
StateGraph<char, intgraph = new StateGraph<char, int>();

for (int istr = 0; istr < rgstrFind.Length; istr++)
{
graph.Add(rgstrFind[istr], istr);
}

return RgobjFromCollection(strText);
}

The function would return an array of indices, corresponding to the
entries in "rgstrFind", one for each matching string (with duplicates, if
a given string matched more than once).

In the class I also included the functionality I alluded to with respect
to finding overlapping strings. There's a different method, called
RgobjFromCollectionParallel() that does the same thing as above, except
that it maintains multiple concurrent states, starting a new traversal
with each character in the string. By doing that, it will find all
instances of your search strings, even if they overlap or are contained in
one another. Again, duplicates will be in the results if any given input
string matches more than once.

I make no guarantees that the code is perfect or bug-free. I made no
effort to optimize performance, being satisfied to simply address the
high-level problem. There are a number of useful features that I didn't
bother to implement (for example, avoiding duplicates or calling a
delegate on a match instead of adding data to an array), mainly because to
me it was an academic exercise. It is entirely possible (likely, even)
that if you do a Google search on .NET and state graphs, you'll find much
better implementations than this. :) If you find problems, feel free to
comment on them here, and of course feel free to expand on the class as
you see fit (duh :) )).

Here you go:

using System;
using System.Collections.Generic;
using System.Text;
using System.Collections;

namespace TestStateGraph
{
public class StateGraph<TransitionKey, Terminal>
{
#region Public Methods

#region Graph Maintenance Methods

// Adds a new object to a given sequence of transition keys in the
// graph. Traverses existing transitions and adds new states and
// transitions as necessary.
public void Add(IEnumerable<TransitionKeyrgtkTerminal, Terminal
obj)
{
StateNode state = _stateRoot;

foreach (TransitionKey tk in rgtkTerminal)
{
state = state.NextState(tk, true);
}

state.AddTerminalObj(obj);
}

// Removes a given sequence of transition keys from the graph.
public void Remove(TransitionKey[] rgtkTerminal, Terminal obj)
{
_StateRemove(rgtkTerminal, 0, obj, _stateRoot);
}

// Clears the entire graph
public void Clear()
{
_stateRoot = new StateNode();
}

#endregion

#region Graph Traversal Methods

// Reset the current state of the graph. Used only in conjunction
// with the RgobjFromNextState() method.
public void ResetGraphState()
{
_stateCur = _stateRoot;
}

// Transition the graph to the next state given a key, and return
// the array of objects associated with the new state
public Terminal[] RgobjFromNextState(TransitionKey tk)
{

_stateCur = _NodeNextState(tk, _stateCur);

return _stateCur.RgobjItems;
}

// Given a sequence of transition keys, traverses the state graph
// linearly and returns all of the objects associated with each
state
// visited during the traversal.
public Terminal[] RgobjFromCollection(IEnumerable<TransitionKey>
rgtk)
{
List<Terminallobj = new List<Terminal>();

foreach (TransitionKey tk in rgtk)
{
lobj.AddRange(RgobjFromNextState(tk));
}

return lobj.ToArray();
}

// Given a sequence of transition keys, returns all matching
sequences of
// transition keys represented in the graph. This method will
correctly
// find all overlapping transition sequences, including transition
sequences
// contained by other transition sequences.
public Terminal[]
RgobjFromCollectionParallel(IEnumerable<Transition Keyrgtk)
{
Queue<StateNodeqstate = new Queue<StateNode>();
List<Terminallobj = new List<Terminal>();

foreach (TransitionKey tk in rgtk)
{
qstate.Enqueue(_stateRoot);
for (int i = qstate.Count; i 0; i--)
{
StateNode state = qstate.Dequeue();

state = _NodeNextState(tk, state);
lobj.AddRange(state.RgobjItems);

if (state != _stateRoot && state.Clink 0)
{
qstate.Enqueue(state);
}
}
}

return lobj.ToArray();
}
#endregion

#endregion

#region Private Members

// The initial state of the graph
private StateNode _stateRoot = new StateNode();

// The current state of the graph. Used only when the client uses
// the "one state at a time" RgobjFromNextState() method.
private StateNode _stateCur = null;

// An individual node within the state graph.
private class StateNode
{
// Lookup table to find the next state, given a transition key
private Dictionary<TransitionKey, StateNode_mpchstate = new
Dictionary<TransitionKey, StateNode>();

// List of objects associated with this state. Empty for any
state
// that doesn't terminate a string.
private List<Terminal_lobj = new List<Terminal>();

// Gets a copy of the list of objects associated with this
state
public Terminal[] RgobjItems
{
get { return _lobj.ToArray(); }
}

// Gets the number of objects associated with this state
public int CobjTerminal
{
get { return _lobj.Count; }
}

// Gets the number non-root states reachable by transition from
// this state.
public int Clink
{
get { return _mpchstate.Count; }
}

// Adds a new object associated with this state
public void AddTerminalObj(Terminal obj)
{
_lobj.Add(obj);
}

// Removes a new object associated with this state
public void RemoveTerminalObj(Terminal obj)
{
_lobj.Remove(obj);
}

// Given a transition key, find the next state. Optionally,
// add a new state when the key isn't yet a valid transition
// for this state.
public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state;

try
{
state = _mpchstate[tk];
}
catch (KeyNotFoundException)
{
state = null;
}

if (state == null && fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}
return state;
}

// Remove a given transition to another state from this state
public void RemoveLink(TransitionKey tk)
{
_mpchstate.Remove(tk);
}
}

// Removes a terminal object associated with a given sequence of
// transition keys from the graph. This will also remove any state
// nodes that as a result wind up without any useful information
// in them (either a transition to a new state, or a list of
objects
// associated with a terminal state).
private StateNode _StateRemove(TransitionKey[] rgtkTerminal, int
itk, Terminal obj, StateNode stateCur)
{
if (itk < rgtkTerminal.Length)
{
TransitionKey tk = rgtkTerminal[itk];
StateNode stateNext = stateCur.NextState(tk, false);

if (stateNext == null)
{
throw new ArgumentException("String not found in
graph", "rgtkTerminal");
}

stateNext = _StateRemove(rgtkTerminal, itk + 1, obj,
stateNext);

if (stateNext == null)
{
stateCur.RemoveLink(tk);
if (stateCur.Clink == 0)
{
stateCur = null;
}
}
}
else
{
if (stateCur.CobjTerminal == 0)
{
throw new ArgumentException("String not found in
graph", "rgtkTerminal");
}

stateCur.RemoveTerminalObj(obj);
if (stateCur.CobjTerminal == 0)
{
stateCur = null;
}
}

return stateCur;
}

private StateNode _NodeNextState(TransitionKey tk, StateNode state)
{
StateNode stateNew = state.NextState(tk, false);

if (stateNew == null)
{
stateNew = _stateRoot.NextState(tk, false);
}

return stateNew != null ? stateNew : _stateRoot;
}

#endregion
}
}

Jun 25 '07 #10

How to efficiently determine if a string contains any one of many strings

Similar topics