471,325 Members | 1,702 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,325 software developers and data experts.

Checking if a list of names appears in a body of text.

I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent
Jun 27 '08 #1
10 1272
On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:
I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?
This comes up here surprisingly often. :)

For the general case, so far I've yet to see someone suggest a better
solution that the one I've proposed in the past: using a state graph.
This approach assumes that you've got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I'm not claiming it's the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I'd posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn't as fast as I'd said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete
Jun 27 '08 #2
Very interesting concept, Peter. I'll be playing with this over the weekend.
Peter
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:
>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

This comes up here surprisingly often. :)

For the general case, so far I've yet to see someone suggest a better
solution that the one I've proposed in the past: using a state graph.
This approach assumes that you've got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I'm not claiming it's the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I'd posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn't as fast as I'd said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete
Jun 27 '08 #3
Brent wrote:
I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).
Try something like:

public static string[] Find2(string[] lst, string txt)
{
HashSet<stringhs = new HashSet<string>(txt.Split(' ', ',', '.'));
return Array.FindAll<string>(lst, (string s) =hs.Contains(s));
}

assuming you are on 3.5 !

Arne
Jun 27 '08 #4
have you considered a decision tree built from your list of company names?

Colin =^.^=

"Brent" <wr********@gmail.comwrote in message
news:41**********************************@y22g2000 prd.googlegroups.com...
>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent

Jun 27 '08 #5

"Brent" <wr********@gmail.comwrote in message
news:41**********************************@y22g2000 prd.googlegroups.com...
>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!
Finding the phrase
How do we use the MakePattern method to find our phrase? Let's suppose that
we aren't interested in where the phrase occurs, or whether it occurs
several times, but just whether or not it appears at all. So our approach
will be to take the original phrase, turn it into a pattern, match the
pattern, and return true if the pattern has been matched:

public Boolean PhraseFound(String argPhrase, String argText)

{

String strPattern = MakePattern(argPhrase);

Match match = Regex.Match(argText, strPattern);

return match.Success;

}

I used the Regex.Match to find the occurrence of a word in a text field or
variable. You can also use the features of Regex that will find the
positions of the words so that can use something like a RichTextBox and
position to the word or words in the textbox and highlight all the words.

Jun 27 '08 #6
On Fri, 2 May 2008 17:23:21 -0700 (PDT), Brent <wr********@gmail.com>
wrote:
>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent
As an alternative have a look at the Rabin-Karp string searching
algorithm. http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.

rossum

Jun 27 '08 #7
On Sat, 03 May 2008 06:13:52 -0700, rossum <ro******@coldmail.comwrote:
As an alternative have a look at the Rabin-Karp string searching
algorithm. http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.
However, it still has to visit each character in the input string many
times (for interior characters -- i.e. those not at the very beginning or
end of the input, M times where M is the length of the longest match
string). Also, to deal with match strings of varying length, it needs to
compute a hash value for every substring of the input string of every
possible match length.

If you're implementing your own hash function and table, this isn't so
bad, since you'd easily be able to use the intermediate results as you
work your way toward the longest possible match length. However, in the
context of .NET the natural approach would be to use the built-in string
hashing, requiring a new hash computation for each possible substring.
That could be prohibitive.

In case it wasn't clear (and as I noted in one of the previous threads I
referenced, I admit that the code I posted isn't necessarily the most
clear), one of the state graph functions (RgobjFromCollectionParallel())
will return all possible matches for multiple match strings, even when
there might be overlapping results.

I agree it's not a clear win over the Rabin-Karp approach, since the
"parallel" traversal winds up having to update multiple graph positions
per input string character, even though it only visits each input string
character once. But even the worst case would be no worse than
Rabin-Karp, since in that algorithm you're assured of nearly N*M
operations (where N is the input string string and M is the maximum match
string length), whereas with the state graph, you only wind up with that
many if by some miracle you have to traverse the entire graph starting
with each input character (that would be pretty weird input data :) ).

One reason I like the state graph is that it's conceptually fairly easy to
understand. (Well, _I_ think it is anyway :) ). Also, if you know that
your input string has no overlapping matches, and you know that none of
your match strings contain any other match string, you can even process
the input in a single pass (rather than running parallel traversals, even
as efficient though that can be).

I think it's entirely possible that a fully-optimized, completely custom
Rabin-Karp implementation could well out-perform the generic state graph
code I posted. But this is .NET. Who wants to rewrite all that hash
table stuff and squeeze the last bit of performance out of the
implementation when a nice, .NET-friendly algorithm works fine. :)

Pete
Jun 27 '08 #8
On Sat, 03 May 2008 10:03:52 -0700, "Peter Duniho"
<Np*********@nnowslpianmk.comwrote:
>On Sat, 03 May 2008 06:13:52 -0700, rossum <ro******@coldmail.comwrote:
>As an alternative have a look at the Rabin-Karp string searching
algorithm. http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.

However, it still has to visit each character in the input string many
times (for interior characters -- i.e. those not at the very beginning or
end of the input, M times where M is the length of the longest match
string). Also, to deal with match strings of varying length, it needs to
compute a hash value for every substring of the input string of every
possible match length.
Yes, you do need multiple hashes, but you do not need to visit
characters more than once. Since you know in advance the lengths you
will need, you can retain just enough characters to calculate the
required number of rolling hashes, say 3 chars, 4 chars and 5 chars
from the starting character. That just needs three character
variables.
>
If you're implementing your own hash function and table, this isn't so
bad, since you'd easily be able to use the intermediate results as you
work your way toward the longest possible match length.
That is essential in Rabin-Karp. You use a rolling hash so it is easy
to calculate a hash without having to recalculate the entire hash, you
just calculate the difference between the old hash and the new hash.
>However, in the
context of .NET the natural approach would be to use the built-in string
hashing, requiring a new hash computation for each possible substring.
That could be prohibitive.
It would be, which is why the R-K algorithm avoids it. You can use
any reasonable hash function that you want, and the rolling hash is a
resonable hash function.
[snip]
>
I think it's entirely possible that a fully-optimized, completely custom
Rabin-Karp implementation could well out-perform the generic state graph
code I posted. But this is .NET. Who wants to rewrite all that hash
table stuff and squeeze the last bit of performance out of the
implementation when a nice, .NET-friendly algorithm works fine. :)
Why would you need to rewrite Hashtable? Just calculate your own
integer rolling hash value and pass it to a Hashtable as the key
value.

rossum
>
Pete
Jun 27 '08 #9
On Sat, 03 May 2008 13:02:27 -0700, rossum <ro******@coldmail.comwrote:
Yes, you do need multiple hashes, but you do not need to visit
characters more than once. Since you know in advance the lengths you
will need, you can retain just enough characters to calculate the
required number of rolling hashes, say 3 chars, 4 chars and 5 chars
from the starting character. That just needs three character
variables.
I see...I missed that part of the description the first time I read the
Wikipedia article.

That said, it still looks to me as though you need to keep a list of
hashes, one for each possible length of match string (each of which still
needs to be updated for each new character in the input string). I remain
unconvinced that this is going to be significantly more performant than a
simple state graph, and it seems more complicated to understand. And in
either case, it seems like we've simply traded visiting characters in the
input string multiple times for visiting individual rolling hash codes
multiple times (to update each one once per input character).

But that's just me. With the rolling hashes, I can see how it has a
different implementation (one that doesn't require a custom hash table),
if not less cost, than I previously assumed, and if it seems appropriate
to the OP, far be it from me to argue. :)

Thanks for the clarification.

Pete
Jun 27 '08 #10
Tomas Petricek has a very nice article with sample code along these lines
here:
http://tomasp.net/articles/ahocorasick.aspx
Peter
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:
>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

This comes up here surprisingly often. :)

For the general case, so far I've yet to see someone suggest a better
solution that the one I've proposed in the past: using a state graph.
This approach assumes that you've got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I'm not claiming it's the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I'd posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn't as fast as I'd said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete
Jun 27 '08 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Andrew V. Romero | last post: by
6 posts views Thread by Haines Brown | last post: by
6 posts views Thread by dam_fool_2003 | last post: by
10 posts views Thread by Fredrik Tolf | last post: by
6 posts views Thread by ahart | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.