Checking if a list of names appears in a body of text.

Brent

I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent

Jun 27 '08 #1

Subscribe Post Reply

1379

Peter Duniho

On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:

I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

This comes up here surprisingly often. :)

For the general case, so far I've yet to see someone suggest a better
solution that the one I've proposed in the past: using a state graph.
This approach assumes that you've got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I'm not claiming it's the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I'd posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn't as fast as I'd said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete

Jun 27 '08 #2

Peter Bromberg [C# MVP]

Very interesting concept, Peter. I'll be playing with this over the weekend.
Peter
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...

On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:

>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

This comes up here surprisingly often. :)

For the general case, so far I've yet to see someone suggest a better
solution that the one I've proposed in the past: using a state graph.
This approach assumes that you've got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I'm not claiming it's the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I'd posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn't as fast as I'd said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete

Jun 27 '08 #3

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Brent wrote:

I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Try something like:

public static string[] Find2(string[] lst, string txt)
{
HashSet<stringhs = new HashSet<string>(txt.Split(' ', ',', '.'));
return Array.FindAll<string>(lst, (string s) =hs.Contains(s));
}

assuming you are on 3.5 !

Arne

Jun 27 '08 #4

colin

have you considered a decision tree built from your list of company names?

Colin =^.^=

"Brent" <wr********@gmail.comwrote in message
news:41**********************************@y22g2000 prd.googlegroups.com...

>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent

Jun 27 '08 #5

Mr. Arnold

"Brent" <wr********@gmail.comwrote in message
news:41**********************************@y22g2000 prd.googlegroups.com...

>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

Finding the phrase
How do we use the MakePattern method to find our phrase? Let's suppose that
we aren't interested in where the phrase occurs, or whether it occurs
several times, but just whether or not it appears at all. So our approach
will be to take the original phrase, turn it into a pattern, match the
pattern, and return true if the pattern has been matched:

public Boolean PhraseFound(String argPhrase, String argText)

{

String strPattern = MakePattern(argPhrase);

Match match = Regex.Match(argText, strPattern);

return match.Success;

}

I used the Regex.Match to find the occurrence of a word in a text field or
variable. You can also use the features of Regex that will find the
positions of the words so that can use something like a RichTextBox and
position to the word or words in the textbox and highlight all the words.

Jun 27 '08 #6

rossum

On Fri, 2 May 2008 17:23:21 -0700 (PDT), Brent <wr********@gmail.com>
wrote:

>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent

As an alternative have a look at the Rabin-Karp string searching
algorithm. http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.

rossum

Jun 27 '08 #7

Peter Duniho

On Sat, 03 May 2008 06:13:52 -0700, rossum <ro******@coldmail.comwrote:

As an alternative have a look at the Rabin-Karp string searching
algorithm. http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.

However, it still has to visit each character in the input string many
times (for interior characters -- i.e. those not at the very beginning or
end of the input, M times where M is the length of the longest match
string). Also, to deal with match strings of varying length, it needs to
compute a hash value for every substring of the input string of every
possible match length.

If you're implementing your own hash function and table, this isn't so
bad, since you'd easily be able to use the intermediate results as you
work your way toward the longest possible match length. However, in the
context of .NET the natural approach would be to use the built-in string
hashing, requiring a new hash computation for each possible substring.
That could be prohibitive.

In case it wasn't clear (and as I noted in one of the previous threads I
referenced, I admit that the code I posted isn't necessarily the most
clear), one of the state graph functions (RgobjFromCollectionParallel())
will return all possible matches for multiple match strings, even when
there might be overlapping results.

I agree it's not a clear win over the Rabin-Karp approach, since the
"parallel" traversal winds up having to update multiple graph positions
per input string character, even though it only visits each input string
character once. But even the worst case would be no worse than
Rabin-Karp, since in that algorithm you're assured of nearly N*M
operations (where N is the input string string and M is the maximum match
string length), whereas with the state graph, you only wind up with that
many if by some miracle you have to traverse the entire graph starting
with each input character (that would be pretty weird input data :) ).

One reason I like the state graph is that it's conceptually fairly easy to
understand. (Well, _I_ think it is anyway :) ). Also, if you know that
your input string has no overlapping matches, and you know that none of
your match strings contain any other match string, you can even process
the input in a single pass (rather than running parallel traversals, even
as efficient though that can be).

I think it's entirely possible that a fully-optimized, completely custom
Rabin-Karp implementation could well out-perform the generic state graph
code I posted. But this is .NET. Who wants to rewrite all that hash
table stuff and squeeze the last bit of performance out of the
implementation when a nice, .NET-friendly algorithm works fine. :)

Pete

Jun 27 '08 #8

rossum

On Sat, 03 May 2008 10:03:52 -0700, "Peter Duniho"
<Np*********@nnowslpianmk.comwrote:

>On Sat, 03 May 2008 06:13:52 -0700, rossum <ro******@coldmail.comwrote:

>As an alternative have a look at the Rabin-Karp string searching
algorithm. http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.

However, it still has to visit each character in the input string many
times (for interior characters -- i.e. those not at the very beginning or
end of the input, M times where M is the length of the longest match
string). Also, to deal with match strings of varying length, it needs to
compute a hash value for every substring of the input string of every
possible match length.

Yes, you do need multiple hashes, but you do not need to visit
characters more than once. Since you know in advance the lengths you
will need, you can retain just enough characters to calculate the
required number of rolling hashes, say 3 chars, 4 chars and 5 chars
from the starting character. That just needs three character
variables.

>
If you're implementing your own hash function and table, this isn't so
bad, since you'd easily be able to use the intermediate results as you
work your way toward the longest possible match length.

That is essential in Rabin-Karp. You use a rolling hash so it is easy
to calculate a hash without having to recalculate the entire hash, you
just calculate the difference between the old hash and the new hash.

>However, in the
context of .NET the natural approach would be to use the built-in string
hashing, requiring a new hash computation for each possible substring.
That could be prohibitive.

It would be, which is why the R-K algorithm avoids it. You can use
any reasonable hash function that you want, and the rolling hash is a
resonable hash function.
[snip]

>
I think it's entirely possible that a fully-optimized, completely custom
Rabin-Karp implementation could well out-perform the generic state graph
code I posted. But this is .NET. Who wants to rewrite all that hash
table stuff and squeeze the last bit of performance out of the
implementation when a nice, .NET-friendly algorithm works fine. :)

Why would you need to rewrite Hashtable? Just calculate your own
integer rolling hash value and pass it to a Hashtable as the key
value.

rossum

>
Pete

Jun 27 '08 #9

Peter Duniho

On Sat, 03 May 2008 13:02:27 -0700, rossum <ro******@coldmail.comwrote:

Yes, you do need multiple hashes, but you do not need to visit
characters more than once. Since you know in advance the lengths you
will need, you can retain just enough characters to calculate the
required number of rolling hashes, say 3 chars, 4 chars and 5 chars
from the starting character. That just needs three character
variables.

I see...I missed that part of the description the first time I read the
Wikipedia article.

That said, it still looks to me as though you need to keep a list of
hashes, one for each possible length of match string (each of which still
needs to be updated for each new character in the input string). I remain
unconvinced that this is going to be significantly more performant than a
simple state graph, and it seems more complicated to understand. And in
either case, it seems like we've simply traded visiting characters in the
input string multiple times for visiting individual rolling hash codes
multiple times (to update each one once per input character).

But that's just me. With the rolling hashes, I can see how it has a
different implementation (one that doesn't require a custom hash table),
if not less cost, than I previously assumed, and if it seems appropriate
to the OP, far be it from me to argue. :)

Thanks for the clarification.

Pete

Jun 27 '08 #10

Peter Bromberg [C# MVP]

Tomas Petricek has a very nice article with sample code along these lines
here:
http://tomasp.net/articles/ahocorasick.aspx
Peter
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...

On Fri, 02 May 2008 17:23:21 -0700, Brent <wr********@gmail.comwrote:

>I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

This comes up here surprisingly often. :)

For the general case, so far I've yet to see someone suggest a better
solution that the one I've proposed in the past: using a state graph.
This approach assumes that you've got a static list that you can use to
initialize your state graph once, and then use the same graph to process
multiple input over and over. The cost of creating the graph would be
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic
implementation of a state graph. I'm not claiming it's the best
implementation, but it does work. :) You can find that message here:
http://groups.google.com/group/micro...06f696d4500b77

VERY IMPORTANT! That code had a performance bug in it that severely
limited its usefulness. The original person I'd posted the code for never
bothered to comment on it, so I never even noticed until much later, when
I recommended the same code to someone else and they complained that it
wasn't as fast as I'd said it should be. You can find my follow-up post
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/micro...50505b568a75fd

If you care about performance (and obviously you do :) ), do NOT use the
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people. A
number of other folks contributed to the threads, and they had insightful
and useful comments, especially pertaining to special cases where you can
get good performance by knowing something particular about the input. The
state graph performs well as a general-purpose solution, but sometimes you
can get equal or better performance with a different solution that takes
advantage of some particular known characteristic of the input.

Pete

Jun 27 '08 #11

Similar topics

How to Convert text list into 2d array

by: Andrew V. Romero | last post by:

At work we have an excel file that contains the list of medications and their corresponding strengths. I would like to save the excel file as a text list and paste this list into a javascript...

Javascript

list-style-image margin problem

by: Haines Brown | last post by:

I find that when I use list-style-image with galeon or mozilla, padding is inserted between the symbol image and the following list text, while under IE 5.0 it seems to be inserted before the image...

HTML / CSS

Linked list with void *

by: dam_fool_2003 | last post by:

Hai, I thank those who helped me to create a single linked list with int type. Now I wanted to try out for a void* type. Below is the code: #include<stdlib.h> #include<stdio.h>...

C / C++

Combining two text field into a long list

by: justin tyme | last post by:

Hello Experts! I would like to combine (which may not be the correct technical term) two text fields from the same table in a query. Specifically, text field A and text field B are both lists of...

Microsoft Access / VBA

Checking function calls

by: Fredrik Tolf | last post by:

If I have a variable which points to a function, can I check if certain argument list matches what the function wants before or when calling it? Currently, I'm trying to catch a TypeError when...

Python

list/classmethod problems

by: ahart | last post by:

I'm pretty new to python and am trying to write a fairly small application to learn more about the language. I'm noticing some unexpected behavior in using lists in some classes to hold child...

Python

Checking individual Radio Buttons with the same GroupName doesn't work

by: Shadow Lynx | last post by:

In standard HTML, the <INPUT type="radio" name="x" /control only allows one radio button to be checked at a time. When more than one are set as checked="true" then only the last one rendered...

ASP.NET

Failed to connect to mail server error for mailing list using PHP

by: valeberry | last post by:

//Index.php <html><head><title>Mailing List Administration</title></head><body> <br> <center><H1>Mailing List Administration</H1></center> Send an email to a mailing list: <form method=post...

PHP

checking one checkbox with other Checkbox present in different views of a multiview

by: cvijaykrishna | last post by:

i am having a web based application and i am having a problem with it pls check it Explanationi am sending a sample code plese see it in VS-2005 FOR BETTER UNFERSTANDING I have a main page...

.NET Framework

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice