I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:
"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."
Then the list contains the following (and this list is MUCH larger in the
real situation):
Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu
I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):
"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."
In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.
I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow. 14 2227
I think RegEx will help you out here. As far as not iterating over the
entire list, I've no clue how you can selectively not fall through given the
only match in the list is the last item in the list or when there is no
match at all.
"Karch" <news.microsoft .comwrote in message
news:e9******** ******@TK2MSFTN GP06.phx.gbl...
>I need to find the fastest way in terms of storage and searching to determine if a given string contains one of a member of a list of strings. So, think of it in terms of this: I have a string such as the following:
"Smith said the dog belongs (or did belong) to the secretary, an employee
of the company."
Then the list contains the following (and this list is MUCH larger in the
real situation):
Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu
I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following,
there would be no match and I would return (false):
"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."
In the given string, do know that the matches should begin at a given
point (zero position), but I need to keep processing until I have
exhausted the candidate string in the list - as shown above - to prevent a
false match.
I have played around with some different data structures, such as prefix
and suffix trees, an these work in the case that you have a string that
you are trying to match in a list, not vice versa. The approach is
required to be very performant because it will be evaluated millions of
times. I am also okay with an unsafe code approach that works. I just need
the evaluations to terminate as soon as possible rather than looping
through every single item in the list. Even an IndexOf is too slow.
Karch wrote:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:
"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."
Then the list contains the following (and this list is MUCH larger in the
real situation):
Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu
Could you give us some idea of the likely length of the list? And
whether it's static or dynamic ?
jd
On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft .comwrote:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.
I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn't.
That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.
If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here: http://groups.google.com/group/micro...06f696d4500b77
I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either. :)
I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.
Pete
One problem I see with this is that you apparently are not doing
simply a string search, but some kind of word search, without defining
what a word is.
You say the "Smithperso n said ..." should not have any matches, so
while the string Smith is a substring, it does not match.
Would it match:
(Smith
(Smith)
Smith,
Smith;
Smith-
smith
etc.
The answer to this will probably constrian your possible algorithms.
If it didn't get too many false hits, doing a string search first and
then re-checking to see if the matches are valid word matches might be
OK.
On Wed, 5 Mar 2008 11:05:22 -0600, "Karch" <news.microsoft .comwrote:
>I need to find the fastest way in terms of storage and searching to determine if a given string contains one of a member of a list of strings. So, think of it in terms of this: I have a string such as the following:
"Smith said the dog belongs (or did belong) to the secretary, an employee of the company."
Then the list contains the following (and this list is MUCH larger in the real situation):
Adams Alan Jones Jacobs Smith Thaxton Vela Zulu
I would need to stop the processing and return (true) as soon as Smith was found. On the other hand, if the string was changed to the following, there would be no match and I would return (false):
"Smitherson said the dog belongs (or did belong) to the secretary, an employee of the company."
In the given string, do know that the matches should begin at a given point (zero position), but I need to keep processing until I have exhausted the candidate string in the list - as shown above - to prevent a false match.
I have played around with some different data structures, such as prefix and suffix trees, an these work in the case that you have a string that you are trying to match in a list, not vice versa. The approach is required to be very performant because it will be evaluated millions of times. I am also okay with an unsafe code approach that works. I just need the evaluations to terminate as soon as possible rather than looping through every single item in the list. Even an IndexOf is too slow.
The list is static (read into memory during initialization) and could be up
to approximately 1000 items (hence the need for a fast method)
"John Daragon" <jo**@argv.co.u kwrote in message
news:47******** *************** @news.zen.co.uk ...
Karch wrote:
>I need to find the fastest way in terms of storage and searching to determine if a given string contains one of a member of a list of strings. So, think of it in terms of this: I have a string such as the following:
"Smith said the dog belongs (or did belong) to the secretary, an employee of the company."
Then the list contains the following (and this list is MUCH larger in the real situation):
Adams Alan Jones Jacobs Smith Thaxton Vela Zulu
Could you give us some idea of the likely length of the list? And whether
it's static or dynamic ?
jd
I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today. I don't think Regex is an option for me because of performance and
the fact that, although the list itself is static, it will be loaded into
memory at run-time. So, the expression building would be complex, not to
mention eating cycles just to prep.
"Peter Duniho" <Np*********@nn owslpianmk.comw rote in message
news:op******** *******@petes-computer.local. ..
On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft .comwrote:
>I need to find the fastest way in terms of storage and searching to determine if a given string contains one of a member of a list of strings.
I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn't.
That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.
If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here: http://groups.google.com/group/micro...06f696d4500b77
I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either. :)
I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.
Pete
On Thu, 06 Mar 2008 10:08:22 -0800, Karch <news.microsoft .comwrote:
I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today.
Have you tried the dictionary approach suggested by a couple of others?
I wasn't paying attention when I first read your question and failed to
note that your search is delimited by spaces. Or, at least in the
examples you provided it was.
I think that if your data is actually restricted like that, the dictionary
lookup might be as fast as a state graph and it would be a lot simpler in
terms of implementation.
Just a thought. Obviously I really like state graphs :), but when there
is information you know about the input data that allows you to constrain
the problem a bit (e.g. by using spaces to identify the beginning and
ending of any possible match), it's often possible to solve the problem
efficiently without using a state graph.
Pete
On Thu, 06 Mar 2008 10:12:51 -0800, Karch <news.microsoft .comwrote:
The Hashtable would work great alongside a tokenized string, but the
problem
is that I don't have a distinct delimiter since the string to be matched
can
span multiple words. It's really a question of "do any of these strings
appear in this source string", irrespective of whitespace.
Then how do you distinguish "Smith" from "Smitherson ", as in your
example? Can you at least make a requirement that the search pattern will
always begin and end on a space, even if it includes spaces within?
Pete
Mufaka wrote:
Karch wrote:
>The Hashtable would work great alongside a tokenized string, but the problem is that I don't have a distinct delimiter since the string to be matched can span multiple words. It's really a question of "do any of these strings appear in this source string", irrespective of whitespace.
<snip>
This is a little different than your original post, but if you can't
delimit search string, you might try something like the following:
Hash the static list of words
Keep a list of unique lengths for those words
for each unique length, iterate the string pulling out substrings of
that length
see if that substring is in your hashed list of words
It at least reduces the amount of InString's you'd have to do.
I'd be interested in seeing how this performs against large strings.
Here's my test code:
<snip>
>
if (text.Length < length) break;
<snip>
That should be continue instead of break. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Christopher W. Douglas |
last post by:
I am writing a VB.NET application in Visual Studio 2003. I have written a
method that handles several events, such as closing a form and changing the
visible status of a form. I have some code that applies to all these
events, but I need to have specific code execute when the form closes. The
properties for this method are sender (the originator) and e (event
arguments). I know how to get typeof (sender) to determine what form or...
|
by: HumanJHawkins |
last post by:
The following query works perfectly (returning all words on a list called
"Dolch" that do not contain a form of "doing"):
SELECT 'Dolch' AS , dbo.Dolch.vchWord
FROM dbo.Dolch LEFT OUTER JOIN
dbo.CombinedLexicons ON CONTAINS(dbo.Dolch.vchWord,
'FORMSOF(INFLECTIONAL, "doing")')
WHERE (dbo.CombinedLexicons.vchWord IS NULL)
However, what I really want to do requires me to piece two strings together,
|
by: Chris Simmons |
last post by:
I know that a String is immutable, but I don't understand why this
piece of code fails in nUnit:
// BEGIN CODE
using System;
class Test
{
public static void Main( String args )
|
by: Chad Myers |
last post by:
I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em) System.String
instances being created.
I've done some analysis and I'm led to believe (but can't yet quantitatively
establish as fact) that the two basic culprits are a lot of calls to:
1.) if( someString.ToLower() == "somestring" )
and
|
by: Tom |
last post by:
Is there such a thing as a CONTAINS for a string variable in VB.NET? For
instance, I want to do something like the following:
If strTest Contains ("A","B", "C") Then
Debug.WriteLine("Found characters")
Else
Debug.WriteLine("Did NOT find characters!")
End If
| |
by: hstagni |
last post by:
Where can I find a library to created text-based windows applications?
Im looking for a library that can make windows and buttons inside
console.. Many old apps were make like this, i guess
____________________________________
| |
| ------------------ |
| | BUTTON | |
| ...
|
by: =?Utf-8?B?bWFnZWxsYW4=?= |
last post by:
Hi,
I'd appreciae any advice on how to do this in VB 2005.
I have a simple array;e..g, a list of States, e.g., CA, WA, ID, AL, and
etc...
I like to determine how many unqiue "States" are in this array...ie.,
duplicate entires, are igorned...
|
by: |
last post by:
I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc. etc.
etc.
I am wondering what the efficient way to do this in code might be. The dumb
and brute-force way would be to loop through the content...
|
by: jacob navia |
last post by:
Abstract:
Continuing the discussion about abstract data types, in this
discussion group, a string collection data type is presented,
patterned after the collection in C# and similar languages (Java).
It stores character strings, and resizes itself to accommodate
new strings when needed.
Interface:
----------
|
by: Karch |
last post by:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:
"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."
Then the list contains the following (and this list is MUCH larger in the
real situation):
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |