By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,423 Members | 1,327 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,423 IT Pros & Developers. It's quick & easy.

Searching 18,000 strings of an average size of 10Kb

P: n/a
What would be the fastest way to search 18,000 strings of an average size of
10Kb, I can have all the strings in memory, should I simply do a instr on
all of the strings? Or is there a faster way? I would like to have a kind of
search like google where you can enter several words to search for, guess
that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I have
my data in an MSSQL 2000 database, and it's not updated very often, some
hundred new records added a day spread over the whole day. and some hundred
records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at all
times. But if there is a better more efficient way to do it I would like som
pointers.

Kind Regards,
Allan Ebdrup.
Jan 11 '07 #1
Share this Question
Share on Google+
8 Replies


P: n/a
this seems interesting:
http://www.codeproject.com/useritems/damnfastsearch.asp

"Allan Ebdrup" <eb****@noemail.noemailwrote in message
news:OW*************@TK2MSFTNGP06.phx.gbl...
What would be the fastest way to search 18,000 strings of an average size
of 10Kb, I can have all the strings in memory, should I simply do a instr
on all of the strings? Or is there a faster way? I would like to have a
kind of search like google where you can enter several words to search
for, guess that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I
have my data in an MSSQL 2000 database, and it's not updated very often,
some hundred new records added a day spread over the whole day. and some
hundred records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at
all times. But if there is a better more efficient way to do it I would
like som pointers.

Kind Regards,
Allan Ebdrup.

Jan 11 '07 #2

P: n/a
If you can sort the list of strings...use a Binary Search. A Binary
Search is capable of finding for instance the string required in about
16 steps of say 500,000 strings. Don't quote me on that but if you
doubled your size to 1,000,000 it would only require 1 more step to
find it.

Essentially a binary search halves the data it needs to search but
again this requires you to work from an array that you can garuntee to
be sorted.

..net has this functionality built in and it's wicked fast.

-R

Allan Ebdrup wrote:
What would be the fastest way to search 18,000 strings of an average size of
10Kb, I can have all the strings in memory, should I simply do a instr on
all of the strings? Or is there a faster way? I would like to have a kind of
search like google where you can enter several words to search for, guess
that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I have
my data in an MSSQL 2000 database, and it's not updated very often, some
hundred new records added a day spread over the whole day. and some hundred
records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at all
times. But if there is a better more efficient way to do it I would like som
pointers.

Kind Regards,
Allan Ebdrup.
Jan 12 '07 #3

P: n/a
_DD
On Thu, 11 Jan 2007 13:31:16 +0100, "Allan Ebdrup"
<eb****@noemail.noemailwrote:
>What would be the fastest way to search 18,000 strings of an average size of
10Kb, I can have all the strings in memory, should I simply do a instr on
all of the strings? Or is there a faster way? I would like to have a kind of
search like google where you can enter several words to search for, guess
that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I have
my data in an MSSQL 2000 database, and it's not updated very often, some
hundred new records added a day spread over the whole day. and some hundred
records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at all
times. But if there is a better more efficient way to do it I would like som
pointers.

Kind Regards,
Allan Ebdrup.
Tough to say from your description whether you are looking for
something fairly simple, fast, and direct, or whether you need to
search on multiple fields.

If the latter, and if the data is already in a SQL db, then a SQL
query may be the best way to do that, no? People spend years trying to
optimize db queries, so it would be hard to get something running
faster, especially as you are probably doing a query to get the data
in the first place. You would have no appreciable memory hit. Of
course you would not be asking if it were that simple.

So...if it's the former, and you need blazing speed at the expense of
loading everything into ram, then check into Boyer-Moore. I believe
that is still the basis of the fastest linear string search
algorithms.

Boyer Moore:
http://www.cs.utexas.edu/users/moore...ing/index.html

That should get you started.
Jan 12 '07 #4

P: n/a
"Deckarep" <de******@gmail.comwrote in message
news:11**********************@v45g2000cwv.googlegr oups.com...
If you can sort the list of strings...use a Binary Search. A Binary
Search is capable of finding for instance the string required in about
16 steps of say 500,000 strings. Don't quote me on that but if you
doubled your size to 1,000,000 it would only require 1 more step to
find it.

Essentially a binary search halves the data it needs to search but
again this requires you to work from an array that you can garuntee to
be sorted.

.net has this functionality built in and it's wicked fast.
I know about binary searches.
However I'm searching for the sting to be somewhere inside the 10Kb of the
string in 18.000 instances, so I can't sort the data, I need some kind of
indexing.

Kind Regards,
Allan Ebdrup
Jan 12 '07 #5

P: n/a
"_DD" <_D*@nospam.comwrote in message
news:e2********************************@4ax.com...
On Thu, 11 Jan 2007 13:31:16 +0100, "Allan Ebdrup"
<eb****@noemail.noemailwrote:
Tough to say from your description whether you are looking for
something fairly simple, fast, and direct, or whether you need to
search on multiple fields.
Simple fast and direct, just one field.
If the latter, and if the data is already in a SQL db, then a SQL
query may be the best way to do that, no? People spend years trying to
optimize db queries, so it would be hard to get something running
faster, especially as you are probably doing a query to get the data
in the first place. You would have no appreciable memory hit. Of
course you would not be asking if it were that simple.
LIKE runs to slow, and I can't use a fulltext index because it splits up
into whole words, and I want a search where searching for "pædagog" also
retuns results with "børnehavepædagog" or "pædagogmedhjælper"
So...if it's the former, and you need blazing speed at the expense of
loading everything into ram, then check into Boyer-Moore. I believe
that is still the basis of the fastest linear string search
algorithms.

Boyer Moore:
http://www.cs.utexas.edu/users/moore...ing/index.html
This is probably the kind of thing used in RegEx, it's blasing fast when
compared to .IndexOf, witch has a surprisingly poor performance.

I found this plugin for MSSQL:
http://www.surfray.com/en/index.php/products/overloook
Seems to be just the kind of thing I want. Ad an index and have fast
searching. I'll try it out and let you know.

Kind Regards,
Allan Ebdrup
Jan 12 '07 #6

P: n/a
_DD
On Fri, 12 Jan 2007 09:30:25 +0100, "Allan Ebdrup"
<eb****@noemail.noemailwrote:
>"_DD" <_D*@nospam.comwrote in message
news:e2********************************@4ax.com.. .
.. I want a search where searching for "pædagog" also
retuns results with "børnehavepædagog" or "pædagogmedhjælper"
Gotta appreciate anyone who knows how to do that combined 'ae' thing!
>So...if it's the former, and you need blazing speed at the expense of
loading everything into ram, then check into Boyer-Moore. I believe
that is still the basis of the fastest linear string search
algorithms.

Boyer Moore:
http://www.cs.utexas.edu/users/moore...ing/index.html

This is probably the kind of thing used in RegEx, it's blasing fast when
compared to .IndexOf, witch has a surprisingly poor performance.
String functions in general do not seem to be optimal performers. The
reason that I suggested a raw Boyer-Moore algorithm is that if you are
indeed doing a linear search as you describe, then it will probably be
the most direct method. You may be able to optimize for your
particular data format.

I don't doubt that Regex often uses Boyer-Moore, but I'm not sure how
much overhead a Regex implementation would add due to the fact that
they are more generalized.
>I found this plugin for MSSQL:
http://www.surfray.com/en/index.php/products/overloook
Seems to be just the kind of thing I want. Ad an index and have fast
searching. I'll try it out and let you know.
1800 times faster than an MS SQL 2005 search? I take back what I said
about the SQL DB team working around the clock to optimize queries.
Seriously, I guess I'd be skeptical about that, but who knows? Please
do report back if you use it.

Jan 12 '07 #7

P: n/a
I don't have a specific answer, but you might want to take a look at
Bioinformatics code or ask a Bioinformatics group. This sort of search is
pretty common and I am sure it has been optimized to death.
Ethan
"Allan Ebdrup" <eb****@noemail.noemailwrote in message
news:OW*************@TK2MSFTNGP06.phx.gbl...
What would be the fastest way to search 18,000 strings of an average size
of 10Kb, I can have all the strings in memory, should I simply do a instr
on all of the strings? Or is there a faster way? I would like to have a
kind of search like google where you can enter several words to search
for, guess that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I
have my data in an MSSQL 2000 database, and it's not updated very often,
some hundred new records added a day spread over the whole day. and some
hundred records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at
all times. But if there is a better more efficient way to do it I would
like som pointers.

Kind Regards,
Allan Ebdrup.

Jan 12 '07 #8

P: n/a
_DD
On Fri, 12 Jan 2007 13:28:12 -0600, "Ethan Strauss" <ethan dot strauss
at Promega dot comwrote:
>I don't have a specific answer, but you might want to take a look at
Bioinformatics code or ask a Bioinformatics group. This sort of search is
pretty common and I am sure it has been optimized to death.
Ethan
Gene-matching algorithms--That's certainly an interesting thought. Not
only optimized; they build custom hardware for that. But I thought
there would be a fuzzy aspect to most of their matching algorithms. Is
that not the case?
Jan 12 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.