Fastest way to search text file for string

Julie

What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The search string
is typically 10 characters or less. Finally, I don't want to spawn out to an
external executable (e.g. grep), but include the algorithm/method directly in
the .NET code base. For the first rev, wildcard support is not a requirement.

Thanks for any pointers!

Nov 16 '05

Subscribe Post Reply

48940

Willy Denoyette [MVP]

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

"Willy Denoyette [MVP]" wrote:
Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did
your test.
At first I thought the same thing, but I ran the test & timing on several
machines w/ similar results on the first run.

If there's no difference between the first run and the second run, it
indicates that the file is read from the FS cache.

Do you know of a way to programmatically flush the cache from .NET? If
so,
I'll try it again w/ a forced flush and see if it changes my results.

You could try to fill the memory with data, this way you will reduce the FS
cache (though, you can't flush all of the cache), but also the Working Set
of all processes, so your system will become real slow.

byte[] bytes = new byte[100 * 5000000]; // allocate 500Mb
(assumes 512 Mb system RAM)
for(int i=0; i<bytes.Length; i++)
bytes[i] = 0;

Willy.

Nov 16 '05 #51

Drebin

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Answer me one question: how can you determine that those requirements are
unreasonable?
Because with the exception of aerospace and gov't work, the private sector
can't tolerate gross inefficiencies like you described. So either you don't
care about being inefficient or you are not smart enough to know that it
is!! Either way, it's bad and either way, you are too stubborn to see people
here are trying to actually help you! Imagine that? Perfect strangers still
willing to help someone who has been a jackass and jerky the whole time.
Oh wise sage, I work in neither of those disciplines.
Why the constant attacks - I'm on the newsgroup to help people, and you're
still just being an ass - and to someone who is trying to help you!! Not
cool.
How about this, for your future reference: try answering the question posed, don't spend so much time trying to read more into it than exists. If you have questions about the requirements, ask them _after_ you have answered the
original question. Otherwise, you come off as a know-it-all.
Ya know what, I don't know where you get this sense of entitlement, everyone
that posted, did so because they WANT to. We don't owe you anything! You
aren't entitled to help! So you know what, if I see any more posts from you,
I won't bother to respond. And if you have further technical problems, take
it someplace else. Newsgroups are filled with people of varying levels of
experience and expertise - some answers are helpful, some - not-so-much (my
own answers included) - but ya know what, when you have a problem, maybe a
combination of several things you've read will help you come to a solution.
If people don't give you the answer you want, you don't blast them! Point
is, you are grossly missing the point of newsgroups and I, for one - think
it's kind of fucked up.

I'm even more annoyed that I wasted this much time on you and this retarded
topic!
Finally, my requirements were well defined, you just don't happen to want to believe them.

I'm completely underwhelmed. I still think you're an idiot.

Nov 16 '05 #52

Julie

Drebin wrote:

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
Answer me one question: how can you determine that those requirements are
unreasonable?
Because with the exception of aerospace and gov't work, the private sector
can't tolerate gross inefficiencies like you described. So either you don't
care about being inefficient or you are not smart enough to know that it
is!! Either way, it's bad and either way, you are too stubborn to see people
here are trying to actually help you! Imagine that? Perfect strangers still
willing to help someone who has been a jackass and jerky the whole time.

Can you please explain to me how I've 'tolerated gross inefficiencies'?

Oh wise sage, I work in neither of those disciplines.

Why the constant attacks - I'm on the newsgroup to help people, and you're
still just being an ass - and to someone who is trying to help you!! Not
cool.

I haven't deliberately attacked anyone. I've been simply trying to clarify for
a few posters specifically what is required.

As near as I can tell, you don't want to believe me that I know what the
requirements are.

I've tried to calmly and effectively re-iterate my requirements. Again, you
may not want to agree with them, you may not understand them, and you many not
like them. But the fact remains, the requirements have been carefully examined
on our end, and what I originally posted is what is still required.

I *have* looked at a database solution, and the previous implementation *DID*
use secondary hash indexes into the data. This is *exactly* what we are trying
to move away from due to the added complexity, maintenance issues, etc. that
just aren't justified for this component. Meeting the goal of < 10 second
searches w/ a simple/maintainable solution is (in our case!) infinitely more
desirable than sub-second searches at the expense of increased complexity and
maintainability. Honestly, it is that simple, why do you want to make it so
much more complicated?

How about this, for your future reference: try answering the question

posed,
don't spend so much time trying to read more into it than exists. If you

have
questions about the requirements, ask them _after_ you have answered the
original question. Otherwise, you come off as a know-it-all.

Ya know what, I don't know where you get this sense of entitlement, everyone
that posted, did so because they WANT to. We don't owe you anything! You
aren't entitled to help!

Sense of entitlement? I'm baffled by that one. I had a question about a
particular requirement, got a lot more comments about the _requirements_ and
very few responses to the original question.

One of the more important things that I've learned about answering questions in
this and other forums is this:

1) don't assume

2) have faith that the op knows what they want (unless they indicate otherwise)

3) if the op isn't clear, ask a follow-up question

4) when it is clear what the op is after, specifically answer that question

5) if there are other methods or alternatives available but differ slightly
from the stated requirements, add that as a post script

Here is an example:

Jill: I need to search an unformatted 100 MB text file for a string in less
than 10 seconds, what is the fastest way to do this in C#?

Jack: I'm not aware of anything in C# other than opening the file as a text
stream and just searching over the data (brute force). You could use B-M (and
related) searching techniques to speed up the actual string search. Can you
index the data or store it in a database?

Jill: Thanks for the info. I was hoping there was something native to the
language/framework and more efficient that brute force, but maybe it will work
in this case. I'll give it a shot and let you know what kind of results I
get. BTW: my target system is 1.3 GHz w/ 500 MB RAM. Regarding
databases/indexing: we specifically do not want to go down this route (we
already implemented it that way, and for the specific needs, it turned out to
be a *huge* pain to maintain, the benefit just wasn't there for this
component).

Jill: Tried out the brute force method, and it works just great. Times are
around 5 seconds, implementation is simple and it will be easily maintained.
Thanks!

Jack: No problem, glad to help out.

So you know what, if I see any more posts from you,
I won't bother to respond. And if you have further technical problems, take
it someplace else.
"Take it someplace else"??? Sorry, I wasn't aware that you were the moderator
of this newsgroup.
Newsgroups are filled with people of varying levels of
experience and expertise - some answers are helpful, some - not-so-much (my
own answers included) - but ya know what, when you have a problem, maybe a
combination of several things you've read will help you come to a solution.
If people don't give you the answer you want, you don't blast them!
I didn't blast anyone. I simply tried to convey my requirements, asking for
possible solutions, you kept attacking my requirements. Do you disagree with
that?
I'm even more annoyed that I wasted this much time on you and this retarded
topic!
Finally, my requirements were well defined, you just don't happen to want

to
believe them.

I'm completely underwhelmed. I still think you're an idiot.

Sorry to hear that. I don't think that you are an idiot, and even if I did, I
wouldn't post it in a public forum.

Nov 16 '05 #53

Drebin

Alright.. last one and I'll let you have the last word if you want it..

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Can you please explain to me how I've 'tolerated gross inefficiencies'?
Flat-files are the most inefficient and error-prone way to store, search and
sort data. Without structure that is enforced, you can't guarantee the data
integrity. Without data integrity, you completely undermine the usefullness
of your entire System! Without structure that is enforced, you can't rely on
more efficient ways to search and sort.

Databases were invented because they simplified what everyone was doing
manually again and again. After you've written a sorting algorithm enough
times, you think "Damn, I wish I didn't have to keep doing this over and
over". You start "leveraging" technology and reusability.

Imagine there is a "program" that stores data in a structured way, can
enforce the integrity of that data, allows for various ways to increase the
efficiency of searching and can sort very quickly. Sounds great! But it's
"called" a RDBMS.. and now all of a sudden you hate it again??

All I'm saying - is that when someone CHOOSES to stay with a text file, in
that native format, it's bad for several reasons:

-Structure is not enforced
-No guarantee of data integrity
-No efficient, native way of searching
-No efficient, native way of sorting

In other words, this is the most inefficient and useless form of data. You
have to write from scratch, anything you want to do to the data.

When this data is in a database, all of these are taken care of for free.
And when talking about price and managability - I'll take managing a bunch
of databases over managing free-form text files all day long.

The bottom line of this, is that when someone is fired-up about working in
text-files - that spells disaster for me, because you are having a logical
computer rely on data that could very well be inconsistent/illogical, and
you have to write even the most basic utility (like searching). And I call
that gross inefficiency. In fact, you can't GET any more inefficient!!
As near as I can tell, you don't want to believe me that I know what the
requirements are.
I STILL question that you understand the implications of writing a System in
the year 2004 based on raw text files. I can't imagine a computer person
that understands the implications of doing this - and being OK with it..
nay, thinking that it's the BETTER way to do things!! I can't imagine not
leveraging the technologies that are available. This is not the first or
last time someone has needed to search a file - someone has already invented
the wheel for searching, USE IT!!
I *have* looked at a database solution, and the previous implementation *DID* use secondary hash indexes into the data. This is *exactly* what we are trying to move away from due to the added complexity, maintenance issues, etc. that just aren't justified for this component. Meeting the goal of < 10 second
searches w/ a simple/maintainable solution is (in our case!) infinitely more desirable than sub-second searches at the expense of increased complexity and maintainability. Honestly, it is that simple, why do you want to make it so much more complicated?
I consider text-files to be much more complicated, because they typically
need so much care and feeding. A database is pretty trouble-free. I can't
imagine the problems you'd run across to make you resort to going back to
text files! This is not a reasonable argument at all (given the information
that you've given).
Sense of entitlement? I'm baffled by that one. I had a question about a
particular requirement, got a lot more comments about the _requirements_ and very few responses to the original question.
Because again, *many* people who post, don't understand the implications of
what they are saying. Most people don't post questions about what they
already know about - they post about something they are not as familiar
with. And if they aren't familiar, then they likely don't know of X and Y
shortcuts or easier ways to do what they are doing.

And you do come off as arrogant when you come around asking for comments,
then criticize the people who respond. You come around asking for help and
then criticize the way in which it's provided!
One of the more important things that I've learned about answering questions in this and other forums is this:

1) don't assume
2) have faith that the op knows what they want (unless they indicate otherwise)

I completely disagree. Most do not. Newsgroups are used by people during the
development process, not after the project is done and they are presenting a
demo. Many times, nay, MOST times - people just want to accomplish a fixed
task "I want to search through 100MB of data in < 10s".. one would expect
that people are going to say "There isn't anything that you can write or
implement that is going to be quicker and search faster than a database"
3) if the op isn't clear, ask a follow-up question
4) when it is clear what the op is after, specifically answer that question 5) if there are other methods or alternatives available but differ slightly from the stated requirements, add that as a post script
OK, so after your little lesson, I got an 80% woo hoo!!! :-)
"Take it someplace else"??? Sorry, I wasn't aware that you were the moderator of this newsgroup.
Just my opinion.
Sorry to hear that. I don't think that you are an idiot, and even if I did, I wouldn't post it in a public forum.

hahah oh this is different, taking the high road now, eh? Nice.

Well, this has been fun.

Nov 16 '05 #54

Julie

Drebin wrote:

Flat-files are the most inefficient and error-prone way to store, search and
sort data. Without structure that is enforced, you can't guarantee the data
integrity. Without data integrity, you completely undermine the usefullness
of your entire System! Without structure that is enforced, you can't rely on
more efficient ways to search and sort.
As I've previously said, the data source is an external piece of equipment of
which I (and our entire team) don't have control over.

Sorting is not a requirement.
Databases were invented because they simplified what everyone was doing
manually again and again. After you've written a sorting algorithm enough
times, you think "Damn, I wish I didn't have to keep doing this over and
over". You start "leveraging" technology and reusability.
I realize all of that. As I've said, a database simply isn't required for this
component for about 10 reasons that I posted in a separate thread.
Imagine there is a "program" that stores data in a structured way, can
enforce the integrity of that data, allows for various ways to increase the
efficiency of searching and can sort very quickly. Sounds great! But it's
"called" a RDBMS.. and now all of a sudden you hate it again??
Nope, I use database(s) all of the time, and it is used as well in the project
I'm working on. As I've stated, this particular component simply doesn't have
the requirement for database (structured) access, and as an average, actually
decreases performance/productivity when factoring in all costs associated with
a database.
All I'm saying - is that when someone CHOOSES to stay with a text file, in
that native format, it's bad for several reasons:
As I've said, I'm not _choosing_ to stay with a text file, that the the
requirement.
-Structure is not enforced
Not required
-No guarantee of data integrity
Basic file system integrity is sufficient
-No efficient, native way of searching
Nothing more than 1 hit in a 100 MB file in 10 seconds or less on a 1.3 GHz/500
MB RAM is required.
-No efficient, native way of sorting
Not required.
In other words, this is the most inefficient and useless form of data. You
have to write from scratch, anything you want to do to the data.
In my case, the 'writing from scratch' amounts to 10 lines of code.
When this data is in a database, all of these are taken care of for free.
That is where you are just plain wrong. You are not considering the entire
'database' picture:

- implementation and development costs
- licensing costs
- installation requirements/costs
- initial/subsequent processing of data costs (how is the data going to get
into the database? Remember, this is *not* structured data, forcing it into
columns would essentially dictate that either a new table is created for each
(of *many*) variant formats, or all of the data is stored in a single blob)
- maintenance costs

None of that is free, and *must* be factored into the analysis of the problem.
I can say that all of these items were considered, /prior/ to the formal
definition of the requirements.
And when talking about price and managability - I'll take managing a bunch
of databases over managing free-form text files all day long.
That sounds great for your project. For ours, this isn't the case.
The bottom line of this, is that when someone is fired-up about working in
text-files - that spells disaster for me, because you are having a logical
computer rely on data that could very well be inconsistent/illogical, and
you have to write even the most basic utility (like searching). And I call
that gross inefficiency. In fact, you can't GET any more inefficient!!
Not at all fired up about using text files, however that is what the
requirement is, and I can therefore work within those bounds.

Personally, I prefer to use the most simple solution that solves the problem,
nothing more.

As near as I can tell, you don't want to believe me that I know what the
requirements are.

I STILL question that you understand the implications of writing a System in
the year 2004 based on raw text files. I can't imagine a computer person
that understands the implications of doing this - and being OK with it..
nay, thinking that it's the BETTER way to do things!! I can't imagine not
leveraging the technologies that are available. This is not the first or
last time someone has needed to search a file - someone has already invented
the wheel for searching, USE IT!!

And that is what I was asking for. Instead of providing an answer to my
question, you changed the question.

Sense of entitlement? I'm baffled by that one. I had a question about a
particular requirement, got a lot more comments about the _requirements_

and
very few responses to the original question.

Because again, *many* people who post, don't understand the implications of
what they are saying. Most people don't post questions about what they
already know about - they post about something they are not as familiar
with. And if they aren't familiar, then they likely don't know of X and Y
shortcuts or easier ways to do what they are doing.

I understand that. I do not fall into that category.
And you do come off as arrogant when you come around asking for comments,
then criticize the people who respond. You come around asking for help and
then criticize the way in which it's provided!
Please show me where I've been critical. Some respondents started by saying
that a database is a solution to the problem. I simply stated that a database
(and indexing/sorting) is not a viable solution.
"There isn't anything that you can write or
implement that is going to be quicker and search faster than a database"

I'm not asking for anything quicker than a database, I'm asking for the
simplest solution to the problem as defined by the requirements.

Sorry to hear that. I don't think that you are an idiot, and even if I

did, I
wouldn't post it in a public forum.

hahah oh this is different, taking the high road now, eh? Nice.

No, taking the same road. Your where the one that called me an idiot.

Nov 16 '05 #55

Daniel O'Connell [C# MVP]

For what its worth, by everything you've written in this thread, a flat file
search was almost certainly the right way to go.

Let me guess, this was a semi-structured tag based file, perhaps something
like

SMPL 1848.44 48174 85771
MP21 181.11 87

etc?

Nov 16 '05 #56

Julie

"Daniel O'Connell [C# MVP]" wrote:

For what its worth, by everything you've written in this thread, a flat file
search was almost certainly the right way to go.

Let me guess, this was a semi-structured tag based file, perhaps something
like

SMPL 1848.44 48174 85771
MP21 181.11 87

etc?

Yes, pretty close. The datafiles contain protein information in a very loose
format. The objective is to search the file for a particular protein.

Nov 16 '05 #57

Bill Woodruff

Why not stop feeding this troll opportunities for further bandwidth waste ?

Nov 16 '05 #58

John Price

> Why not stop feeding this troll opportunities for further bandwidth waste

?

No, please don't stop. This is a fascinating insight into the behaviour of a
troll who has not yet realised he's totally out classed by the shear
professionalism of an expert (Julie). I look forward to a good laugh and
just wonder if he will every get the message and realise how ridicules his
stance is. But somehow I don't think he ever will ... as he's a troll by
nature.

John.

Nov 16 '05 #59

Daniel O'Connell [C# MVP]

"Bill Woodruff" <billw at dotscience dot com> wrote in message
news:Or*************@tk2msftngp13.phx.gbl...

Why not stop feeding this troll opportunities for further bandwidth waste
?

Where exactly do you see trolls?

Everyone paticipating in this thread are rather common posters.

Nov 16 '05 #60

Michael C

Interesting. 5 seconds was your worst case scenario? (i.e., Rebooted
machine to ensure there are no files cached, and the string you're searching
for is at the very end of the 100 MB file?) Congrats! Sometimes the
simplest answer is the best.

Now if you had no line breaks (bad assumption on my part when I saw the word
'unstructured'), you'd probably have to read in a chunk of the file, search
it, read in the next chunk, scan it, etc., and of course in between chunks
you'd have to save the last n characters of a chunk and append them to the
front of the next chunk (where n is the size of the search string minus one
character) to ensure that you don't miss any matches that cross boundaries.

Thanks,
Michael C., MCDBA

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

"Willy Denoyette [MVP]" wrote:
Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did your test.
At first I thought the same thing, but I ran the test & timing on several
machines w/ similar results on the first run.

Do you know of a way to programmatically flush the cache from .NET? If

so, I'll try it again w/ a forced flush and see if it changes my results.

Nov 16 '05 #61

Similar topics