Fastest way to search text file for string

Julie

What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The search string
is typically 10 characters or less. Finally, I don't want to spawn out to an
external executable (e.g. grep), but include the algorithm/method directly in
the .NET code base. For the first rev, wildcard support is not a requirement.

Thanks for any pointers!

Nov 16 '05 #1

Subscribe Post Reply

48954

Hermit Dave

i would suggest that you have a look at Regex implmentation. I think regex
is the fastest when it comes to scanning.
You might need to use filestream to load the file so i dont think its the
most appropriate answer.
anyways make a local copy of one of those files and give Regex a try. see if
it comes anywhere near the 10 sec mark.

--

Regards,

Hermit Dave
(http://hdave.blogspot.com)
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!

Nov 16 '05 #2

Drebin

I wouldn't spend anymore time on see *if* you can do this, until you find
out *why* you have to do this!

100mb flat file?? This is exactly the reason why relational databases were
made and are still used for just about everything. Without knowing more
about your app, I'd rather take the 2 minutes to load this into a SQL table,
build an index - and then what you want to do, suddenly becomes quick
(sub-second), simple and will support wildcards later. Maybe bulk-load your
file at night - and have your front-end hit the database during the day?

I don't think you will be happy with just about any solution. Every response
you will get to this is either going to be way to slow -or- way too
complicated. You're re-inventing the wheel!!

My $ .02

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

What is the *fastest* way in .NET to search large on-disk text files (100+
MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped
files
are ok (and preferred). Speed/performance is a requirement -- the target
is to
locate the string in 10 seconds or less for a 100 MB file. The search
string
is typically 10 characters or less. Finally, I don't want to spawn out to
an
external executable (e.g. grep), but include the algorithm/method directly
in
the .NET code base. For the first rev, wildcard support is not a
requirement.

Thanks for any pointers!

Nov 16 '05 #3

cody

Just load the Text file into a large string and use string.IndexOf() this
should be even faster than RegEx.

--
cody

[Freeware, Games and Humor]
www.deutronium.de.vu || www.deutronium.tk
"Julie" <ju***@nospam.com> schrieb im Newsbeitrag
news:41***************@nospam.com...

What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!

Nov 16 '05 #4

John Timney $Microsoft MVP$

Given the size of the file, probably the only way would be to use a
filebytearray, load the file in as bytes 1 at a time and convert them to
chars by creating an indexer. You will need to work out a way of checking
if a series of chars make up the string you are looking for.

I would start here.

http://msdn.microsoft.com/library/de...us/csref/html/
vcwlkindexerstutorial.asp

However, I very much doubt you will manage to scan 100 meg of data in 10
seconds.

--
Regards

John Timney
Microsoft Regional Director
Microsoft MVP
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!

Nov 16 '05 #5

Julie

Drebin wrote:

I wouldn't spend anymore time on see *if* you can do this, until you find
out *why* you have to do this!
All requirements have been defined at this point by project management. This
isn't just a blind decision, but the result of the examination of the domain
and expected results.
100mb flat file?? This is exactly the reason why relational databases were
made and are still used for just about everything. Without knowing more
about your app, I'd rather take the 2 minutes to load this into a SQL table,
build an index - and then what you want to do, suddenly becomes quick
(sub-second), simple and will support wildcards later. Maybe bulk-load your
file at night - and have your front-end hit the database during the day?
Yes, 100+ MB flat files. These are loosely formatted datafiles from external
laboratory instruments.

Remember, proper implementation dictates that you implement what is required,
nothing more. The current requirements are simple access & management of these
text files that allows immediate searching that averages 10 seconds or less.
WinGrep accomplishes this in 6 seconds on our target system (1.3 GHz, 500 MB
RAM).

Future requirements *may* dictate that additional time constraints are imposed,
which would then lead to db or other external indexing of the files. But, that
will be implemented when and if necessary. If you have questions about this
approach, you may want to look into the (industrial) extreme programming
paradigm, which is what our shop successfully uses.
I don't think you will be happy with just about any solution. Every response
you will get to this is either going to be way to slow -or- way too
complicated. You're re-inventing the wheel!!
Nay, my good friend, not re-inventing the wheel, but asking where the wheel
is. Text-searching of large files isn't uncommon or inappropriate. I'm just
looking into comments on such searches in .Net; this stuff is fairly trivial in
C++/Win32 (I'd prefer *not* to drop down to managed/unmanaged C++ for this
project).

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
What is the *fastest* way in .NET to search large on-disk text files (100+
MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped
files
are ok (and preferred). Speed/performance is a requirement -- the target
is to
locate the string in 10 seconds or less for a 100 MB file. The search
string
is typically 10 characters or less. Finally, I don't want to spawn out to
an
external executable (e.g. grep), but include the algorithm/method directly
in
the .NET code base. For the first rev, wildcard support is not a
requirement.

Thanks for any pointers!

Nov 16 '05 #6

Julie

"John Timney (Microsoft MVP)" wrote:

Given the size of the file, probably the only way would be to use a
filebytearray, load the file in as bytes 1 at a time and convert them to
chars by creating an indexer. You will need to work out a way of checking
if a series of chars make up the string you are looking for.

I would start here.

http://msdn.microsoft.com/library/de...us/csref/html/
vcwlkindexerstutorial.asp

However, I very much doubt you will manage to scan 100 meg of data in 10
seconds.
Thanks, I'll look into that.

WinGrep performs the search in 6 seconds on our target system (1.3 GHz, 500 MB
RAM). (WinGrep is not open source, and C++/Win32.)

--
Regards

John Timney
Microsoft Regional Director
Microsoft MVP

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
What is the *fastest* way in .NET to search large on-disk text files (100+

MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped

files
are ok (and preferred). Speed/performance is a requirement -- the target

is to
locate the string in 10 seconds or less for a 100 MB file. The search

string
is typically 10 characters or less. Finally, I don't want to spawn out to

an
external executable (e.g. grep), but include the algorithm/method directly

in
the .NET code base. For the first rev, wildcard support is not a

requirement.

Thanks for any pointers!

Nov 16 '05 #7

Frans Bouma [C# MVP]

Julie wrote:

Nay, my good friend, not re-inventing the wheel, but asking where the wheel
is. Text-searching of large files isn't uncommon or inappropriate. I'm
just looking into comments on such searches in .Net; this stuff is fairly
trivial in C++/Win32 (I'd prefer not to drop down to managed/unmanaged C++
for this project).

Searching text in textblocks should use the textsearch algorithm by
Knuth-Morris-More or the Boyer-Moore variant. These algorithms are much
faster than the brute force algorithms implemented in the string class.

Algorithms in C by Sedgewick contains a description of these algorithms, and
I'm sure you'll find some descriptions on the internet.

Basicly they come down to this:
string: ababababcababababacababababacbabababacbabababc

if you now try to find the string abc, do not start at teh first character,
but at the last. So in the string, the 3rd character is an 'a'. So abc will
never start at the first character of the string, so we can skip the first 3.
It works with skip arrays and is quite clever, it will tremendously speed up
string search, especially with large texts.

Frans.
--
Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
My .NET Blog: http://weblogs.asp.net/fbouma
Microsoft C# MVP

Nov 16 '05 #8

James Curran

Julie wrote:

What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string.

I don't want to load the entire file into physical memory,
memory-mapped files are ok (and preferred).

The problem is that fast access to a large file requires direct access
to memory, which is antithetical to managed code. Your best choice would be
to isolate the search in a unmanaged C++ function, which is coded by the
managed C# app. Access the file as a memory mapped file is fairly easy in
unmanaged code, but I don't believe it's possible at all in a managed app.

The best algorithm for searching your file is probably the Boyer-Moore
method. Moore himself has a cool webpage graphically demostrating it:
http://www.cs.utexas.edu/users/moore...ing/index.html.
Googling "Boyer-Moore" will provide any number of implementations, such as
this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html

--
Truth,
James Curran [MVP]
www.NJTheater.com (Professional)
www.NovelTheory.com (Personal)

Nov 16 '05 #9

William Stacey [MVP]

If you want to jam em out in c#, I for one would really appreaciate it and
keep for future use. Or post at CodeProject. TIA.

--
William Stacey, MVP

"Frans Bouma [C# MVP]" <pe******************@xs4all.nl> wrote in message
news:xn***************@msnews.microsoft.com...

Julie wrote:
Nay, my good friend, not re-inventing the wheel, but asking where the wheel is. Text-searching of large files isn't uncommon or inappropriate. I'm

....

Nov 16 '05 #10

Frans Bouma [C# MVP]

William Stacey [MVP] wrote:

If you want to jam em out in c#, I for one would really appreaciate it and
keep for future use. Or post at CodeProject. TIA.

I made a typo, It's Knutt-Morris-Pratt, not Knutt-Morris-Moore.

Here is a link to a lot of string search algorithms. Both I mentioned are
there with C-code and explanation. It's pretty straight forward :)]
http://www-igm.univ-mlv.fr/~lecroq/string/

Frans.

--
Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
My .NET Blog: http://weblogs.asp.net/fbouma
Microsoft C# MVP

Nov 16 '05 #11

Julie

Drebin wrote:

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
Remember, proper implementation dictates that you implement what is
required,
nothing more.
Wow. Most every other developer would argue that "proper" implementation
dictates: functionality that is required and then any other structural or
design implementations that will likely be used in the future.

I was in that group of developers 6 months ago. I was raised old school where
requirements, specs, and definitions are done before coding starts.

Believe me, the paradigm switch *wasn't* easy nor intuitive. However, I have
to say, that there is a lot to IXP that really impacts the production quality
*and* performance of the entire project. I'm no expert, but I'm making
progress, and probably won't switch back to the old way.

And I guarantee that I wasn't excited about the switch. However, I committed
to trying it for 6 months, and then reevaluate. So far, so good.
That's like saying "my customers told me to build a house and they want 3
bedrooms".
Right -- and that is what you build. The foresight is to build it in such a
fashion that that future changes can be easily accomplished. One of the
primary keys is constant refactoring of code and TDD. That leaves the code in
a much more stable and flexible state. It doesn't sound efficient, but when
you are proficient at it, it really is more effective.
You need to have the foresight and wisdom to anticipate that
later they are going to say "we want a kitchen and bathrooms too!".
In your hypothetical example, what if they never want a kitchen/bathroom? Then
you have wasted work.
If you
don't, you will *always* be doing way more work than you need to! Customers
often don't know what they want - it's up to your expertise to help them
define that.

For your app, common sense tells me: if they want to search this data, they
will want to sort it, and they will want to browse and filter results. Going
your route, EACH step will likely take just as long. If you did it the right
way once, in the beginning - then each additional thing the users wanted
would be practically free.
See, your common sense is already wrong, and if followed, leads down the wrong
path and direction for this application.

Anyway, the thread isn't to discuss the merits of IXP, merely looking for a
quick text search implementation in C# -- nothing more, nothing less.
If you have questions about this
approach, you may want to look into the (industrial) extreme programming
paradigm, which is what our shop successfully uses.

You can call it what you want and justify it however, but I'm really glad I
don't work where you work. :-)

Nov 16 '05 #12

Julie

Michael C wrote:

That's like saying "my customers told me to build a house and they want 3
bedrooms". You need to have the foresight and wisdom to anticipate that
later they are going to say "we want a kitchen and bathrooms too!". If you
don't, you will *always* be doing way more work than you need to! Customers
often don't know what they want - it's up to your expertise to help them
define that.

If you're storing 100MB of data in a flat file, you're probably not running
very efficiently to begin with. Like Drebin said, you need to look at your
ultimate goals and you'll probably find much better solutions out there
(i.e., SQL, etc.)

Like I indicated in another follow-up: "These are loosely formatted datafiles
from external laboratory instruments." The requirement is to work w/ these
files, not change the file format.
That being said, you need to have the foresight, wisdom and *discipline* to
define all requirements *up-front*. Adding extra kitchens and bathrooms to
the floorplan halfway through the house building process jacks the cost - in
time and $$$'s - *way* up. It's true that a lot of users don't understand
the requirements definition process, and many aren't aware of the features
they'll want down the road; that's where you step in and help guide them
during the planning process. If you're constantly re-writing your
applications because the user requirements keep changing during the
implementation phase, you might want to look at improving your own planning
process.

Thank you for the compliment! Yes, we definitely do have foresight, wisdom and
discipline. The conclusion of our investigation was the following
requirements:

Search an approx. 100 MB flat text file for a given string in an average of 10
seconds or less, no wildcards.

Now, if I can just have the *discipline* to implement something that meets
_those_ requirements, I get the job done, am appreciated by the team, we
release the product, I get paid, and everyone is happy.

Nov 16 '05 #13

Julie

James Curran wrote:

Julie wrote:
What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string.

I don't want to load the entire file into physical memory,
memory-mapped files are ok (and preferred).

The problem is that fast access to a large file requires direct access
to memory, which is antithetical to managed code. Your best choice would be
to isolate the search in a unmanaged C++ function, which is coded by the
managed C# app. Access the file as a memory mapped file is fairly easy in
unmanaged code, but I don't believe it's possible at all in a managed app.

The best algorithm for searching your file is probably the Boyer-Moore
method. Moore himself has a cool webpage graphically demostrating it:
http://www.cs.utexas.edu/users/moore...ing/index.html.
Googling "Boyer-Moore" will provide any number of implementations, such as
this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html

Thanks for the tips. I wasn't aware that mm file support wasn't available in
..NET, seems short-sighted to me.

Managed/unmanaged code really isn't a possibility, part of the requirements is
that it is implemented in C#. I may be able to get away w/ using interop
though for the mm file support.

Nov 16 '05 #14

Julie

Julie wrote:

James Curran wrote:

Julie wrote:
What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string.

I don't want to load the entire file into physical memory,
memory-mapped files are ok (and preferred).

The problem is that fast access to a large file requires direct access
to memory, which is antithetical to managed code. Your best choice would be
to isolate the search in a unmanaged C++ function, which is coded by the
managed C# app. Access the file as a memory mapped file is fairly easy in
unmanaged code, but I don't believe it's possible at all in a managed app.

The best algorithm for searching your file is probably the Boyer-Moore
method. Moore himself has a cool webpage graphically demostrating it:
http://www.cs.utexas.edu/users/moore...ing/index.html.
Googling "Boyer-Moore" will provide any number of implementations, such as
this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html

Thanks for the tips. I wasn't aware that mm file support wasn't available in
.NET, seems short-sighted to me.

Managed/unmanaged code really isn't a possibility, part of the requirements is
that it is implemented in C#. I may be able to get away w/ using interop
though for the mm file support.

Not interop, but P/Invoke...

Nov 16 '05 #15

Michael C

Use Perl.
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Michael C wrote:
That's like saying "my customers told me to build a house and they want 3 bedrooms". You need to have the foresight and wisdom to anticipate that later they are going to say "we want a kitchen and bathrooms too!". If you don't, you will *always* be doing way more work than you need to! Customers
often don't know what they want - it's up to your expertise to help them define that.

If you're storing 100MB of data in a flat file, you're probably not running very efficiently to begin with. Like Drebin said, you need to look at your ultimate goals and you'll probably find much better solutions out there
(i.e., SQL, etc.)

Like I indicated in another follow-up: "These are loosely formatted

datafiles from external laboratory instruments." The requirement is to work w/ these files, not change the file format.
That being said, you need to have the foresight, wisdom and *discipline* to define all requirements *up-front*. Adding extra kitchens and bathrooms to the floorplan halfway through the house building process jacks the cost - in time and $$$'s - *way* up. It's true that a lot of users don't understand the requirements definition process, and many aren't aware of the features they'll want down the road; that's where you step in and help guide them
during the planning process. If you're constantly re-writing your
applications because the user requirements keep changing during the
implementation phase, you might want to look at improving your own planning process.
Thank you for the compliment! Yes, we definitely do have foresight,

wisdom and discipline. The conclusion of our investigation was the following
requirements:

Search an approx. 100 MB flat text file for a given string in an average of 10 seconds or less, no wildcards.

Now, if I can just have the *discipline* to implement something that meets
_those_ requirements, I get the job done, am appreciated by the team, we
release the product, I get paid, and everyone is happy.

Nov 16 '05 #16

Julie

Michael C wrote:

Use Perl.

Nov 16 '05 #17

Michael C

You got some serious specifications, but haven't given enough information to
really help you find a solution. So I suggested using a language that was
designed specifically to perform text processing. Maybe you could provide
more information, like:

You specify 10 seconds to locate text matches in a 100 MB flat text file.
Is the file already loaded into memory? Do you want to load the whole file
into memory first? Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole
point is moot). How many matches are you searching for? One match, every
match? Is the file structured in such a way that its format can be
leveraged to speed up the process? Are there certain fields that are
searched more than others in searches?

Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data
in reasonable time frames. Of course that may not be an option for you.

Thanks,
Michael C., MCDBA

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Michael C wrote:

Use Perl.

?!

Nov 16 '05 #18

Phill

How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Nov 16 '05 #19

Drebin

Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do
something like:

select customerid from customer where customer_lastname = 'smith'

assuming you had customer data to load. This is pretty standard stuff - did
you mean this in a different way? SQL is already a really, really powerful
tool for loading, sorting and searching for data. So for someone to want to,
and think they could do a better job - is ambitious to say the least. SQL
concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...

How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Nov 16 '05 #20

James Curran

Julie wrote:

Thanks for the tips. I wasn't aware that mm file support wasn't
available in .NET, seems short-sighted to me.

What good is a memory mapped file in an environment where you cannot
directly access memory?

--
Truth,
James Curran [MVP]
www.NJTheater.com (Professional)
www.NovelTheory.com (Personal)

Nov 16 '05 #21

James Curran

Julie wrote:

Like I indicated in another follow-up: "These are loosely formatted
datafiles from external laboratory instruments." The requirement is
to work w/ these files, not change the file format.

Well, admittedly I don't know the specifics of the contract, but it
seems that you are taking that a bit too literally. As far as I can see,
you are required to:
A) Read in the 100 MB flat text file, and
B) Spit out results found within that file.

Exactly how you get from A) to B) is strictly your concern, and if you want
to implement it by loading it into a SQL database or otherwise indexing it,
no one else needs even know about it.

--
Truth,
James Curran [MVP]
www.NJTheater.com (Professional)
www.NovelTheory.com (Personal)

Nov 16 '05 #22

Michael C

Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file
so that it didn't have to be re-created from scratch every darn time you run
the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL
Server. Any solution short of using a separate index of some sort is going
to be very non-scalable and comparatively slow. But alack and alas, to each
her own...

Thanks,
Michael C., MCDBA

"Drebin" <th*******@hotmail.com> wrote in message
news:bP*****************@newssvr33.news.prodigy.co m...

Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do something like:

select customerid from customer where customer_lastname = 'smith'

assuming you had customer data to load. This is pretty standard stuff - did you mean this in a different way? SQL is already a really, really powerful
tool for loading, sorting and searching for data. So for someone to want to, and think they could do a better job - is ambitious to say the least. SQL
concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...
How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Nov 16 '05 #23

Julie

James Curran wrote:

Julie wrote:
Thanks for the tips. I wasn't aware that mm file support wasn't
available in .NET, seems short-sighted to me.

What good is a memory mapped file in an environment where you cannot
directly access memory?

Can't directly access memory? Sure you can.

In this case, it could be an abstraction layer on something like char[] or
byte[].

Nov 16 '05 #24

Julie

Michael C wrote:

Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file
so that it didn't have to be re-created from scratch every darn time you run
the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL
Server. Any solution short of using a separate index of some sort is going
to be very non-scalable and comparatively slow. But alack and alas, to each
her own...
As I indicated in my original post, indexing isn't an option (and this
therefore excludes any database access).

Apparently you aren't aware of any methods in .NET to accomplish what I need --
thanks for your input.

Thanks,
Michael C., MCDBA

"Drebin" <th*******@hotmail.com> wrote in message
news:bP*****************@newssvr33.news.prodigy.co m...
Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could

do
something like:

select customerid from customer where customer_lastname = 'smith'

assuming you had customer data to load. This is pretty standard stuff -

did
you mean this in a different way? SQL is already a really, really powerful
tool for loading, sorting and searching for data. So for someone to want

to,
and think they could do a better job - is ambitious to say the least. SQL
concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...
How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Nov 16 '05 #25

Julie

Michael C wrote:

You got some serious specifications, but haven't given enough information to
really help you find a solution.
I wasn't asking for a solution, I was asking:

"What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."
So I suggested using a language that was
designed specifically to perform text processing.
I'm not doing text processing, I'm searching a text file for a given string --
that is all.
Maybe you could provide more information, like:

You specify 10 seconds to locate text matches in a 100 MB flat text file.
Is the file already loaded into memory?
No, the file is on disk, as I indicated in my original post:

"I don't want to load the entire file into physical memory"
Do you want to load the whole file into memory first?
No: "I don't want to load the entire file into physical memory"
Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole
point is moot).
N/A: "I don't want to load the entire file into physical memory"
How many matches are you searching for? One match, every
match?
Typically 0 or 1 (i.e. found/not found).
Is the file structured in such a way that its format can be
leveraged to speed up the process?
For the purposes of my requirements, the file is unordered, unstructured,
essentially random text.
Are there certain fields that are
searched more than others in searches?
No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data
in reasonable time frames. Of course that may not be an option for you.

Not stuck, that is the requirement. Or, do you presume the original
requirements for grep to be 'stuck'?

Again, I'm not interested in reinventing the wheel as you put it, I'm simply
after: "What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."

Nov 16 '05 #26

Julie

Drebin wrote:

Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do
something like:

select customerid from customer where customer_lastname = 'smith'
Please explain how to do this w/ unstructured text?

"The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted."
assuming you had customer data to load. This is pretty standard stuff - did
you mean this in a different way? SQL is already a really, really powerful
tool for loading, sorting and searching for data. So for someone to want to,
and think they could do a better job - is ambitious to say the least. SQL
concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!

"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...
How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Nov 16 '05 #27

Julie

James Curran wrote:

Julie wrote:
Like I indicated in another follow-up: "These are loosely formatted
datafiles from external laboratory instruments." The requirement is
to work w/ these files, not change the file format.

Well, admittedly I don't know the specifics of the contract, but it
seems that you are taking that a bit too literally. As far as I can see,
you are required to:
A) Read in the 100 MB flat text file, and
B) Spit out results found within that file.

Exactly how you get from A) to B) is strictly your concern, and if you want
to implement it by loading it into a SQL database or otherwise indexing it,
no one else needs even know about it.

Nope, sorry, I'm not taking it too literally, but thanks for implying that I'm
incapable to understand requirements.

Assume the input/output requirements to be the same as grep.

Nov 16 '05 #28

Drebin

Julie,

Please don't take offense, I have no desire to get into a pissing contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never met
anyone that thinks like this. So after this, I'll back off :-)

First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what you
want, until you see some of your project come to life. Using the analogy of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.

So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).

As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.

So - bottom line, if you have to deal with data that doesn't have structure,
you should first address that. You are going to spend 3x as much time trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.

You seem pretty engrained in this mindset of yours, so I'm not trying to
convince you of anything - just giving you the computer science take on what
you are trying to do. Anyhow, good luck with this!! :-)
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Drebin wrote:

Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do something like:

select customerid from customer where customer_lastname = 'smith'
Please explain how to do this w/ unstructured text?

"The files are unindexed and unsorted, and for the purposes of my

immediate requirements, can't be indexed/sorted."
assuming you had customer data to load. This is pretty standard stuff - did you mean this in a different way? SQL is already a really, really powerful tool for loading, sorting and searching for data. So for someone to want to, and think they could do a better job - is ambitious to say the least. SQL concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!

"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...
How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Nov 16 '05 #29

Michael C

Oh that's simple then. You point your browser at
http://www.thecodeproject.com/csharp...asp#xx825897xx and contact
Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you
implement and see if it does the job in 10 seconds or less.

Too easy.

Your welcome,
Michael C., MCDBA

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Michael C wrote:

You got some serious specifications, but haven't given enough information to really help you find a solution.
I wasn't asking for a solution, I was asking:

"What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."
So I suggested using a language that was
designed specifically to perform text processing.

I'm not doing text processing, I'm searching a text file for a given

string -- that is all.
Maybe you could provide more information, like:

You specify 10 seconds to locate text matches in a 100 MB flat text file. Is the file already loaded into memory?
No, the file is on disk, as I indicated in my original post:

"I don't want to load the entire file into physical memory"
Do you want to load the whole file into memory first?

No: "I don't want to load the entire file into physical memory"
Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole point is moot).

N/A: "I don't want to load the entire file into physical memory"
How many matches are you searching for? One match, every
match?

Typically 0 or 1 (i.e. found/not found).
Is the file structured in such a way that its format can be
leveraged to speed up the process?

For the purposes of my requirements, the file is unordered, unstructured,
essentially random text.
Are there certain fields that are
searched more than others in searches?

No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data in reasonable time frames. Of course that may not be an option for you.

Not stuck, that is the requirement. Or, do you presume the original
requirements for grep to be 'stuck'?

Again, I'm not interested in reinventing the wheel as you put it, I'm

simply after: "What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string."

Nov 16 '05 #30

Willy Denoyette [MVP]

The real problem is to find the fastest way to read the data into memory.
Reading a 100Mb file will take something like 5 - 10 seconds depending on
the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data
is not cached.
Depending on the results of the file data load time, you can decide to use a
naive algorithm or opt for a faster algorithm like the Boyer-Moore
algorithm.
Consider that:
- the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6
times faster than a naive algorithm,
- a naive algorithm should be able to search a 10 char pattern in less than
one sec on a decent system (P4 - 2.8 GHz).

So it's really important to know exactly how much time will be spent to
bring the data in memory before you decide upon the searching algorithm.

Willy.
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

What is the *fastest* way in .NET to search large on-disk text files (100+
MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped
files
are ok (and preferred). Speed/performance is a requirement -- the target
is to
locate the string in 10 seconds or less for a 100 MB file. The search
string
is typically 10 characters or less. Finally, I don't want to spawn out to
an
external executable (e.g. grep), but include the algorithm/method directly
in
the .NET code base. For the first rev, wildcard support is not a
requirement.

Thanks for any pointers!

Nov 16 '05 #31

Michael C

Simple. Since you require a simple Grep-like function, you find one that's
already made, and modify it to your taste as opposed to re-inventing the
wheel from scratch. If the Code Project link I already gave you doesn't fit
your needs, I would recommend Googling "C#.NET Grep". But I'm sure you've
already done that.

Your welcome,
Michael C., MCDBA
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Michael C wrote:

Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file so that it didn't have to be re-created from scratch every darn time you run the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL Server. Any solution short of using a separate index of some sort is going to be very non-scalable and comparatively slow. But alack and alas, to each her own...
As I indicated in my original post, indexing isn't an option (and this
therefore excludes any database access).

Apparently you aren't aware of any methods in .NET to accomplish what I

need -- thanks for your input.

Thanks,
Michael C., MCDBA

"Drebin" <th*******@hotmail.com> wrote in message
news:bP*****************@newssvr33.news.prodigy.co m...
Create a loader program/BCP/DTS job to split the records accordingly and load them into a table structure (say named "customer").. then you could
do
something like:

select customerid from customer where customer_lastname = 'smith'

assuming you had customer data to load. This is pretty standard
stuff - did
you mean this in a different way? SQL is already a really, really
powerful tool for loading, sorting and searching for data. So for someone to want to,
and think they could do a better job - is ambitious to say the least.

SQL concepts (like indexing and searching) are time-tested, I can't picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...
> How would you load a 100MB txt file into a DB and then search it for a > word? How would that work?

Nov 16 '05 #32

James Curran

Julie wrote:

Assume the input/output requirements to be the same as grep.

And it's *completely acceptable* for GREP to be implemented using a SQL
server behind the scenes. Granted that probably wouldn't be a very good
implementation, but it as long as the *inputs* and *outputs* are as
expected, the implementation is irrelevant.

--
Truth,
James Curran [MVP]
www.NJTheater.com (Professional)
www.NovelTheory.com (Personal)

Nov 16 '05 #33

John Price

Drebin,

I have no desire to get into a pissing contest -
But that is exactly what you are doing. You are trying to prove you are
right and she is wrong. Just except that she knows what her requirements are
and that (rightly or wrongly) a database is not suitable. Why does it matter
so much to you if she is wrong?

Her original question was quite clear and explicit, even to the point of
stating the file could not be indexed or sorted! Unfortunately some times in
life you have to deal with what you are given and you can't always demand
how information is provided.

Just answer the question as asked and don't start criticising people just
because you believe their approach is wrong. By all means offer an
alternative solution but don't take offence when somebody says it is not
suitable. And don't be so arrogant to assume you know their requirements or
circumstances better they do.

Regards

John

p.s By all means start ranting and raving at me as well but do so knowing I
wont be reading anymore of your replies. You have a major attitude problem.
I suggest you refrain from contributing until you can control it.

"Drebin" <th*******@hotmail.com> wrote in message
news:T5******************@newssvr19.news.prodigy.c om... Julie,

Please don't take offense, I have no desire to get into a pissing
contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never
met
anyone that thinks like this. So after this, I'll back off :-)

First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what
you
want, until you see some of your project come to life. Using the analogy
of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to
build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.

So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).

As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time
right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded
to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.

So - bottom line, if you have to deal with data that doesn't have
structure,
you should first address that. You are going to spend 3x as much time
trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.

You seem pretty engrained in this mindset of yours, so I'm not trying to
convince you of anything - just giving you the computer science take on
what
you are trying to do. Anyhow, good luck with this!! :-)
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
Drebin wrote:
>
> Create a loader program/BCP/DTS job to split the records accordingly
> and
> load them into a table structure (say named "customer").. then you
> could do > something like:
>
> select customerid from customer where customer_lastname = 'smith'

Please explain how to do this w/ unstructured text?

"The files are unindexed and unsorted, and for the purposes of my

immediate
requirements, can't be indexed/sorted."
> assuming you had customer data to load. This is pretty standard stuff - did > you mean this in a different way? SQL is already a really, really powerful > tool for loading, sorting and searching for data. So for someone to
> want to, > and think they could do a better job - is ambitious to say the least. SQL > concepts (like indexing and searching) are time-tested, I can't picture
> challenging it and thinking I could do a better job!!
>
> "Phill" <wa********@yahoo.com> wrote in message
> news:ac**************************@posting.google.c om...
> > How would you load a 100MB txt file into a DB and then search it for
> > a
> > word? How would that work?

Nov 16 '05 #34

Daniel O'Connell [C# MVP]

"James Curran" <Ja*********@mvps.org> wrote in message
news:u3**************@TK2MSFTNGP09.phx.gbl...

Julie wrote:
Assume the input/output requirements to be the same as grep.
And it's *completely acceptable* for GREP to be implemented using a SQL
server behind the scenes. Granted that probably wouldn't be a very good
implementation, but it as long as the *inputs* and *outputs* are as
expected, the implementation is irrelevant.

It would be ok if the grep *ENGINE* can accept any input, a grep
implementation that uses a database is valid, but close to useless
considering grep is built strictly to search files. What good is a program
that imports all of my files into a database and then tosses the database
out afterwards? Would you *WANT* that? Just because some of you think its a
really great idea because it means you don't have to do any work?

For what its worth, this heavy push for a database is probably foolish. If
the search only occurs a couple of times, there is a pretty good chance that
the time spent loading the database and extra space used up will be a huge
waste of space.

An index makes sense, but only if alot of searches will occur over the same
files. For one off searches I would imagine creating hte index would take
more time and alot more space than just searching the file.
--
Truth,
James Curran [MVP]
www.NJTheater.com (Professional)
www.NovelTheory.com (Personal)

Nov 16 '05 #35

Julie

Michael C wrote:

Oh that's simple then. You point your browser at
http://www.thecodeproject.com/csharp...asp#xx825897xx and contact
Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you
implement and see if it does the job in 10 seconds or less.
Excellent -- thanks!

Too easy.

Your welcome,
Michael C., MCDBA

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
Michael C wrote:

You got some serious specifications, but haven't given enough information to really help you find a solution.

I wasn't asking for a solution, I was asking:

"What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."
So I suggested using a language that was
designed specifically to perform text processing.

I'm not doing text processing, I'm searching a text file for a given

string --
that is all.
Maybe you could provide more information, like:

You specify 10 seconds to locate text matches in a 100 MB flat text file. Is the file already loaded into memory?

No, the file is on disk, as I indicated in my original post:

"I don't want to load the entire file into physical memory"
Do you want to load the whole file into memory first?

No: "I don't want to load the entire file into physical memory"
Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole point is moot).

N/A: "I don't want to load the entire file into physical memory"
How many matches are you searching for? One match, every
match?

Typically 0 or 1 (i.e. found/not found).
Is the file structured in such a way that its format can be
leveraged to speed up the process?

For the purposes of my requirements, the file is unordered, unstructured,
essentially random text.
Are there certain fields that are
searched more than others in searches?

No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data in reasonable time frames. Of course that may not be an option for you.

Not stuck, that is the requirement. Or, do you presume the original
requirements for grep to be 'stuck'?

Again, I'm not interested in reinventing the wheel as you put it, I'm

simply
after: "What is the *fastest* way in .NET to search large on-disk text

files
(100+ MB) for a given string."

Nov 16 '05 #36

Julie

"Willy Denoyette [MVP]" wrote:

The real problem is to find the fastest way to read the data into memory.
Reading a 100Mb file will take something like 5 - 10 seconds depending on
the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data
is not cached.
Depending on the results of the file data load time, you can decide to use a
naive algorithm or opt for a faster algorithm like the Boyer-Moore
algorithm.
Consider that:
- the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6
times faster than a naive algorithm,
- a naive algorithm should be able to search a 10 char pattern in less than
one sec on a decent system (P4 - 2.8 GHz).

So it's really important to know exactly how much time will be spent to
bring the data in memory before you decide upon the searching algorithm.
Yes, thanks for the comments on on B-M searching.

As for loading the file into memory, I specifically do *not* want to do that.
Win32 has offered memory-mapped files for quite some time -- exactly what I'm
after in the .Net world.

Willy.

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
What is the *fastest* way in .NET to search large on-disk text files (100+
MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped
files
are ok (and preferred). Speed/performance is a requirement -- the target
is to
locate the string in 10 seconds or less for a 100 MB file. The search
string
is typically 10 characters or less. Finally, I don't want to spawn out to
an
external executable (e.g. grep), but include the algorithm/method directly
in
the .NET code base. For the first rev, wildcard support is not a
requirement.

Thanks for any pointers!

Nov 16 '05 #37

Julie

Drebin wrote:

Julie,

Please don't take offense, I have no desire to get into a pissing contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never met
anyone that thinks like this. So after this, I'll back off :-)
I'll gladly take that as compliment.
First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what you
want, until you see some of your project come to life. Using the analogy of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.

So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).
Well, you may want to consider that it is possible -- this is actually a port
of an existing C++ application to C#. Requirements were
defined/refined/reevaluated years ago. My task is simple, as I oringally
posted. Please don't assume that you know more about my requirements than I
do.
As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.
Right -- The source of the data is a 3rd party $100,000 laboratory instrument
(mass spectromoter). I'll just contact the manufacturer and tell them that
they are doing everything wrong and to output their data in some other format.
If they have any questions as to why, I'll refer them to you.
So - bottom line, if you have to deal with data that doesn't have structure,
you should first address that. You are going to spend 3x as much time trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.
So, apparently, you implement more than was asked for. Interesting. I prefer
to implement what was asked for.

You seem pretty engrained in this mindset of yours, so I'm not trying to
convince you of anything - just giving you the computer science take on what
you are trying to do. Anyhow, good luck with this!! :-)

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
Drebin wrote:

Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do something like:

select customerid from customer where customer_lastname = 'smith'

Please explain how to do this w/ unstructured text?

"The files are unindexed and unsorted, and for the purposes of my

immediate
requirements, can't be indexed/sorted."
assuming you had customer data to load. This is pretty standard stuff - did you mean this in a different way? SQL is already a really, really powerful tool for loading, sorting and searching for data. So for someone to want to, and think they could do a better job - is ambitious to say the least. SQL concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!

"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om...
> How would you load a 100MB txt file into a DB and then search it for a
> word? How would that work?

Nov 16 '05 #38

Jon Skeet [C# MVP]

Julie <ju***@nospam.com> wrote:

Yes, thanks for the comments on on B-M searching.

As for loading the file into memory, I specifically do *not* want to do that.
Win32 has offered memory-mapped files for quite some time -- exactly what I'm
after in the .Net world.

Willy wasn't suggesting (as I read it) loading the whole file into
memory in one go - but you need to accept that if you're going to
search through every byte of the original data, all of it will need to
be loaded from disk into memory at that stage. It would be well worth
finding out how long it takes *just* for the loading part on the target
system (taking the cache into account) before looking at the searching
part, IMO.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #39

Willy Denoyette [MVP]

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

"Willy Denoyette [MVP]" wrote: Yes, thanks for the comments on on B-M searching.

As for loading the file into memory, I specifically do *not* want to do
that.
Win32 has offered memory-mapped files for quite some time -- exactly what
I'm
after in the .Net world.

Willy.

Julie,

Like Jon said, I wasn't suggesting loading the whole file in memory at once,
but as you need to search the whole file you will have to transfer all file
data to memory at some point in time.
Also, using memory mapped files doesn't make sense because:
- The total IO time will be the same as you would read the data directly
into your process space.
Simply because you have to create a "file mapping object" with a size equal
to the file size, then you can create individual file views to do the
search, but as you need to search the whole file you effectively load the
whole file in (virtual) memory.
- You don't shared the file mapping object with other processes which is one
main reason to use mapped files.

Willy.

Nov 16 '05 #40

James Curran

Daniel O'Connell [C# MVP] wrote:

For what its worth, this heavy push for a database is probably
foolish.

But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form.
There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would
probably be faster than doing several simple text searches.
--
Truth,
James Curran [MVP]
www.NJTheater.com (Professional)
www.NovelTheory.com (Personal)

Nov 16 '05 #41

Daniel O'Connell [C# MVP]

"James Curran" <Ja*********@mvps.org> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...

Daniel O'Connell [C# MVP] wrote:
For what its worth, this heavy push for a database is probably
foolish.

But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely
free-form. There's probably recognizable rows & columns. Further, I don't
believe searching would generally be limited to one search per file, so
reading, indexing, do several indexed searches, and then expiring the
indexing would probably be faster than doing several simple text searches.

Perhaps it has rows and columns, or perhaps its a flat file that simply has
periodic data, perhaps the data is closer to an XML file than it is a
databse, having tagged segments that may exist in any order. Perhaps the
file isn't constant, perhaps the equipment or network setups changes the
file commonly. Rows &columns are only one possible representation and static
content is only one possiblity.

Also, we have no way of knowing how quickly this data is flowing in. What if
this particular piece of equipment is generating hundreds of gigs a week?
Does a database still make sense? Does an index? Or does each file get
processed once and sent along to permanant storage? Is this query only run
against data sometimes, when something else occurs, or is it run dozens of
times an hour? Are the searches automated or are they something an
individual user does when they have reason to? Does the search utility have
to operate over millions of potential files or just one?

Without all of this information, what right do you have to call the OP's
choice naive? As anything we state is probably far less informed than what
their's is, I can't really believe you would consider your own stance to be
the less naive of the bunch.
Whatever our personal experiances may be, that doesn't mean that our
experiances with any particular thing are the only ones possible.

Your suggestion may well be the absolute wrong thing to do. Or it may be the
right one, however I just don't think you're in any better position than I
am to guage that. One would hope the person who wrote the requirements *was*
however.

Nov 16 '05 #42

Julie

James Curran wrote:

Daniel O'Connell [C# MVP] wrote:
For what its worth, this heavy push for a database is probably
foolish.
But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

Thanks for the insult of calling me naive.

I really find it hard to believe that you think that I can't understand what my
requirements are.
For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form.
There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would
probably be faster than doing several simple text searches.

I still don't see where I'm asking for anything faster than 1 hit in a 100 MB
file in 10 seconds or less. That is all that the performance requirement
dictates. Anything more is completely wasted effort.

I appreciate your comments and suggestions, but please, don't become so focused
on what you think is necessary that you completely ignore what I know to be the
requirements.

Thanks

Nov 16 '05 #43

Drebin

Julie,

Just one other random thought, if I may. I guess the reason why you are
finding so much resistance is that pretty much everyone - except for you -
finds "solutions" in their jobs and remains open-minded in their work. Very
rarely are we asked to blindly "fill these requirements". Our industry is
such that, most companies simply can't afford to work that inefficiently.
Instead, most companies give a developer a "problem" and we are to use our
expertise to find the most efficient solution. "Efficient" means not only
the fastest to implement, but also the the most scalable and easiest to
change.

In other words, when we hear such unreasonable requirements as you've
vaguely defined, everyone's first reaction is to get a handle on those,
because they don't sound reasonable. When someone comes to me with with
outrageous requirements, I don't even BOTHER trying to answer the original
question, because 99% of the time, they don't have a handle on the problem.
But I'm gathering from your reaction and style that you likely work in the
gov't or aerospace - and I know that is a completely different mindset
there.

Anyhow - for future reference, it would probably be a big time-saver if you
were to actually share (in some level of detail) what your requirements are.
What IS the format of this data? Because without that, you come off as an
inexperienced and stubborn developer that is not pushing back on
unreasonable requirements when you CLEARLY should be (in our minds). So, if
you want people to stop reacting to your requirements, it might be best to
actually go into some detail about why they are so immutable so people can
get past it - and start helping with your actual problem.

For whatever it's worth..

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

James Curran wrote:

Daniel O'Connell [C# MVP] wrote:
For what its worth, this heavy push for a database is probably
foolish.
But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

Thanks for the insult of calling me naive.

I really find it hard to believe that you think that I can't understand

what my requirements are.
For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form. There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would probably be faster than doing several simple text searches.
I still don't see where I'm asking for anything faster than 1 hit in a 100

MB file in 10 seconds or less. That is all that the performance requirement
dictates. Anything more is completely wasted effort.

I appreciate your comments and suggestions, but please, don't become so focused on what you think is necessary that you completely ignore what I know to be the requirements.

Thanks

Nov 16 '05 #44

Julie

Drebin wrote:

Julie,

Just one other random thought, if I may. I guess the reason why you are
finding so much resistance is that pretty much everyone - except for you -
finds "solutions" in their jobs and remains open-minded in their work. Very
rarely are we asked to blindly "fill these requirements". Our industry is
such that, most companies simply can't afford to work that inefficiently.
Instead, most companies give a developer a "problem" and we are to use our
expertise to find the most efficient solution. "Efficient" means not only
the fastest to implement, but also the the most scalable and easiest to
change.
Your points would be valid if: you were talking to someone that didn't know
what they were doing and/or asked for comments on process. Neither of those
apply to me.

I'm *very* open minded, I examine all of the potential solutions that I can,
and make informed decisions. I did that work before posing the original
question. I challenge you to be a little more open minded and realize that
simple-text searching can be (and IS! in this case) a valid solution to the
problem posed.

I work very closely w/ those that define the requirements, I know exactly what
they want, and they are informed as to the details, costs, issues, etc.

Honestly, I had gone your route and implemented this using a database, it would
have completely changed the disposition of the components that I'm working on,
the installation requirements, licensing requirements, base system
requirements, implementation time frame, complexity, maintainability, version
issues, etc., etc., etc. Absolutely none of that can be tolerated on the
project for a relatively simple part of the component that I'm working on.
In other words, when we hear such unreasonable requirements as you've
vaguely defined, everyone's first reaction is to get a handle on those,
because they don't sound reasonable.
Answer me one question: how can you determine that those requirements are
unreasonable?
When someone comes to me with with
outrageous requirements, I don't even BOTHER trying to answer the original
question, because 99% of the time, they don't have a handle on the problem.
But I'm gathering from your reaction and style that you likely work in the
gov't or aerospace - and I know that is a completely different mindset
there.
Oh wise sage, I work in neither of those disciplines.
Anyhow - for future reference, it would probably be a big time-saver if you
were to actually share (in some level of detail) what your requirements are.
What IS the format of this data? Because without that, you come off as an
inexperienced and stubborn developer that is not pushing back on
unreasonable requirements when you CLEARLY should be (in our minds). So, if
you want people to stop reacting to your requirements, it might be best to
actually go into some detail about why they are so immutable so people can
get past it - and start helping with your actual problem.
How about this, for your future reference: try answering the question posed,
don't spend so much time trying to read more into it than exists. If you have
questions about the requirements, ask them _after_ you have answered the
original question. Otherwise, you come off as a know-it-all.

Finally, my requirements were well defined, you just don't happen to want to
believe them.

For whatever it's worth..

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
James Curran wrote:

Daniel O'Connell [C# MVP] wrote:

> For what its worth, this heavy push for a database is probably
> foolish.

But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

Thanks for the insult of calling me naive.

I really find it hard to believe that you think that I can't understand

what my
requirements are.
For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form. There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would probably be faster than doing several simple text searches.

I still don't see where I'm asking for anything faster than 1 hit in a 100

MB
file in 10 seconds or less. That is all that the performance requirement
dictates. Anything more is completely wasted effort.

I appreciate your comments and suggestions, but please, don't become so

focused
on what you think is necessary that you completely ignore what I know to

be the
requirements.

Thanks

Nov 16 '05 #45

Julie

Julie wrote:

What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The search string
is typically 10 characters or less. Finally, I don't want to spawn out to an
external executable (e.g. grep), but include the algorithm/method directly in
the .NET code base. For the first rev, wildcard support is not a requirement.

Thanks to all those that replied.

I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.

As indicated, my requirement was to search a 100 MB text file for a string in
10 seconds or less. My initial results (debug, unoptimized) are right around 5
seconds on the target system, presumably the release/optimized build will be a
bit faster.

Implementation is essentially opening a text stream (StreamReader) and reading
the contents, line by line looking for the search string. Total implementation
is about 10 lines of code.

Nov 16 '05 #46

John Timney $Microsoft MVP$

Julie,

Purely out of interest - how are you checking if the string doesn't exist
over two lines?

Regards

John Timney
Microsoft Regional Director
Microsoft MVP
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Julie wrote:

What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a
requirement.
Thanks to all those that replied.

I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.

As indicated, my requirement was to search a 100 MB text file for a string in 10 seconds or less. My initial results (debug, unoptimized) are right around 5 seconds on the target system, presumably the release/optimized build will be a bit faster.

Implementation is essentially opening a text stream (StreamReader) and reading the contents, line by line looking for the search string. Total implementation is about 10 lines of code.

Nov 16 '05 #47

Willy Denoyette [MVP]

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...

Julie wrote:

What is the *fastest* way in .NET to search large on-disk text files
(100+ MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my
immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped
files
are ok (and preferred). Speed/performance is a requirement -- the target
is to
locate the string in 10 seconds or less for a 100 MB file. The search
string
is typically 10 characters or less. Finally, I don't want to spawn out
to an
external executable (e.g. grep), but include the algorithm/method
directly in
the .NET code base. For the first rev, wildcard support is not a
requirement.

Thanks to all those that replied.

I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.

As indicated, my requirement was to search a 100 MB text file for a string
in
10 seconds or less. My initial results (debug, unoptimized) are right
around 5
seconds on the target system, presumably the release/optimized build will
be a
bit faster.

Implementation is essentially opening a text stream (StreamReader) and
reading
the contents, line by line looking for the search string. Total
implementation
is about 10 lines of code.

Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did
your test.

Willy.

Nov 16 '05 #48

Julie

"John Timney (Microsoft MVP)" wrote:

Julie,

Purely out of interest - how are you checking if the string doesn't exist
over two lines?
As a matter of definition of the file format, the search string cannot span
lines, so no extra processing required.

"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com...
Julie wrote:

What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a

requirement.

Thanks to all those that replied.

I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.

As indicated, my requirement was to search a 100 MB text file for a string

in
10 seconds or less. My initial results (debug, unoptimized) are right

around 5
seconds on the target system, presumably the release/optimized build will

be a
bit faster.

Implementation is essentially opening a text stream (StreamReader) and

reading
the contents, line by line looking for the search string. Total

implementation
is about 10 lines of code.

Nov 16 '05 #49

Julie

"Willy Denoyette [MVP]" wrote:

Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did
your test.

At first I thought the same thing, but I ran the test & timing on several
machines w/ similar results on the first run.

Do you know of a way to programmatically flush the cache from .NET? If so,
I'll try it again w/ a forced flush and see if it changes my results.

Nov 16 '05 #50

Fastest way to search text file for string

Similar topics