What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The search string
is typically 10 characters or less. Finally, I don't want to spawn out to an
external executable (e.g. grep), but include the algorithm/method directly in
the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers! 60 48954
i would suggest that you have a look at Regex implmentation. I think regex
is the fastest when it comes to scanning.
You might need to use filestream to load the file so i dont think its the
most appropriate answer.
anyways make a local copy of one of those files and give Regex a try. see if
it comes anywhere near the 10 sec mark.
--
Regards,
Hermit Dave
( http://hdave.blogspot.com)
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+
MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped
files are ok (and preferred). Speed/performance is a requirement -- the target
is to locate the string in 10 seconds or less for a 100 MB file. The search
string is typically 10 characters or less. Finally, I don't want to spawn out to
an external executable (e.g. grep), but include the algorithm/method directly
in the .NET code base. For the first rev, wildcard support is not a
requirement. Thanks for any pointers!
I wouldn't spend anymore time on see *if* you can do this, until you find
out *why* you have to do this!
100mb flat file?? This is exactly the reason why relational databases were
made and are still used for just about everything. Without knowing more
about your app, I'd rather take the 2 minutes to load this into a SQL table,
build an index - and then what you want to do, suddenly becomes quick
(sub-second), simple and will support wildcards later. Maybe bulk-load your
file at night - and have your front-end hit the database during the day?
I don't think you will be happy with just about any solution. Every response
you will get to this is either going to be way to slow -or- way too
complicated. You're re-inventing the wheel!!
My $ .02
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!
Just load the Text file into a large string and use string.IndexOf() this
should be even faster than RegEx.
--
cody
[Freeware, Games and Humor] www.deutronium.de.vu || www.deutronium.tk
"Julie" <ju***@nospam.com> schrieb im Newsbeitrag
news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+
MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped
files are ok (and preferred). Speed/performance is a requirement -- the target
is to locate the string in 10 seconds or less for a 100 MB file. The search
string is typically 10 characters or less. Finally, I don't want to spawn out to
an external executable (e.g. grep), but include the algorithm/method directly
in the .NET code base. For the first rev, wildcard support is not a
requirement. Thanks for any pointers!
Given the size of the file, probably the only way would be to use a
filebytearray, load the file in as bytes 1 at a time and convert them to
chars by creating an indexer. You will need to work out a way of checking
if a series of chars make up the string you are looking for.
I would start here. http://msdn.microsoft.com/library/de...us/csref/html/
vcwlkindexerstutorial.asp
However, I very much doubt you will manage to scan 100 meg of data in 10
seconds.
--
Regards
John Timney
Microsoft Regional Director
Microsoft MVP
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+
MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped
files are ok (and preferred). Speed/performance is a requirement -- the target
is to locate the string in 10 seconds or less for a 100 MB file. The search
string is typically 10 characters or less. Finally, I don't want to spawn out to
an external executable (e.g. grep), but include the algorithm/method directly
in the .NET code base. For the first rev, wildcard support is not a
requirement. Thanks for any pointers!
Drebin wrote: I wouldn't spend anymore time on see *if* you can do this, until you find out *why* you have to do this!
All requirements have been defined at this point by project management. This
isn't just a blind decision, but the result of the examination of the domain
and expected results.
100mb flat file?? This is exactly the reason why relational databases were made and are still used for just about everything. Without knowing more about your app, I'd rather take the 2 minutes to load this into a SQL table, build an index - and then what you want to do, suddenly becomes quick (sub-second), simple and will support wildcards later. Maybe bulk-load your file at night - and have your front-end hit the database during the day?
Yes, 100+ MB flat files. These are loosely formatted datafiles from external
laboratory instruments.
Remember, proper implementation dictates that you implement what is required,
nothing more. The current requirements are simple access & management of these
text files that allows immediate searching that averages 10 seconds or less.
WinGrep accomplishes this in 6 seconds on our target system (1.3 GHz, 500 MB
RAM).
Future requirements *may* dictate that additional time constraints are imposed,
which would then lead to db or other external indexing of the files. But, that
will be implemented when and if necessary. If you have questions about this
approach, you may want to look into the (industrial) extreme programming
paradigm, which is what our shop successfully uses.
I don't think you will be happy with just about any solution. Every response you will get to this is either going to be way to slow -or- way too complicated. You're re-inventing the wheel!!
Nay, my good friend, not re-inventing the wheel, but asking where the wheel
is. Text-searching of large files isn't uncommon or inappropriate. I'm just
looking into comments on such searches in .Net; this stuff is fairly trivial in
C++/Win32 (I'd prefer *not* to drop down to managed/unmanaged C++ for this
project). "Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!
"John Timney (Microsoft MVP)" wrote: Given the size of the file, probably the only way would be to use a filebytearray, load the file in as bytes 1 at a time and convert them to chars by creating an indexer. You will need to work out a way of checking if a series of chars make up the string you are looking for.
I would start here.
http://msdn.microsoft.com/library/de...us/csref/html/ vcwlkindexerstutorial.asp
However, I very much doubt you will manage to scan 100 meg of data in 10 seconds.
Thanks, I'll look into that.
WinGrep performs the search in 6 seconds on our target system (1.3 GHz, 500 MB
RAM). (WinGrep is not open source, and C++/Win32.) -- Regards
John Timney Microsoft Regional Director Microsoft MVP
"Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement. Thanks for any pointers!
Julie wrote: Nay, my good friend, not re-inventing the wheel, but asking where the wheel is. Text-searching of large files isn't uncommon or inappropriate. I'm just looking into comments on such searches in .Net; this stuff is fairly trivial in C++/Win32 (I'd prefer not to drop down to managed/unmanaged C++ for this project).
Searching text in textblocks should use the textsearch algorithm by
Knuth-Morris-More or the Boyer-Moore variant. These algorithms are much
faster than the brute force algorithms implemented in the string class.
Algorithms in C by Sedgewick contains a description of these algorithms, and
I'm sure you'll find some descriptions on the internet.
Basicly they come down to this:
string: ababababcababababacababababacbabababacbabababc
if you now try to find the string abc, do not start at teh first character,
but at the last. So in the string, the 3rd character is an 'a'. So abc will
never start at the first character of the string, so we can skip the first 3.
It works with skip arrays and is quite clever, it will tremendously speed up
string search, especially with large texts.
Frans.
--
Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
My .NET Blog: http://weblogs.asp.net/fbouma
Microsoft C# MVP
Julie wrote: What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred).
The problem is that fast access to a large file requires direct access
to memory, which is antithetical to managed code. Your best choice would be
to isolate the search in a unmanaged C++ function, which is coded by the
managed C# app. Access the file as a memory mapped file is fairly easy in
unmanaged code, but I don't believe it's possible at all in a managed app.
The best algorithm for searching your file is probably the Boyer-Moore
method. Moore himself has a cool webpage graphically demostrating it: http://www.cs.utexas.edu/users/moore...ing/index.html.
Googling "Boyer-Moore" will provide any number of implementations, such as
this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html
--
Truth,
James Curran [MVP] www.NJTheater.com (Professional) www.NovelTheory.com (Personal)
If you want to jam em out in c#, I for one would really appreaciate it and
keep for future use. Or post at CodeProject. TIA.
--
William Stacey, MVP
"Frans Bouma [C# MVP]" <pe******************@xs4all.nl> wrote in message
news:xn***************@msnews.microsoft.com... Julie wrote: Nay, my good friend, not re-inventing the wheel, but asking where the
wheel is. Text-searching of large files isn't uncommon or inappropriate. I'm
....
William Stacey [MVP] wrote: If you want to jam em out in c#, I for one would really appreaciate it and keep for future use. Or post at CodeProject. TIA.
I made a typo, It's Knutt-Morris-Pratt, not Knutt-Morris-Moore.
Here is a link to a lot of string search algorithms. Both I mentioned are
there with C-code and explanation. It's pretty straight forward :)] http://www-igm.univ-mlv.fr/~lecroq/string/
Frans.
--
Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
My .NET Blog: http://weblogs.asp.net/fbouma
Microsoft C# MVP
Drebin wrote: "Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... Remember, proper implementation dictates that you implement what is required, nothing more. Wow. Most every other developer would argue that "proper" implementation dictates: functionality that is required and then any other structural or design implementations that will likely be used in the future.
I was in that group of developers 6 months ago. I was raised old school where
requirements, specs, and definitions are done before coding starts.
Believe me, the paradigm switch *wasn't* easy nor intuitive. However, I have
to say, that there is a lot to IXP that really impacts the production quality
*and* performance of the entire project. I'm no expert, but I'm making
progress, and probably won't switch back to the old way.
And I guarantee that I wasn't excited about the switch. However, I committed
to trying it for 6 months, and then reevaluate. So far, so good.
That's like saying "my customers told me to build a house and they want 3 bedrooms".
Right -- and that is what you build. The foresight is to build it in such a
fashion that that future changes can be easily accomplished. One of the
primary keys is constant refactoring of code and TDD. That leaves the code in
a much more stable and flexible state. It doesn't sound efficient, but when
you are proficient at it, it really is more effective.
You need to have the foresight and wisdom to anticipate that later they are going to say "we want a kitchen and bathrooms too!".
In your hypothetical example, what if they never want a kitchen/bathroom? Then
you have wasted work.
If you don't, you will *always* be doing way more work than you need to! Customers often don't know what they want - it's up to your expertise to help them define that.
For your app, common sense tells me: if they want to search this data, they will want to sort it, and they will want to browse and filter results. Going your route, EACH step will likely take just as long. If you did it the right way once, in the beginning - then each additional thing the users wanted would be practically free.
See, your common sense is already wrong, and if followed, leads down the wrong
path and direction for this application.
Anyway, the thread isn't to discuss the merits of IXP, merely looking for a
quick text search implementation in C# -- nothing more, nothing less. If you have questions about this approach, you may want to look into the (industrial) extreme programming paradigm, which is what our shop successfully uses.
You can call it what you want and justify it however, but I'm really glad I don't work where you work. :-)
Michael C wrote: That's like saying "my customers told me to build a house and they want 3 bedrooms". You need to have the foresight and wisdom to anticipate that later they are going to say "we want a kitchen and bathrooms too!". If you don't, you will *always* be doing way more work than you need to! Customers often don't know what they want - it's up to your expertise to help them define that.
If you're storing 100MB of data in a flat file, you're probably not running very efficiently to begin with. Like Drebin said, you need to look at your ultimate goals and you'll probably find much better solutions out there (i.e., SQL, etc.)
Like I indicated in another follow-up: "These are loosely formatted datafiles
from external laboratory instruments." The requirement is to work w/ these
files, not change the file format.
That being said, you need to have the foresight, wisdom and *discipline* to define all requirements *up-front*. Adding extra kitchens and bathrooms to the floorplan halfway through the house building process jacks the cost - in time and $$$'s - *way* up. It's true that a lot of users don't understand the requirements definition process, and many aren't aware of the features they'll want down the road; that's where you step in and help guide them during the planning process. If you're constantly re-writing your applications because the user requirements keep changing during the implementation phase, you might want to look at improving your own planning process.
Thank you for the compliment! Yes, we definitely do have foresight, wisdom and
discipline. The conclusion of our investigation was the following
requirements:
Search an approx. 100 MB flat text file for a given string in an average of 10
seconds or less, no wildcards.
Now, if I can just have the *discipline* to implement something that meets
_those_ requirements, I get the job done, am appreciated by the team, we
release the product, I get paid, and everyone is happy.
James Curran wrote: Julie wrote: What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred).
The problem is that fast access to a large file requires direct access to memory, which is antithetical to managed code. Your best choice would be to isolate the search in a unmanaged C++ function, which is coded by the managed C# app. Access the file as a memory mapped file is fairly easy in unmanaged code, but I don't believe it's possible at all in a managed app.
The best algorithm for searching your file is probably the Boyer-Moore method. Moore himself has a cool webpage graphically demostrating it: http://www.cs.utexas.edu/users/moore...ing/index.html. Googling "Boyer-Moore" will provide any number of implementations, such as this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html
Thanks for the tips. I wasn't aware that mm file support wasn't available in
..NET, seems short-sighted to me.
Managed/unmanaged code really isn't a possibility, part of the requirements is
that it is implemented in C#. I may be able to get away w/ using interop
though for the mm file support.
Julie wrote: James Curran wrote: Julie wrote: What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred).
The problem is that fast access to a large file requires direct access to memory, which is antithetical to managed code. Your best choice would be to isolate the search in a unmanaged C++ function, which is coded by the managed C# app. Access the file as a memory mapped file is fairly easy in unmanaged code, but I don't believe it's possible at all in a managed app.
The best algorithm for searching your file is probably the Boyer-Moore method. Moore himself has a cool webpage graphically demostrating it: http://www.cs.utexas.edu/users/moore...ing/index.html. Googling "Boyer-Moore" will provide any number of implementations, such as this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html
Thanks for the tips. I wasn't aware that mm file support wasn't available in .NET, seems short-sighted to me.
Managed/unmanaged code really isn't a possibility, part of the requirements is that it is implemented in C#. I may be able to get away w/ using interop though for the mm file support.
Not interop, but P/Invoke...
Use Perl.
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Michael C wrote: That's like saying "my customers told me to build a house and they
want 3 bedrooms". You need to have the foresight and wisdom to anticipate
that later they are going to say "we want a kitchen and bathrooms too!". If
you don't, you will *always* be doing way more work than you need to! Customers often don't know what they want - it's up to your expertise to help
them define that.
If you're storing 100MB of data in a flat file, you're probably not
running very efficiently to begin with. Like Drebin said, you need to look at
your ultimate goals and you'll probably find much better solutions out there (i.e., SQL, etc.)
Like I indicated in another follow-up: "These are loosely formatted
datafiles from external laboratory instruments." The requirement is to work w/
these files, not change the file format.
That being said, you need to have the foresight, wisdom and *discipline*
to define all requirements *up-front*. Adding extra kitchens and bathrooms
to the floorplan halfway through the house building process jacks the
cost - in time and $$$'s - *way* up. It's true that a lot of users don't
understand the requirements definition process, and many aren't aware of the
features they'll want down the road; that's where you step in and help guide them during the planning process. If you're constantly re-writing your applications because the user requirements keep changing during the implementation phase, you might want to look at improving your own
planning process. Thank you for the compliment! Yes, we definitely do have foresight,
wisdom and discipline. The conclusion of our investigation was the following requirements:
Search an approx. 100 MB flat text file for a given string in an average
of 10 seconds or less, no wildcards.
Now, if I can just have the *discipline* to implement something that meets _those_ requirements, I get the job done, am appreciated by the team, we release the product, I get paid, and everyone is happy.
Michael C wrote: Use Perl.
?!
You got some serious specifications, but haven't given enough information to
really help you find a solution. So I suggested using a language that was
designed specifically to perform text processing. Maybe you could provide
more information, like:
You specify 10 seconds to locate text matches in a 100 MB flat text file.
Is the file already loaded into memory? Do you want to load the whole file
into memory first? Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole
point is moot). How many matches are you searching for? One match, every
match? Is the file structured in such a way that its format can be
leveraged to speed up the process? Are there certain fields that are
searched more than others in searches?
Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data
in reasonable time frames. Of course that may not be an option for you.
Thanks,
Michael C., MCDBA
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Michael C wrote: Use Perl.
?!
How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?
Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do
something like:
select customerid from customer where customer_lastname = 'smith'
assuming you had customer data to load. This is pretty standard stuff - did
you mean this in a different way? SQL is already a really, really powerful
tool for loading, sorting and searching for data. So for someone to want to,
and think they could do a better job - is ambitious to say the least. SQL
concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message
news:ac**************************@posting.google.c om... How would you load a 100MB txt file into a DB and then search it for a word? How would that work?
Julie wrote: Thanks for the tips. I wasn't aware that mm file support wasn't available in .NET, seems short-sighted to me.
What good is a memory mapped file in an environment where you cannot
directly access memory?
--
Truth,
James Curran [MVP] www.NJTheater.com (Professional) www.NovelTheory.com (Personal)
Julie wrote: Like I indicated in another follow-up: "These are loosely formatted datafiles from external laboratory instruments." The requirement is to work w/ these files, not change the file format.
Well, admittedly I don't know the specifics of the contract, but it
seems that you are taking that a bit too literally. As far as I can see,
you are required to:
A) Read in the 100 MB flat text file, and
B) Spit out results found within that file.
Exactly how you get from A) to B) is strictly your concern, and if you want
to implement it by loading it into a SQL database or otherwise indexing it,
no one else needs even know about it.
--
Truth,
James Curran [MVP] www.NJTheater.com (Professional) www.NovelTheory.com (Personal)
Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file
so that it didn't have to be re-created from scratch every darn time you run
the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL
Server. Any solution short of using a separate index of some sort is going
to be very non-scalable and comparatively slow. But alack and alas, to each
her own...
Thanks,
Michael C., MCDBA
"Drebin" <th*******@hotmail.com> wrote in message
news:bP*****************@newssvr33.news.prodigy.co m... Create a loader program/BCP/DTS job to split the records accordingly and load them into a table structure (say named "customer").. then you could
do something like:
select customerid from customer where customer_lastname = 'smith'
assuming you had customer data to load. This is pretty standard stuff -
did you mean this in a different way? SQL is already a really, really powerful tool for loading, sorting and searching for data. So for someone to want
to, and think they could do a better job - is ambitious to say the least. SQL concepts (like indexing and searching) are time-tested, I can't picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message news:ac**************************@posting.google.c om... How would you load a 100MB txt file into a DB and then search it for a word? How would that work?
James Curran wrote: Julie wrote: Thanks for the tips. I wasn't aware that mm file support wasn't available in .NET, seems short-sighted to me.
What good is a memory mapped file in an environment where you cannot directly access memory?
Can't directly access memory? Sure you can.
In this case, it could be an abstraction layer on something like char[] or
byte[].
Michael C wrote: Exactly my point. The only way I can think of to search 100 MB files quickly would require a separate index, preferably kept in a separate file so that it didn't have to be re-created from scratch every darn time you run the program. That being said, that's exactly what SQL does; any type of separate indexing method would basically be re-inventing the wheel. And heck, they're giving away MSDE for free, so you don't even have to buy SQL Server. Any solution short of using a separate index of some sort is going to be very non-scalable and comparatively slow. But alack and alas, to each her own...
As I indicated in my original post, indexing isn't an option (and this
therefore excludes any database access).
Apparently you aren't aware of any methods in .NET to accomplish what I need --
thanks for your input. Thanks, Michael C., MCDBA
"Drebin" <th*******@hotmail.com> wrote in message news:bP*****************@newssvr33.news.prodigy.co m... Create a loader program/BCP/DTS job to split the records accordingly and load them into a table structure (say named "customer").. then you could do something like:
select customerid from customer where customer_lastname = 'smith'
assuming you had customer data to load. This is pretty standard stuff - did you mean this in a different way? SQL is already a really, really powerful tool for loading, sorting and searching for data. So for someone to want to, and think they could do a better job - is ambitious to say the least. SQL concepts (like indexing and searching) are time-tested, I can't picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message news:ac**************************@posting.google.c om... How would you load a 100MB txt file into a DB and then search it for a word? How would that work?
Michael C wrote: You got some serious specifications, but haven't given enough information to really help you find a solution.
I wasn't asking for a solution, I was asking:
"What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."
So I suggested using a language that was designed specifically to perform text processing.
I'm not doing text processing, I'm searching a text file for a given string --
that is all.
Maybe you could provide more information, like:
You specify 10 seconds to locate text matches in a 100 MB flat text file. Is the file already loaded into memory?
No, the file is on disk, as I indicated in my original post:
"I don't want to load the entire file into physical memory"
Do you want to load the whole file into memory first?
No: "I don't want to load the entire file into physical memory"
Does the load time count against your 10 seconds, or is it in addition to? If it counts against your 10 seconds, can your recommended system configuration load it in 10 seconds? (If not, the whole point is moot).
N/A: "I don't want to load the entire file into physical memory"
How many matches are you searching for? One match, every match?
Typically 0 or 1 (i.e. found/not found).
Is the file structured in such a way that its format can be leveraged to speed up the process?
For the purposes of my requirements, the file is unordered, unstructured,
essentially random text.
Are there certain fields that are searched more than others in searches?
No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to utilize a SQL database or other method of access, I suppose I'd have to *reinvent the wheel* and create a separate index file to retrieve the data in reasonable time frames. Of course that may not be an option for you.
Not stuck, that is the requirement. Or, do you presume the original
requirements for grep to be 'stuck'?
Again, I'm not interested in reinventing the wheel as you put it, I'm simply
after: "What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."
Drebin wrote: Create a loader program/BCP/DTS job to split the records accordingly and load them into a table structure (say named "customer").. then you could do something like:
select customerid from customer where customer_lastname = 'smith'
Please explain how to do this w/ unstructured text?
"The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted."
assuming you had customer data to load. This is pretty standard stuff - did you mean this in a different way? SQL is already a really, really powerful tool for loading, sorting and searching for data. So for someone to want to, and think they could do a better job - is ambitious to say the least. SQL concepts (like indexing and searching) are time-tested, I can't picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message news:ac**************************@posting.google.c om... How would you load a 100MB txt file into a DB and then search it for a word? How would that work?
James Curran wrote: Julie wrote: Like I indicated in another follow-up: "These are loosely formatted datafiles from external laboratory instruments." The requirement is to work w/ these files, not change the file format.
Well, admittedly I don't know the specifics of the contract, but it seems that you are taking that a bit too literally. As far as I can see, you are required to: A) Read in the 100 MB flat text file, and B) Spit out results found within that file.
Exactly how you get from A) to B) is strictly your concern, and if you want to implement it by loading it into a SQL database or otherwise indexing it, no one else needs even know about it.
Nope, sorry, I'm not taking it too literally, but thanks for implying that I'm
incapable to understand requirements.
Assume the input/output requirements to be the same as grep.
Julie,
Please don't take offense, I have no desire to get into a pissing contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never met
anyone that thinks like this. So after this, I'll back off :-)
First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what you
want, until you see some of your project come to life. Using the analogy of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.
So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).
As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.
So - bottom line, if you have to deal with data that doesn't have structure,
you should first address that. You are going to spend 3x as much time trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.
You seem pretty engrained in this mindset of yours, so I'm not trying to
convince you of anything - just giving you the computer science take on what
you are trying to do. Anyhow, good luck with this!! :-)
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Drebin wrote: Create a loader program/BCP/DTS job to split the records accordingly and load them into a table structure (say named "customer").. then you could
do something like:
select customerid from customer where customer_lastname = 'smith' Please explain how to do this w/ unstructured text?
"The files are unindexed and unsorted, and for the purposes of my
immediate requirements, can't be indexed/sorted."
assuming you had customer data to load. This is pretty standard stuff -
did you mean this in a different way? SQL is already a really, really
powerful tool for loading, sorting and searching for data. So for someone to want
to, and think they could do a better job - is ambitious to say the least.
SQL concepts (like indexing and searching) are time-tested, I can't picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message news:ac**************************@posting.google.c om... How would you load a 100MB txt file into a DB and then search it for a word? How would that work?
Oh that's simple then. You point your browser at http://www.thecodeproject.com/csharp...asp#xx825897xx and contact
Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you
implement and see if it does the job in 10 seconds or less.
Too easy.
Your welcome,
Michael C., MCDBA
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Michael C wrote: You got some serious specifications, but haven't given enough
information to really help you find a solution. I wasn't asking for a solution, I was asking:
"What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string."
So I suggested using a language that was designed specifically to perform text processing.
I'm not doing text processing, I'm searching a text file for a given
string -- that is all.
Maybe you could provide more information, like:
You specify 10 seconds to locate text matches in a 100 MB flat text
file. Is the file already loaded into memory? No, the file is on disk, as I indicated in my original post:
"I don't want to load the entire file into physical memory"
Do you want to load the whole file into memory first?
No: "I don't want to load the entire file into physical memory"
Does the load time count against your 10 seconds, or is it in addition to? If it counts against your 10 seconds, can your recommended system configuration load it in 10 seconds? (If not, the
whole point is moot).
N/A: "I don't want to load the entire file into physical memory"
How many matches are you searching for? One match, every match?
Typically 0 or 1 (i.e. found/not found).
Is the file structured in such a way that its format can be leveraged to speed up the process?
For the purposes of my requirements, the file is unordered, unstructured, essentially random text.
Are there certain fields that are searched more than others in searches?
No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to utilize a SQL database or other method of access, I suppose I'd have to *reinvent the wheel* and create a separate index file to retrieve the
data in reasonable time frames. Of course that may not be an option for you.
Not stuck, that is the requirement. Or, do you presume the original requirements for grep to be 'stuck'?
Again, I'm not interested in reinventing the wheel as you put it, I'm
simply after: "What is the *fastest* way in .NET to search large on-disk text
files (100+ MB) for a given string."
The real problem is to find the fastest way to read the data into memory.
Reading a 100Mb file will take something like 5 - 10 seconds depending on
the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data
is not cached.
Depending on the results of the file data load time, you can decide to use a
naive algorithm or opt for a faster algorithm like the Boyer-Moore
algorithm.
Consider that:
- the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6
times faster than a naive algorithm,
- a naive algorithm should be able to search a 10 char pattern in less than
one sec on a decent system (P4 - 2.8 GHz).
So it's really important to know exactly how much time will be spent to
bring the data in memory before you decide upon the searching algorithm.
Willy.
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!
Simple. Since you require a simple Grep-like function, you find one that's
already made, and modify it to your taste as opposed to re-inventing the
wheel from scratch. If the Code Project link I already gave you doesn't fit
your needs, I would recommend Googling "C#.NET Grep". But I'm sure you've
already done that.
Your welcome,
Michael C., MCDBA
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Michael C wrote: Exactly my point. The only way I can think of to search 100 MB files quickly would require a separate index, preferably kept in a separate
file so that it didn't have to be re-created from scratch every darn time you
run the program. That being said, that's exactly what SQL does; any type of separate indexing method would basically be re-inventing the wheel. And heck, they're giving away MSDE for free, so you don't even have to buy
SQL Server. Any solution short of using a separate index of some sort is
going to be very non-scalable and comparatively slow. But alack and alas, to
each her own... As I indicated in my original post, indexing isn't an option (and this therefore excludes any database access).
Apparently you aren't aware of any methods in .NET to accomplish what I
need -- thanks for your input.
Thanks, Michael C., MCDBA
"Drebin" <th*******@hotmail.com> wrote in message news:bP*****************@newssvr33.news.prodigy.co m... Create a loader program/BCP/DTS job to split the records accordingly
and load them into a table structure (say named "customer").. then you
could do something like:
select customerid from customer where customer_lastname = 'smith'
assuming you had customer data to load. This is pretty standard
stuff - did you mean this in a different way? SQL is already a really, really
powerful tool for loading, sorting and searching for data. So for someone to
want to, and think they could do a better job - is ambitious to say the least.
SQL concepts (like indexing and searching) are time-tested, I can't
picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message news:ac**************************@posting.google.c om... > How would you load a 100MB txt file into a DB and then search it for
a > word? How would that work?
Julie wrote: Assume the input/output requirements to be the same as grep.
And it's *completely acceptable* for GREP to be implemented using a SQL
server behind the scenes. Granted that probably wouldn't be a very good
implementation, but it as long as the *inputs* and *outputs* are as
expected, the implementation is irrelevant.
--
Truth,
James Curran [MVP] www.NJTheater.com (Professional) www.NovelTheory.com (Personal)
Drebin, I have no desire to get into a pissing contest -
But that is exactly what you are doing. You are trying to prove you are
right and she is wrong. Just except that she knows what her requirements are
and that (rightly or wrongly) a database is not suitable. Why does it matter
so much to you if she is wrong?
Her original question was quite clear and explicit, even to the point of
stating the file could not be indexed or sorted! Unfortunately some times in
life you have to deal with what you are given and you can't always demand
how information is provided.
Just answer the question as asked and don't start criticising people just
because you believe their approach is wrong. By all means offer an
alternative solution but don't take offence when somebody says it is not
suitable. And don't be so arrogant to assume you know their requirements or
circumstances better they do.
Regards
John
p.s By all means start ranting and raving at me as well but do so knowing I
wont be reading anymore of your replies. You have a major attitude problem.
I suggest you refrain from contributing until you can control it.
"Drebin" <th*******@hotmail.com> wrote in message
news:T5******************@newssvr19.news.prodigy.c om... Julie,
Please don't take offense, I have no desire to get into a pissing contest - I am just flabbergasted on your viewpoint. I can honestly say I've never met anyone that thinks like this. So after this, I'll back off :-)
First, to talk to what you said before about "getting solid requirements first" - the problem with getting requirements isn't that people are incompetent, it's very much human nature to not being able to know what you want, until you see some of your project come to life. Using the analogy of building a house, you may not realize until after the 2nd floor is in place - that you have a really cool view, and it would've been nice to build a balcony. So - I think that people many times are UNABLE to make solid choices on requirements because you just can see that far ahead.
So I believe human nature makes it quite impossible to make 100% accurate requirements. It's just not possible (for most large projects).
As to your point below, computers are based on structure. If you want to KEEP your data unstructured, you are going to be fighting the computer the whole way. So if you are asking me, I would be spending all of my time right now, getting this unstructured data - into some sort of structure. If you source for this data gives you it unstructured, then you need to get yourself a new data source or build a converter of some sort. You eluded to this being data from an electronic device of some sort. It seems to me, if it just gave you an array of numbers and values that were random - this would be completely useless.
So - bottom line, if you have to deal with data that doesn't have structure, you should first address that. You are going to spend 3x as much time trying to do any little thing with this data. Versus - if you just get this straightened out in the beginning. If you do this, then you can leverage a TON of technology and products (like SQL server) rather than writting a custom-version of them for this specific problem. "Write-once-use-once" software is soooo "mid-90's".. "Write-once-use-many-times" is what things have evolved to.
You seem pretty engrained in this mindset of yours, so I'm not trying to convince you of anything - just giving you the computer science take on what you are trying to do. Anyhow, good luck with this!! :-)
"Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... Drebin wrote: > > Create a loader program/BCP/DTS job to split the records accordingly > and > load them into a table structure (say named "customer").. then you > could do > something like: > > select customerid from customer where customer_lastname = 'smith'
Please explain how to do this w/ unstructured text?
"The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted."
> assuming you had customer data to load. This is pretty standard stuff - did > you mean this in a different way? SQL is already a really, really powerful > tool for loading, sorting and searching for data. So for someone to > want to, > and think they could do a better job - is ambitious to say the least. SQL > concepts (like indexing and searching) are time-tested, I can't picture > challenging it and thinking I could do a better job!! > > "Phill" <wa********@yahoo.com> wrote in message > news:ac**************************@posting.google.c om... > > How would you load a 100MB txt file into a DB and then search it for > > a > > word? How would that work?
"James Curran" <Ja*********@mvps.org> wrote in message
news:u3**************@TK2MSFTNGP09.phx.gbl... Julie wrote: Assume the input/output requirements to be the same as grep. And it's *completely acceptable* for GREP to be implemented using a SQL server behind the scenes. Granted that probably wouldn't be a very good implementation, but it as long as the *inputs* and *outputs* are as expected, the implementation is irrelevant.
It would be ok if the grep *ENGINE* can accept any input, a grep
implementation that uses a database is valid, but close to useless
considering grep is built strictly to search files. What good is a program
that imports all of my files into a database and then tosses the database
out afterwards? Would you *WANT* that? Just because some of you think its a
really great idea because it means you don't have to do any work?
For what its worth, this heavy push for a database is probably foolish. If
the search only occurs a couple of times, there is a pretty good chance that
the time spent loading the database and extra space used up will be a huge
waste of space.
An index makes sense, but only if alot of searches will occur over the same
files. For one off searches I would imagine creating hte index would take
more time and alot more space than just searching the file.
-- Truth, James Curran [MVP] www.NJTheater.com (Professional) www.NovelTheory.com (Personal)
Michael C wrote: Oh that's simple then. You point your browser at http://www.thecodeproject.com/csharp...asp#xx825897xx and contact Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you implement and see if it does the job in 10 seconds or less.
Excellent -- thanks! Too easy.
Your welcome, Michael C., MCDBA
"Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... Michael C wrote: You got some serious specifications, but haven't given enough information to really help you find a solution.
I wasn't asking for a solution, I was asking:
"What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string."
So I suggested using a language that was designed specifically to perform text processing.
I'm not doing text processing, I'm searching a text file for a given string -- that is all.
Maybe you could provide more information, like:
You specify 10 seconds to locate text matches in a 100 MB flat text file. Is the file already loaded into memory?
No, the file is on disk, as I indicated in my original post:
"I don't want to load the entire file into physical memory"
Do you want to load the whole file into memory first?
No: "I don't want to load the entire file into physical memory"
Does the load time count against your 10 seconds, or is it in addition to? If it counts against your 10 seconds, can your recommended system configuration load it in 10 seconds? (If not, the whole point is moot).
N/A: "I don't want to load the entire file into physical memory"
How many matches are you searching for? One match, every match?
Typically 0 or 1 (i.e. found/not found).
Is the file structured in such a way that its format can be leveraged to speed up the process?
For the purposes of my requirements, the file is unordered, unstructured, essentially random text.
Are there certain fields that are searched more than others in searches?
No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to utilize a SQL database or other method of access, I suppose I'd have to *reinvent the wheel* and create a separate index file to retrieve the data in reasonable time frames. Of course that may not be an option for you.
Not stuck, that is the requirement. Or, do you presume the original requirements for grep to be 'stuck'?
Again, I'm not interested in reinventing the wheel as you put it, I'm simply after: "What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string."
"Willy Denoyette [MVP]" wrote: The real problem is to find the fastest way to read the data into memory. Reading a 100Mb file will take something like 5 - 10 seconds depending on the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data is not cached. Depending on the results of the file data load time, you can decide to use a naive algorithm or opt for a faster algorithm like the Boyer-Moore algorithm. Consider that: - the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6 times faster than a naive algorithm, - a naive algorithm should be able to search a 10 char pattern in less than one sec on a decent system (P4 - 2.8 GHz).
So it's really important to know exactly how much time will be spent to bring the data in memory before you decide upon the searching algorithm.
Yes, thanks for the comments on on B-M searching.
As for loading the file into memory, I specifically do *not* want to do that.
Win32 has offered memory-mapped files for quite some time -- exactly what I'm
after in the .Net world. Willy.
"Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks for any pointers!
Drebin wrote: Julie,
Please don't take offense, I have no desire to get into a pissing contest - I am just flabbergasted on your viewpoint. I can honestly say I've never met anyone that thinks like this. So after this, I'll back off :-)
I'll gladly take that as compliment.
First, to talk to what you said before about "getting solid requirements first" - the problem with getting requirements isn't that people are incompetent, it's very much human nature to not being able to know what you want, until you see some of your project come to life. Using the analogy of building a house, you may not realize until after the 2nd floor is in place - that you have a really cool view, and it would've been nice to build a balcony. So - I think that people many times are UNABLE to make solid choices on requirements because you just can see that far ahead.
So I believe human nature makes it quite impossible to make 100% accurate requirements. It's just not possible (for most large projects).
Well, you may want to consider that it is possible -- this is actually a port
of an existing C++ application to C#. Requirements were
defined/refined/reevaluated years ago. My task is simple, as I oringally
posted. Please don't assume that you know more about my requirements than I
do.
As to your point below, computers are based on structure. If you want to KEEP your data unstructured, you are going to be fighting the computer the whole way. So if you are asking me, I would be spending all of my time right now, getting this unstructured data - into some sort of structure. If you source for this data gives you it unstructured, then you need to get yourself a new data source or build a converter of some sort. You eluded to this being data from an electronic device of some sort. It seems to me, if it just gave you an array of numbers and values that were random - this would be completely useless.
Right -- The source of the data is a 3rd party $100,000 laboratory instrument
(mass spectromoter). I'll just contact the manufacturer and tell them that
they are doing everything wrong and to output their data in some other format.
If they have any questions as to why, I'll refer them to you.
So - bottom line, if you have to deal with data that doesn't have structure, you should first address that. You are going to spend 3x as much time trying to do any little thing with this data. Versus - if you just get this straightened out in the beginning. If you do this, then you can leverage a TON of technology and products (like SQL server) rather than writting a custom-version of them for this specific problem. "Write-once-use-once" software is soooo "mid-90's".. "Write-once-use-many-times" is what things have evolved to.
So, apparently, you implement more than was asked for. Interesting. I prefer
to implement what was asked for. You seem pretty engrained in this mindset of yours, so I'm not trying to convince you of anything - just giving you the computer science take on what you are trying to do. Anyhow, good luck with this!! :-)
"Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... Drebin wrote: Create a loader program/BCP/DTS job to split the records accordingly and load them into a table structure (say named "customer").. then you could do something like:
select customerid from customer where customer_lastname = 'smith'
Please explain how to do this w/ unstructured text?
"The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted."
assuming you had customer data to load. This is pretty standard stuff - did you mean this in a different way? SQL is already a really, really powerful tool for loading, sorting and searching for data. So for someone to want to, and think they could do a better job - is ambitious to say the least. SQL concepts (like indexing and searching) are time-tested, I can't picture challenging it and thinking I could do a better job!!
"Phill" <wa********@yahoo.com> wrote in message news:ac**************************@posting.google.c om... > How would you load a 100MB txt file into a DB and then search it for a > word? How would that work?
Julie <ju***@nospam.com> wrote: Yes, thanks for the comments on on B-M searching.
As for loading the file into memory, I specifically do *not* want to do that. Win32 has offered memory-mapped files for quite some time -- exactly what I'm after in the .Net world.
Willy wasn't suggesting (as I read it) loading the whole file into
memory in one go - but you need to accept that if you're going to
search through every byte of the original data, all of it will need to
be loaded from disk into memory at that stage. It would be well worth
finding out how long it takes *just* for the loading part on the target
system (taking the cache into account) before looking at the searching
part, IMO.
--
Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... "Willy Denoyette [MVP]" wrote:
Yes, thanks for the comments on on B-M searching.
As for loading the file into memory, I specifically do *not* want to do that. Win32 has offered memory-mapped files for quite some time -- exactly what I'm after in the .Net world.
Willy.
Julie,
Like Jon said, I wasn't suggesting loading the whole file in memory at once,
but as you need to search the whole file you will have to transfer all file
data to memory at some point in time.
Also, using memory mapped files doesn't make sense because:
- The total IO time will be the same as you would read the data directly
into your process space.
Simply because you have to create a "file mapping object" with a size equal
to the file size, then you can create individual file views to do the
search, but as you need to search the whole file you effectively load the
whole file in (virtual) memory.
- You don't shared the file mapping object with other processes which is one
main reason to use mapped files.
Willy.
Daniel O'Connell [C# MVP] wrote: For what its worth, this heavy push for a database is probably foolish.
But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.
For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form.
There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would
probably be faster than doing several simple text searches.
--
Truth,
James Curran [MVP] www.NJTheater.com (Professional) www.NovelTheory.com (Personal)
"James Curran" <Ja*********@mvps.org> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl... Daniel O'Connell [C# MVP] wrote:
For what its worth, this heavy push for a database is probably foolish.
But I'm not pushing heavily for a database. I'm pushing heavily in favor of looking beyond a very naive reading of the requirements, and putting some knowledge of the domain space behind the problem.
For example, I have trouble believing that the input data here ("datafiles from external laboratory instruments") are completely free-form. There's probably recognizable rows & columns. Further, I don't believe searching would generally be limited to one search per file, so reading, indexing, do several indexed searches, and then expiring the indexing would probably be faster than doing several simple text searches.
Perhaps it has rows and columns, or perhaps its a flat file that simply has
periodic data, perhaps the data is closer to an XML file than it is a
databse, having tagged segments that may exist in any order. Perhaps the
file isn't constant, perhaps the equipment or network setups changes the
file commonly. Rows &columns are only one possible representation and static
content is only one possiblity.
Also, we have no way of knowing how quickly this data is flowing in. What if
this particular piece of equipment is generating hundreds of gigs a week?
Does a database still make sense? Does an index? Or does each file get
processed once and sent along to permanant storage? Is this query only run
against data sometimes, when something else occurs, or is it run dozens of
times an hour? Are the searches automated or are they something an
individual user does when they have reason to? Does the search utility have
to operate over millions of potential files or just one?
Without all of this information, what right do you have to call the OP's
choice naive? As anything we state is probably far less informed than what
their's is, I can't really believe you would consider your own stance to be
the less naive of the bunch.
Whatever our personal experiances may be, that doesn't mean that our
experiances with any particular thing are the only ones possible.
Your suggestion may well be the absolute wrong thing to do. Or it may be the
right one, however I just don't think you're in any better position than I
am to guage that. One would hope the person who wrote the requirements *was*
however.
James Curran wrote: Daniel O'Connell [C# MVP] wrote:
For what its worth, this heavy push for a database is probably foolish. But I'm not pushing heavily for a database. I'm pushing heavily in favor of looking beyond a very naive reading of the requirements, and putting some knowledge of the domain space behind the problem.
Thanks for the insult of calling me naive.
I really find it hard to believe that you think that I can't understand what my
requirements are.
For example, I have trouble believing that the input data here ("datafiles from external laboratory instruments") are completely free-form. There's probably recognizable rows & columns. Further, I don't believe searching would generally be limited to one search per file, so reading, indexing, do several indexed searches, and then expiring the indexing would probably be faster than doing several simple text searches.
I still don't see where I'm asking for anything faster than 1 hit in a 100 MB
file in 10 seconds or less. That is all that the performance requirement
dictates. Anything more is completely wasted effort.
I appreciate your comments and suggestions, but please, don't become so focused
on what you think is necessary that you completely ignore what I know to be the
requirements.
Thanks
Julie,
Just one other random thought, if I may. I guess the reason why you are
finding so much resistance is that pretty much everyone - except for you -
finds "solutions" in their jobs and remains open-minded in their work. Very
rarely are we asked to blindly "fill these requirements". Our industry is
such that, most companies simply can't afford to work that inefficiently.
Instead, most companies give a developer a "problem" and we are to use our
expertise to find the most efficient solution. "Efficient" means not only
the fastest to implement, but also the the most scalable and easiest to
change.
In other words, when we hear such unreasonable requirements as you've
vaguely defined, everyone's first reaction is to get a handle on those,
because they don't sound reasonable. When someone comes to me with with
outrageous requirements, I don't even BOTHER trying to answer the original
question, because 99% of the time, they don't have a handle on the problem.
But I'm gathering from your reaction and style that you likely work in the
gov't or aerospace - and I know that is a completely different mindset
there.
Anyhow - for future reference, it would probably be a big time-saver if you
were to actually share (in some level of detail) what your requirements are.
What IS the format of this data? Because without that, you come off as an
inexperienced and stubborn developer that is not pushing back on
unreasonable requirements when you CLEARLY should be (in our minds). So, if
you want people to stop reacting to your requirements, it might be best to
actually go into some detail about why they are so immutable so people can
get past it - and start helping with your actual problem.
For whatever it's worth..
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... James Curran wrote: Daniel O'Connell [C# MVP] wrote:
For what its worth, this heavy push for a database is probably foolish. But I'm not pushing heavily for a database. I'm pushing heavily in favor of looking beyond a very naive reading of the requirements, and putting some knowledge of the domain space behind the problem.
Thanks for the insult of calling me naive.
I really find it hard to believe that you think that I can't understand
what my requirements are.
For example, I have trouble believing that the input data here ("datafiles from external laboratory instruments") are completely
free-form. There's probably recognizable rows & columns. Further, I don't believe searching would generally be limited to one search per file, so reading, indexing, do several indexed searches, and then expiring the indexing
would probably be faster than doing several simple text searches. I still don't see where I'm asking for anything faster than 1 hit in a 100
MB file in 10 seconds or less. That is all that the performance requirement dictates. Anything more is completely wasted effort.
I appreciate your comments and suggestions, but please, don't become so
focused on what you think is necessary that you completely ignore what I know to
be the requirements.
Thanks
Drebin wrote: Julie,
Just one other random thought, if I may. I guess the reason why you are finding so much resistance is that pretty much everyone - except for you - finds "solutions" in their jobs and remains open-minded in their work. Very rarely are we asked to blindly "fill these requirements". Our industry is such that, most companies simply can't afford to work that inefficiently. Instead, most companies give a developer a "problem" and we are to use our expertise to find the most efficient solution. "Efficient" means not only the fastest to implement, but also the the most scalable and easiest to change.
Your points would be valid if: you were talking to someone that didn't know
what they were doing and/or asked for comments on process. Neither of those
apply to me.
I'm *very* open minded, I examine all of the potential solutions that I can,
and make informed decisions. I did that work before posing the original
question. I challenge you to be a little more open minded and realize that
simple-text searching can be (and IS! in this case) a valid solution to the
problem posed.
I work very closely w/ those that define the requirements, I know exactly what
they want, and they are informed as to the details, costs, issues, etc.
Honestly, I had gone your route and implemented this using a database, it would
have completely changed the disposition of the components that I'm working on,
the installation requirements, licensing requirements, base system
requirements, implementation time frame, complexity, maintainability, version
issues, etc., etc., etc. Absolutely none of that can be tolerated on the
project for a relatively simple part of the component that I'm working on.
In other words, when we hear such unreasonable requirements as you've vaguely defined, everyone's first reaction is to get a handle on those, because they don't sound reasonable.
Answer me one question: how can you determine that those requirements are
unreasonable?
When someone comes to me with with outrageous requirements, I don't even BOTHER trying to answer the original question, because 99% of the time, they don't have a handle on the problem. But I'm gathering from your reaction and style that you likely work in the gov't or aerospace - and I know that is a completely different mindset there.
Oh wise sage, I work in neither of those disciplines.
Anyhow - for future reference, it would probably be a big time-saver if you were to actually share (in some level of detail) what your requirements are. What IS the format of this data? Because without that, you come off as an inexperienced and stubborn developer that is not pushing back on unreasonable requirements when you CLEARLY should be (in our minds). So, if you want people to stop reacting to your requirements, it might be best to actually go into some detail about why they are so immutable so people can get past it - and start helping with your actual problem.
How about this, for your future reference: try answering the question posed,
don't spend so much time trying to read more into it than exists. If you have
questions about the requirements, ask them _after_ you have answered the
original question. Otherwise, you come off as a know-it-all.
Finally, my requirements were well defined, you just don't happen to want to
believe them. For whatever it's worth..
"Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... James Curran wrote: Daniel O'Connell [C# MVP] wrote:
> For what its worth, this heavy push for a database is probably > foolish.
But I'm not pushing heavily for a database. I'm pushing heavily in favor of looking beyond a very naive reading of the requirements, and putting some knowledge of the domain space behind the problem.
Thanks for the insult of calling me naive.
I really find it hard to believe that you think that I can't understand what my requirements are.
For example, I have trouble believing that the input data here ("datafiles from external laboratory instruments") are completely free-form. There's probably recognizable rows & columns. Further, I don't believe searching would generally be limited to one search per file, so reading, indexing, do several indexed searches, and then expiring the indexing would probably be faster than doing several simple text searches.
I still don't see where I'm asking for anything faster than 1 hit in a 100 MB file in 10 seconds or less. That is all that the performance requirement dictates. Anything more is completely wasted effort.
I appreciate your comments and suggestions, but please, don't become so focused on what you think is necessary that you completely ignore what I know to be the requirements.
Thanks
Julie wrote: What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks to all those that replied.
I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.
As indicated, my requirement was to search a 100 MB text file for a string in
10 seconds or less. My initial results (debug, unoptimized) are right around 5
seconds on the target system, presumably the release/optimized build will be a
bit faster.
Implementation is essentially opening a text stream (StreamReader) and reading
the contents, line by line looking for the search string. Total implementation
is about 10 lines of code.
Julie,
Purely out of interest - how are you checking if the string doesn't exist
over two lines?
Regards
John Timney
Microsoft Regional Director
Microsoft MVP
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Julie wrote: What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my
immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped
files are ok (and preferred). Speed/performance is a requirement -- the
target is to locate the string in 10 seconds or less for a 100 MB file. The search
string is typically 10 characters or less. Finally, I don't want to spawn out
to an external executable (e.g. grep), but include the algorithm/method
directly in the .NET code base. For the first rev, wildcard support is not a
requirement. Thanks to all those that replied.
I spent a little time researching some of the access and search methods proposed, and the funny thing is that the most simple and straightforward implementation actually turned out to be quite sufficient.
As indicated, my requirement was to search a 100 MB text file for a string
in 10 seconds or less. My initial results (debug, unoptimized) are right
around 5 seconds on the target system, presumably the release/optimized build will
be a bit faster.
Implementation is essentially opening a text stream (StreamReader) and
reading the contents, line by line looking for the search string. Total
implementation is about 10 lines of code.
"Julie" <ju***@nospam.com> wrote in message
news:41***************@nospam.com... Julie wrote: What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a requirement.
Thanks to all those that replied.
I spent a little time researching some of the access and search methods proposed, and the funny thing is that the most simple and straightforward implementation actually turned out to be quite sufficient.
As indicated, my requirement was to search a 100 MB text file for a string in 10 seconds or less. My initial results (debug, unoptimized) are right around 5 seconds on the target system, presumably the release/optimized build will be a bit faster.
Implementation is essentially opening a text stream (StreamReader) and reading the contents, line by line looking for the search string. Total implementation is about 10 lines of code.
Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did
your test.
Willy.
"John Timney (Microsoft MVP)" wrote: Julie,
Purely out of interest - how are you checking if the string doesn't exist over two lines?
As a matter of definition of the file format, the search string cannot span
lines, so no extra processing required. "Julie" <ju***@nospam.com> wrote in message news:41***************@nospam.com... Julie wrote: What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The search string is typically 10 characters or less. Finally, I don't want to spawn out to an external executable (e.g. grep), but include the algorithm/method directly in the .NET code base. For the first rev, wildcard support is not a
requirement. Thanks to all those that replied.
I spent a little time researching some of the access and search methods proposed, and the funny thing is that the most simple and straightforward implementation actually turned out to be quite sufficient.
As indicated, my requirement was to search a 100 MB text file for a string
in 10 seconds or less. My initial results (debug, unoptimized) are right around 5 seconds on the target system, presumably the release/optimized build will be a bit faster.
Implementation is essentially opening a text stream (StreamReader) and reading the contents, line by line looking for the search string. Total implementation is about 10 lines of code.
"Willy Denoyette [MVP]" wrote: Did you try to flush the File System cache first? I'm pretty sure the file was (partly) cached in the FS cache when you did your test.
At first I thought the same thing, but I ran the test & timing on several
machines w/ similar results on the first run.
Do you know of a way to programmatically flush the cache from .NET? If so,
I'll try it again w/ a forced flush and see if it changes my results. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Ignacio X. Domínguez |
last post by:
Hi. I'm developing a desktop application that needs to store some data in a
local file. Let's say for example that I want to have an address book with
names and phone numbers in a file. I would...
|
by: Vjay77 |
last post by:
Hi,
I haven't posted any problem in quite a while now, but I came to the
point that I really need to ask for help.
I need to create an application which will search through .txt log
file and...
|
by: beersa |
last post by:
Hi All,
I have to query the database with the string from text file. Here are
the details:
OS: WinXP Home Pro
DB: Oracle 9.x
The table in DB has 20,000 rows. The text file has 15,000...
|
by: Clinto |
last post by:
Hi,
I am trying to find the fastest way to search a txt file for a
particular string and return the line that contains the string. I have
so for just used the most basic method. Initialized a...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
| |