473,387 Members | 3,781 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Search a file and find all occurences

Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important. I don't want to take the route of loading the text of
the file into a giant string and searching it, but would rather focus on a
performance-minded solution. Any sugesstions for a performance - oriented
"find all occurences" type of search against a file?

Thanks in advance!
--
Dameon
Dec 13 '06 #1
4 3344

Dameon wrote:
Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important.
I don't want to take the route of loading the text of the file into a giant string and
searching it, but would rather focus on a performance-minded solution.
First of all you have to get clear in your mind what your requirements
are. First you state that you want a performant (for real time)
solution, then you state that you don't want to load the text of the
file into a string and search it. Well, if the file is up to 25Mb then
that wouldn't be feasible... nonetheless, are you concerned with time
or memory? Often you have to trade one to get the other. Where along
this scale are you:

1. Care about speed and memory consumption be damned.
2. Want speed without using unnecessarily large amounts of memory.
3. Want a compact footprint and you're willing to sacrifice some speed.
4. Want a compact footprint and speed be damned.

Obviously you're not near the bottom of the list, but are you willing
to pay for speed with memory? That may help limit possible solutions.
Any sugesstions for a performance - oriented "find all occurences" type of search against a file?
I don't have time to search for the code or the precise algorithm, but
I believe that what you want is a state-machine driven model. For a
search string you can build a state machine that indicates what to do
at each mismatch point. Then you can read the file as a sequence of
bytes and apply the state machine to the stream.

As I said, I don't have time to search for the code, but here is an
idea. Let's say you're searching for "abracadabra" in the stream
"abracabracadabra":

1. a matches a. Move along.
2. b matches b. Move along.
3. r matches r. Move along.
4. a matches a. Move along.
5. c matches c. Move along.
6. a matches a. Move along.
7. b does not match d. At this point, the state machine knows that you
don't have to back up all the way to the preceding "b". You can back up
to the 4th check, the "a" and start comparing again.
....and so on...

The idea is that you pre-process the search string into a state
machine. The states and transitions are unique to the search string and
allow you to avoid re-checking characters you don't have to. The
technique becomes more effective the longer the search string.

Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.

Dec 13 '06 #2

Bruce Wood wrote:
Dameon wrote:
Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important.

Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.
I found it. It's called the Boyer-Moore Algorithm, and I had it
backward... it tests from the end of the search string to the start.
It's the Knuth-Pratt-Morris Algorithm that searches in the way I
described.

You can just Google those names. I found this page with some examples
of the algorithms in action:

http://www.cs.utexas.edu/users/moore...ing/index.html

Dec 13 '06 #3
"Bruce Wood" <br*******@canada.comwrote:
>You can just Google those names. I found this page with some examples
of the algorithms in action:
http://www.cs.utexas.edu/users/moore...ing/index.html
Thank you! The Byers-Moore algorithm on that page is very clever, and
the web page is a very nice way to present it!

--
Lucian
Dec 13 '06 #4

Dameon wrote:
>
Any advice on how to implement Boyer-Moore in C# for the following scenario:

1. I have an automated process which needs to add a closing tag to a file
2. The close tag cannot be added until the file has X amount of occurences
of a specified string
3. The automated process will call methodX to check for X amount of
occurences in the specified file
4. When the correct # of occurences are found, the closing tag will be added.
The first question is, how many of the files are likely to need the
closing tag added?
1. Most of them
2. A few of them

This governs your solution. If the answer is "a few of them" then you
can tolerate an algorithm that first has to read the file to see if it
is a match, and then has to read it again to actually append the tag.
This is easier. If the answer is "most of them" then you have to search
and append the tag all in one operation, which is harder.
Unfortunately, there's no free lunch, either: if you choose the second,
harder algorithm, then its performance will suck if you have to append
to only a few files. If you choose the first, easier algorithm, then
its performance will suck if you have to append to most files.

In either case, in order to do the search, I would create an array of
characters in memory that is the same size as or larger than the search
string. I would then wrap it in a class that gives you "circular
buffer" semantics with indices that work backward (where index 0 is the
last character you just read, 1 is the one before that, 2 is the one
before that, etc):

public class CircularBuffer
{
private char[] _buffer;
private int _length;
private int _zeroIndex;

public CircularBuffer(int capacity)
{
this._buffer = new char[capacity];
this._length = 0;
this._zeroIndex = 0;
}

public void Add(char newChar)
{
this._zeroIndex -= 1;
if (this._zeroIndex < 0)
{
this._zeroIndex += this._buffer.Length;
}
this._buffer[this._zeroIndex] = newChar;
if (this._length < this._buffer.Length)
{
this._length += 1;
}
}

public char this[int index]
{
if (index < 0 || index >= this._length)
{
throw new IndexOutOfRangeException(...);
}
int trueIndex = this._zeroIndex + index;

// Yes, one could use a modulus operator to do this, but I
would expect
// a test followed by a subtraction to be slightly faster than
a modulus,
// which requires an integer division operation.
if (trueIndex 0)
{
trueIndex -= this._buffer.Length;
}
return this._buffer[trueIndex];
}
}

Basically what this gives you is a buffer that holds the last
"capacity" characters that you have read from a stream. As you read
each character you can "Add" it to the buffer, which makes one
character "fall off" the "start" of the array to make room for the new
character.

With this, it should be easy enough to program Boyer-Moore: you have a
rolling record of enough characters from the stream that you can move
far enough backward through the stream to make the required
comparisons.

Now the only problem is adding the closing tag.

Here the only question is whether you want to be writing an output file
as you are doing Boyer-Moore, and throw away the copied file if it
turns out that the file fails the test, or whether you want to do
Boyer-Moore first and then, if the file passes the test, read the file
again, copying it into a new file and appending the closing tag.

In the first case, you will waste time writing files you don't need if
the Boyer-Moore test eventually reveals that the file isn't a match. In
the second case, you have to read the file again if the test reveals
that the file is a match. So, if few files will be a match, then the
second algorithm (read to test / read again to copy) is obviously more
performant. On the other hand, if most files will be a match, then the
first algorithm (always write new file / delete if not needed) will be
more performant. This first algorithm is, by the way, slightly easier
to write, as you can write two completely separate bits of code, each
of which has one job to do. However, it really depends upon your mix of
files which is the best way to do it.

Dec 14 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...
0
by: todd | last post by:
here is a search tool SP I wrote. How many times have you wanted to search all your Stored procs or views (in a database) for a keyword but couldn't!? Well now you can! THis can makes life a...
4
by: Jane Doe | last post by:
Hi, I need to search and replace patterns in web pages, but I can't find a way even after reading the ad hoc chapter in New Rider's "Inside JavaScript". Here's what I want to do: function...
79
by: pinkfloydhomer | last post by:
I want to scan a file byte for byte for occurences of the the four byte pattern 0x00000100. I've tried with this: # start import sys numChars = 0 startCode = 0 count = 0
3
by: Liz Malcolm | last post by:
Hello and TIA for guidance. I am building a reusable search procedure (thanks go to Graham Thorpe for his example that set me on my way). Everything works up until the 2nd match is found, the...
6
by: DC | last post by:
Hi, I am programming a search catalogue with 200000 items (and growing). I am currently using the SQL Server 2000 fulltext engine for this task but it does not fit the requirements anymore. ...
8
by: Daneel | last post by:
Hello! I'm looking for an algorithm which finds all occurences of a bit sequence (e.g., "0001") in a file. This sequence can start at any bit in the file (it is not byte aligned). I have some...
18
by: Neehar | last post by:
Hello For one of the interviews I took recently, I was given an offline programming quiz. In 30 minutes I had to write code in C++ to counts the number of times each unique word appears in a...
3
by: SLauren | last post by:
Hi, I am trying to find the records based on a search string which can be anything and can match any of the varchar fields in the table. Right now i am getting those records by using the...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.