473,748 Members | 4,065 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Search a file and find all occurences

Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important. I don't want to take the route of loading the text of
the file into a giant string and searching it, but would rather focus on a
performance-minded solution. Any sugesstions for a performance - oriented
"find all occurences" type of search against a file?

Thanks in advance!
--
Dameon
Dec 13 '06 #1
4 3379

Dameon wrote:
Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important.
I don't want to take the route of loading the text of the file into a giant string and
searching it, but would rather focus on a performance-minded solution.
First of all you have to get clear in your mind what your requirements
are. First you state that you want a performant (for real time)
solution, then you state that you don't want to load the text of the
file into a string and search it. Well, if the file is up to 25Mb then
that wouldn't be feasible... nonetheless, are you concerned with time
or memory? Often you have to trade one to get the other. Where along
this scale are you:

1. Care about speed and memory consumption be damned.
2. Want speed without using unnecessarily large amounts of memory.
3. Want a compact footprint and you're willing to sacrifice some speed.
4. Want a compact footprint and speed be damned.

Obviously you're not near the bottom of the list, but are you willing
to pay for speed with memory? That may help limit possible solutions.
Any sugesstions for a performance - oriented "find all occurences" type of search against a file?
I don't have time to search for the code or the precise algorithm, but
I believe that what you want is a state-machine driven model. For a
search string you can build a state machine that indicates what to do
at each mismatch point. Then you can read the file as a sequence of
bytes and apply the state machine to the stream.

As I said, I don't have time to search for the code, but here is an
idea. Let's say you're searching for "abracadabr a" in the stream
"abracabracadab ra":

1. a matches a. Move along.
2. b matches b. Move along.
3. r matches r. Move along.
4. a matches a. Move along.
5. c matches c. Move along.
6. a matches a. Move along.
7. b does not match d. At this point, the state machine knows that you
don't have to back up all the way to the preceding "b". You can back up
to the 4th check, the "a" and start comparing again.
....and so on...

The idea is that you pre-process the search string into a state
machine. The states and transitions are unique to the search string and
allow you to avoid re-checking characters you don't have to. The
technique becomes more effective the longer the search string.

Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.

Dec 13 '06 #2

Bruce Wood wrote:
Dameon wrote:
Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important.

Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.
I found it. It's called the Boyer-Moore Algorithm, and I had it
backward... it tests from the end of the search string to the start.
It's the Knuth-Pratt-Morris Algorithm that searches in the way I
described.

You can just Google those names. I found this page with some examples
of the algorithms in action:

http://www.cs.utexas.edu/users/moore...ing/index.html

Dec 13 '06 #3
"Bruce Wood" <br*******@cana da.comwrote:
>You can just Google those names. I found this page with some examples
of the algorithms in action:
http://www.cs.utexas.edu/users/moore...ing/index.html
Thank you! The Byers-Moore algorithm on that page is very clever, and
the web page is a very nice way to present it!

--
Lucian
Dec 13 '06 #4

Dameon wrote:
>
Any advice on how to implement Boyer-Moore in C# for the following scenario:

1. I have an automated process which needs to add a closing tag to a file
2. The close tag cannot be added until the file has X amount of occurences
of a specified string
3. The automated process will call methodX to check for X amount of
occurences in the specified file
4. When the correct # of occurences are found, the closing tag will be added.
The first question is, how many of the files are likely to need the
closing tag added?
1. Most of them
2. A few of them

This governs your solution. If the answer is "a few of them" then you
can tolerate an algorithm that first has to read the file to see if it
is a match, and then has to read it again to actually append the tag.
This is easier. If the answer is "most of them" then you have to search
and append the tag all in one operation, which is harder.
Unfortunately, there's no free lunch, either: if you choose the second,
harder algorithm, then its performance will suck if you have to append
to only a few files. If you choose the first, easier algorithm, then
its performance will suck if you have to append to most files.

In either case, in order to do the search, I would create an array of
characters in memory that is the same size as or larger than the search
string. I would then wrap it in a class that gives you "circular
buffer" semantics with indices that work backward (where index 0 is the
last character you just read, 1 is the one before that, 2 is the one
before that, etc):

public class CircularBuffer
{
private char[] _buffer;
private int _length;
private int _zeroIndex;

public CircularBuffer( int capacity)
{
this._buffer = new char[capacity];
this._length = 0;
this._zeroIndex = 0;
}

public void Add(char newChar)
{
this._zeroIndex -= 1;
if (this._zeroInde x < 0)
{
this._zeroIndex += this._buffer.Le ngth;
}
this._buffer[this._zeroIndex] = newChar;
if (this._length < this._buffer.Le ngth)
{
this._length += 1;
}
}

public char this[int index]
{
if (index < 0 || index >= this._length)
{
throw new IndexOutOfRange Exception(...);
}
int trueIndex = this._zeroIndex + index;

// Yes, one could use a modulus operator to do this, but I
would expect
// a test followed by a subtraction to be slightly faster than
a modulus,
// which requires an integer division operation.
if (trueIndex 0)
{
trueIndex -= this._buffer.Le ngth;
}
return this._buffer[trueIndex];
}
}

Basically what this gives you is a buffer that holds the last
"capacity" characters that you have read from a stream. As you read
each character you can "Add" it to the buffer, which makes one
character "fall off" the "start" of the array to make room for the new
character.

With this, it should be easy enough to program Boyer-Moore: you have a
rolling record of enough characters from the stream that you can move
far enough backward through the stream to make the required
comparisons.

Now the only problem is adding the closing tag.

Here the only question is whether you want to be writing an output file
as you are doing Boyer-Moore, and throw away the copied file if it
turns out that the file fails the test, or whether you want to do
Boyer-Moore first and then, if the file passes the test, read the file
again, copying it into a new file and appending the closing tag.

In the first case, you will waste time writing files you don't need if
the Boyer-Moore test eventually reveals that the file isn't a match. In
the second case, you have to read the file again if the test reveals
that the file is a match. So, if few files will be a match, then the
second algorithm (read to test / read again to copy) is obviously more
performant. On the other hand, if most files will be a match, then the
first algorithm (always write new file / delete if not needed) will be
more performant. This first algorithm is, by the way, slightly easier
to write, as you can write two completely separate bits of code, each
of which has one job to do. However, it really depends upon your mix of
files which is the best way to do it.

Dec 14 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
39352
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1: domything() And the regexp search assuming no case restriction would be,
0
3663
by: todd | last post by:
here is a search tool SP I wrote. How many times have you wanted to search all your Stored procs or views (in a database) for a keyword but couldn't!? Well now you can! THis can makes life a lot easier for developers. Would you email me and let me know which part of the world my code is running in?
4
10200
by: Jane Doe | last post by:
Hi, I need to search and replace patterns in web pages, but I can't find a way even after reading the ad hoc chapter in New Rider's "Inside JavaScript". Here's what I want to do: function filter() { var items = new Array("John", "Jane");
79
5269
by: pinkfloydhomer | last post by:
I want to scan a file byte for byte for occurences of the the four byte pattern 0x00000100. I've tried with this: # start import sys numChars = 0 startCode = 0 count = 0
3
2224
by: Liz Malcolm | last post by:
Hello and TIA for guidance. I am building a reusable search procedure (thanks go to Graham Thorpe for his example that set me on my way). Everything works up until the 2nd match is found, the command doesn't go to the third match, it exits out of the loop. I've read everything I could find in NG's, but I can't figure out where I'm going wrong. Here is my code. Function cmdSearch(ctlSearchText As Control, ctlFoundText As Control,...
6
3332
by: DC | last post by:
Hi, I am programming a search catalogue with 200000 items (and growing). I am currently using the SQL Server 2000 fulltext engine for this task but it does not fit the requirements anymore. The products typically do have a verbose name, "canadian superapples: red tasty juicy macintosh apple from toronto" and the like. If a customer is looking for "canadian apple" this product needs to match, but also if he is looking for "juicy mac".
8
5654
by: Daneel | last post by:
Hello! I'm looking for an algorithm which finds all occurences of a bit sequence (e.g., "0001") in a file. This sequence can start at any bit in the file (it is not byte aligned). I have some ideas of how to approach the problem (1) reading file into unsigned char buffer, 2) defining bit structure, 3) comparing the first 4 bits of the buffer with the bit structure, 4) shifting the char buffer one left, 5) repeat at step 3)) but I'm...
18
3871
by: Neehar | last post by:
Hello For one of the interviews I took recently, I was given an offline programming quiz. In 30 minutes I had to write code in C++ to counts the number of times each unique word appears in a given file. I tried my level best even after the quiz to come up with a solution but cudnt find an efficient one. :( This is what I did.
3
2475
by: SLauren | last post by:
Hi, I am trying to find the records based on a search string which can be anything and can match any of the varchar fields in the table. Right now i am getting those records by using the combinations of "LIKE" and "OR" operators.But as per the requirements i also need to show the number of occurences of the search string that could match different varchar fields in a record?? Can anyone please suggest me how can i prepare this query?? ...
0
8987
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9366
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9316
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9241
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8239
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6793
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6073
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
3303
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2211
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.