optimizing file i/o - C# / C Sharp

Michael Powe

Hello,

I have written a small app to parse web log files and extract certain
lines to another file. There is also functionality to count all the
items that are being filtered out.

I wrote this in c# instead of in perl because the log files are 3-4GB
and I want faster processing than perl would typically provide. And,
I'm learning c#.

There are two issues I would like to address: improve the speed of the
file i/o and control the processing. Right now, this app takes about 20
min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
Processing is implementing a method that both filters and counts. Also,
it pegs my CPU while it's running.

Below are the filtering and filtering/counting methods.

Thanks.

mp

public class parseLines
{
static string fileIdentifiers =
@"\.gif\s|\.js\s|\.png\s|\.css\s|\.jpg\s";
static Regex reAll = new Regex(fileIdentifiers);
string fileName;

public parseLines(string fileName)
{
this.fileName = fileName;
}

public void getLines()
{
// print nonmatching lines to stdout
}
public Hashtable countMatches()
{
// count individual matches
}
public void filterLines()
{
string newFileName = fileName + ".modified.log";

StreamReader sr = new StreamReader(fileName);
StreamWriter wr = new StreamWriter(newFileName);
string nextLine = sr.ReadLine();
while (nextLine != null)
{
Match myMatch = reAll.Match(nextLine);
if (!myMatch.Success)
{
wr.WriteLine(nextLine);
}
nextLine = sr.ReadLine();
}
sr.Close();
wr.Close();
}
public Hashtable filterAndCountLines()
{
string newFileName = fileName + ".modified.log";
Hashtable ht = new Hashtable();
char[] sep = {'|'};
string[] newTypeArray = fileIdentifiers.Split(sep);
Regex[] newMatchArray = new Regex[5];

for (int i = 0; i < newTypeArray.Length; i++)
{
Regex item = new Regex(newTypeArray[i]);
newMatchArray[i] = item;
}
foreach (string item in newTypeArray)
{
ht.Add(item,0);
}
ht.Add("total Match",0);
ht.Add("total No Match",0);

StreamReader sr = new StreamReader(fileName);
StreamWriter wr = new StreamWriter(newFileName);

string nextLine = sr.ReadLine();
while (nextLine != null)
{
Match myMatch = reAll.Match(nextLine);
if (!myMatch.Success)
{
wr.WriteLine(nextLine);
ht["total No Match"] =
(int)ht["total No Match"] + 1;
}
else
{
foreach (Regex itemRegex in
newMatchArray)
{
Match arrMatch =
itemRegex.Match(nextLine);
if (arrMatch.Success)
{
ht[itemRegex.ToString()] =
(int)ht[itemRegex.ToString()]
+ 1;
break;
}
}
ht["total Match"] = (int)ht["total
Match"] + 1;
}
nextLine = sr.ReadLine();
}
sr.Close();
wr.Close();
return ht;
}
}
class MainClass
{
public static void Main(string[] args)
{
Hashtable count;
IDictionaryEnumerator countEnumerator;

parseLines pl = new parseLines(args[0]);
count = pl.filterAndCountLines();

countEnumerator = count.GetEnumerator();
while (countEnumerator.MoveNext())
{
Console.WriteLine(countEnumerator.Key.ToString() + " : " +
countEnumerator.Value.ToString());
}
Console.WriteLine("finished");
}
}
}

--
Michael Powe mi*****@trollope.org Waterbury CT
ENOSIG: signature file is empty

Nov 17 '05 #1

Subscribe Post Reply

1761

Jon Skeet [C# MVP]

Michael Powe <mi***********@trollope.org> wrote:

I have written a small app to parse web log files and extract certain
lines to another file. There is also functionality to count all the
items that are being filtered out.

I wrote this in c# instead of in perl because the log files are 3-4GB
and I want faster processing than perl would typically provide. And,
I'm learning c#.

There are two issues I would like to address: improve the speed of the
file i/o and control the processing. Right now, this app takes about 20
min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
Processing is implementing a method that both filters and counts. Also,
it pegs my CPU while it's running.

Below are the filtering and filtering/counting methods.

If the CPU is pegged (which I can understand, given the code), then the
I/O speed isn't the problem.

Some suggestions:

1) Don't create the regular expressions freshly each time. I don't know
whether you've got a lot of small files or just a few big ones, but it
would make more sense to create them once, as you don't need to change
them.

2) Use the option to compile the regular expressions when you create
them. This could improve things enormously.

3) Rather than using a hashtable, consider having an array of ints
along with your array of regular expressions. You could then iterate
through the regular expression array by index rather than by value, and
just increment the relevant int - no hashtable lookup, no unboxing and
then reboxing.

4) If most lines in the file will match one of the filters, try getting
rid of the "all" regular expression, working out the result just by
running all the others. It may not help, but it's worth a try.

Finally, use using statements for your stream readers and writers -
that way, if an exception is thrown, you'll still close the file
immediately.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 17 '05 #2

Michael Powe

>>>>> "Jon" == Jon Skeet [C# MVP] <Jon> writes:

Jon> Michael Powe <mi***********@trollope.org> wrote:

I have written a small app to parse web log files and extract
certain lines to another file. There is also functionality to
count all the items that are being filtered out.

I wrote this in c# instead of in perl because the log files are
3-4GB and I want faster processing than perl would typically
provide. And, I'm learning c#.

There are two issues I would like to address: improve the speed
of the file i/o and control the processing. Right now, this
app takes about 20 min to process a 3GB file on a laptop with a
2Ghz proc and 2GB RAM. Processing is implementing a method
that both filters and counts. Also, it pegs my CPU while it's
running.

Below are the filtering and filtering/counting methods.

Jon> If the CPU is pegged (which I can understand, given the
Jon> code), then the I/O speed isn't the problem.

Jon> Some suggestions:

Jon> 1) Don't create the regular expressions freshly each time. I
Jon> don't know whether you've got a lot of small files or just a
Jon> few big ones, but it would make more sense to create them
Jon> once, as you don't need to change them.

Jon> 2) Use the option to compile the regular expressions when you
Jon> create them. This could improve things enormously.

Jon> 3) Rather than using a hashtable, consider having an array of
Jon> ints along with your array of regular expressions. You could
Jon> then iterate through the regular expression array by index
Jon> rather than by value, and just increment the relevant int -
Jon> no hashtable lookup, no unboxing and then reboxing.

Jon> 4) If most lines in the file will match one of the filters,
Jon> try getting rid of the "all" regular expression, working out
Jon> the result just by running all the others. It may not help,
Jon> but it's worth a try.

Jon> Finally, use using statements for your stream readers and
Jon> writers - that way, if an exception is thrown, you'll still
Jon> close the file immediately.

Jon> -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet
Jon> If replying to the group, please do not mail me too

Thanks very much for the clues, I will follow up. As I mentioned, the
files are large -- 3 to 4 GB, which is why I'm trying C# instead of
using perl.

Your help is much appreciated.

mp

--
'cat' is not recognized as an internal or external command,
operable program or batch file.

Nov 17 '05 #3

Similar topics

optimizing code

by: Andreas Paasch | last post by:

I've finally gotten my nice little system working and it's gone live. Now, I spent time optimizing my code and adding a little smart functionality here and there, based on needs and simplicity. ...

PHP

Optimizing code

by: A Future Computer Scientist | last post by:

A question: Is it really important to think about optimizing the native code or optimizing it for P Code? Or does the code you write make a difference?

Visual Basic 4 / 5 / 6

Q about "optimizing away" "non-used" code

by: J. Campbell | last post by:

From reading this forum, it is my understanding that C++ doesn't require the compiler to keep code that does not manifest itself in any way to the user. For example, in the following: { for(int...

C / C++

VC++ Standard ed. (optimizing compiler revisited)

by: Brian | last post by:

In particular, this question goes out to the Microsoft C++ development team. Back in June, Ronald Laeremans posted the following message. Has the optimizing compiler been included with the...

.NET Framework

Optimizing with 2003 .NET

by: Flashman | last post by:

A little confusing with setting up optimizing options with 2003 .NET. Under the Optimization Tab. if you set to /O1 or /O2 is the program ignoring the settings for Inline Function expansion,...

.NET Framework

Question on optimizing a piece of code

by: Diffident | last post by:

Hello All, I need some help on optimizing a piece of code. Currently this is how we retrieve our Oracle's connection string for executing all the queries in our application. public string...

ASP.NET

Optimizing size of generated XML

by: Nick Gilbert | last post by:

Hi, I have to send an array of prices for a list of products over XML. Currently my XML data looks like this: <ArrayOfProd> <Prod Code="productcode001"> <Prices> <P F="2005-01-01"...

.NET Framework

Optimizing function pointer usage

by: Jack | last post by:

I have a chunk of code that loads a few dozen function pointers into global variables. I'm concerned with unused memory consumption. What if the client only needs to use one or two functions? Then...

C / C++

Optimizing, examples of very fast programs ;)

by: peter_k | last post by:

Hi, Last time i'm interested in optimizing small c programs. On my studies we are sending the programs using the web interface to online judge. The person who will wrote the faster program get...

C / C++

Optimizing Away

by: jaysome | last post by:

Suppose I have something like this on a free-standing environment (where void main(void) is documented and well-defined): static int G_num_loops = 0; void main (void) { /* set up interrupts...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice