473,660 Members | 2,445 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

optimizing file i/o

Hello,

I have written a small app to parse web log files and extract certain
lines to another file. There is also functionality to count all the
items that are being filtered out.

I wrote this in c# instead of in perl because the log files are 3-4GB
and I want faster processing than perl would typically provide. And,
I'm learning c#.

There are two issues I would like to address: improve the speed of the
file i/o and control the processing. Right now, this app takes about 20
min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
Processing is implementing a method that both filters and counts. Also,
it pegs my CPU while it's running.

Below are the filtering and filtering/counting methods.

Thanks.

mp

public class parseLines
{
static string fileIdentifiers =
@"\.gif\s|\.js\ s|\.png\s|\.css \s|\.jpg\s";
static Regex reAll = new Regex(fileIdent ifiers);
string fileName;

public parseLines(stri ng fileName)
{
this.fileName = fileName;
}

public void getLines()
{
// print nonmatching lines to stdout
}
public Hashtable countMatches()
{
// count individual matches
}
public void filterLines()
{
string newFileName = fileName + ".modified.log" ;

StreamReader sr = new StreamReader(fi leName);
StreamWriter wr = new StreamWriter(ne wFileName);
string nextLine = sr.ReadLine();
while (nextLine != null)
{
Match myMatch = reAll.Match(nex tLine);
if (!myMatch.Succe ss)
{
wr.WriteLine(ne xtLine);
}
nextLine = sr.ReadLine();
}
sr.Close();
wr.Close();
}
public Hashtable filterAndCountL ines()
{
string newFileName = fileName + ".modified.log" ;
Hashtable ht = new Hashtable();
char[] sep = {'|'};
string[] newTypeArray = fileIdentifiers .Split(sep);
Regex[] newMatchArray = new Regex[5];

for (int i = 0; i < newTypeArray.Le ngth; i++)
{
Regex item = new Regex(newTypeAr ray[i]);
newMatchArray[i] = item;
}
foreach (string item in newTypeArray)
{
ht.Add(item,0);
}
ht.Add("total Match",0);
ht.Add("total No Match",0);

StreamReader sr = new StreamReader(fi leName);
StreamWriter wr = new StreamWriter(ne wFileName);

string nextLine = sr.ReadLine();
while (nextLine != null)
{
Match myMatch = reAll.Match(nex tLine);
if (!myMatch.Succe ss)
{
wr.WriteLine(ne xtLine);
ht["total No Match"] =
(int)ht["total No Match"] + 1;
}
else
{
foreach (Regex itemRegex in
newMatchArray)
{
Match arrMatch =
itemRegex.Match (nextLine);
if (arrMatch.Succe ss)
{
ht[itemRegex.ToStr ing()] =
(int)ht[itemRegex.ToStr ing()]
+ 1;
break;
}
}
ht["total Match"] = (int)ht["total
Match"] + 1;
}
nextLine = sr.ReadLine();
}
sr.Close();
wr.Close();
return ht;
}
}
class MainClass
{
public static void Main(string[] args)
{
Hashtable count;
IDictionaryEnum erator countEnumerator ;

parseLines pl = new parseLines(args[0]);
count = pl.filterAndCou ntLines();

countEnumerator = count.GetEnumer ator();
while (countEnumerato r.MoveNext())
{
Console.WriteLi ne(countEnumera tor.Key.ToStrin g() + " : " +
countEnumerator .Value.ToString ());
}
Console.WriteLi ne("finished") ;
}
}
}

--
Michael Powe mi*****@trollop e.org Waterbury CT
ENOSIG: signature file is empty
Nov 17 '05 #1
2 1777
Michael Powe <mi***********@ trollope.org> wrote:
I have written a small app to parse web log files and extract certain
lines to another file. There is also functionality to count all the
items that are being filtered out.

I wrote this in c# instead of in perl because the log files are 3-4GB
and I want faster processing than perl would typically provide. And,
I'm learning c#.

There are two issues I would like to address: improve the speed of the
file i/o and control the processing. Right now, this app takes about 20
min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
Processing is implementing a method that both filters and counts. Also,
it pegs my CPU while it's running.

Below are the filtering and filtering/counting methods.


If the CPU is pegged (which I can understand, given the code), then the
I/O speed isn't the problem.

Some suggestions:

1) Don't create the regular expressions freshly each time. I don't know
whether you've got a lot of small files or just a few big ones, but it
would make more sense to create them once, as you don't need to change
them.

2) Use the option to compile the regular expressions when you create
them. This could improve things enormously.

3) Rather than using a hashtable, consider having an array of ints
along with your array of regular expressions. You could then iterate
through the regular expression array by index rather than by value, and
just increment the relevant int - no hashtable lookup, no unboxing and
then reboxing.

4) If most lines in the file will match one of the filters, try getting
rid of the "all" regular expression, working out the result just by
running all the others. It may not help, but it's worth a try.

Finally, use using statements for your stream readers and writers -
that way, if an exception is thrown, you'll still close the file
immediately.

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #2
>>>>> "Jon" == Jon Skeet [C# MVP] <Jon> writes:

Jon> Michael Powe <mi***********@ trollope.org> wrote:
I have written a small app to parse web log files and extract
certain lines to another file. There is also functionality to
count all the items that are being filtered out.

I wrote this in c# instead of in perl because the log files are
3-4GB and I want faster processing than perl would typically
provide. And, I'm learning c#.

There are two issues I would like to address: improve the speed
of the file i/o and control the processing. Right now, this
app takes about 20 min to process a 3GB file on a laptop with a
2Ghz proc and 2GB RAM. Processing is implementing a method
that both filters and counts. Also, it pegs my CPU while it's
running.

Below are the filtering and filtering/counting methods.


Jon> If the CPU is pegged (which I can understand, given the
Jon> code), then the I/O speed isn't the problem.

Jon> Some suggestions:

Jon> 1) Don't create the regular expressions freshly each time. I
Jon> don't know whether you've got a lot of small files or just a
Jon> few big ones, but it would make more sense to create them
Jon> once, as you don't need to change them.

Jon> 2) Use the option to compile the regular expressions when you
Jon> create them. This could improve things enormously.

Jon> 3) Rather than using a hashtable, consider having an array of
Jon> ints along with your array of regular expressions. You could
Jon> then iterate through the regular expression array by index
Jon> rather than by value, and just increment the relevant int -
Jon> no hashtable lookup, no unboxing and then reboxing.

Jon> 4) If most lines in the file will match one of the filters,
Jon> try getting rid of the "all" regular expression, working out
Jon> the result just by running all the others. It may not help,
Jon> but it's worth a try.

Jon> Finally, use using statements for your stream readers and
Jon> writers - that way, if an exception is thrown, you'll still
Jon> close the file immediately.

Jon> -- Jon Skeet - <sk***@pobox.co m> http://www.pobox.com/~skeet
Jon> If replying to the group, please do not mail me too

Thanks very much for the clues, I will follow up. As I mentioned, the
files are large -- 3 to 4 GB, which is why I'm trying C# instead of
using perl.

Your help is much appreciated.

mp

--
'cat' is not recognized as an internal or external command,
operable program or batch file.
Nov 17 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
2318
by: Andreas Paasch | last post by:
I've finally gotten my nice little system working and it's gone live. Now, I spent time optimizing my code and adding a little smart functionality here and there, based on needs and simplicity. In this context, I was wondering about something. I have a growing include.inc file that holds all my functions in it. Thinking speed, I was thinking that it might be a bit faster to split that include.inc file up into the different functions...
6
2869
by: A Future Computer Scientist | last post by:
A question: Is it really important to think about optimizing the native code or optimizing it for P Code? Or does the code you write make a difference?
4
2514
by: J. Campbell | last post by:
From reading this forum, it is my understanding that C++ doesn't require the compiler to keep code that does not manifest itself in any way to the user. For example, in the following: { for(int i = 0; i < 10; ++i){ std::cout << i << std::endl; for(int j = 0; j < 0x7fffffff; ++j){} } }
2
1558
by: Brian | last post by:
In particular, this question goes out to the Microsoft C++ development team. Back in June, Ronald Laeremans posted the following message. Has the optimizing compiler been included with the standard edition of VC++ yet? Don't think that we've forgotten about it.... Thank, Brian From: Ronald Laeremans (ronaldl@online.microsoft.com)
4
1473
by: Flashman | last post by:
A little confusing with setting up optimizing options with 2003 .NET. Under the Optimization Tab. if you set to /O1 or /O2 is the program ignoring the settings for Inline Function expansion, Enabling of Intrinsic? Would seem these should be greyed out to let you know. Bigger question is what have people found in setting up the optimizing? Using /O1 or /O2 ? Any ideas would be greatly appreciated.
3
1276
by: Diffident | last post by:
Hello All, I need some help on optimizing a piece of code. Currently this is how we retrieve our Oracle's connection string for executing all the queries in our application. public string GetOracleClientConnectionString() { string strConnectionString = System.Configuration.ConfigurationSettings.AppSettings;
3
1804
by: Nick Gilbert | last post by:
Hi, I have to send an array of prices for a list of products over XML. Currently my XML data looks like this: <ArrayOfProd> <Prod Code="productcode001"> <Prices> <P F="2005-01-01" T="2005-09-09" Q="10" V="27.50" />
2
2542
by: Jack | last post by:
I have a chunk of code that loads a few dozen function pointers into global variables. I'm concerned with unused memory consumption. What if the client only needs to use one or two functions? Then there's quite a few function pointers consuming memory and going to waste. Here's little example: // mycode.cpp or mycode.c typedef int (*PFN) (); PFN g_pfn;
6
2506
by: peter_k | last post by:
Hi, Last time i'm interested in optimizing small c programs. On my studies we are sending the programs using the web interface to online judge. The person who will wrote the faster program get the bonus score. This are usually simple problems, like sorting small numbers, parsing the text and checking something etc... To get the bonus the good algorithm is not everything, you have to do a lot of optimizations on c level (or asembler...
26
1577
by: jaysome | last post by:
Suppose I have something like this on a free-standing environment (where void main(void) is documented and well-defined): static int G_num_loops = 0; void main (void) { /* set up interrupts */ /* ... */ for ( ; ; )
0
8428
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8341
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8851
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8754
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8542
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7362
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
4177
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4343
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
1740
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.