Count Lines in (Huge) Text Files

NvrBst

Whats the best way to count the lines? I'm using the following code
at the moment:
public long GetNumberOfLines(string fileName) {
int buffSize = 65536;
int streamSize = 65536;

long numOfLines = 0;
byte[] bArr = new byte[buffSize];
using(FileStream br = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.None, streamSize,
FileOptions.RandomAccess))
for(int i = br.Read(bArr, 0, buffSize); i 0; i =
br.Read(bArr, 0, buffSize))
if(i == buffSize) foreach(byte A in bArr) { if(A ==
(byte)'\n') numOfLines++; }
else for(int ii = 0; ii < i; ii++) if(bArr[ii] ==
(byte)'\n') numOfLines++;
return numOfLines;
}

I'm able to count files under 300MB files (100MB in about 2 or 3
seconds, 300MB in about 20 seconds). My problem is that I need to
count lines in really big files (1GB -2GB files). Is there any
suggestions to make that possible? Or the above basically as good as
it gets?

--Notes--
1. I've tried a lot of combinations of buffSize/streamSize, and 65kb/
65kb seems to work fastest for files above 100MB. Making them higher
doesn't seem to increase speed much, streamSize can be ~8kb without
noticable changes, however, thought it doesn't hurt to keep them the
same.

2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.

3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both. Cleaner way to do this?

4. FileOptions.RandomAccess seems to work better than
"FileOptions.SequentialScan" for me, expecially if I call
"CountLines()" twice in a row. IE 300MB files, 20seconds for first,
and ~5seconds for second. Is this normal? Or should SequentialScan
be better for the above? Is disposing whats messing that up?

First Scan is what I'm more-so try'ing to achive since I can play
around with the caching options after I'm able to count once in a
decent amount of time.

Thanks

Aug 14 '08 #1

Subscribe Post Reply

13167

Peter Duniho

On Thu, 14 Aug 2008 16:45:40 -0700, NvrBst <nv****@gmail.comwrote:

[...]
I'm able to count files under 300MB files (100MB in about 2 or 3
seconds, 300MB in about 20 seconds). My problem is that I need to
count lines in really big files (1GB -2GB files). Is there any
suggestions to make that possible? Or the above basically as good as
it gets?

On what kind of hard drive are you testing?

Looking at the data at http://www.storagereview.com/, it appears to me
that high-end drives top out at around 130MB/s, there's another class at
around 90MB/s, and then a third at around 60MB/s. Note that these
benchmarks are run using high-end hardware that can ensure that the drive
is the only bottleneck.

Depending on what hardware you're actually using, 50MB/s may be about as
good as you get. I have no idea why you would be able to read 100MB "in
about 2 or 3 seconds" but then take ten times as long to read only 3 times
as much data. I'd look into that before I worried about finer
implementation details.

As far as your specific comments go...

--Notes--
1. I've tried a lot of combinations of buffSize/streamSize, and 65kb/
65kb seems to work fastest for files above 100MB. Making them higher
doesn't seem to increase speed much, streamSize can be ~8kb without
noticable changes, however, thought it doesn't hurt to keep them the
same.

I would expect best results to come where "streamSize" is set to a value
some integral multiple of "buffSize", and where each is some integral
multiple of 4096 (the default size of a page in memory).

A few years ago, I was doing some high-throughput disk stuff and my
recollection is that I found the same thing you did: larger buffers only
helped up to about 8K or so, and past that any improvement was minimal.

2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.

I don't understand why StreamReader would be "a lot slower". It seems to
me that with appropriate settings for its buffer, it should perform
better, since it ought to be optimized for line-based i/o. Did you try
different buffer sizes for it?

Though, now that I think about it, StreamReader does do a lot of
superfluous work, like allocating the string that's returned from
ReadLine() and doing character decoding. So maybe if all you care about
is the count of lines, a raw scan is best.

3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both. Cleaner way to do this?

I suppose that depends on why there's the performance difference in the
first place. But you could write a single loop like this:

int ib = 0;

foreach (byte b in bArr)
{
if (ib++ < i)
{
if (b == (byte)'\n')
{
numOfLines++;
}
}
else
{
break;
}
}

Assuming what's hurting you in the explicit for() loop is the retrieval of
the data and not the counter increment, the above should perform basically
as well as a plain foreach() loop.

I'm surprised there's such a significant difference though. After all,
foreach() has to allocate an enumerator and repeatedly call into it, where
a bunch of branching happens (granted, the CPU will predict most of that
correctly), state has to be saved, and the new value has to be returned.
That it should cost so much to index the byte array explicitly that some
optimization in the enumerator makes up for all the extra work seems
really weird to me.

But then, I guess that's why we do performance testing, huh? :)

4. FileOptions.RandomAccess seems to work better than
"FileOptions.SequentialScan" for me, expecially if I call
"CountLines()" twice in a row. IE 300MB files, 20seconds for first,
and ~5seconds for second. Is this normal? Or should SequentialScan
be better for the above? Is disposing whats messing that up?

I would expect SequentialScan to be better for one-time access. After
that, I'd be surprised if one was significantly better than the other.
The speedup you're seeing is Windows caching the data from the file. It
shows that for the first access, your code really isn't the bottleneck;
the disk is.

I have to admit, I'm surprised that you are able to alter the
first-time-scan performance much if at all by variations in the code.
Once the file data's been cached it makes sense that code differences
would show up. But for the first-time-scan scenario, only really bad code
should be slower than the disk.

Pete

Aug 15 '08 #2

Pavel Minaev

On Aug 15, 3:45*am, NvrBst <nvr...@gmail.comwrote:

Whats the best way to count the lines? *I'm using the following code
at the moment:

public long GetNumberOfLines(string fileName) {
* * int buffSize = 65536;
* * int streamSize = 65536;

* * long numOfLines = 0;
* * byte[] bArr = new byte[buffSize];
* * using(FileStream br = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.None, streamSize,
FileOptions.RandomAccess))
* * * * for(int i = br.Read(bArr, 0, buffSize); i 0; i =
br.Read(bArr, 0, buffSize))
* * * * * * if(i == buffSize) foreach(byte A in bArr) { if(A ==
(byte)'\n') numOfLines++; }
* * * * * * else for(int ii = 0; ii < i; ii++) if(bArr[ii] ==
(byte)'\n') numOfLines++;
* * return numOfLines;

}

Unless your files are restricted to ASCII or Latin-1, the above is
wrong. For some multibyte encodings, '\n' can occur as a part of
another characters, and shouldn't be treated as a newline in that
case.

2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.

StreamReader is probably slower because it opens files as UTF-8 by
default, so it has to do the conversions. Even if it never actually
encounters a multibyte sequence, the checks are still made for every
char read.

LINQ Count is slower because it uses IEnumerable to access the
string's characters. This means two virtual method calls for every
character.

ReadByte() is slower because you forego caching then, but I wonder
what it would look like if you'd wrap your FileStream into
BufferedStream before using it.

3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both. *Cleaner way to do this?

Yes - try using generic 3-argument static method Array.IndexOf() in a
loop.

Aug 15 '08 #3

rossum

On Thu, 14 Aug 2008 16:45:40 -0700 (PDT), NvrBst <nv****@gmail.com>
wrote:

>Whats the best way to count the lines? I'm using the following code
at the moment:
public long GetNumberOfLines(string fileName) {
int buffSize = 65536;
int streamSize = 65536;

long numOfLines = 0;
byte[] bArr = new byte[buffSize];
using(FileStream br = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.None, streamSize,
FileOptions.RandomAccess))
for(int i = br.Read(bArr, 0, buffSize); i 0; i =
br.Read(bArr, 0, buffSize))
if(i == buffSize) foreach(byte A in bArr) { if(A ==
(byte)'\n') numOfLines++; }
else for(int ii = 0; ii < i; ii++) if(bArr[ii] ==
(byte)'\n') numOfLines++;
return numOfLines;
}

I'm able to count files under 300MB files (100MB in about 2 or 3
seconds, 300MB in about 20 seconds). My problem is that I need to
count lines in really big files (1GB -2GB files). Is there any
suggestions to make that possible? Or the above basically as good as
it gets?

--Notes--
1. I've tried a lot of combinations of buffSize/streamSize, and 65kb/
65kb seems to work fastest for files above 100MB. Making them higher
doesn't seem to increase speed much, streamSize can be ~8kb without
noticable changes, however, thought it doesn't hurt to keep them the
same.

2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.

3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both. Cleaner way to do this?

4. FileOptions.RandomAccess seems to work better than
"FileOptions.SequentialScan" for me, expecially if I call
"CountLines()" twice in a row. IE 300MB files, 20seconds for first,
and ~5seconds for second. Is this normal? Or should SequentialScan
be better for the above? Is disposing whats messing that up?

First Scan is what I'm more-so try'ing to achive since I can play
around with the caching options after I'm able to count once in a
decent amount of time.

Thanks

How accurate do you need your count to be? You could count lines in
the first 10% of the file and multiply by 10 to get an approximation
of the full count.

rossum

Aug 15 '08 #4

Hilton

What about:

int count = 0;
using (StreamReader sr = new StreamReader (filename))
{
string line;
while ((line = sr.ReadLine()) != null)
{
count++;
}
}

or if that doesn't help, then use your code but instead of using foreach,
use IndexOf. Let us know what you find.

Hilton

Aug 15 '08 #5

NvrBst

Thank you all for your suggestions

On what kind of hard drive are you testing?

WDC WD800JD-60LSA5 (http://www.wdc.com/en/products/productspecs.asp?
driveid=83). Its a SATA HD, and the site says 150MB transfer. My HD
hasn't been defragmented in a while so that might be why the large
files are such poor for initial loading.

I made a mistake about 20sec vs. 3sec. With re-testing, very initial
load is about 60seconds on a 270MB file, and ~0.5seconds after that.
I was looking at ticks initially and was reading them by about a
factor of 10 off.

I should of posted some numbers instead of saying "a lot". What I
noticed was the following: Averaging 10 results, calling CountLines()
twice right after each other, calling the EXE 10 times. Times are in
Seconds.

--FileOptions.RandomAccess-- ForEach
1st Min(0.5) Max(0.656) Avg(0.656)
2nd Min(0.328) Max(0.406) Avg(0.3592)

--FileOptions.RandomAccess-- For Loop (11-25% slower than ForEach)
1st Min(0.609) Max(0.625) Avg(0.619)
2nd Min(0.453) Max(0.484) Avg(0.467)

--FileOptions.RandomAccess-- For /w Array.IndexOf (20-25% faster than
ForEach)
1st Min(0.406) Max(0.422) Avg(0.414)
2nd Min(0.266) Max(0.312) Avg(0.284)

--FileOptions.Sequential-- For /w Array.IndexOf
This turned out weird for me. Basically 1st and 2nd are always about
the same. But sometimes it does a "~4s/~4s" and sometimes "~0.5/
~0.5". Very Initial load shows up as "~60s/~60s" vs "~60s/0.284" that
RandomAccess Does. Again, I didn't look into this much, maybe because
I'm disposing the FileHandle after I'm done with it. When it is low
(under 1s), it's average is about 5% bigger than the RanAccess
version.

I also noticed that if I changed my two for loops to "while((i = XX) >
0)" type thing, there is a ~10% drop in speed which seemed strange to
me (thought they'd be the same).

Yes - try using generic 3-argument static method Array.IndexOf() in a loop.

Ahh this implementation was faster than all the ones I've tried
before, hands down. Thanks

How accurate do you need your count to be?

What I was doing was playing around with the DataGridView / ListView
classes (in virtual mode), but needed to set the VirtualListSize. I
was thinking counting the lines should be quick enough and then I'd
have the exact amount needed. Since the (very first) initial load
time is too slow for 1G-2G files I'm going to play around with other
approaches. Basically I was thinking one of the two:

a) Have a Background Worker counting the lines, and a timer that
updates the VirtualListSize every 5 or so seconds.
b) "Memory Mode" type thing where I base everything on the file size,
and treat the Index as a percent (IE Index is at 22% of size, Seek to
that position and build a few pages of the cache). When out of range
of personal cache, rebuild on Current Index %.
c) Pure would be harder if I don't have the seek position of the line
I need to process for the DGV/LV. This would turn out to be simular
to b) except for the element level instead of page level.
I'm leaning towards b (maybe for just files above 50 or 100MB) type
thing, but a) might be okay too, and work for all file sizes. I'll
play around with them, thanks for all your suggestions.

In the end the best LineCount Implementation I got was the following
if it is useful for anyone, or if someone can suggest an improvement.
readonly int buffSize = 65536, streamSize = 65536;
public long GetNumberOfLines(string fileName) {
long numOfLines = 0;
byte[] bArr = new byte[buffSize];
using(FileStream fs = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.Read, streamSize,
FileOptions.RandomAccess))
for(int i = fs.Read(bArr, 0, buffSize); i 0; i =
fs.Read(bArr, 0, buffSize))
for(int ii = Array.IndexOf<byte>(bArr, (byte)'\n', 0); ii !
= -1; ii = Array.IndexOf<byte>(bArr, (byte)'\n', ++ii))
numOfLines++;
return numOfLines;
}

Aug 15 '08 #6

NvrBst

On Aug 15, 11:02*am, "Hilton" <nos...@nospam.comwrote:

What about:

int count = 0;
using (StreamReader sr = new StreamReader (filename))
{
* string line;
* while ((line = sr.ReadLine()) != null)
* {
* * count++;
* }

}

or if that doesn't help, then use your code but instead of using foreach,
use IndexOf. *Let us know what you find.

Hilton

Oww I tried StreamReader for the first post. Basically it's about the
same for the inital load (~1 min for a 270MB file), but the runs after
that are about 3x longer than the implementation above with
Array.IndexOf.

Aug 15 '08 #7

Peter Duniho

On Fri, 15 Aug 2008 11:55:48 -0700, NvrBst <nv****@gmail.comwrote:

Thank you all for your suggestions

>On what kind of hard drive are you testing?
WDC WD800JD-60LSA5 (http://www.wdc.com/en/products/productspecs.asp?
driveid=83). Its a SATA HD, and the site says 150MB transfer. My HD
hasn't been defragmented in a while so that might be why the large
files are such poor for initial loading.

Keep in mind that the transfer rate described in that sheet is the
interface, not the actual throughput of the drive. For example, from the
storagereview.com web site again, the big-brother to that drive (WD3200JD)
has a _maximum_ transfer rate of about 66MB/s, and drops to as low as
40MB/s at the slowest part of the drive (with a constant rotational speed,
the linear speed of the disk media depends on whether data's being read
from the inner or outer parts of the drive platter.

The interface rate is useful when you're reading small amounts of data,
when it's been prefetched by the drive into its internal buffer. But when
you start reading large files, you run into the actual physical limits of
the drive, and you'll never see the published transfer rate.

I made a mistake about 20sec vs. 3sec. With re-testing, very initial
load is about 60seconds on a 270MB file, and ~0.5seconds after that.
I was looking at ticks initially and was reading them by about a
factor of 10 off.

I should of posted some numbers instead of saying "a lot". What I
noticed was the following: Averaging 10 results, calling CountLines()
twice right after each other, calling the EXE 10 times. Times are in
Seconds. [...]

I don't understand those numbers. The first read should be a lot slower
than the second, not just a little slower. The numbers you posted suggest
that the file you're reading is so small that the disk transfer limits
don't wind up being an issue. If that's true, you'll want to be very
careful extrapolating those results to larger files.

If those are in fact numbers from very large files, then perhaps you could
elaborate on why a) the times are so fast even for the first read, and b)
why the second read is only a little faster than the first.

Seems odd to me.

I think it's great that Pavel's suggestion to use IndexOf() produced a
significant improvement. But with your elaboration, I'm not sure you're
measuring something that will turn out to be significant.

For one, the really interesting case is the slow one for a large file.
For another, since you're trying to implement a virtual view in a control,
you're going to need a quick way to go to specific lines, not just know
how many lines there are. I suspect that the code that scans input data
for line-breaks just won't be the bottleneck. For the first read of a
file, the disk itself is going to slow you down, and if you're creating an
index of lines as you read the file, then the work of managing that data
structure is likely to swamp the effort of just scanning the bytes.

It seems to me that your implementation is fine, and it's great that it
performs well. But I'm wondering what's going to happen when you try to
use this with some actual large file and the necessary work to avoid
having to keep scanning the file looking for specific lines as the user
scrolls through the control.

Pete

Aug 15 '08 #8

NvrBst

On Aug 15, 12:44*pm, "Peter Duniho" <NpOeStPe...@nnowslpianmk.com>
wrote:

On Fri, 15 Aug 2008 11:55:48 -0700, NvrBst <nvr...@gmail.comwrote:
Thank you all for your suggestions

On what kind of hard drive are you testing?
WDC WD800JD-60LSA5 (http://www.wdc.com/en/products/productspecs.asp?
driveid=83). *Its a SATA HD, and the site says 150MB transfer. *My HD
hasn't been defragmented in a while so that might be why the large
files are such poor for initial loading.

Keep in mind that the transfer rate described in that sheet is the *
interface, not the actual throughput of the drive. *For example, from the *
storagereview.com web site again, the big-brother to that drive (WD3200JD) *
has a _maximum_ transfer rate of about 66MB/s, and drops to as low as *
40MB/s at the slowest part of the drive (with a constant rotational speed, *
the linear speed of the disk media depends on whether data's being read *
*from the inner or outer parts of the drive platter.

The interface rate is useful when you're reading small amounts of data, *
when it's been prefetched by the drive into its internal buffer. *But when *
you start reading large files, you run into the actual physical limits of*
the drive, and you'll never see the published transfer rate.

I made a mistake about 20sec vs. 3sec. *With re-testing, very initial
load is about 60seconds on a 270MB file, and ~0.5seconds after that.
I was looking at ticks initially and was reading them by about a
factor of 10 off.

I should of posted some numbers instead of saying "a lot". *What I
noticed was the following: Averaging 10 results, calling CountLines()
twice right after each other, calling the EXE 10 times. *Times are in
Seconds. [...]

I don't understand those numbers. *The first read should be a lot slower *
than the second, not just a little slower. *The numbers you posted suggest *
that the file you're reading is so small that the disk transfer limits *
don't wind up being an issue. *If that's true, you'll want to be very *
careful extrapolating those results to larger files.

If those are in fact numbers from very large files, then perhaps you could *
elaborate on why a) the times are so fast even for the first read, and b)*
why the second read is only a little faster than the first.

Seems odd to me.

I think it's great that Pavel's suggestion to use IndexOf() produced a *
significant improvement. But with your elaboration, I'm not sure you're *
measuring something that will turn out to be significant.

For one, the really interesting case is the slow one for a large file. *
For another, since you're trying to implement a virtual view in a control, *
you're going to need a quick way to go to specific lines, not just know *
how many lines there are. *I suspect that the code that scans input data *
for line-breaks just won't be the bottleneck. *For the first read of a *
file, the disk itself is going to slow you down, and if you're creating an *
index of lines as you read the file, then the work of managing that data *
structure is likely to swamp the effort of just scanning the bytes.

It seems to me that your implementation is fine, and it's great that it *
performs well. *But I'm wondering what's going to happen when you try to *
use this with some actual large file and the necessary work to avoid *
having to keep scanning the file looking for specific lines as the user *
scrolls through the control.

Pete

Sorry, I should of elaborated more. For the numbers the initial (very
first) load was always about 60 seconds (50-80seconds) for all cases.
After that what was happening was a batch script which ran the EXE 10
times to get the resulting numbers. It then gave the Min, Max,
Average of those 10 runs. So the number improvments I were given were
all in relation to the file's inital opening being done already. The
"1st" "2nd" relates to the program calling "GetLineCount()" twice in a
row. So 1st call gave "1st" results, and 2nd call gave "2nd
results" (1 EXE run). RandomAccess seems to make the 2nd call a lot
better, while others (IE Sequential) keeps "1st" and "2nd" the same.

Since the inital load is taking so long the small improvments probably
wont matter, but, eventually I'm going to be processing the log files
(IE picking out certain lines, etc). Logs are mostly of TCP/IP trafic
so I will be picking out (filtering) data from certain IP address's a
lot. So this might help that.

What I was doing was adding the Position of the newline to a
"List<uint>" (counting was just simpler for me to convey the problem),
and using that Position to get the item in the
"GetVirtualItem(Index#Requested)" call. With my 300MB log files, the
List only grows to be about 3MB so far but, potentially, this could be
bad if say there are only about 10 characters per line, on a 1G file
(I don't think I will run into this though, the lines usally are very
long for my files). I missed a word in "c)" which was suppose to say
"Pure Estimation", which this type of thing is what I was talking
about with "don't have the seek position".

Using the "Memory Type" approach, I wouldn't have to store seek
positions, which is why it'd be good for huge (1G+) files. Right now
in my implementation (using a backgroundworker + ProgressChange to
update VirtualListSize) seems to work okay. On the 300MB files it
takes about 1.5 mins to finish processing, but the list is populated
with data almost instantly, so I'm able to look at it and scroll while
data is still being added. I havn't done much testing with the real
1-2GB files yet though, and havn't attempt the "Memory Type" approach.

Aug 15 '08 #9

Peter Duniho

On Fri, 15 Aug 2008 13:14:31 -0700, NvrBst <nv****@gmail.comwrote:

Sorry, I should of elaborated more. For the numbers the initial (very
first) load was always about 60 seconds (50-80seconds) for all cases.

Ah. Right. So, that's the true time it takes to read the file. Unless
you are scanning the same file over and over, the other numbers are
irrelevant. They are only that fast because the OS has cached the file
after the first time.

Note that measuring the difference between RandomAccess and Sequential
also is irrelevant except for the first time the file is scanned. The
whole point of Sequential is to provide a hint to the OS that it doesn't
have to keep around older data in the file as it's being read. Of course,
the OS is free to ignore the hint, but from your description it sounds
like at least some of the time it doesn't and the file data winds up _not_
being cached (hence slower speeds).

There's some (minimal) overhead to caching, so providing that hint could
be important when you really are reading a lot of files, all of them just
once through. But that doesn't seem to be what you're doing here.
RandomAccess is probably the right choice, but mainly because Sequential
might be a lie (since you go back and revisit the file). :)

And finally, note that the actual "reading the file" time is orders of
magnitude longer than when it's already in memory. This means that
whatever performance difference exists in your code, it's completely
hidden by the cost of actually reading the data from the disk. With the
exception of any truly awful implementations (if you found a version of
the code that was 10x or more slower, for example), there really is no
point trying to optimize this section of code. The only time it's slow,
it's slow for reasons completely unrelated to (and unfixable by) the code
itself.

[...]
What I was doing was adding the Position of the newline to a
"List<uint>" (counting was just simpler for me to convey the problem),
and using that Position to get the item in the
"GetVirtualItem(Index#Requested)" call. With my 300MB log files, the
List only grows to be about 3MB so far but, potentially, this could be
bad if say there are only about 10 characters per line, on a 1G file
(I don't think I will run into this though, the lines usally are very
long for my files).

I think using a List<Tof indices is fine. Even a 50MB data structure is
probably not a huge problem on modern PCs, assuming you don't need to have
a lot of them all at once. And of course, it has the benefit of being
completely precise.

But it does add even more overhead to the scanning code, making it even
more pointless to optimize the actual "look for line breaks" part. :)

Your estimation-based implementation options both obviate any need to scan
the file at all, of course. So not only are they memory-lightweight, if
you find that scanning the file is an unacceptable cost, that would be the
way to go. The scroll bar will wind up behaving funny though, because the
control is going to expect a direct mapping between the scroll bar and
rows in the data, which you won't have.

Pete

Aug 16 '08 #10

Count Lines in (Huge) Text Files

Similar topics