fastest way to parse a file; Most efficient way to store the data?

hoopsho

Hi Everyone,
I am trying to write a program that does a few things very fast
and with efficient use of memory...

a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

So my questions are as follows:

o What is the most effiecient way to pull fields out of a file?
It does not appear that "tokenizing" the string of each line
into an array via split is very efficient.... is there another way?
o I need to take the id of each field and ensure that it does not
already exist in my "collection". Is using the arraylist the best way
to do this? Should I take the data from each line of the file and make
it into an object or a STRUCT? Which is faster and better from a
memory standpoint?
o Which collection has the best sorting capabilities?
o I assume that a for/next type of loop is the best way to
iterate through this collection to output the data... is there a better
way?

Thanks for any help.

Nov 16 '05 #1

Subscribe Post Reply

3579

Ignacio Machin \( .NET/ C# MVP \)

Hi,

A million of rows is big for anything less of RDBMS so it will be
intensive no matter what you do.

a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

So my questions are as follows:

o What is the most effiecient way to pull fields out of a file?
It does not appear that "tokenizing" the string of each line
into an array via split is very efficient.... is there another way?
You will do it, eventually, either in the fly ( if you keep the entire
line) or just split it at reading time and keep a separate instance for each
token
o I need to take the id of each field and ensure that it does not
already exist in my "collection". Is using the arraylist the best way
to do this? Should I take the data from each line of the file and make
it into an object or a STRUCT? Which is faster and better from a
memory standpoint?
I don;t think that an arraylist is the best for this, ( as well as for
sorting ) you should use either a b-tree or a binary tree ( IIRC a b-tree is
better )
o Which collection has the best sorting capabilities?
Depends, if you use a binary tree for example you will build it sorted
already, so all you have to do is a ( preorder? ) iteration and will get
the items sorted

Now if the sorting criteria change on the fly that is another history :)
o I assume that a for/next type of loop is the best way to
iterate through this collection to output the data... is there a better
way?

Any way you go it will be processor intensive as well as memory intensive,
1 million is a lot of records after all.

Cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

Nov 16 '05 #2

Nicholas Paldino [.NET/C# MVP]

hoopsho,

What I would do is use a FileStream (perhaps with a BufferedStream on
top to reduce reads to the disk) to read the contents of the file. Since
you say the Split method is too slow, I recommend that you read character by
character, tokenizing the lines yourself.

If you need to make sure that every line is unique, then you would want
to use a struct. However, storing these in a Hashtable will mean that you
have to box and unbox the value every time you want to access this, and
that's extremely inefficient.

The reason you would use a structure is that it would provide you with a
hashvalues that are equal if all of the hashvalues are the same.

To get around this, I would recommend creating a structure that has all
the fields you will populate. Then, create a class wrapper around the
structure, which has a private field of the type of the structure. Override
the GetHashCode to return the hash code from the structure. Override Equals
as well to return true when the hash codes match.

Then, I would store that in the hashtable. When adding a new row, check
to see if there is a value in the hashtable already. If there is, then
ignore it, if not, then store it.

Then you have the issue of sorting the data on a specific field. The
hashtable isn't going to give you anything, so then I would recommend a
DataSet, or use an ArrayList that holds references to the same items that
are in the Hashtable. Storing a million row table in a data set is going to
crush your server (unless you have a TON of memory that you can spare), and
generally isn't a good idea. The hashtable has the same drawbacks. If you
don't want to use the DataSet and you know the field(s) you want to sort on,
then I say that as you add the items to the hashtable, perform a binary
search through the ArrayList and insert the new value at the position in the
arraylist where it should go. This is an issue as well. Say you have to
insert a reference at the beginning of the list for the millionth record.
Moving 999,999 records forward is not going to be quick.

It is this kind of scenario where you might want to consider unsafe
code, allocating a buffer of memory, and moving the records around yourself.

In the end, I would recommend that you use a RDBMS to store the values
temporarily. The hashtable scheme used before can help here. If you detect
that you don't have a row with those values, then you can insert it,
otherwise, do nothing. Granted, the hashtable is going to incur overhead,
and a lot. However, it would reduce the number of operations against the
database that you have to perform, depending on the number of duplicates you
have (on average).

Finally, when you want to output the results, you can just perform a
query, selecting the fields you want, and sorting it how you wish.

You are ultimately going to make your decision based on a number of
factors. The biggest one is the amount of RAM that you have on the machine.
If you have a good amount that could support this much information in your
structure in memory, then by all means, use a DataSet. However, if you do
not, or the number of duplicate records is high, then go with the
Hashtabe/DB solution, as it would be much faster than looping through all of
the items yourself, comparing fields, trying to sort, etc, etc.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"hoopsho" <ho*****@gmail.com> wrote in message
news:11**********************@c13g2000cwb.googlegr oups.com...

Hi Everyone,
I am trying to write a program that does a few things very fast
and with efficient use of memory...

a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

So my questions are as follows:

o What is the most effiecient way to pull fields out of a file?
It does not appear that "tokenizing" the string of each line
into an array via split is very efficient.... is there another way?
o I need to take the id of each field and ensure that it does not
already exist in my "collection". Is using the arraylist the best way
to do this? Should I take the data from each line of the file and make
it into an object or a STRUCT? Which is faster and better from a
memory standpoint?
o Which collection has the best sorting capabilities?
o I assume that a for/next type of loop is the best way to
iterate through this collection to output the data... is there a better
way?

Thanks for any help.

Nov 16 '05 #3

Bruce Wood

First off, let me point out that in computing, you are usually facing a
tradeoff between memory and speed. The question, "What is the fastest,
most memory-efficient way to do this?" is like asking, "What is the
quickest, cheapest way to get to Paris?" (Assuming, of course, that
you're not already in Paris. :) Sometimes the fastest way is the
cheapest way, but more often than not you have to make a tradeoff
between speed and cost: airplanes cost more than freighters. So it is
with speed and memory: sometimes the fastest way is also the most
memory-efficient, but more often than not you have to trade one for the
other.

That said, your solution depends very much upon whether the field
you're sorting on is always also the field that you require be unique.
If it is, then I suggest that you use some form of tree structure
(research the already-available .NET collections), which will sort your
items on the fly and give you some indication of uniqueness at the same
time. Since you have to sort anyway, you might as well do that and your
uniqueness check all at once.

However, if you could potentially be determining uniqueness on one
field and sorting on a different field, then there's no value in
determining uniqueness using anything other than a hash table. A hash
table will give you lightning-fast lookup capabilities to determine if
you've already seen a key. There's only one thing faster, which is a
very, very big B-tree, but it uses up tons of memory so I wouldn't go
that way. Hash tables are robust and fast.

As for sorting, you should either build a tree structure or use the
quicksort algorithm. Both methods are reasonably quick. I wouldn't
suggest using insertion sort, which is what Nicholas was suggesting
(sorry) because with a million records you'll _definitely_ notice a
performance difference. The Array class contains a Sort method, but it
doesn't mention which algorithm it uses, although I must suppose that
if the MS people who wrote the Framework didn't use quicksort, or
something even faster (yes, there are a few faster algorithms) then
they're not too sharp.

Finally, there's the problem of storage. Yes, you can parse each line
and blow it out into an array of strings, but then if you have to write
it out again. As well, if you're doing a quick sort, you have to
shuffle (potentially) large records around in memory.

Another way to solve the problem is to create a small class containing
a string, an offset, and a length. If you use short integers for the
offset and the length you can pare this down to 64 bits. When you read
in a line, and you want to represent field #15, for example, make a new
one of these objects, set the string pointer to your line you read in,
and the offset and the length to indicate where your field starts and
how long it is.

Now if you write an IComparer for this structure:

public class FieldComparer : IComparer
{
public int Compare(object x, object y)
{
MyField field1 = (MyField)x;
MyField field2 = (MyField)y;

if (field1.Length != field2.Length)
{
return field1.Length - field2.Length;
}
else
{
return String.Compare(field1.String, field1.Offset,
field2.String, field2.Offset, field2.Length);
}
}
}

I'm using a class for Field rather than a struct to avoid boxing and
unboxing in the standard Compare method.

So, assuming that you have to determine uniqueness on one field and
sort on another field, here is how I would do it.

Make a Hashtable that will have Field objects as its keys. The values
stored in the Hasthable don't matter, so you might as well use the
lines you're reading from the file. This won't result in any extra
storage for the lines, because if you're storing them as strings then
the runtime will share points so long as the input lines themselves
never change.

Make an ArrayList that will hold the FIeld objects for the fields you
want to eventually sort on.

Read each line, find the field you need to sort on, and the field you
need to verify as unique. Create a Field object for each of them. Check
to see if the Field object for your unique field is already stored in
the Hashtable, and add it if it isn't. Add the Field object for your
sort field to the ArrayList using the Add method. You didn't say
whether you want the record in the output set only if the unique field
is the first occurrence, but you can determine that here because you
already tried to put in the hash table.

When you've read all the lines, use ArrayList.Sort to sort the array
list using an instance of the IComparer class that you created above.
This will take a while, but it's faster than insertion sort or any
other sort method that you might roll yourself.

Run through the array list and feed the records one by one to some sort
of output object, which will know which fields to pick out and display
to the user. Since your Field class contains the original string
pointer for the line, you can recover the input line and scan it for
the output fields that you want.

The only extra overhead that this introduces is that you scan the input
line twice: once to get your unique / sort fields, and once to get your
output fields. However, I doubt that this will create a significant
performance hit. Not after all of that sorting.
Anyway, there's my solution! Good luck!

Nov 16 '05 #4

Bruce Wood

Nov 16 '05 #5

Bruce Wood

Nov 16 '05 #6

Bruce Wood

Nov 16 '05 #7

Bruce Wood

Nov 16 '05 #8

Chris Hyde

hoopsho wrote:

Hi Everyone,
I am trying to write a program that does a few things very fast
and with efficient use of memory...

a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

I would think this is a good candidate for a DTS package (assuming of
course you have access to DTS and SQL Server). You can define the
package to the format of the file (assuming (again!) that the format
remains the same each time). Use this package to pump your information
into a properly keyed and indexed table...with a "custom" package you
could even create the hash on the fly and store it in a record field.

From there, you can use straight SQL to query for the data AND sort it
all at once...here is where the bottleneck appears in this approach: if
you try to return all 1 million+ lines at once, you are going to bring
even a robust server to its knees (can you say "memory consumption"?
:-). You say though that you want to "report" only certain fields to
the user...I would then suggest that you implement a solid paging
scheme, so that you only query for "pages" of records at a time.

Now, I know nothing of the "time requirements" of your application, as
in how often does this file need to be parsed/stored/sorted. You do
have scheduling capabilities with DTS packages though, so perhaps you
could take advantage of that as well.

HTH...

Chris

Nov 16 '05 #9

Bruce Wood

Nov 16 '05 #10

Bruce Wood

I apologize if this is a re-post. I'm using Google Groups Beta (it's
veeery beta) and it keeps losing its little mind.

First off, let me point out that in computing, you are usually facing a
tradeoff between memory and speed. The question, "What is the fastest,
most memory-efficient way to do this?" is like asking, "What is the
quickest, cheapest way to get to Paris?" (Assuming, of course, that
you're not already in Paris. :) Sometimes the fastest way is the
cheapest way, but more often than not you have to make a tradeoff
between speed and cost: airplanes cost more than freighters. So it is
with speed and memory: sometimes the fastest way is also the most
memory-efficient, but more often than not you have to trade one for the
other.

That said, your solution depends very much upon whether the field
you're sorting on is always also the field that you require be unique.
If it is, then I suggest that you use some form of tree structure
(research the already-available .NET collections), which will sort your
items on the fly and give you some indication of uniqueness at the same
time. Since you have to sort anyway, you might as well do that and your
uniqueness check all at once.

However, if you could potentially be determining uniqueness on one
field and sorting on a different field, then there's no value in
determining uniqueness using anything other than a hash table. A hash
table will give you lightning-fast lookup capabilities to determine if
you've already seen a key. There's only one thing faster, which is a
very, very big B-tree, but it uses up tons of memory so I wouldn't go
that way. Hash tables are robust and fast.

As for sorting, you should either build a tree structure or use the
quicksort algorithm. Both methods are reasonably quick. I wouldn't
suggest using insertion sort, which is what Nicholas was suggesting
(sorry) because with a million records you'll _definitely_ notice a
performance difference. The Array class contains a Sort method, but it
doesn't mention which algorithm it uses, although I must suppose that
if the MS people who wrote the Framework didn't use quicksort, or
something even faster (yes, there are a few faster algorithms) then
they're not too sharp.

Finally, there's the problem of storage. Yes, you can parse each line
and blow it out into an array of strings, but if the number of fields
you have to work with (including output fields) is a small fraction of
the total number of fields in the record, you could be doing a bunch of
extra work for nothing.

Another way to solve the problem is to create a small class containing
a string, an offset, and a length. If you use short integers for the
offset and the length you can pare this down to 64 bits. When you read
in a line, and you want to represent field #15, for example, make a new
one of these objects, set the string pointer to your line you read in,
and the offset and the length to indicate where your field starts and
how long it is.

Now if you write an IComparer for this class:

public class FieldComparer : IComparer
{
public int Compare(object x, object y)
{
MyField field1 = (MyField)x;
MyField field2 = (MyField)y;

if (field1.Length != field2.Length)
{
return field1.Length - field2.Length;
}
else
{
return String.Compare(field1.String, field1.Offset,
field2.String, field2.Offset, field2.Length);
}
}
}

I'm using a class for Field rather than a struct to avoid boxing and
unboxing in the standard Compare method.

So, assuming that you have to determine uniqueness on one field and
sort on another field, here is how I would do it.

Make a Hashtable that will have Field objects as its keys. Again, using
a Field class rather than a Field struct avoids boxing here. The values
stored in the Hasthable don't matter, so you might as well use the same
Field object you used as a key, or the lines you're reading from the
file. This won't result in any extra storage for the lines, because if
you're storing them as strings then the runtime will share references
so long as the input lines themselves never change.

Make an ArrayList that will hold the FIeld objects for the fields you
want to eventually sort on.

Read each line, find the field you need to sort on, and the field you
need to verify as unique. Create a Field object for each of them. Check
to see if the Field object for your unique field is already stored in
the Hashtable, and add it if it isn't. Add the Field object for your
sort field to the ArrayList using the Add method. You didn't say
whether you want the record in the output set only if the unique field
is the first occurrence, but you can determine that here because you
already tried to put in the hash table.

When you've read all the lines, use ArrayList.Sort to sort the array
list using an instance of the IComparer class that you created above.
This will take a while, but it's faster than insertion sort or any
other sort method that you might roll yourself.

Run through the array list and feed the records one by one to some sort
of output object, which will know which fields to pick out and display
to the user. Since your Field class contains the original string
pointer for the line, you can recover the input line and scan it for
the output fields that you want.

The only extra overhead that this introduces is that you scan the input
line twice: once to get your unique / sort fields, and once to get your
output fields. However, I doubt that this will create a significant
performance hit. Not after all of that sorting.
Anyway, there's my solution! Good luck!

Nov 16 '05 #11

Bruce Wood

I apologize if this is a repost. Google Groups Beta went out of its
mind yesterday. (It's beta... very beta). I don't see my post here, so
I'm trying again.

First off, let me point out that in computing, you are usually facing a
tradeoff between memory and speed. The question, "What is the fastest,
most memory-efficient way to do this?" is like asking, "What is the
quickest, cheapest way to get to Paris?" (Assuming, of course, that
you're not already in Paris. :) Sometimes the fastest way is the
cheapest way, but more often than not you have to make a tradeoff
between speed and cost: airplanes cost more than freighters. So it is
with speed and memory: sometimes the fastest way is also the most
memory-efficient, but more often than not you have to trade one for the
other.

That said, your solution depends very much upon whether the field
you're sorting on is always also the field that you require be unique.
If it is, then I suggest that you use some form of tree structure
(research the already-available .NET collections), which will sort your
items on the fly and give you some indication of uniqueness at the same
time. Since you have to sort anyway, you might as well do that and your
uniqueness check all at once.

However, if you could potentially be determining uniqueness on one
field and sorting on a different field, then there's no value in
determining uniqueness using anything other than a hash table. A hash
table will give you lightning-fast lookup capabilities to determine if
you've already seen a key. There's only one thing faster, which is a
very, very big B-tree, but it uses up tons of memory so I wouldn't go
that way. Hash tables are robust and fast.

As for sorting, you should either build a tree structure or use the
quicksort algorithm. Both methods are reasonably quick. I wouldn't
suggest using insertion sort, which is what Nicholas was suggesting
(sorry) because with a million records you'll _definitely_ notice a
performance difference. The Array class contains a Sort method, but it
doesn't mention which algorithm it uses, although I must suppose that
if the MS people who wrote the Framework didn't use quicksort, or
something even faster (yes, there are a few faster algorithms) then
they're not too sharp.

Finally, there's the problem of storage. Yes, you can parse each line
and blow it out into an array of strings, but then you're dong a lot
more work then you have to if the number of output fields is a small
fraction of the total number of fields.

Another way to solve the problem is to create a small class containing
a string, an offset, and a length. If you use short integers for the
offset and the length you can pare this down to 64 bits. When you read
in a line, and you want to represent field #15, for example, make a new
one of these objects, set the string pointer to your line you read in,
and the offset and the length to indicate where your field starts and
how long it is.

Now if you write an IComparer for this class:

public class FieldComparer : IComparer
{
public int Compare(object x, object y)
{
MyField field1 = (MyField)x;
MyField field2 = (MyField)y;

if (field1.Length != field2.Length)
{
return field1.Length - field2.Length;
}
else
{
return String.Compare(field1.String, field1.Offset,
field2.String, field2.Offset, field2.Length);
}
}
}

I'm using a class for Field rather than a struct to avoid boxing and
unboxing in the standard Compare method.

So, assuming that you have to determine uniqueness on one field and
sort on another field, here is how I would do it.

Make a Hashtable that will have Field objects as its keys. Again, using
a class here will avoid boxing. The values stored in the Hasthable
don't matter, so you might as well use the same Field objects you're
using as keys, or lines you're reading from the file. This won't result
in any extra storage for the lines, because if you're storing them as
strings then the runtime will share references so long as the input
lines themselves never change.

Make an ArrayList that will hold the Field objects for the fields you
want to eventually sort on.

Read each line, find the field you need to sort on, and the field you
need to verify as unique. Create a Field object for each of them. Check
to see if the Field object for your unique field is already stored in
the Hashtable, and add it if it isn't. Add the Field object for your
sort field to the ArrayList using the Add method. You didn't say
whether you want the record in the output set only if the unique field
is the first occurrence, but you can determine that here because you
already tried to put in the hash table.

When you've read all the lines, use ArrayList.Sort to sort the array
list using an instance of the IComparer class that you created above.
This will take a while, but it's faster than insertion sort or any
other sort method that you might roll yourself.

Run through the array list and feed the records one by one to some sort
of output object, which will know which fields to pick out and display
to the user. Since your Field class contains the original string
pointer for the line, you can recover the input line and scan it for
the output fields that you want.

The only extra overhead that this introduces is that you scan the input
line twice: once to get your unique / sort fields, and once to get your
output fields. However, I doubt that this will create a significant
performance hit. Not after all of that sorting.
Anyway, there's my solution! Good luck!

Nov 16 '05 #12

fastest way to parse a file; Most efficient way to store the data?

Similar topics