what is the best datatype for..

calvert4rent

I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert

Nov 19 '07 #1

Subscribe Post Reply

2313

Nicholas Paldino [.NET/C# MVP]

I would store this in a Dictionary<int, intwhere the key is the id and
the value is the count.

Then, when you need to sort the keys, I would enumerate through the
KeyValuePair instances and sort those (Dictionary<TKey, TValueimplements
IEnumerable<KeyValuePair<TKey, TValue>>). You can simply put them into an
array and then call the static Sort method on the Array class, passing in an
anonymous method for the Comparison<Tdelegate (which in this case, would
be Comparison<KeyValuePair<int, int>>) which would compare the values based
on the count.

You could also get away with using a DataTable with an id column and a
count column, and then just placing a data view on the data table which is
sorted by the count.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<ca**********@gmail.comwrote in message
news:89**********************************@d61g2000 hsa.googlegroups.com...

>I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert

Nov 19 '07 #2

Chris Dunaway

On Nov 19, 9:53 am, calvert4r...@gmail.com wrote:

I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert

As for data types, a general rule of thumb would be that if you plan
to do some calculations with it (like addition, subtraction, etc.)
then it should be some sort of numeric type (int, double, decimal).
If you are only going to store it, then use a string. For example,
even though a phone number is all digits, you should use a string
since you don't do calculations with a phone number.

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next
Hope this helps

Chris

Nov 19 '07 #3

calvert4rent

Thank you both very much. That was exactly the information I was
looking for. I am going to give it a shot now. Thanks again.

Nov 19 '07 #4

Hilton

Hi,

Being a CF engineer, I'm always looking to optimize, so even though hashes,
dictionaries etc will work, the whole process is inefficient. Do you have a
range of values? Let's say that you know that the values will be 0-1000,
define a struct with two values; one being the value, the other being the
count. Create an array of the structs and off you go. Alternatively you
could use two separate arrays.

Again, this is for performance reasons and I'll bet it'll be way faster than
dictionaries etc. Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques. :)

Hilton
<ca**********@gmail.comwrote in message
news:89**********************************@d61g2000 hsa.googlegroups.com...

>I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert

Nov 19 '07 #5

Hilton

Peter,

A dictionary implementation will be faster than your suggested
implementation.

You misunderstood probably because I never explained it well enough.

If I understand your suggestion correctly, you are proposing creating an
array that will be scanned searching for the index value, at which point
the count can be updated.

Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation. In fact, you don't even need the struct
if most of the work is 'reading' the value and the sorting doesn't need to
be 'high-performance' - then you can define an array of ints and not structs
and spend a little more time doing the sorting. To decide, we'd need to
understand more about the nature of the data.

Hilton

Nov 20 '07 #6

Wingot

Err, isn't that VB? Correct me if I'm wrong, but the "Dim" was the
biggest clue, and the fact that the comments start with ', among other
things.

Just thought it was weird to see it in the csharp newsgroup :).

Regards,
Wingot

-----Original Message-----
From: Chris Dunaway [mailto:du******@gmail.com]
Posted At: Tuesday, 20 November 2007 1:14 AM
Posted To: microsoft.public.dotnet.languages.csharp
Conversation: what is the best datatype for..
Subject: Re: what is the best datatype for..

*snip*

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next
Hope this helps

Chris

Nov 20 '07 #7

Jon Skeet [C# MVP]

On Nov 20, 12:56 am, "Hilton" <nos...@nospam.comwrote:

<snip>

Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

Jon

Nov 20 '07 #8

Peter Duniho

On 2007-11-20 00:00:01 -0800, "Hilton" <no****@nospam.comsaid:

Dude, chill. The guy asked how to write a histogram, I said use an array of
integers.

If that's _all_ you'd said, we'd not have a problem.

It's the other crap you insist on writing that causes the issues.

Stick to the facts and you won't get flamed.

Nov 20 '07 #9

Marc Gravell

As a semi-retraction... for reference, looking deeper into the code it
appears that Enumerable.GroupBy() uses a Lookup<TKey,TElement>
internally, which adds the elements to each grouping; as such, it
probably isn't suited to cases where you want to forget about the
contents and just keep the aggregates...

But never the less, worth knowing. Now to see if I can write a more
efficient LINQ aggregator ;-p

Marc

Nov 20 '07 #10

Chris Dunaway

On Nov 19, 8:33 pm, "Wingot" <win...@newsgroup.nospamwrote:

Err, isn't that VB? Correct me if I'm wrong, but the "Dim" was the
biggest clue, and the fact that the comments start with ', among other
things.

Just thought it was weird to see it in the csharp newsgroup :).

Regards,
Wingot

-----Original Message-----
From: Chris Dunaway [mailto:dunaw...@gmail.com]

Posted At: Tuesday, 20 November 2007 1:14 AM
Posted To: microsoft.public.dotnet.languages.csharp
Conversation: what is the best datatype for..
Subject: Re: what is the best datatype for..

*snip*

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next

Hope this helps

Chris

Oops! I had been looking at the VB group and obviously forgot where I
was! I'm too young to have a "senior moment"!

Chris

Nov 20 '07 #11

Hilton

OK.... I'm not looking to get into a pissing match here, so this will be my
last post to this thread, but for the record, I'll correct just one of the
many incorrect things in your post.

You said: "Frankly, when you write stuff like "Being a CF engineer, I'm
always looking to optimize", you are REALLY insulting to those of us who
don't
write CF code. As if we aren't concerned with optimization as well."

Peter, this logic is so flawed that is it laughable. If a red car has
wheels, then a blue car doesn't? Secondly, it is pretty obvious that
running code on a small device with a computationally-limited 200MHz CPU
with a tiny bus to slow memory, etc requires some additional optimizations
that a dual-core desktop may not.

Well, there you go.

Hilton

"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:2007112002384575249-NpOeStPeAdM@NnOwSlPiAnMkcom...

On 2007-11-20 00:00:01 -0800, "Hilton" <no****@nospam.comsaid:

>Dude, chill. The guy asked how to write a histogram, I said use an array
of
integers.

If that's _all_ you'd said, we'd not have a problem.

It's the other crap you insist on writing that causes the issues.

Stick to the facts and you won't get flamed.

Nov 20 '07 #12

Hilton

Jon,

>Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array
of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is
faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

I agree 100% - the nature of the data is significant. BTW: I wouldn't have
to sort 100,000, the algorithm could, for example remove the unused counts
by creating another array and copying over only the used counts and convert
from an int to a struct which would include the index.

(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

The "int count, index" struct would have double the data to swap so you are
correct (instead of just a reference) - assuming that we use ints (again,
depends on the nature of the data - we might require bytes or longs).
However, since arrays of structs are contiguous, cache misses would/might be
reduced and that might end up being significant (or not). On the flip side,
creating (and disposing) the objectes incurs a penalty. There are several
'exercises for the reader' here; e.g. how much more expensive are objects
than structs during creation and swapping, how much penalty is incurred
during boxing and unboxing (if required), faster to swap structs or
references etc. Another thing, seems like accessing a x[y].z should be
faster when using structs than objects (I think).

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

I think it really all depends on the nature of the data and the nature of
the app. With large amounts of data (e.g. large bitmap or network streams)
I would definitely favor the array of ints, but the techniques you suggest
definitely should be part of the same toolkit to solve the problem.

Jon, do have any URLs with information on the internals of "Dictionary<int,
int>"? I'm particularly interested in boxing, unboxing, casting, hashing,
etc. Performance info would be great too.

Thanks,

Hilton

Nov 20 '07 #13

Jon Skeet [C# MVP]

Hilton <no****@nospam.comwrote:

If you know you'll have say 1000 possible indices, then define an array
of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is
faster
than any dictionary implementation.
That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

I agree 100% - the nature of the data is significant. BTW: I wouldn't have
to sort 100,000, the algorithm could, for example remove the unused counts
by creating another array and copying over only the used counts and convert
from an int to a struct which would include the index.

Removing unused counts is still O(R) where R is the theoretical range -
that may be *much* larger than the set of actual numbers used.

(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

The "int count, index" struct would have double the data to swap so you are
correct (instead of just a reference) - assuming that we use ints (again,
depends on the nature of the data - we might require bytes or longs).
However, since arrays of structs are contiguous, cache misses would/might be
reduced and that might end up being significant (or not). On the flip side,
creating (and disposing) the objectes incurs a penalty. There are several
'exercises for the reader' here; e.g. how much more expensive are objects
than structs during creation and swapping, how much penalty is incurred
during boxing and unboxing (if required), faster to swap structs or
references etc. Another thing, seems like accessing a x[y].z should be
faster when using structs than objects (I think).

Possibly. My bigger point was that it's not a good idea to claim
performance benefits without testing, which is what you were doing.
Optimization is a very subtle art - and a time consuming exercise,
which should only be attempted where really needed.

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

I think it really all depends on the nature of the data and the nature of
the app. With large amounts of data (e.g. large bitmap or network streams)
I would definitely favor the array of ints, but the techniques you suggest
definitely should be part of the same toolkit to solve the problem.

Why would you "definitely favour the array of ints"? You've already
suggested that with a range of size 100,000 you wouldn't use an array -
that could be the case with large streams.

In particular, would you "definitely favour the array of ints" without
benchmarking? If so, that's foolish.

Jon, do have any URLs with information on the internals of "Dictionary<int,
int>"? I'm particularly interested in boxing, unboxing, casting, hashing,
etc. Performance info would be great too.

There'd be no boxing involved, due to the way that generics are handled
by the JIT. General performance info is given in MSDN - lookup on a
Dictionary<K,Vapproaches an O(1) operation; adding is O(1) unless
extra capacity is required, in which case it becomes O(n).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Nov 20 '07 #14

Scott Gifford

"Jon Skeet [C# MVP]" <sk***@pobox.comwrites:

On Nov 20, 12:56 am, "Hilton" <nos...@nospam.comwrote:

<snip>

>Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000

Just for the OP's reference (and some Google keywords), this is a
standard tradeoff, and it depends on whether the data is "dense" or
"sparse". For dense data the array approach is generally more
efficient; for sparse data the dictionary is likely to be best. But
of course, performance specifics will depend on a lot of factors.

Since both ways are fairly simple, if performance matters it would be
straightforward to write it both ways and benchmark, as Jon suggests.
It would also give some useful data for the next time you have to make
this tradeoff.

Good luck!

----Scott.

Nov 20 '07 #15

Hilton

Jon wrote:

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

I think it really all depends on the nature of the data and the nature of
the app. With large amounts of data (e.g. large bitmap or network
streams)
I would definitely favor the array of ints, but the techniques you
suggest
definitely should be part of the same toolkit to solve the problem.

Why would you "definitely favour the array of ints"? You've already
suggested that with a range of size 100,000 you wouldn't use an array -
that could be the case with large streams.

In particular, would you "definitely favour the array of ints" without
benchmarking? If so, that's foolish.

You're combining two of my comments into one. When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc. So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.
Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application. You think they use an array or a dictionary? If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
"definitely favour the array of ints?" - absolutely. Would using a
dictionary be faster than "x[y]++" where x is an array of ints? And which
would be clearer to read? Jon, Peter, take that as my throwing down a
challenge. :)

Hilton

Nov 20 '07 #16

Peter Duniho

On 2007-11-20 15:14:02 -0800, "Hilton" <no****@nospam.comsaid:

You're combining two of my comments into one. When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc. So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.
Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application. You think they use an array or a dictionary? If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
"definitely favour the array of ints?" - absolutely. Would using a
dictionary be faster than "x[y]++" where x is an array of ints? And which
would be clearer to read? Jon, Peter, take that as my throwing down a
challenge. :)

Challenge? Looks more like a straw man to me.

Jon's point appears to me to be that you wrote "with large amounts of
data I would definitely favor the array of ints", but there's really
nothing about the amount of data that predisposes it to being better
solved by an array of ints when that amount is large.

He's not combining two comments. The single statement is by itself
flawed. Even if you have a large amount of data, if the range is large
and the population within that range is sparse, an array of ints is not
likely to be a good solution.

I have already agreed that in many situations, I would prefer an array
of ints for its simplicity. Raising that as an arguing point doesn't
make any sense; we're already in agreement there.

The point of contention, which you seem to keep overlooking, is that
your original claim was that an array of ints is efficient while a
dictionary is NOT efficient. The claim is wrong twice: first, because
the dictionary isn't inefficient at all even in cases where it's slower
than the array of ints, and second because there are situations in
which a dictionary solution is actually more efficient than the array
of ints.

As far as readability goes, which is more readable depends on context.
While I think that the array of ints is likely to be the simpler, more
clear implementation in most cases, that doesn't mean it would be in
all cases. For example, using a SortedDictionary produces very simple
code IMHO because you never even have to have a separate section of
code to sort the data. Even a Dictionary may be found to be simpler
because of the ability to do compile-time type checking, allowing for
it to be used in a simpler way elsewhere.

Context is everything, and there's not enough context to claim a priori
that the array of ints is in fact the clearer implementation.

But besides, your contention has been that the array solution is
superior due to _performance_, not readability. And on that basis,
your claim has no firm foundation.

Pete

Nov 20 '07 #17

Jon Skeet [C# MVP]

Hilton <no****@nospam.comwrote:

I think it really all depends on the nature of the data and the nature of
the app. With large amounts of data (e.g. large bitmap or network
streams)
I would definitely favor the array of ints, but the techniques you
suggest
definitely should be part of the same toolkit to solve the problem.
Why would you "definitely favour the array of ints"? You've already
suggested that with a range of size 100,000 you wouldn't use an array -
that could be the case with large streams.

In particular, would you "definitely favour the array of ints" without
benchmarking? If so, that's foolish.

You're combining two of my comments into one.

Um, you said in a single post:

"With large amounts of data (e.g. large bitmap or network streams) I
would definitely favour the array of ints".

How is that combining two comments?

When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc. So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.

You never suggested that "huge network stream" implies "we're only
examining a single byte at a time". Why can't a network stream be a
stream of longs, for instance?

Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application. You think they use an array or a dictionary? If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
"definitely favour the array of ints?" - absolutely. Would using a
dictionary be faster than "x[y]++" where x is an array of ints? And which
would be clearer to read? Jon, Peter, take that as my throwing down a
challenge. :)

In the case of only 256 entries, I'd probably go with the array as
well. That's not always the case with large network streams, nor with
images (which may well *not* be 8-bit indexed).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Nov 20 '07 #18

Hilton

Jon,

Yet another thread where being in the same room would resolve it in 2
minutes but NG posts result in ambiguity etc. I agree I wasn't specific
enough with large network stream, that's why I clarified it. You and I are
agreeing here that small range, lots of data would imply array of int usage;
sparse and/or big-range data would imply Dictionary or other non-array
solution. And as we both keep saying/implying, the nature of the data
ultimately determines the course of action.

Cheers,

Hilton

Nov 21 '07 #19

Chris Mullins [MVP - C#]

"Hilton" <no****@nospam.comwrote

Again, this is for performance reasons and I'll bet it'll be way faster
than dictionaries etc. Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

Well, as both Jon and Peter have pointed out your array of ints has some
potential problems. It's not a clear-cut "This is better across the board"
type of decision. It's likley better in many cases, but not all. Sadly, that
always seems to be the case...

Here's one more piece of fuel for the fire:
If you allocate a large array of ints, they're going to be allocated off the
Large Object Heap. This may have some long-term implications for your
process, as the LOH is never compacted. In the limited memory environment of
a mobile platform, this impact may be much larger than it would be on a
desktop system. For long term stability, something more link-list based may
be better, as it won't have the appearence of leaking memory...

I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques. :)

Ya know, the most efficient technique would be scan the values, and see if
you can further compact them. You should run sometime like the PunyCode
algorithm over them, so as to guarantee that all of your values are in a
default range. Then you can fit, on average, 2 values into the space that
was taken by 1. You could then (in memory) ZIP the punycoded array, and
compact it by a factor or 5:1. This would give you an overall memory saving
of nearly 10:1.

For further savings, you could dump the values to the screen, and ask the
end-user to remember them for a brief amount of time. Then run a sliding
visual window through, and have the user enter the values. This would let
you completly eliminate the array from memory, thereby signifigantly
decreasing your hardware specs and opening up your target market.

For long term storage, you should use the built-in IR channel to dump the
array to a printer in a 3D barcode format. When you need to recover the
data, use the built-in camera on the phone as a scanner, and volia, you have
reduced footprint even more.

If you're really brave, see if you can setup a harmonic based on the 802.11
receivers built into many of the Windows Mobile devices. With proper
encoding, you could offload everything to the wireless....

(It's a slow day. Honestly though, there is no cut-and-dried answer.... :) )

--
Chris Mullins

Nov 21 '07 #20

Wingot

-----Original Message-----
*snip*

For further savings, you could dump the values to the screen, and ask
the
end-user to remember them for a brief amount of time. Then run a
sliding
visual window through, and have the user enter the values. This would
let
you completly eliminate the array from memory, thereby signifigantly
decreasing your hardware specs and opening up your target market.
(It's a slow day. Honestly though, there is no cut-and-dried

answer....

:) )

--
Chris Mullins

Nice one Chris, I love that idea. :)

You could even pretend it's a game for the user to test his memory :P.

Nov 21 '07 #21

Peter Duniho

On 2007-11-20 16:20:42 -0800, "Hilton" <no****@nospam.comsaid:

Jon,

Yet another thread where being in the same room would resolve it in 2
minutes but NG posts result in ambiguity etc.

I disagree that that's the root of the disagreement here. Hopefully
you can handle the recursion. :)

I agree I wasn't specific
enough with large network stream, that's why I clarified it.

No, you accused Jon of "combining two of [your] comments into one".
That's not clarification, that's finger-pointing.

You and I are
agreeing here that small range, lots of data would imply array of int usage;

I'm not sure that Jon would agree that "lots of data" has anything to
do with it. I know I don't. The amount of data is essentially
irrelevant; the distribution of the data is what matters, along with
code maintainability (something that is almost always in direct
conflict with squeezing the last few percentage points of performance
out of some code).

sparse and/or big-range data would imply Dictionary or other non-array
solution. And as we both keep saying/implying, the nature of the data
ultimately determines the course of action.

No disagreement there. But you continue to refuse to acknowledge and
retract your false statement at the outset, claiming that using a
dictionary implementation would be inefficient. You've simply shifted
your stance to a point where you can claim a lack of disagreement,
silently abandoning your original premise.

There are some scenarios in which an array implementation would be
slightly more efficient (not "way faster" as you wrote), but that's not
a given and in no case would a dictionary implementation actually be
considered _inefficient_.

Pete

Nov 21 '07 #22

Hilton

Chris wrote:

Hilton wrote:
>Again, this is for performance reasons and I'll bet it'll be way faster
than dictionaries etc. Of course, only do this extra work if you need
the performance and you're not just running a one-off process.

Well, as both Jon and Peter have pointed out your array of ints has some
potential problems. It's not a clear-cut "This is better across the board"
type of decision. It's likley better in many cases, but not all. Sadly,
that always seems to be the case...

Chris, I totally agree (as mentioned in previous posts). If you look at the
OP, the numbers mentioned were: 1,4,6,23,6,2,54,1,6,23,2,4,1,6, hence my
suggestion for using the array.

I think that if there was always one best solution, our skill/art/work
wouldn't be as challenging and rewarding. :)

Hilton

Nov 23 '07 #23

Marc Gravell

(I'm assuming that this is a single-pass counter that discards the
elements and keeps the aggregates)

The group scenario is such a common case that an *efficient*
aggregator probably is worthy of a place in MiscUtil. The more general
case (multiple concurrent aggregates) gets very scrappy very quickly,
but a single-purpose "does one thing well" is a good thing. And IMO
perhaps a very good educational aid for people wanting to understand
LINQ-to-objects. As for the names - it stumped me too ;-p I didn't
stress over it...

It almost seems a shame that this common case didn't make it into the
base-libs so that the same code could port to the database too... but
I guess it helps to keep things simple.

Now... to get some sleep...

Marc

Nov 24 '07 #24

Jon Skeet [C# MVP]

Marc Gravell <ma**********@gmail.comwrote:

(I'm assuming that this is a single-pass counter that discards the
elements and keeps the aggregates)

Yup.

The group scenario is such a common case that an *efficient*
aggregator probably is worthy of a place in MiscUtil. The more general
case (multiple concurrent aggregates) gets very scrappy very quickly,
but a single-purpose "does one thing well" is a good thing.

Right. I may get round to doing sum/min/max etc at some point. Not just
yet though :)

And IMO perhaps a very good educational aid for people wanting to understand
LINQ-to-objects. As for the names - it stumped me too ;-p I didn't
stress over it...

It almost seems a shame that this common case didn't make it into the
base-libs so that the same code could port to the database too... but
I guess it helps to keep things simple.

Indeed. Another thing that I'm rather amazed isn't in the framework is
a way of reading a file, line by line, as an IEnumerable<string>. It's
simple to implement, but *so* handy. Again, it'll be in MiscUtil next
time I do a release (the code's there already, and is more flexible
than just files, of course).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Nov 24 '07 #25

Marc Gravell

First off - apologies to the group; this is turning into a bit more
free-thought/discussion than question/answer. As such, it occurs that
I should really start blogging this stuff instead (as a more
appropriate forum). If anybody can recommend an appropriate blog site,
I'd be interested...

I looked into the Expression stuff above, and upon consideration I
believe it is a pretty good solution for that old chestnut: how to do
maths with generics. The following works, and is easily extended to
other operators... the generated delegate is fully cached etc...

This has some application to the LINQ / grouping scenario above, but I
suspect it has a lot *more* uses, and it looks like it should
automatically support "JoesRandomStruct" etc...

Marc

using System;
using System.Linq.Expressions;

namespace ConsoleApplication1 {
class Program {
static void Main() {
var data = new[] {1.2, 3.1, 2.9};
Test(data);
/*double i = 0;
int j = 1;
double x = i / j;*/
}
static void Test<T>(T[] args) {
T sum = default(T);
int count = 0;
foreach (T t in args) {
sum = MathOperator<T>.Add(sum, t);
count++;
}
T avg = MathOperator<T>.DivideInt32(sum, count);
}
}
}

static class MathOperator<T{
private static Func<T, T, Tadd, divide;
private static Func<T, int, TdivideInt32;

static Func<TLhs, TRhs, TReturnBuildBinary<TLhs, TRhs,
TReturn>(Func<Expression, Expression, BinaryExpressionmethod) {
ParameterExpression lhs = Expression.Parameter(typeof(TLhs),
"lhs"),
rhs = Expression.Parameter(typeof(TRhs), "rhs");
try {
return Expression.Lambda<Func<TLhs, TRhs, TReturn>>(
method(lhs, rhs), lhs, rhs).Compile();
} catch (InvalidOperationException) {
// try castring rhs to lhs
Expression castRhs = Expression.Convert(rhs,
typeof(TLhs));
return Expression.Lambda<Func<TLhs, TRhs, TReturn>>(
method(lhs, castRhs), lhs, rhs).Compile();
}
}
public static T Add(T arg1, T arg2) {
if (add == null) {
add = BuildBinary<T, T, T>(Expression.Add);
}
return add(arg1, arg2);
}
public static T Divide(T arg1, T arg2) {
if (divide == null) {
divide = BuildBinary<T, T, T>(Expression.Divide);
}
return divide(arg1 ,arg2);
}
public static T DivideInt32(T arg1, int arg2) {
if (divideInt32== null) {
divideInt32 = BuildBinary<T, int, T>(Expression.Divide);
}
return divideInt32(arg1, arg2);
}

}

Nov 24 '07 #26

Jon Skeet [C# MVP]

Marc Gravell <ma**********@gmail.comwrote:

First off - apologies to the group; this is turning into a bit more
free-thought/discussion than question/answer.

Fine by me - although I've got nothing to add to the particular
suggestion you've presented here.

What I've been thinking about is a sort of "Split" method which could
take a further query expression as its parameter, and apply that query
expression to each group within a set of results. That way we wouldn't
need to write "group and count", "group and sum" etc - just split,
passing count, sum or whatever as the parameter.

Unfortunately, I can't get this to work in my head with IEnumerable<T>
without creating one thread per group - we get into stack difficulties,
effectively. (Or you have to read the whole lot and cache it first,
which is what we're trying to avoid.) It may not be possible, but that
would certainly be a pity... it's *logically* feasible, but LINQ
doesn't help us as much as we might want.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Nov 24 '07 #27

Jon Skeet [C# MVP]

Marc Gravell <ma**********@gmail.comwrote:

<snip>

but unless I've misunderstood your post we are
thinking along the same lines.

<snip>

I don't *think* so. As far as I can see, your aggregation works on a
single value at a time, so you can't use the existing extension
methods. I was thinking of something like:

var query = data.Split (x =x.Count());

foreach (var group in query)
{
Console.WriteLine (group.Key + ": "+group.Result);
}

"x" in the first part would either be an IEnumerable<Tof the same
kind as data, or possibly IGrouping<Tto allow the lower levels to
access the key if they wanted - but Count() is the standard
Enumerable.Count().

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Nov 25 '07 #28

Marc Gravell

As an aside - with some luck it could also elect *not* to start a new
thread until the group's stack is full an needs pumping... then at the
end the primary thread could mop up any groups without threads (or
even share them among the threads that *do* exist, but this is more
complex).

That way, you don't get idling threads for things that you only see
once or twice...

Marc

Nov 25 '07 #29

Jon Skeet [C# MVP]

Marc Gravell <ma**********@gmail.comwrote:

As an aside - with some luck it could also elect *not* to start a new
thread until the group's stack is full an needs pumping... then at the
end the primary thread could mop up any groups without threads (or
even share them among the threads that *do* exist, but this is more
complex).

That way, you don't get idling threads for things that you only see
once or twice...

That's an interesting idea - I suspect it would make the implementation
pretty nasty though.

I think I'll have a go at implementing it the naive "thread per group"
way, for the sake of interest, and write a blog post about it... when
I've time, of course.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Nov 25 '07 #30

what is the best datatype for..

Similar topics