Comment on how to uniquely track your objects in C# / hash table /get hash code

raylopez99

A quick sanity check, and I think I am correct, but just to make
sure: if you have a bunch of objects that are very much like one
another you can uniquely track them simply by using an ArrayList or
Array, correct? An example: create the object, create an array, the
stuff the object into the array. Later on, assume the object is
mutable, the object changes, but you can find it, if you have enough
state information to uniquely identify it (which ideally is being
stored on the object itself), delete it, etc using the state
information for that object. But to find the object in the array or
array list, you must traverse the entire list, foreach, worse case,
O(n) or something like that in big oh notation, right?

Now this method is foolproof but slow, no? Especially if you're going
to be sorting and doing lots of finding. So for this reason, we use
hash tables to speed up searching and sorting of objects, and to
"find" or "contains" objects, yes?

But, as pointed out in the links below, hash functions can be non-
unique, or give collisions with two different objects having the same
hash (rare, but happens), and further you have to write your own
gethascode, such as if you overload the Equals implementation in a
ArrayList for example.

I've seen examples on overriding GetHashCode and understand how it
works (kind of, at least it made sense), but my question /comment is
that if you're not concerned with performance, then a simple index
based retrieval scheme is best.

Correct me if I'm wrong.

RL

http://www.interact-sw.co.uk/iangblo...21/gethashcode (good
rant on how hash is not unique)

http://msdn.microsoft.com/en-us/libr...thashcode.aspx
(how to overload gethash and other stuff for your classes, since you
cannot rely on the default implementations)

Aug 4 '08 #1

Subscribe Post Reply

5688

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

raylopez99 wrote:

A quick sanity check, and I think I am correct, but just to make
sure: if you have a bunch of objects that are very much like one
another you can uniquely track them simply by using an ArrayList or
Array, correct? An example: create the object, create an array, the
stuff the object into the array. Later on, assume the object is
mutable, the object changes, but you can find it, if you have enough
state information to uniquely identify it (which ideally is being
stored on the object itself), delete it, etc using the state
information for that object. But to find the object in the array or
array list, you must traverse the entire list, foreach, worse case,
O(n) or something like that in big oh notation, right?

Now this method is foolproof but slow, no? Especially if you're going
to be sorting and doing lots of finding. So for this reason, we use
hash tables to speed up searching and sorting of objects, and to
"find" or "contains" objects, yes?

Yes.

But, as pointed out in the links below, hash functions can be non-
unique, or give collisions with two different objects having the same
hash (rare, but happens), and further you have to write your own
gethascode, such as if you overload the Equals implementation in a
ArrayList for example.

I've seen examples on overriding GetHashCode and understand how it
works (kind of, at least it made sense), but my question /comment is
that if you're not concerned with performance, then a simple index
based retrieval scheme is best.

Correct me if I'm wrong.

If you had to write your own Hashtable or Dictionary<class, then
there would be a point.

But writing a GetHashCode method is not that difficult, so in case
you nee dto lookup by something, then I will recommend Dictionary<>
anyway.

Arne

Aug 4 '08 #2

michel.desangles

Also be aware that the fact you want to mutate your object can cause
problems in the dictionary/whatever container, if the fields used to
calculate the hash are mutable themselves.

For instance, say your object is a class with only a string in it, and
the hashcode returned by the object is the string's hashcode. Store
that object in the dictionary, change the string, and your object is
lost in the dictionary (that is, you cannot find it anymore, as the
lookup is based on the new hashcode).

In short, all the fields involved in calculating the hashcode must be
immutable (that's why the .Net framework uses the internal reference
number of the object: it's not ideal in terms of spreading accross the
integer range, but it's immutable).

If all of your fields can mutate, then you're pretty much stuck with
the ArrayList solution, and the manual lookups, and the O(n). But if
you're not concerned with performance, then go ahead!

Michel
On 5 août, 01:18, Arne Vajhøj <a...@vajhoej.dkwrote:

raylopez99 wrote:
A quick sanity check, and I think I am correct, but just to make
sure: *if you have a bunch of objects that are very much like one
another you can uniquely track them simply by using an ArrayList or
Array, correct? *An example: *create the object, create an array, the
stuff the object into the array. *Later on, assume the object is
mutable, the object changes, but you can find it, if you have enough
state information to uniquely identify it (which ideally is being
stored on the object itself), delete it, etc using the state
information for that object. *But to find the object in the array or
array list, you must traverse the entire list, foreach, worse case,
O(n) or something like that in big oh notation, right?

Now this method is foolproof but slow, no? *Especially if you're going
to be sorting and doing lots of finding. So for this reason, we use
hash tables to speed up searching and sorting of objects, and to
"find" or "contains" objects, yes?

Yes.

But, as pointed out in the links below, hash functions can be non-
unique, or give collisions with two different objects having the same
hash (rare, but happens), and further you have to write your own
gethascode, such as if you overload the Equals implementation in a
ArrayList for example.

I've seen examples on overriding GetHashCode and understand how it
works (kind of, at least it made sense), but my question /comment is
that if you're not concerned with performance, then a simple index
based retrieval scheme is best.

Correct me if I'm wrong.

If you had to write your own Hashtable or Dictionary<class, then
there would be a point.

But writing a GetHashCode method is not that difficult, so in case
you nee dto lookup by something, then I will recommend Dictionary<>
anyway.

Arne- Masquer le texte des messages précédents -

- Afficher le texte des messages précédents -

Aug 5 '08 #3

Pavel Minaev

On Aug 5, 2:25*am, raylopez99 <raylope...@yahoo.comwrote:

A quick sanity check, and I think I am correct, but just to make
sure: *if you have a bunch of objects that are very much like one
another you can uniquely track them simply by using an ArrayList or
Array, correct?

You can uniquely track an instance of any reference type, no matter
what data is inside, because it has identity, which can be compared
against by using Object.ReferenceEquals(). It is this identity that
serves as a basis for default implementations of GetHashCode() and
Equals(), and more often than not it is good enough.

But, as pointed out in the links below, hash functions can be non-
unique, or give collisions with two different objects having the same
hash (rare, but happens),

This is largely irrelevant to the issue of performance, since hash
tables work correctly even with collisions - just slower. You're still
guaranteed to have O(1) on average on random input, though
pathological cases can become O(n).

I've seen examples on overriding GetHashCode and understand how it
works (kind of, at least it made sense), but my question /comment is
that if you're not concerned with performance, then a simple index
based retrieval scheme is best.

Maintaining a hash table also has some additional overhead, both on
insertion and on retrieval (computing hashes, balancing buckets etc).
In practice, particularly for small collections (roughly ~100 items),
where both insertions and lookups happen frequently, a plain list with
linear search can be more efficient than a hashtable both in terms of
speed and memory. For larger stuff, hashtable is worth considering -
though even then, if inserts are frequent and lookups are very rare, a
list is probably still better (but it is not a usual case).

On the other hand, if you fill the collection once, and then just do
lookups, never modifying it, then the most efficient implementation is
pre-sorting the collection and doing binary search on it
(Array.BinarySearch or List<T>.BinarySearch, as needed - but if a
collection is immutable, might as well make it an array for the slight
performance increase that gives).

Aug 5 '08 #4

raylopez99

On Aug 4, 4:18 pm, Arne Vajhøj <a...@vajhoej.dkwrote:

raylopez99 wrote:

If you had to write your own Hashtable or Dictionary<class, then
there would be a point.

But writing a GetHashCode method is not that difficult, so in case
you nee dto lookup by something, then I will recommend Dictionary<>
anyway.

Arne

Yes, I have come to the same conclusion. For example, if you have a
bunch of rectangles, which are commonly used in Windows programming
for identifying areas, with the same Size, all stored in an array as
part of a class variable MyRectangleClass, and you want to search to
find and replace one, if you use Find you'll get the first rectangle,
which might not be the object (of MyRectangleClass) that you want. So
the solution is to simply add a GetHashCode to the lookup, and if the
object in the ArrayList is the same, then you 'know' you found the
same variable object. As Pavel Minaev points out, the
Object.ReferenceEquals() method also implicitly uses GetHashCode.

But, here is a philosophical problem to consider: how often does the
IL engine (sorry in advance for any wrong terminology) assign the same
hash code id to two different objects? Very, very rare, but it
happens. Then, when this happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no? Has anybody thought
about this problem? I suppose there is no way around it. At first,
when first confronted with this problem, I think a solution is that
you can write an override for "Equals" for ArrayList, and a 'try/
catch' block so that if get a match between hashcodes for two
different objects (rectangles) that do not match in Size, you can
'catch' the error and assign equality anyway. But, the more I think
about it, the less sure I am this solution is good. First off, it
must be true that hashcode collisions are very rare, probably less
than 0.000001% (guessing). Some blogs had some numbers, and they were
small after several million iterations. But, suppose you do find a
error as above--then what? Your ArrayList of MyRectangleClass
variables cannot be searched for the 'true' or 'real' rectangle,
since, if it is very large, it is likely to have several variables
with rectangles that match the rectangle you are comparing them to.
Which one do you pick? So, while try/catch will allow you to spot
errors returned by GetHashCode, it won't give a solution to actually
finding the object you are looking for.

A similar problem comes up with database design, which I dabbled with
last year. They use GUIDs (basically hashcode IDs it seems) to
uniquely identify a row in a table. At the time I thought GUIDs were
100% unique--and nobody in the database world suggested otherwise,
though perhaps I didn't read enough--but some database gurus did not
like, for database design purposes (so-called 5 Rules of dB
Normalization), using GUIDs since they are not a good identifier for
Normalization purposes. Then the problem becomes--how do I uniquely
identify a row, if you cannot use a GUID? You have to pick a
combination of values from the row, to identify that row. In the case
of the MyRectanglesClass example, you can perhaps have other stuff to
identify the rectangles, such as the position on the screen, or
something else, like their filled properties, Brush color, or so
forth. But doing so is tough because 'unique identifiers' are
mutable. In the database world, a person can be assigned a GUID (like
a national identity number or Social Security number in the USA),
which is good, but if you don't use the GUID, what combination of
attributes is assignable to any person, to identify that person, that
is not mutable? I can't think of any (maybe their entire DNA
sequence?!--but that is not a short ID). Certainly not name, address,
height, weight, age, date of birth, in any combination, since they can
change or are often duplicates, especially with nearly 7 billion
people on the planet.

RL

Aug 5 '08 #5

raylopez99

On Aug 5, 12:24*am, michel.desang...@gmail.com wrote:

Also be aware that the fact you want to mutate your object can cause
problems in the dictionary/whatever container, if the fields used to
calculate the hash are mutable themselves.

For instance, say your object is a class with only a string in it, and
the hashcode returned by the object is the string's hashcode. Store
that object in the dictionary, change the string, and your object is
lost in the dictionary (that is, you cannot find it anymore, as the
lookup is based on the new hashcode).

Yes, I'll remember this--don't change the keys in a dictionary!

>
In short, all the fields involved in calculating the hashcode must be
immutable (that's why the .Net framework uses the internal reference
number of the object: it's not ideal in terms of spreading accross the
integer range, but it's immutable).

See my philosophical post in reply to Arne in this thread--even the
internal reference number of the object, while immutable while the
object is alive, is not guaranteed (says Microsoft) to be unique.
Very rare, but it happens one in XYZ times.

>
If all of your fields can mutate, then you're pretty much stuck with
the ArrayList solution, and the manual lookups, and the O(n). But if
you're not concerned with performance, then go ahead!

But, as per the reply to Arne, even ArrayList solution, or it's
generic equivalent, is not free of problems it seems.

RL

Aug 5 '08 #6

Jon Skeet [C# MVP]

raylopez99 <ra********@yahoo.comwrote:

But, here is a philosophical problem to consider: how often does the
IL engine (sorry in advance for any wrong terminology) assign the same
hash code id to two different objects? Very, very rare, but it
happens. Then, when this happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no? Has anybody thought
about this problem?

Yes, they certainly have. When you look something up in a hashtable,
first you need to find all the items that have the same hashcode as the
key that you're trying to look up. Then you need to call Equals to find
out whether each candidate entry is *actually* a match for the key
you're using.

Hash collisions slow things down (because you need to call Equals on
more candidate entries) but they don't stop a hashtable from actually
working correctly. Just don't try to treat a hashcode as a unique
identifier - it isn't.

--
Jon Skeet - <sk***@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

Aug 5 '08 #7

raylopez99

On Aug 5, 1:08*am, Pavel Minaev <int...@gmail.comwrote:

On Aug 5, 2:25*am, raylopez99 <raylope...@yahoo.comwrote:

A quick sanity check, and I think I am correct, but just to make
sure: *if you have a bunch of objects that are very much like one
another you can uniquely track them simply by using an ArrayList or
Array, correct?

You can uniquely track an instance of any reference type, no matter
what data is inside, because it has identity, which can be compared
against by using Object.ReferenceEquals(). It is this identity that
serves as a basis for default implementations of GetHashCode() and
Equals(), and more often than not it is good enough.

Thanks, I did not think about Object.ReferenceEquals(). I was
concentrating on using a generic List and writing my own
"Equals" (override) {public override bool Equals(Object obj) }. But I
like .ReferenceEquals since it's a library function and I trust code
written by librarians--I just hobby code, so, like a beginner chess
player who memorizes grandmaster moves, I prefer to use professional
code whenever possible.

But, as pointed out in the links below, hash functions can be non-
unique, or give collisions with two different objects having the same
hash (rare, but happens),

This is largely irrelevant to the issue of performance, since hash
tables work correctly even with collisions - just slower. You're still
guaranteed to have O(1) on average on random input, though
pathological cases can become O(n).

Yes, hash tables may work correctly with collisions, but see my
philosophical thread to Arne--you can have problems if a collision
occurs in a Array or List when you are trying to find a 'unique'
object that is not so unique.

>
Maintaining a hash table also has some additional overhead, both on
insertion and on retrieval (computing hashes, balancing buckets etc).
In practice, particularly for small collections (roughly ~100 items),
where both insertions and lookups happen frequently, a plain list with
linear search can be more efficient than a hashtable both in terms of
speed and memory.

I didn't know the small collections cutoff was that small! I thought
it would be closer to 10k items. It shows the librarians who wrote
the code for hash table did a very good job, to get the number that
low.

For larger stuff, hashtable is worth considering -
though even then, if inserts are frequent and lookups are very rare, a
list is probably still better (but it is not a usual case).

I wish I had a table to consult to see what container to use for what
type situation... probably there's a table somewhere on the net.

On the other hand, if you fill the collection once, and then just do
lookups, never modifying it, then the most efficient implementation is
pre-sorting the collection and doing binary search on it
(Array.BinarySearch or List<T>.BinarySearch, as needed - but if a
collection is immutable, might as well make it an array for the slight
performance increase that gives).

Very interesting. I'll keep this in mind-- use binary tree for
objects you will not move around much, which is akin to what I heard
about fast lookups for balanced red/black binary trees. But I think
the philosophical question I posed to Arne in this thread will also
apply to binary trees. Probably it's a problem you cannot escape
from.

RL

Aug 5 '08 #8

raylopez99

On Aug 5, 2:53*am, Jon Skeet [C# MVP] <sk...@pobox.comwrote:

raylopez99 <raylope...@yahoo.comwrote:
But, here is a philosophical problem to consider: *how often does the
IL engine (sorry in advance for any wrong terminology) assign the same
hash code id to two different objects? *Very, very rare, but it
happens. *Then, when this happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no? *Has anybody thought
about this problem?

Yes, they certainly have. When you look something up in a hashtable,
first you need to find all the items that have the same hashcode as the
key that you're trying to look up. Then you need to call Equals to find
out whether each candidate entry is *actually* a match for the key
you're using.

Two questions: what about an array or generic List or ArrayList
(which apparently is deprecated, after C# 2.0 it seems)? Do you need
to lookup the hashcode to see if there are duplicates?

Second question: with Equals, and I've seen examples of customized
overridden Equals from the MSDN, how do you find a so-called "actual"
match, as you say? As I said for rectangles, what is unique? Seems
this is a bug or feature in computer science you cannot avoid.

Seems then I discovered it, de novo and all by myself.

Logically, seems that I am a budding or actual genius, thank you very
much.

Hash collisions slow things down (because you need to call Equals on
more candidate entries) but they don't stop a hashtable from actually
working correctly. Just don't try to treat a hashcode as a unique
identifier - it isn't.

Yes, I see your point about hash collisions... like you say, the
'buckets' may have more than one entry in them. But it doesn't solve
the problem of finding your 'unique' or not so unique object /
variable, if you have a nearly infinite sea of such variables.

RL

Aug 5 '08 #9

Jon Skeet [C# MVP]

raylopez99 <ra********@yahoo.comwrote:

Yes, they certainly have. When you look something up in a hashtable,
first you need to find all the items that have the same hashcode as the
key that you're trying to look up. Then you need to call Equals to find
out whether each candidate entry is *actually* a match for the key
you're using.

Two questions: what about an array or generic List or ArrayList
(which apparently is deprecated, after C# 2.0 it seems)? Do you need
to lookup the hashcode to see if there are duplicates?

List and ArrayList don't use hashcodes at all.

Second question: with Equals, and I've seen examples of customized
overridden Equals from the MSDN, how do you find a so-called "actual"
match, as you say? As I said for rectangles, what is unique? Seems
this is a bug or feature in computer science you cannot avoid.

Sometimes there isn't any useful concept of uniqueness. If you've got
two references to equal strings, for example, you can only tell the
difference between them using object.ReferenceEquals (or comparison
using == having cast up to object). But if they're equal, you rarely
need to distinguish between them in the first place.

Hash collisions slow things down (because you need to call Equals on
more candidate entries) but they don't stop a hashtable from actually
working correctly. Just don't try to treat a hashcode as a unique
identifier - it isn't.

Yes, I see your point about hash collisions... like you say, the
'buckets' may have more than one entry in them. But it doesn't solve
the problem of finding your 'unique' or not so unique object /
variable, if you have a nearly infinite sea of such variables.

The key is usually to think about why you're interested in uniqueness
in the first place.

--
Jon Skeet - <sk***@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

Aug 5 '08 #10

raylopez99

On Aug 5, 3:22*am, Jon Skeet [C# MVP] <sk...@pobox.comwrote:

>
The key is usually to think about why you're interested in uniqueness
in the first place.

This is true. But suppose you know they are unique, but just can't
prove it easily... like a message packet that gets routed my mistake
to a different location because it has the same packet ID (header ID)
number... you can "drill down" and examine the payload of the packet
to see that in fact it contains different text than another packet
having the same header ID, but that takes a lot of computational time
and effort. So you just have to live with collisions on occasion.

For this particular example, I've concluded that the chances of
getting the same hash code for an array having even several thousand
elements is vanishingly small, and not worth worrying about. Probably
your hardware will crash more often than getting a collision between
object hash codes for any moderately sized array. But if you really
worry about collisions, and find several variables having the same
hash (i.e., "FindAll" method), then you can "drill down" and try and
identify them further, maybe with a try/catch block.

Thanks for your input.

RL

Aug 5 '08 #11

michel.desangles

On 5 août, 14:46, raylopez99 <raylope...@yahoo.comwrote:

On Aug 5, 3:22*am, Jon Skeet [C# MVP] <sk...@pobox.comwrote:

The key is usually to think about why you're interested in uniqueness
in the first place.

This is true. *But suppose you know they are unique, but just can't
prove it easily... like a message packet that gets routed my mistake
to a different location because it has the same packet ID (header ID)
number... you can "drill down" and examine the payload of the packet
to see that in fact it contains different text than another packet
having the same header ID, but that takes a lot of computational time
and effort. *So you just have to live with collisions on occasion.

For this particular example, I've concluded that the chances of
getting the same hash code for an array having even several thousand
elements is vanishingly small, and not worth worrying about. *Probably
your hardware will crash more often than getting a collision between
object hash codes for any moderately sized array. *But if you really
worry about collisions, and find several variables having the same
hash (i.e., "FindAll" method), then you can "drill down" and try and
identify them further, maybe with a try/catch block.

Thanks for your input.

RL

Forgive me if this is already completely obvious to you, I just want
to make sure we agree on one thing : hashcodes are made to *locate* an
object quickly, but have nothing to do whatsoever with whether two (or
more) objects are the same. Say you have a container (List, Array...)
with several objets (not structs, reference objects). If you do
something like this :

foreach (object o in myArray)
if (myReferenceObject == o)
// do something with it

...the comparison between myReferenceObject and each of the objects in
the array will not involve comparing the respective hashcodes, nor
comparing the actual values of the objects, whatever they are (it
would compare the values if the objects were value types, see
http://www.yoda.arachsys.com/csharp/parameters.html for a thorough
explanation on value types and reference types, by the same Jon Skeet
who's writing in this very thread).

It won't involve comparing an internal reference number either, it
will only compare the pointers to the objects, that is, their
reference. Again, if you know this already, consider it as an exercise
for newcomers : each object has a unique place in memory, and the
reference is simply the address of that place. So, if two objects, o1
and o2, are stored exactly at the same place, then they are the same
object (each place can store only one object at one time).

So, to answer a part of your philosophical question, "Then, when this
happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no?", the answer is no when
you use GetHashCode (because then the CLR will switch to comparing
pointers, see Jon's answer), and no when you use ReferenceEquals
because in this case, the reference we're talking about is not the
"internal reference number" used by the CLR to track objects (for the
GC, I believe, though I could be wrong), but the reference pointer to
the address in memory where the objects are stored.

Having the same hashcode is never ever a guarantee that two objects
are the same or have identical content, just some sort of shortcut to
find an object quicker than by comparing them one by one (think
dichotomy).

To sum up : two objects can BE the same, or CONTAIN the same thing.
Hashcode (usually) refer to their content, but if it's not enough to
distinguish between two objects, then the reference is used. As Pavel
pointed out, for lists above a hundred items, hashcode is faster, even
with the default GetHashCode CLR implementation, than a pure reference-
based comparison, that is, an array traversal (though I'm not sure if
SortedList uses dichotomy to speed up the find process).

Except if you :
override GetHashCode()
{
return 1;
}

:)

Michel

Aug 5 '08 #12

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

raylopez99 wrote:

On Aug 4, 4:18 pm, Arne Vajhøj <a...@vajhoej.dkwrote:
>raylopez99 wrote:
But writing a GetHashCode method is not that difficult, so in case
you nee dto lookup by something, then I will recommend Dictionary<>
anyway.

But, here is a philosophical problem to consider: how often does the
IL engine (sorry in advance for any wrong terminology) assign the same
hash code id to two different objects? Very, very rare, but it
happens. Then, when this happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no?

No.

GetHashCode returns the same value.

ReferenceEquals returns false.

Dictionary<is able to handle it because it uses first GetHashCode
and then Equals.

A similar problem comes up with database design, which I dabbled with
last year. They use GUIDs (basically hashcode IDs it seems) to
uniquely identify a row in a table. At the time I thought GUIDs were
100% unique--and nobody in the database world suggested otherwise,
though perhaps I didn't read enough--but some database gurus did not
like, for database design purposes (so-called 5 Rules of dB
Normalization), using GUIDs since they are not a good identifier for
Normalization purposes. Then the problem becomes--how do I uniquely
identify a row, if you cannot use a GUID? You have to pick a
combination of values from the row, to identify that row.

auto number / auto increment / identity and sequences was
invented to solve that problem.

Arne

Aug 5 '08 #13

raylopez99

On Aug 5, 10:37*am, michel.desang...@gmail.com wrote:

>
Forgive me if this is already completely obvious to you, I just want
to make sure we agree on one thing : hashcodes are made to *locate* an
object quickly, but have nothing to do whatsoever with whether two (or
more) objects are the same. Say you have a container (List, Array...)
with several objets (not structs, reference objects). If you do
something like this :

foreach (object o in myArray)
* if (myReferenceObject == o)
* * // do something with it

...the comparison between myReferenceObject and each of the objects in
the array will not involve comparing the respective hashcodes, nor
comparing the actual values of the objects, whatever they are (it
would compare the values if the objects were value types, see

http://www.yoda.arachsys.com/csharp/parameters.htmlfor a thorough

explanation on value types and reference types, by the same Jon Skeet
who's writing in this very thread).

It won't involve comparing an internal reference number either, it
will only compare the pointers to the objects, that is, their
reference. Again, if you know this already, consider it as an exercise
for newcomers : each object has a unique place in memory, and the
reference is simply the address of that place. So, if two objects, o1
and o2, are stored exactly at the same place, then they are the same
object (each place can store only one object at one time).

Actually I did not know this--that "It won't involve comparing an
internal reference number either". That's interesting. Maybe (or no
doubt, after Pavel's point) the CLR / runtime engine uses a hash
table, and, if there's a collision, then resorts to comparing
reference locations.

>
So, to answer a part of your philosophical question, "Then, when this
happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no?", the answer is no when
you use GetHashCode (because then the CLR will switch to comparing
pointers, see Jon's answer),

http://www.yoda.arachsys.com/csharp/parameters.html did not contain
keywords "pointers" or "hash" so this is not discussed by Jon. But
interesting, if true.

and no when you use ReferenceEquals
because in this case, the reference we're talking about is not the
"internal reference number" used by the CLR to track objects (for the
GC, I believe, though I could be wrong), but the reference pointer to
the address in memory where the objects are stored.

New information--interesting.

>
Having the same hashcode is never ever a guarantee that two objects
are the same or have identical content, just some sort of shortcut to
find an object quicker than by comparing them one by one (think
dichotomy).

To sum up : two objects can BE the same, or CONTAIN the same thing.
Hashcode (usually) refer to their content, but if it's not enough to
distinguish between two objects, then the reference is used.

This is interesting. As an aside, the Help documentation for .Equals
(Obj1.Equals(Obj2)) I read yesterday indicates that the default (which
you can override) Equals is simply a comparison of references, to see
if they point to the same object. If they do, then they are "equal",
which for most people is good enough. But, suppose you have an Array
of Circles, then "equals" might not mean the same reference pointed to
(like Object myObject = new Object(); Object myObject2 = myObject;
therefore, "myObject2 = myObject"). Instead, you might want to
compare areas of two different Circles, and if they have the same
area, they are "equal", even though they are pointed to by different
references (or however you say that).

RL

Aug 6 '08 #14

raylopez99

On Aug 5, 4:39*pm, Arne Vajhøj <a...@vajhoej.dkwrote:

A similar problem comes up with database design, which I dabbled with
last year. *They use GUIDs (basically hashcode IDs it seems) to
uniquely identify a row in a table. *At the time I thought GUIDs were
100% unique--and nobody in the database world suggested otherwise,
though perhaps I didn't read enough--but some database gurus did not
like, for database design purposes (so-called 5 Rules of dB
Normalization), using GUIDs since they are not a good identifier for
Normalization purposes. *Then the problem becomes--how do I uniquely
identify a row, if you cannot use a GUID? *You have to pick a
combination of values from the row, to identify that row.

auto number / auto increment / identity and sequences was
invented to solve that problem.

Arne

I'm hardly an authority on dB's, having spent about a month learning
them (but did get a functional program using Visual Basic and Access
Jet as the dB), but I believe auto number / auto increment / identify
will be the same thing as GUID, except shorter, so you're back to
square one, in the sense that you could have collision between two
tables that use the same auto number and table name, but contain
distinct rows.

Anyway, I'll let the people at Oracle who deal with petabytes and
billion user databases deal with it.

RL

Aug 6 '08 #15

michel.desangles

On 6 août, 11:57, raylopez99 <raylope...@yahoo.comwrote:

On Aug 5, 10:37*am, michel.desang...@gmail.com wrote:

Forgive me if this is already completely obvious to you, I just want
to make sure we agree on one thing : hashcodes are made to *locate* an
object quickly, but have nothing to do whatsoever with whether two (or
more) objects are the same. Say you have a container (List, Array...)
with several objets (not structs, reference objects). If you do
something like this :

foreach (object o in myArray)
* if (myReferenceObject == o)
* * // do something with it

...the comparison between myReferenceObject and each of the objects in
the array will not involve comparing the respective hashcodes, nor
comparing the actual values of the objects, whatever they are (it
would compare the values if the objects were value types, see

http://www.yoda.arachsys.com/csharp/parameters.htmlfora thorough

explanation on value types and reference types, by the same Jon Skeet
who's writing in this very thread).

It won't involve comparing an internal reference number either, it
will only compare the pointers to the objects, that is, their
reference. Again, if you know this already, consider it as an exercise
for newcomers : each object has a unique place in memory, and the
reference is simply the address of that place. So, if two objects, o1
and o2, are stored exactly at the same place, then they are the same
object (each place can store only one object at one time).

Actually I did not know this--that "It won't involve comparing an
internal reference number either". *That's interesting. Maybe (or no
doubt, after Pavel's point) the CLR / runtime engine uses a hash
table, and, if there's a collision, then resorts to comparing
reference locations.

Er... When Jon, Pavel and Arne concur, it's a pretty safe bet that
they're right ! ;)

And as Jon noted, it only uses a hashtable for hash-based containers
(like Dictionary), not for Arrays.

So, to answer a part of your philosophical question, "Then, when this
happens, you'll get an error when you use
GetHashCode or Object.ReferenceEquals(), no?", the answer is no when
you use GetHashCode (because then the CLR will switch to comparing
pointers, see Jon's answer),

http://www.yoda.arachsys.com/csharp/parameters.htmldid not contain
keywords "pointers" or "hash" so this is not discussed by Jon. *But
interesting, if true.

Yes, sorry, I meant that it contained an explanation on the difference
between value types and reference types. The pointers thing is the
mechanism by which you can "identify" a reference object (to keep
things simple). Usually, explanations about pointers are found
together with explanations about "unsafe code" (one of the advantages
about managed code, compared to C++, being that the raw pointers are
hidden to you, and encapsulated into safe and easier-to-handle
delegates).

and no when you use ReferenceEquals
because in this case, the reference we're talking about is not the
"internal reference number" used by the CLR to track objects (for the
GC, I believe, though I could be wrong), but the reference pointer to
the address in memory where the objects are stored.

New information--interesting.

Having the same hashcode is never ever a guarantee that two objects
are the same or have identical content, just some sort of shortcut to
find an object quicker than by comparing them one by one (think
dichotomy).

To sum up : two objects can BE the same, or CONTAIN the same thing.
Hashcode (usually) refer to their content, but if it's not enough to
distinguish between two objects, then the reference is used.

This is interesting. *As an aside, the Help documentation for .Equals
(Obj1.Equals(Obj2)) I read yesterday indicates that the default (which
you can override) Equals is simply a comparison of references, to see
if they point to the same object. *If they do, then they are "equal",
which for most people is good enough. *But, suppose you have an Array
of Circles, then "equals" might not mean the same reference pointed to
(like Object myObject = new Object(); *Object myObject2 = myObject;
therefore, "myObject2 = myObject"). *Instead, you might want to
compare areas of two different Circles, and if they have the same
area, they are "equal", even though they are pointed to by different
references (or however you say that).

That what Jon meant when he said that it depends on what you mean by
"equality", and there are numerous pitfalls here. For instance, there
are two version of Equals, one is static (in the Object class), the
other is an instance method of your own objects. It is customary to
never override the first (so that object.Equals(o1, o2) will always
tell you whether the object is indeed the same object - read : has the
same place in memory, and will be equivalent to ReferenceEquals for
reference types), and occasionaly override the second (often together
with the == operator) to accomodate for the case you're stating
(comparing surfaces of two circles).

Another pitfall : ReferenceEquals will always return false for value
types. i.e. :

int a = 5;
int b = a;
Console.WriteLine(object.ReferenceEquals(a,b));

...will yield "false", because of the boxing that occurs on value
types in the ReferenceEquals method.

The thing is, as you can see, "being equal" can have several meanings.
That's why I usually never override Equals or ==, but provide a method
with a stupid/evocative name like "Circle.AreSurfacesEqual(o1, o2)".
It has the advantage of removing the ambiguity you're talking about.
That's just me, though.

I'm hardly an authority on dB's, having spent about a month learning
them (but did get a functional program using Visual Basic and Access
Jet as the dB), but I believe auto number / auto increment / identify
will be the same thing as GUID, except shorter, so you're back to
square one, in the sense that you could have collision between two
tables that use the same auto number and table name, but contain
distinct rows.

Yes, but as you can still discriminate the rows through the table name/
number, there won't be any collisions, in the same way you know that a
dog called "Snoop" is different from a cat named "Snoop", even though
they have the same name, the same number of legs, eyes... It's handled
in exactly the same way as the filesystem, if you have two files with
identical names in an identical directory structure, like this :

C:\Docs\SomeFolder\File1.txt
D:\Docs\SomeFolder\File1.txt

...you still know they're different files, because the full path
differs (even if only on one drive letter). if the *full* path was the
same, then it would be the same file. It's the same principle for DBs
and for memory management of variables.

HTH,

Michel

Aug 6 '08 #16

Pavel Minaev

On Aug 6, 1:57*pm, raylopez99 <raylope...@yahoo.comwrote:

It won't involve comparing an internal reference number either, it
will only compare the pointers to the objects, that is, their
reference. Again, if you know this already, consider it as an exercise
for newcomers : each object has a unique place in memory, and the
reference is simply the address of that place. So, if two objects, o1
and o2, are stored exactly at the same place, then they are the same
object (each place can store only one object at one time).

Actually I did not know this--that "It won't involve comparing an
internal reference number either". *That's interesting. Maybe (or no
doubt, after Pavel's point) the CLR / runtime engine uses a hash
table, and, if there's a collision, then resorts to comparing
reference locations.

If you mean how ReferenceEqual() works - well, as far as I know, it
just compares raw addresses of the objects. Since every object
occupies a distinct region of memory, and no object representation can
be shorter than 1 byte, every object's address is unique.

As for Hashtable/Dictionary - it does not resort to comparing
locations. It compares GetHashCode() first, if that returns true, it
second-checks using the instance Equals() method. The default
implementation of Equals() - as inherited from System.Object - just
does ReferenceEqual(), but the hashtable itself is not aware of this.

This is interesting. *As an aside, the Help documentation for .Equals
(Obj1.Equals(Obj2)) I read yesterday indicates that the default (which
you can override) Equals is simply a comparison of references, to see
if they point to the same object. *If they do, then they are "equal",
which for most people is good enough. *But, suppose you have an Array
of Circles, then "equals" might not mean the same reference pointed to
(like Object myObject = new Object(); *Object myObject2 = myObject;
therefore, "myObject2 = myObject"). *Instead, you might want to
compare areas of two different Circles, and if they have the same
area, they are "equal", even though they are pointed to by different
references (or however you say that).

This is correct. In general, Equals() is supposed to be value
comparison (for strings, for example, it compares the characters). If
the developer of the class had nothing to say regarding value equality
- by not overriding Equal() - then the only thing that is be safe to
say is that any object is equal to itself; which is what the default
Equals() does.

Aug 6 '08 #17

Pavel Minaev

On Aug 6, 5:59*pm, michel.desang...@gmail.com wrote:

That what Jon meant when he said that it depends on what you mean by
"equality", and there are numerous pitfalls here. For instance, there
are two version of Equals, one is static (in the Object class), the
other is an instance method of your own objects.

The static version is really just a helper to avoid checking for null:
a.Equals(b) will throw if a==null, but object.Equals(a, b) will not -
it will consider two nulls equal, and one null and one non-null
unequal.

It is customary to never override the first (so that object.Equals(o1, o2) will always
tell you whether the object is indeed the same object - read : has the
same place in memory, and will be equivalent to ReferenceEquals for
reference types)

It isn't customary so much as it is impossible - as two-argument
Equals() is a static method, it cannot be overriden.

Another pitfall : ReferenceEquals will always return false for value
types. i.e. :

int a = 5;
int b = a;
Console.WriteLine(object.ReferenceEquals(a,b));

...will yield "false", because of the boxing that occurs on value
types in the ReferenceEquals method.

This actually makes sense, because value types do not have the notion
of identity. Is this "1" the same (not just equal, but the same
instance) as that other "1"? It is really a meaningless question to
ask.

The thing is, as you can see, "being equal" can have several meanings.
That's why I usually never override Equals or ==, but provide a method
with a stupid/evocative name like "Circle.AreSurfacesEqual(o1, o2)".
It has the advantage of removing the ambiguity you're talking about.
That's just me, though.

That's probably not a good idea, because, for the most part, those
methods all have well-defined meanings. You override Equals() when
your class supports the notion of value equality. You override
operator== when, in addition, it does not have meaningful identity
(strings, for example - you never really care whether
ReferenceEquals(s1, s2)==true - you only care if the characters are
the same, because no sane code would behave differently if they are,
but identities differ). Note than if a class has reason to override
operator==, it effectively becomes a value type in disguise - so it
may make sense to convert it to struct at that point.

Aug 6 '08 #18

michel.desangles

On 6 août, 16:29, Pavel Minaev <int...@gmail.comwrote:

On Aug 6, 5:59*pm, michel.desang...@gmail.com wrote:

That what Jon meant when he said that it depends on what you mean by
"equality", and there are numerous pitfalls here. For instance, there
are two version of Equals, one is static (in the Object class), the
other is an instance method of your own objects.

The static version is really just a helper to avoid checking for null:
a.Equals(b) will throw if a==null, but object.Equals(a, b) will not -
it will consider two nulls equal, and one null and one non-null
unequal.

It is customary to never override the first (so that object.Equals(o1, o2) will always
tell you whether the object is indeed the same object - read : has the
same place in memory, and will be equivalent to ReferenceEquals for
reference types)

It isn't customary so much as it is impossible - as two-argument
Equals() is a static method, it cannot be overriden.

I meant redefine (static public new bool Equals(object obj)).

Another pitfall : ReferenceEquals will always return false for value
types. i.e. :

int a = 5;
int b = a;
Console.WriteLine(object.ReferenceEquals(a,b));

...will yield "false", because of the boxing that occurs on value
types in the ReferenceEquals method.

This actually makes sense, because value types do not have the notion
of identity. Is this "1" the same (not just equal, but the same
instance) as that other "1"? It is really a meaningless question to
ask.

The thing is, as you can see, "being equal" can have several meanings.
That's why I usually never override Equals or ==, but provide a method
with a stupid/evocative name like "Circle.AreSurfacesEqual(o1, o2)".
It has the advantage of removing the ambiguity you're talking about.
That's just me, though.

That's probably not a good idea, because, for the most part, those
methods all have well-defined meanings. You override Equals() when
your class supports the notion of value equality. You override
operator== when, in addition, it does not have meaningful identity
(strings, for example - you never really care whether
ReferenceEquals(s1, s2)==true - you only care if the characters are
the same, because no sane code would behave differently if they are,
but identities differ). Note than if a class has reason to override
operator==, it effectively becomes a value type in disguise - so it
may make sense to convert it to struct at that point.

The ambiguity I was refering about is the remainder of "for the most
part". :)

Michel

Aug 6 '08 #19

raylopez99

On Aug 6, 6:59*am, michel.desang...@gmail.com wrote:

This is interesting. *As an aside, the Help documentation for .Equals
(Obj1.Equals(Obj2)) I read yesterday indicates that the default (which
you can override) Equals is simply a comparison of references, to see
if they point to the same object. *If they do, then they are "equal",
which for most people is good enough. *But, suppose you have an Array
of Circles, then "equals" might not mean the same reference pointed to
(like Object myObject = new Object(); *Object myObject2 = myObject;
therefore, "myObject2 = myObject"). *Instead, you might want to
compare areas of two different Circles, and if they have the same
area, they are "equal", even though they are pointed to by different
references (or however you say that).

That what Jon meant when he said that it depends on what you mean by
"equality", and there are numerous pitfalls here. For instance, there
are two version of Equals, one is static (in the Object class), the
other is an instance method of your own objects. It is customary to
never override the first (so that object.Equals(o1, o2) will always
tell you whether the object is indeed the same object - read : has the
same place in memory, and will be equivalent to ReferenceEquals for
reference types), and occasionaly override the second (often together
with the == operator) to accomodate for the case you're stating
(comparing surfaces of two circles).

Another pitfall : ReferenceEquals will always return false for value
types. i.e. :

int a = 5;
int b = a;
Console.WriteLine(object.ReferenceEquals(a,b));

...will yield "false", because of the boxing that occurs on value
types in the ReferenceEquals method.

The thing is, as you can see, "being equal" can have several meanings.
That's why I usually never override Equals or ==, but provide a method
with a stupid/evocative name like "Circle.AreSurfacesEqual(o1, o2)".
It has the advantage of removing the ambiguity you're talking about.
That's just me, though.

Yes, good stuff here, I've added this quote of yours to my library of
C# stuff that I keyword search.

>
I'm hardly an authority on dB's, having spent about a month learning
them (but did get a functional program using Visual Basic and Access
Jet as the dB), but I believe auto number / auto increment / identify
will be the same thing as GUID, except shorter, so you're back to
square one, in the sense that you could have collision between two
tables that use the same auto number and table name, but contain
distinct rows.

Yes, but as you can still discriminate the rows through the table name/
number, there won't be any collisions, in the same way you know that a
dog called "Snoop" is different from a cat named "Snoop", even though
they have the same name, the same number of legs, eyes... It's handled
in exactly the same way as the filesystem, if you have two files with
identical names in an identical directory structure, like this :

C:\Docs\SomeFolder\File1.txt
D:\Docs\SomeFolder\File1.txt

...you still know they're different files, because the full path
differs (even if only on one drive letter). if the *full* path was the
same, then it would be the same file. It's the same principle for DBs
and for memory management of variables.

Yes, sorry, my example was bad since it was not a true collision. But
my point is that unless you use GUIDs or Autoincrement or some such
number as your Primary key, it's hard to define uniqueness. However,
more relevant to C#, after going through this thread I feel better
that C# does indeed distinquish between data (variables, objects,
whatever) stored in different parts of memory for the .Equals method
and for ReferenceEquals (which I didn't even know existed as a method)
and doesn't rely on an internal number and/or hash number. And I
agree that overloading "=" or Equals or "+" operator as was the
fashion for C++ is just eye candy / syntatic sugar is not a good idea,
so I like your suggestion of Circle.AreSurfacesEqual.

RL

Aug 6 '08 #20

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

raylopez99 wrote:

On Aug 5, 4:39 pm, Arne Vajhøj <a...@vajhoej.dkwrote:

>>A similar problem comes up with database design, which I dabbled with
last year. They use GUIDs (basically hashcode IDs it seems) to
uniquely identify a row in a table. At the time I thought GUIDs were
100% unique--and nobody in the database world suggested otherwise,
though perhaps I didn't read enough--but some database gurus did not
like, for database design purposes (so-called 5 Rules of dB
Normalization), using GUIDs since they are not a good identifier for
Normalization purposes. Then the problem becomes--how do I uniquely
identify a row, if you cannot use a GUID? You have to pick a
combination of values from the row, to identify that row.
auto number / auto increment / identity and sequences was
invented to solve that problem.

I'm hardly an authority on dB's, having spent about a month learning
them (but did get a functional program using Visual Basic and Access
Jet as the dB), but I believe auto number / auto increment / identify
will be the same thing as GUID, except shorter, so you're back to
square one, in the sense that you could have collision between two
tables that use the same auto number and table name, but contain
distinct rows.

No.

Auto increment etc. just start with 1 and add 1 one for each row
inserted. And when the data type can not have a bigger value you get an
exception when inserting (some DB's at least) and if you use
a sufficiently large data type you will run out of disk
space before that. There will never be a collision.

GUID that are UUID version 4 has a collision probability
of 1/2^122. Which is very small, but still greater than zero.

They are different.

Arne

Aug 6 '08 #21

raylopez99

On Aug 6, 3:50*pm, Arne Vajhøj <a...@vajhoej.dkwrote:

Auto increment etc. just start with 1 and add 1 one for each row
inserted. And when the data type can not have a bigger value you get an
exception when inserting (some DB's at least) and if you use
a sufficiently large data type you will run out of disk
space before that. There will never be a collision.

Right, but does anybody use "auto increment" for a primary key? I
guess they do, same as a GUID but you have to add a unique table name
I guess. But, at the time I wrote this, I was thinking of so-called
"natural keys" and using them as your primary key, and the difficulty
of uniqueness using natural keys. And, at the time I wrote this, I
did not realize that "==" or "ReferenceEquals” is a mechanical
calculation that is 100% guaranteed not to produce collisions--I
thought that it involved GetHashCode/ Hash codes, so there was 1/2^122
or whatever chance of a collision.

GUID that are UUID version 4 has a collision probability
of 1/2^122. Which is very small, but still greater than zero.

Thanks for clearing that up.

RL

Aug 7 '08 #22

=?windows-1252?Q?Arne_Vajh=F8j?=

raylopez99 wrote:

On Aug 6, 3:50 pm, Arne Vajhøj <a...@vajhoej.dkwrote:
>Auto increment etc. just start with 1 and add 1 one for each row
inserted. And when the data type can not have a bigger value you get an
exception when inserting (some DB's at least) and if you use
a sufficiently large data type you will run out of disk
space before that. There will never be a collision.

Right, but does anybody use "auto increment" for a primary key?

Oh yes.

Try invite 10 database people to dinner and ask them whether
they prefer natural keys or surrogate keys, then you will see
a fight break out, be sure to remove any guns and knifes from
them before the fun starts.

Arne

Aug 8 '08 #23

Pavel Minaev

On Aug 8, 5:40*am, Arne Vajhøj <a...@vajhoej.dkwrote:

Try invite 10 database people to dinner and ask them whether
they prefer natural keys or surrogate keys, then you will see
a fight break out, be sure to remove any guns and knifes from
them before the fun starts.

Then ask whether GUIDs do indeed make good primary keys, but be ready
to duck, since all attention will be immediately diverted to you.

Aug 11 '08 #24

Comment on how to uniquely track your objects in C# / hash table /get hash code

Similar topics