How to implement a Hash Table in C

ravi

Hi
can anybody tell me that which ds will be best suited to implement a
hash table in C/C++

thanx. in advanced

Aug 11 '07

Subscribe Reply

139

14240

« First
<
6
7
8
9
10
>
Last »

Richard Heathfield

James Dow Allen said:

On Aug 13, 8:50 pm, Richard Heathfield <r...@see.sig.i nvalidwrote:

>But seriously, there are also advantages in making the table size an
integer power of 2 (which, in all but one case, fails the primality
test with a vengeance), since the hash value can then be reduced to
the table size more rapidly.

Gak! You're not coding ala Falconer are you?

No, I don't think so. I don't actually use hash tables a great deal
(which is not to say that I never use them), but when I *do* use them,
I don't probe at all. I just use stretchy buckets instead (using trees
to handle the stretchy bit).

<snip>

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 17 '07 #71

Eric Sosman

James Dow Allen wrote:

[...]
Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

Some people seem very interested in the relative speed
of AND vs MOD for this purpose, and on most machines AND
is faster or at the very least no slower. I think, though,
that the time savings must be viewed as just one component
of the search cost -- another piece (sometimes rather large)
is computing the key's hash value in the first place, not
to mention the cost of equality comparisons on the table
items encountered along the way.

(Slight digression): It's often suggested that those
equality comparisons can be made cheaper by storing hash
codes in the table entries along with the payload. Before
making a possibly expensive strcmp() or whatever, one can
first do a (fast) arithmetic comparison of the hashes: if
the hashes don't match, the objects mush be unequal and
it is unnecessary to make the expensive comparison. But this
stratagem doesn't seem to me to be a "Well, doh!" proposition;
at least two counter-arguments exist:

- If the table stores pointers to items rather than item
instances, storing hash values in the table roughly
doubles the amount of memory required. If the same
amount of memory were used for a pointers-only table,
the number of table slots would double and the load
factor would be halved. This should lead to shorter
searches -- and it is not at all clear _a priori_ that
more fast probes will outrace fewer slow probes.

- Storing the hash codes saves time whenever a probe hits
a slot occupied by the "wrong" key (it saves no time when
the probe finds the "right" key). If the time savings is
significant, it must mean that the key comparisons are
very expensive or that the probe sequences are very long.
If we've taken care to construct good hash functions and
have reason to believe that searches take two-point-mumble
probes, on the average, the savings is only one-point-mumble
comparisons per search.

As always when one is interested in speed, the gold standard
is the familiar "Measure, measure, measure!"

--
Eric Sosman
es*****@ieee-dot-org.invalid

Aug 17 '07 #72

Richard Harter

<OT>
On Fri, 17 Aug 2007 08:43:06 -0400, Eric Sosman
<es*****@ieee-dot-org.invalidwrot e:

>James Dow Allen wrote:
>[...]
Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

In a thread in comp.programmin g (IIRCC) Paul Hsieh pointed out
that always using the same probe increment in a power of two
table size also has bad properties. His solution is to use the
unused bits of the full hash to select an increment size from a
table of primes.

[snip]

re: counterargument to storing the hash code.

- If the table stores pointers to items rather than item
instances, storing hash values in the table roughly
doubles the amount of memory required. If the same
amount of memory were used for a pointers-only table,
the number of table slots would double and the load
factor would be halved. This should lead to shorter
searches -- and it is not at all clear _a priori_ that
more fast probes will outrace fewer slow probes.

This doesn't sound right - ordinarily a table has to hold a
key/value pair. Adding a hash would increase the load factor by
50% - for a given amount of memory.

>
- Storing the hash codes saves time whenever a probe hits
a slot occupied by the "wrong" key (it saves no time when
the probe finds the "right" key). If the time savings is
significant, it must mean that the key comparisons are
very expensive or that the probe sequences are very long.
If we've taken care to construct good hash functions and
have reason to believe that searches take two-point-mumble
probes, on the average, the savings is only one-point-mumble
comparisons per search.

As always when one is interested in speed, the gold standard
is the familiar "Measure, measure, measure!"

</OT>

Aug 17 '07 #73

Richard Harter

On Fri, 17 Aug 2007 01:01:27 -0700, user923005
<dc*****@connx. comwrote:

>/* Here is that cuckoo hash function converted to C */
/* Dann Corbit did the quick & dirty C conversion. */

<snip code>

Thanks for posting the code - much appreciated.

Aug 17 '07 #74

CBFalconer

Richard Harter wrote:

Eric Sosman <es*****@ieee-dot-org.invalidwrot e:
>James Dow Allen wrote:

>>[...]
Speaking of Falconer, his method (linear probing with second
hash value used for probe increment) *requires* (even more so
than quadratic probing, see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

In a thread in comp.programmin g (IIRCC) Paul Hsieh pointed out
that always using the same probe increment in a power of two
table size also has bad properties. His solution is to use the
unused bits of the full hash to select an increment size from a
table of primes.

That is dangerous. The purpose of the multiple hash is to have
different search patterns for different objects that have the same
primary hash pattern. This avoids clustering, and tends to keep
the search chain short. The very iffy thing in hashlib is the
reduction of the range of the secondary hash, which encourages
light clustering in nearly empty tables. The system attempts to
maintain tables at between 44 and 88% of capacity.

Also, I want the table action to be independent of the hash
functions (apart from speed), so the functions are computed every
time needed. The system functions with all the hash functions
replaced by the constant 1 (unity) (tested in the test suite) at a
considerable speed penalty. So the action is hash function
independent, but the speed is highly dependent.

The system is intended to isolate all these decisions in the hash
table code, so the user need not worry about them. Similarly the
algorithms for table expansion, etc. The result is a storage and
retrieval black box.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home .att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Aug 17 '07 #75

Malcolm McLean

"CBFalconer " <cb********@yah oo.comwrote in message
news:46******** *******@yahoo.c om...

Malcolm McLean wrote:
>>
... snip ...
>>
/* Initializes DEQUE as an empty deque of elements ELEM_SIZE
bytes in size, with an initial capacity of at least
CAPACITY. Returns the initial deque data array. */
void *
deque_init (struct deque *deque, size_t capacity, size_t elem_size)
{
void *data = NULL;

deque_init_null (deque);
if (capacity 0) {
deque->capacity = 1;
while (deque->capacity < capacity)
deque->capacity <<= 1;
data = xnmalloc (deque->capacity, elem_size);
}
return data;
}

Makes no sense, lacking descriptions of struct deque,
deque_init_null , xnmalloc.

For it to be OK dequeue->capacity would have to be type guaranteed to be
wider than a size_t. There is no such type in standard C. So you can see
that the code is incorrect.
Actually calling with ~0 is a mistake a calling programmer is not too
unlikely to make. If the he knows he needs N-1 spaces in his queue for a
dental surgery, one patient on the chair and thus not in queue, then he
might pass in N -1 and forget for check for N == 0.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 17 '07 #76

user923005

On Aug 17, 2:00 pm, "Malcolm McLean" <regniz...@btin ternet.comwrote :
[snip]

There are many generic queue libraries. None has gained wide acceptance.
Therefore the bar for "usable" is in fact very high. This is not true of
other languages, but it is true of C.

These queues have a large following:
http://www.microsoft.com/windowsserv...q/default.mspx
http://www-306.ibm.com/software/integration/wmq/

Here is a list of queues:
http://sourceforge.net/search/?type_...ft&words=queue

I like this one:
http://sourceforge.net/projects/safmq/
It's C++ though, and is a lot more that most people probably need for
a simple fifo.

For those with simple needs, here is a thread-safe fifo:
http://fifoembed.sourceforge.net/
It's POSIX oriented, but it should be easy enough to convert to
generic use (POSIX threads for Windows could be used for that OS, for
instance).

Aug 17 '07 #77

Malcolm McLean

"Richard Harter" <cr*@tiac.netwr ote in message
news:46******** *******@news.sb tc.net...

This doesn't sound right - ordinarily a table has to hold a
key/value pair. Adding a hash would increase the load factor by
50% - for a given amount of memory.

Normally you can derive the keys from the data item. So in C you would pass
in a function pointer to extract a key from a datum, or if you want to be a
little faster but less flexible, pass in a length and offset.
Unfortunately it make the table harder to use.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 17 '07 #78

Richard Heathfield

Ben Bacarisse said:

<snip>

I'd have add() and remove() do these tests, but
if you don't do that they need to be provided as queue operations.

I pointed out the name clash with <stdio.h>'s remove() some days - or
possibly even weeks - ago. That should have been plenty of time to
correct the electronic version of the book (given that MM seems to have
plenty of time to discuss it, so I presume he has enough time to fix it
too - the more important task).

So I went to have a look, to see if it had been fixed yet. (This is the
first time I've looked at the chapter.)

MM's A123 XYZ / B123 CDE example is wrong, in that it suggests checking
Slot 134. If that's empty, MM says, B123 CDE goes goes in Slot 124. I
can see no justification whatsoever for checking slot (S + 10) before
deciding whether to use Slot S.

I was puzzled to find that hash table entries were not abstracted as the
table itself was. "Divide and conquer" is too useful a strategy to be
ignored lightly, especially in teaching.

Oddly, MM seems to be using int rather than size_t for recording lengths
and sizes (neither of which can be negative). Presumably this is a
typo, since no experienced C programmer would deliberately be that
daft.

His malloc is not ideal - answer = malloc(sizeof *answer); would be
preferable. He repeats this mistake later in the (poorly named)
hashtable() function.

Good functions perform a task. They are doers. And they need "doing
words" (which we sometimes call "verbs") in their names. For example,
stdio's "remove" and "rename" and "fopen" and "printf" - all reasonably
good names (given their creator's habit of terseness).

The indentation is appalling, and led me to believe at first that
hashtable() always returns 0. I had to double back to check. Blech.

There's a reference to a strdup call, but as far as I can see there is
neither a definition of a strdup function nor indeed a call to one.
There is a call to a mystrdup function, but that isn't the same thing.
(Nor is it defined, so we have no idea what it does.)

I wonder what "menas" means.

MM seems to be thinking that "quadratic probing" means "square the hash
value" (and that's a direct quote). Either he's misunderstood quadratic
probing completely, or I have. As far as I'm concerned, quadratic
probing has nothing whatsoever to do with the hash value except as a
place to start looking for a slot. My understanding is that, if the
hash value h has so far resulted in n collisions in a table of size s,
then the place to look next is slot (h + n*n) % s - and most certainly
NOT (h * h) % s!

At this point, I started to lose interest, since it seems evident that
the chapter is so badly broken as to be beyond recovery - and I started
to skim. At random, I discovered an invasion of implementation
namespace, a correct claim that the primality test was interesting (but
alas, it is interesting for all the wrong reasons), an incorrect claim
that trial division is the only guaranteed way to determine whether a
number is prime, and a prime number function that wrongly decides 2 is
not a prime number.

Overall, I found the chapter to be of disappointingly low quality, and I
could not possibly recommend it. If that chapter is the advertisement,
the product itself must be dire indeed.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 18 '07 #79

Malcolm McLean

"Richard Heathfield" <rj*@see.sig.in validwrote in message
news:34******** *************@b t.com...

Ben Bacarisse said:

<snip>

>I'd have add() and remove() do these tests, but
if you don't do that they need to be provided as queue operations.

I pointed out the name clash with <stdio.h>'s remove() some days - or
possibly even weeks - ago. That should have been plenty of time to
correct the electronic version of the book (given that MM seems to have
plenty of time to discuss it, so I presume he has enough time to fix it
too - the more important task).

Corrrections are being stored up for an edition 2, which will be released
shortly.

MM's A123 XYZ / B123 CDE example is wrong, in that it suggests
checking Slot 134. If that's empty, MM says, B123 CDE goes goes in Slot
124. I can see no justification whatsoever for checking slot (S + 10)
before
deciding whether to use Slot S.

I was puzzled to find that hash table entries were not abstracted as the
table itself was. "Divide and conquer" is too useful a strategy to be
ignored lightly, especially in teaching.

There are two tables. One stores data items of any size, one stores
pointers.

>
Oddly, MM seems to be using int rather than size_t for recording lengths
and sizes (neither of which can be negative). Presumably this is a
typo, since no experienced C programmer would deliberately be that
daft.

It's an algorithms book. The algorithms work on integers. The hash function
then has to to return a size_t, which is confusing to people using the book
to understand how to port to another language.

>
His malloc is not ideal - answer = malloc(sizeof *answer); would be
preferable. He repeats this mistake later in the (poorly named)
hashtable() function.

Too hard to read. malloc( sizeof *ptr) is a bad clc ism that you will seldom
see in another environment. The reader has to perform a mental dereference
which makes it harder to see if the size of correct.

>
Good functions perform a task. They are doers. And they need "doing
words" (which we sometimes call "verbs") in their names. For example,
stdio's "remove" and "rename" and "fopen" and "printf" - all reasonably
good names (given their creator's habit of terseness).

My convention is to have an opaque structure, and a constructor with the
name of the structure in lower case. This is analagous to C++. As you say,
create_xxx would have the advantage of being a verb.

>
The indentation is appalling, and led me to believe at first that
hashtable() always returns 0. I had to double back to check. Blech.

Read the actual book - preview is available. The html is dumped form the
Open Office book files, unfortunately it doesn't understand code formatting.
I really ought to be able to find a way around that, I know.

>
There's a reference to a strdup call, but as far as I can see there is
neither a definition of a strdup function nor indeed a call to one.
There is a call to a mystrdup function, but that isn't the same thing.
(Nor is it defined, so we have no idea what it does.)

strdup() is discussed in chapter 1.

>
I wonder what "menas" means.

typo.

>
MM seems to be thinking that "quadratic probing" means "square the hash
value" (and that's a direct quote). Either he's misunderstood quadratic
probing completely, or I have. As far as I'm concerned, quadratic
probing has nothing whatsoever to do with the hash value except as a
place to start looking for a slot. My understanding is that, if the
hash value h has so far resulted in n collisions in a table of size s,
then the place to look next is slot (h + n*n) % s - and most certainly
NOT (h * h) % s!

I'll have to check on this one.

>
At this point, I started to lose interest, since it seems evident that
the chapter is so badly broken as to be beyond recovery - and I started
to skim. At random, I discovered an invasion of implementation
namespace, a correct claim that the primality test was interesting (but
alas, it is interesting for all the wrong reasons), an incorrect claim
that trial division is the only guaranteed way to determine whether a
number is prime, and a prime number function that wrongly decides 2 is
not a prime number.

Overall, I found the chapter to be of disappointingly low quality, and I
could not possibly recommend it. If that chapter is the advertisement,
the product itself must be dire indeed.

You seem determined to pick faults. The problem it then becomes distorting,
like Ben Bacarisse's comments. It is then impossible to know whether the
criticisms actually represent anything legitimate or are just fussing.

abstracted - is an entry of arbitrary size abstract enough? The keys, yes, I
admit it might have been better to go for non-strings.

size_t - yes, there is an obvious case, but you don't seem to have
considered the implications.

malloc(size *ptr) - that's not the convention in most places. The argument
for it is based on the needs of the compiler rather than the human reader.

verbs for function names. There is a case, but it is not obvious that you
are right. C++ doesn't agree.

indentations. Yes, it's my fault for not getting a decent HTML dumper. If I
edit the html

strdup - discussed in chapter 1.

menas - unambiguously valid criticism.

quadratic probing - you might be right here.

prime stuff - yes, that needs a bit of fixing, I will agree.

So actually most of the criticism dissolves. I am not saying that there are
not things you can say against the book, or that it couldn't have been done
better. However can you point to a better introductory web page on hash
tables? That's the real test.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 18 '07 #80

Similar topics