473,703 Members | 2,328 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

My hash table is in need of peer review

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable. Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

struct hash {
struct entry *bucket[TABLE_SIZE];
};

static unsigned long get_hash(const char *key)
{
unsigned long hash = 0;

while (*key) {
hash = (hash * 191 + *key) % TABLE_SIZE;
key++;
}

return hash;
}

char *hash_lookup(st ruct hash *table, const char *key)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p)
return NULL;

return p->value;
}

int hash_insert(str uct hash *table, const char *key, char *value)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p) {
p = malloc(sizeof(* p));
if (!p)
return -1;

p->key = malloc(strlen(k ey) + 1);
if (!p->key) {
free(p);
return -1;
}

strcpy(p->key, key);
p->next = table->bucket[hash];
table->bucket[hash] = p;
}

p->value = value;

return 0;
}

struct hash *hash_new()
{
struct hash *table = malloc(sizeof(* table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table)) ;

return table;
}

void hash_free(struc t hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {
p = table->bucket[i];
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);
}

Aug 1 '06 #1
21 3215
In article <11************ **********@m73g 2000cwd.googleg roups.com>,
Johan Tibell <jo**********@g mail.comwrote:
hash = (hash * 191 + *key) % TABLE_SIZE;
I haven't read the rest of your code, but I recommend not building the
table size into the hash function. Use a more general hash function that
produces an int-sized value, and reduce it to the table size (with %)
only when needed to choose the bucket.

Apart from making your hash function more reusable, you can then store
the (large) hash value in the entry, and avoid many (quite likely all)
strcmp()s by comparing the hash values first.

-- Richard
Aug 1 '06 #2
ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
>Johan Tibell <jo**********@g mail.comwrote:
> hash = (hash * 191 + *key) % TABLE_SIZE;

I haven't read the rest of your code, but I recommend not building the
table size into the hash function.
Also a routine to resize the array if the buckets:table size grows
too large or too small. Will need to to recalculate the hash, unless
you store (the hash before doing % TABLE_SIZE). If so, a table size
which is a power of 2 might be better.
Apart from making your hash function more reusable, you can then store
the (large) hash value in the entry, and avoid many (quite likely all)
strcmp()s by comparing the hash values first.
Well, successful lookups will need at least one strcmp().

Might use a "max 32 bits" type for the hash value. (From <stdint.h>
with C99, or use #include <limits.h/ #if UINT_MAX >= 0xffffffff /
typedef ... to support C90.)

Another little hack:
struct entry {
char *value;
struct entry *next;
char key[]; /* in C99, or key[1] in C89 */
};
Now allocate entries large enough to hold both the struct and the
string. The C89 way, the compiler may "know" that keys are just
1 byte wide, so the standard-compliant way to access 'key' is
(char *)<entry_ptr+ offsetof(struct entry, key)

--
Hallvard
Aug 1 '06 #3
In article <hb************ **@bombur.uio.n o>,
Hallvard B Furuseth <h.**********@u sit.uio.nowrote :
>Well, successful lookups will need at least one strcmp().
Good point. Though with a good 64-bit hash you could easily get the
99.99% reliability mentioned in another thread :-)

-- Richard
Aug 1 '06 #4


Johan Tibell wrote On 08/01/06 12:59,:
I would be grateful if someone had a minute or two to review my hash
table implementation.
Based on "a minute or two," I see a few miscellaneous
improvements that might be made and a few rather odd design
decisions that may be worth reconsidering.
It's not yet commented but hopefully it's short
and idiomatic enough to be readable. Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
Some `const' qualifiers would be welcome here.
struct entry *next;
};

struct hash {
struct entry *bucket[TABLE_SIZE];
};

static unsigned long get_hash(const char *key)
{
unsigned long hash = 0;

while (*key) {
hash = (hash * 191 + *key) % TABLE_SIZE;
Why reduce modulo TABLE_SIZE at every single character?
You'll get as good a hash code by accumulating the whole
thing at full `unsigned long' width and taking just one
remainder at the very end. (Given all that remaindering,
`unsigned long' seems a peculiar choice ...)

Others have suggested having get_hash() return an
"unreduced" or "full-width" hash code, and remaindering
it only when you're ready to access the bucket array; the
idea is to store the unreduced hash in the entry and maybe
save a strcmp() call. That's a classic trade-off: more
memory for the possibility of greater speed, but you would
need to try it both ways and make measurements to be sure
of which works better for your application.
key++;
}

return hash;
}

char *hash_lookup(st ruct hash *table, const char *key)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p)
return NULL;

return p->value;
I see you're an adherent of some kind of fundamentalist
control-flow orthodoxy. Wouldn't this be easier to read
if the successful strcmp() were followed by `return p->value'
instead of by `break'? By insisting on `break' you've made
yourself write an extraneous test just to figure out which
of the two possible loop exit conditions caused the loop to
stop iterating. You already had that information while you
were inside the loop; why did you throw it away only to go
back and ask a second time?
}

int hash_insert(str uct hash *table, const char *key, char *value)
Why is the value not `const'?
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}
As before, if you used `return' instead of `break' you
wouldn't need the extra test.
if (!p) {
p = malloc(sizeof(* p));
if (!p)
return -1;

p->key = malloc(strlen(k ey) + 1);
if (!p->key) {
free(p);
return -1;
}

strcpy(p->key, key);
p->next = table->bucket[hash];
table->bucket[hash] = p;
This puts the newest item at the front of its bucket's
list. You might want to consider keeping the lists sorted
by key; the insertion is just a wee bit trickier, but an
unsuccessful search takes only half as many strcmp() calls,
on the average.
}

p->value = value;
This seems weird. You've gone to some trouble to make
a duplicate copy of the key, but then you just use the very
same value pointer the caller hands you. Copying both or
copying neither would make sense (depending on the usage),
but treating them differently strikes me as peculiar.
return 0;
As written, the function returns -1 for "ghastly failure"
and 0 for "anything else." Consider splitting the "else"
into two different reported conditions: item inserted vs.
item not inserted because of duplicate key. The caller might
be able to make productive use of that knowledge (for example,
he could figure out whether the value pointer is or is not
expendable).
}

struct hash *hash_new()
`(void)' would be a little better. If you use an
automated tool to generate function declarations for header
files, they may be able to generate a better declaration
if the definition prototypes the argument list.
{
struct hash *table = malloc(sizeof(* table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table)) ;
Likely to work, but not guaranteed. All-bits-zero is
not necessarily a null pointer, although that's a common
practice. Write a simple loop -- or, if you're really a
fan of zeroes-are-nulls, consider using calloc() instead
of malloc()-plus-memset().
return table;
}

void hash_free(struc t hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {
p = table->bucket[i];
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);
You made mention earlier of keeping a second list so
the "cost of freeing is in proportion to the number of
entries." That's pretty much the case as things already
stand; there's the overhead of looping through TABLE_SIZE
list headers, but that won't amount to much. Notice that
with N entries you'd still make 2*N+1 free() calls anyhow.
Also, maintaining the extra list adds a tiny amount to the
cost of each insertion. "A *tiny* amount," you may say,
but consider: How does the cost of N list insertions (two
pointer stores each) compare to that of TABLE_SIZE list-
head inspections?
}
General stylistic comments (you're welcome to ignore
these; they're all "eye of the beholder" matters):

- I'd prefer testing `p == NULL' to testing `!p'. The
two mean the same thing, but the former seems to me
to read a little more directly: "Have we arrived at
the special sentinel value?" is the question being
asked, so a comparison to that value seems natural.

- In the same vein, I *really* hate `!strcmp(...)' and
*strongly* suggest `strcmp(...) == 0' instead. The
first form looks like a test for a comparison that
has failed; in fact, the test actually fires when the
comparison succeeds. (I have fixed more than one bug
that ensued when somebody mis-read the `!' and wrote
the then and else clauses backwards. Sometimes, the
poor misguided sucker was the same person who wrote
the `if (!strcmp(...))' to begin with!)

- Several of your `while' loops would probably do better
as `for' loops instead. There is no rule that says
`for' is only for counting; it's perfectly good as a
linked-list-chaser, too.

--
Er*********@sun .com

Aug 1 '06 #5
Johan Tibell wrote:
I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.
It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.
[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};
So this is a linked entry (or "Open hash table") design. The problem
with choosing a single value 193 as the index size is that it means the
hash table will only be effective for a limited size of possible
tables. In general, its prefereable to make that a run-time constant,
rather than a compiler time constant.

Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.

Oh, and you are making a classical mistake that people make with hash
table implementations . Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.
struct hash {
struct entry *bucket[TABLE_SIZE];
};

static unsigned long get_hash(const char *key)
{
unsigned long hash = 0;

while (*key) {
hash = (hash * 191 + *key) % TABLE_SIZE;
key++;
}

return hash;
}
G'aah! That's going to end up being a pretty terrible hash function.
You should use either mine or Bob Jenkins which you can find in these
links:

http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/hash/doobs.html

Even if you want to stick with your own inline thingy, you should
design something with the "% TABLE_SIZE" hoisted out of the inner loop.
Relatively speaking the cost of "%" is *enormous*, and you should pull
those out of inner loops when you can.

In general, what you should do, is let the hash function overflow, and
just accumulate into massive noisy bit outputs which use all the bits
of your unsigned int. You only perform the modulo just before its
being used as an index for your buckets. So at a minimum, for now,
just remove the "% TABLE_SIZE", then when you have some spare time go
read the links above to see how real hash functions are designed.
char *hash_lookup(st ruct hash *table, const char *key)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p)
return NULL;

return p->value;
}
Ok, here's where the extra stored hash value becomes useful. We first
compare the hash value against the previously computed hash key for
each entry before invoking the call to strcmp. If your hash function
is good this will lead to drammatic reductions in the number of calls
to strcmp, where you should expect your bottleneck to be in many
situations. So:

char *hash_lookup(st ruct hash *table, const char *key) {
unsigned long hash = get_hash(key);
struct entry *p;

for (p = table->bucket[hash % TABLE_SIZE]; p; p = p->next) {
if (p->hashvalue == hash && !strcmp(key, p->key))
return p->value;
}

return NULL;
}

Notice how the extra "% TABLE_SIZE" operation is added here, with the
expectation that the hash value itself is large (typically well above
193 or whatever your index limit is). The other big change is the
extra condition p->hashvalue == hash. This way, multiple entries which
hash to the same hash index (a value from 0 to 192 inclusive), but not
the same actual full hash value (an arbitrary unsigned long) do not
need to be checked with a full strcmp call. This relies on the fact
that if the hashvalues don't match then the original keys don't match
either. Notice how this further leverages the value of the
(potentially very) expensive hash function call.
int hash_insert(str uct hash *table, const char *key, char *value)
{
unsigned long hash = get_hash(key);
Add the following:

unsigned long tableIdx = hash % TABLE_SIZE;
struct entry *p = table->bucket[hash];
So I would change this to:

struct entry *p = table->bucket[tableIdx];

With the different hash function.
while (p) {
if (!strcmp(key, p->key))
Again here we can do the prescreen test again:

if (p->hashkey == hash && !strcmp(key, p->key))
break;
p = p->next;
}

if (!p) {
p = malloc(sizeof(* p));
if (!p)
return -1;

p->key = malloc(strlen(k ey) + 1);
if (!p->key) {
free(p);
return -1;
}

strcpy(p->key, key);
p->next = table->bucket[hash];
Change the above to:

p->next = table->bucket[tableIdx];

And add the following:

p->hashkey = hash;
table->bucket[hash] = p;
Changed to:

table->bucket[tableIdx] = p;
}

p->value = value;
Ok, so you are assuming that the memory for "value"s are maintained by
the call site, but the keys are maintained by hash table itself. Is
that what you intend?
return 0;
}

struct hash *hash_new()
{
struct hash *table = malloc(sizeof(* table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table)) ;
Oh ... this is apparently technically not completely portable.
Apparently there exists some apocryphal systems in which NULL pointers
are not the same as a packed sequence of 0 bytes. So you should make
this the for loop and store NULL pointers in each entry.
return table;
}

void hash_free(struc t hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {
p = table->bucket[i];
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);
}
Obviously from a code implementation point of view you seem to know
what you are doing. So let me propose a few changes to your core
*design* that will help with flexibility and performance:

struct tuple {
unsigned int hashvalue; /* for prescreening */
char *value;
char key[1]; /* Inline, as suggested by others */
};

struct entriesInIndex {
size_t allocCount, length;
struct tuple ** tuples; /* A periodically realloced buffer */
}

struct hashTable {
size_t bucketCount;
struct entriesInIndex bucket[1]; /* Inline */
};

The way you allocate the inline entries is using the "struct hack" as
the snippets below show:

#include <stddef.h>
/* ... */
struct hashTable ht = (struct hashTable *) malloc (offsetof (struct
hashTable, bucket) + sizeof (struct entriesInIndex) * qty);
/* ... */
size_t keylength = strlen (key) + 1;
struct tuple * t = (struct tuple *) malloc (offsetof (struct tuple
bucket, key) + sizeof (char) * (keylength));
memcpy (t->key, key, keylength);

The tuples field is an array instead of a sequence of linked entries.
Arrays are generally just faster than linked lists, and we can leverage
decreased allocation overhead as a result as well. So you perform a
tuple insert as follows:

struct hashTable * ht;
struct tuple * t;
unsigned long hash, tableIdx;
/* ... */
/* t is created and initialized appropriately */
if (ht->bucket[tableIdx].length >= ht->bucket[tableIdx].allocCount)
{
size_t len = ht->bucket[tableIdx].allocCount ? 1 : 2 *
ht->bucket[tableIdx].length;
struct tuple ** ntuples = (struct tuple **) realloc
(ht->bucket[tableIdx].tuples, len);
if (!ntuples) { /* out of memory error */ }
ht->bucket[tableIdx].tuples = ntuples;
ht->bucket[tableIdx].allocCount = len;
}
ht->bucket[tableIdx].tuples[ht->bucket[tableIdx].length] = t;
ht->bucket[tableIdx].length++;

So you end up with one allocation per tuple, plus one allocation per
bucket, and one overall allocation for the hash table. So its still
three levels of allocation, but the vast majority of them are the tuple
allocations, which I don't see an easy way of reducing.

If you don't even plan on adding a deletion function to your hash
table, then you can make your own special allocator just for the
tuples. This different allocator would be more stack-like than heap
like. It would support allocations (like slicing sections of cheese
sticks or sausages) and a "free all" which just purged the whole thing
in one shot. This vastly improves performance, and asymptotically
reduces average allocation space overhead.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 1 '06 #6


we******@gmail. com wrote On 08/01/06 15:32,:
Johan Tibell wrote:
>>I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.


It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

>>[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};


So this is a linked entry (or "Open hash table") design.
The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.
Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.
The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.
Oh, and you are making a classical mistake that people make with hash
table implementations . Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.
Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations . Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

--
Er*********@sun .com

Aug 1 '06 #7
Eric Sosman wrote:
we******@gmail. com wrote On 08/01/06 15:32,:
Johan Tibell wrote:
>I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.
It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.
>[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};
So this is a linked entry (or "Open hash table") design.

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.
I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.
Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.
I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.
Oh, and you are making a classical mistake that people make with hash
table implementations . Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.
I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * keyvalues for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.
One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations .
Except that it isn't. I've tried a number of those variations on
different scenarios, including string -<somethinghashi ng (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.
[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"
Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 1 '06 #8
Eric Sosman wrote:
we******@gmail. com wrote On 08/01/06 15:32,:
Johan Tibell wrote:
>I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.
It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.
>[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};
So this is a linked entry (or "Open hash table") design.

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.
I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.
Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.
I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.
Oh, and you are making a classical mistake that people make with hash
table implementations . Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.
I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * keyvalues for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.
One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations .
Except that it isn't. I've tried a number of those variations on
different scenarios, including string -<somethinghashi ng (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.
[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"
Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 1 '06 #9
Eric Sosman wrote:
we******@gmail. com wrote On 08/01/06 15:32,:
Johan Tibell wrote:
>I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.
It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.
>[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};
So this is a linked entry (or "Open hash table") design.

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.
I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.
Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.
I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.
Oh, and you are making a classical mistake that people make with hash
table implementations . Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.
I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * keyvalues for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.
One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations .
Except that it isn't. I've tried a number of those variations on
different scenarios, including string -<somethinghashi ng (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.
[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"
Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 1 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
4251
by: Pelo GANDO | last post by:
Hi everybody ! I am a beginner in C++. I am looking for a (simple if it's possible) source code in C++ about hash table... Thank you ! Pelo
2
2525
by: Ravi | last post by:
Hi, I am working on a winform app. I need to use an object which can store some information(key/value pairs) and also can be acessed by multiple threads(read/write). From what I heard Hash table is best suited for it. My question is Why hash table. why can't we use an array. I thought hash table uses more memory than array. Correct me if am wrong.
0
1340
by: Rick | last post by:
Hello, I'm six months into asp.net/c# and enjoying it. A couple weeks ago I needed to write a "simple" high performance counter, the short story is it turned into a mini project. Many lesson lessons learned including... static objects vs cached objects http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&th=c267acc0a0434fc0&rnum=2 performance
1
2190
by: Rick | last post by:
Hello, I'm six months into asp.net/c# and enjoying it. I needed to code a "simple" high performance counter... a couple weeks later and a lot of learning I think I did it. Many lesson lessons learned including: static objects vs. cached objects, performance of asp.net, strong typing collections, collection locking, and using the very helpful System.Timers.Timer();
0
275
by: Larry Serflaten | last post by:
I am planning to share a Cards.DLL out on Planet Source Code and GotDotNet. But before I send it out to the public I would like to get a peer review to be sure it works as intended, and to avoid any stupid coding tricks or other unsightly practices. I am supplying a zipped package of just the source text and allowing the user to build all the executables. I want to be sure that goes as planned. To demonstrate the Cards.DLL I have...
24
4303
by: kdotsky | last post by:
Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm...
139
14178
by: ravi | last post by:
Hi can anybody tell me that which ds will be best suited to implement a hash table in C/C++ thanx. in advanced
8
2772
by: Jim Cobban | last post by:
I am writing a program in which I need a hash table implementation of a Map. The version of g++ available for Windo$e does not yet include the TR1 support for this. It just has the original SGI implementation. However the SGI implementation is so old that it predates the STL, so it does not, among other things, include hash support for std::basic_string. So I tried implementing the hash map from Stroustrup 3rd edition. At the least I...
0
8759
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9122
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9017
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6588
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5922
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4687
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3125
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2453
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2069
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.