Ive been doing some investigation of python hashtables and the hash

function used in Python.

It turns out, that when running pystone.py, as much time is spent in

string_hash() as is spent in lookdict_string(). If you look at

python's hash function, shown below, you can get an idea why - a

multiply is used for every character in the string.

static long

string_hash1(register char *p, int size)

{

register int len;

register long x;

len = size;

x = *p << 7;

while (--len >= 0)

x = (100003*x) ^ *p++;

x ^= size;

return x;

}

Looking around the web I found the following hash function, which is

much faster, and seems to distribute better than the python hash

function.

static long

string_hash3(register char *p, register int size)

{

register long x;

register char c;

x = 0xd2d84a61;

while (c = *p++)

x ^= ( (x<<7) + c + (x>>5) );

return x;

}

I also came up with a hash function of my own, which seems to

distribute near-perfectly (in the sense of being comparable to

assigning, as hashes, random numbers to unique strings) and be faster

yet (at least, on my P4 box).

static long

string_hash6(register unsigned short *p, int size)

{

register short s;

register unsigned long x = 0;

if (size == 0) return 0;

len = (size+1) >> 1;

while (len--) {

x += (x>>14) + (*p++ * 0xd2d84a61);

}

x += (x>>14) + (size*0xd2d84a61);

return x;

}

Ive tested these functions by hashing a large set of strings generated

by instrumenting the python interpeter to emit a string every time a

string is added to a dictionary. These strings are hashed and thrown

into the buckets of various sized tables. I then calculate sigma

statistics (sum((observed-expected)^2)/(N-1)) for the number of items

in the buckets of those tables.

Im not sure what other tests to try out. Im hoping that someone on

c.l.py has some experience in testing hash functions, and can suggest

some additional tests and/or tweaks.