go****@yahoo.co m wrote:

Hi there,

There's a classic hash function to hash strings, where MULT is defined

as "31":

//from programming pearls

unsigned int hash(char *ptr)

{ unsigned int h = 0;

unsigned char *p = ptr;

int n;

for (n = k; n 0; p++) {

h = MULT * h + *p;

if (*p == 0)

n--;

}

return h % NHASH;

}

It looks like this code was probably mis-transcribed.

There's this undefined quantity `k' floating around, and

the behavior at end-of-string can only be called broken.

But that doesn't seem central to your question:

Why MULT defined as 31? ( How about 29? 24? or 26? )

It's a mixture of superstition and good sense.

First, the superstition: People who write code having

to do with hash tables apparently recall that prime numbers

are particularly "good" for them. It seems they don't always

remember just what the "goodness" was or in what connection,

but they'll throw prime numbers into the mix whenever they

can. They'll throw in prime numbers even if they're not too

sure what a prime number is! A colleague of mine once ran

across this little coding gem:

#define HASHSIZE 51 /* a smallish prime */

Second, the good sense: Suppose MULT were 26, and consider

hashing a hundred-character string. How much influence does

the string's first character have on the final value of `h',

just before the mod operation? The first character's value

will have been multiplied by MULT 99 times, so if the arithmetic

were done in infinite precision the value would consist of some

jumble of bits followed by 99 low-order zero bits -- each time

you multiply by MULT you introduce another low-order zero, right?

The computer's finite arithmetic just chops away all the excess

high-order bits, so the first character's actual contribution to

`h' is ... precisely zero! The `h' value depends only on the

rightmost 32 string characters (assuming a 32-bit int), and even

then things are not wonderful: the first of those final 32 bytes

influences only the leftmost bit of `h' and has no effect on

the remaining 31. Clearly, an even-valued MULT is a poor idea.

Need MULT be prime? Not as far as I know (I don't know

everything); any odd value ought to suffice. 31 may be attractive

because it is close to a power of two, and it may be easier for

the compiler to replace a possibly slow multiply instruction with

a shift and subtract (31*x == (x << 5) - x) on machines where it

makes a difference. Setting MULT one greater than a power of two

(e.g., 33) would also be easy to optimize, but might produce too

"simple" an arrangement: mostly a juxtaposition of two copies

of the original set of bits, with a little mixing in the middle.

So you want an odd MULT that has plenty of one-bits.

You'd also like a MULT that "smears" its operand bits across

as much of `h' as you can manage, so MULT shouldn't be too small

(consider the degenerate case of MULT==1). Also, MULT shouldn't

be too large -- to put it differently, UINT_MAX-MULT shouldn't be

too small. How small is "too small" depends to some extent on

the expected lengths of the strings; I suspect 31 is "too small"

if the strings are short (the values won't have time to invade

the high-order part of `h'). I think it would be wiser to choose

MULT somewhere between sqrt(UINT_MAX) and UINT_MAX-sqrt(UINT_MAX),

making sure it's odd and has lots of one-bits. Primality doesn't

seem important here -- but perhaps someone else may offer a good

reason to choose a prime. Sometimes superstition has valid roots.

--

Eric Sosman

es*****@acm-dot-org.invalid