Hash Function (str -> 32 bits)

barcaroller

I couldn't find a message-digest newsgroup, so I posted here. I have a C
function that converts a string of arbitrary length to a 32-bit hash value.
I realize this is overkill but I used OpenSSL's sha1() to convert the string
to a SHA-1 160-bit message digest.

The question is: how do I use these 160 bits to get my final 32 bits?
Should I use the first 32 bits or the last 32 bits or is there yet another
method?

Note: I understand that the 32-bit result is much more likely to cause
collisions than the SHA-1 message digest.

Feb 19 '06 #1

Subscribe Reply

3844

websnarf

barcaroller wrote:

I couldn't find a message-digest newsgroup, so I posted here.
sci.crypt is the group you want. But comp.programming would have
worked just as well.
[...] I have a C
function that converts a string of arbitrary length to a 32-bit hash value.
I realize this is overkill but I used OpenSSL's sha1() to convert the string
to a SHA-1 160-bit message digest.
Yes that is overkill. If all you want is 32 bits, then you can use the
function found here:

http://www.pobox.com/~qed/hash.html

Its a smaller and much faster piece of code.
The question is: how do I use these 160 bits to get my final 32 bits?
Should I use the first 32 bits or the last 32 bits or is there yet another
method?

You may pick *any* of the 32 bits in the digest. That's kind of the
point of it.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Feb 19 '06 #2

barcaroller

<we******@gmail.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...

Yes that is overkill. If all you want is 32 bits, then you can use the
function found here:

http://www.pobox.com/~qed/hash.html

Its a smaller and much faster piece of code.

You may pick *any* of the 32 bits in the digest. That's kind of the
point of it.

Thanks for your response. Two questions for you:

1. Does the function you mentioned cause less collisions than SHA-1
converted to 32 bits?

2. Are you saying that choosing the last 32 bits or the first 32 bits in the
SHA-1 message digest would (on average) result in the same number of
collisions?

Feb 19 '06 #3

websnarf

barcaroller wrote:

<we******@gmail.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...
Yes that is overkill. If all you want is 32 bits, then you can use the
function found here:

http://www.pobox.com/~qed/hash.html

Its a smaller and much faster piece of code.

You may pick *any* of the 32 bits in the digest. That's kind of the
point of it.
Thanks for your response. Two questions for you:

1. Does the function you mentioned cause less collisions than SHA-1
converted to 32 bits?

It depends on the situation. If you are trying to formulaically attack
the hash, then SHA-1 is clearly better, since you require a brute force
attack for *each* collision (my hash will fall to deterministic attacks
which can be generalized to any size). If you are expecting random,
uniform, or otherwise typical data, then they will behave practically
the same.
2. Are you saying that choosing the last 32 bits or the first 32 bits in the
SHA-1 message digest would (on average) result in the same number of
collisions?

Yes -- minimal.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Feb 19 '06 #4

CBFalconer

barcaroller wrote:

I couldn't find a message-digest newsgroup, so I posted here. I
have a C function that converts a string of arbitrary length to a
32-bit hash value. I realize this is overkill but I used OpenSSL's
sha1() to convert the string to a SHA-1 160-bit message digest.

The question is: how do I use these 160 bits to get my final 32
bits? Should I use the first 32 bits or the last 32 bits or is
there yet another method?

You can find some simple yet effective string hashing functions in
the source code for hashlib, together with some references to other
possible functions. Hashlib is at:

<http://cbfalconer.home.att.net/download/hashlib.zip>

Since hashlib was originally developed to try out hash functions,
it can still be used for the same purpose. It collects suitable
statistics during a run.

Hashlib is implemented in C, but this general subject is better
suited to comp.programming.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

Feb 19 '06 #5

Michael Wojcik

In article <11**********************@g44g2000cwa.googlegroups .com>, we******@gmail.com writes:

barcaroller wrote:
2. Are you saying that choosing the last 32 bits or the first 32 bits in the
SHA-1 message digest would (on average) result in the same number of
collisions?

As Paul already noted, this is off-topic for comp.lang.c, and SHA-1
is probably not well-suited to your needs.
Yes -- minimal.

More formally, you would have a collision with probability 1/2 after
hashing 2**16 objects in this fashion, per the "birthday paradox".
That's the best you can do for a pseudorandom hash, or for any hash
over random input. (If you know your input is biased, you can bias
your hash accordingly and reduce chances of collision further.)

Note this assumes that SHA-1 is not broken, or is broken in a fashion
that does not bias it.

SHA-1 is a cryptographic message digest. Any non-broken crypto
digest of length N bits must have the full avalanche property, or
it's advertising more bits of strength than it actually has. That
means each input bits affects each output bit independently with
probability 1/2. And that means that the output bits in the final
result are independent.

It doesn't matter how you reduce those N bits (assuming N > 32) to 32
bits, so long as you allow each of your output bits to be fairly
determined by at least bit of entropy from those N bits. You can
take any 32 bits from the digest; you can take 64 bits as two groups
of 32 and XOR them together. Makes no difference.

Obviously you can misapply the input bits and reduce entropy, eg by
ORing two groups of 32 bits together, which will bias bits toward 1.
But any 32 bits picked at random from the output of a non-broken
digest will be equally well-distributed. They each have value 1 with
probability 1/2.
--
Michael Wojcik mi************@microfocus.com

Web 2.0 is ... like talking to people - without the pesky annoyance of
other people. -- Martin Wood

Feb 21 '06 #6

Walter Roberson

In article <dt*******@news3.newsguy.com>,
Michael Wojcik <mw*****@newsguy.com> wrote:

In article <11**********************@g44g2000cwa.googlegroups .com>, we******@gmail.com writes:
barcaroller wrote:
> 2. Are you saying that choosing the last 32 bits or the first 32 bits in the
> SHA-1 message digest would (on average) result in the same number of
> collisions?
As Paul already noted, this is off-topic for comp.lang.c, and SHA-1
is probably not well-suited to your needs. More formally, you would have a collision with probability 1/2 after
hashing 2**16 objects in this fashion, per the "birthday paradox".

Minor OT correction: you would need 1L<<16 * pow(2*log(2),1/2.)
That's about 1.1774 times larger than your count.
--
I was very young in those days, but I was also rather dim.
-- Christopher Priest

Feb 21 '06 #7

Similar topics

python hash function

by: Damien Morton | last post by:

Ive been doing some investigation of python hashtables and the hash function used in Python. It turns out, that when running pystone.py, as much time is spent in string_hash() as is spent in...