crc32 to be used as hash

Weiguang Shi

Hi there,

I'm thinking of using binascii.crc32 as a hash-function when I read in
the reference
http://www.python.org/doc/current/li...binascii.html:

crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an
initial crc. This is consistent with the ZIP file checksum.
Since the algorithm is designed for use as a checksum algorithm,
it is not suitable for use as a general hash algorithm.

CRC32 has been shown in the (Internetworking) literature that it can
be used as a good hash function. I'm wondering what's the concern
here.

Thanks
Weiguang

Jul 18 '05 #1

Subscribe Post Reply

5807

Tim Peters

[Weiguang Shi]

I'm thinking of using binascii.crc32 as a hash-function when I read in
the reference
http://www.python.org/doc/current/li...binascii.html:

crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an
initial crc. This is consistent with the ZIP file checksum.
Since the algorithm is designed for use as a checksum algorithm,
it is not suitable for use as a general hash algorithm.

CRC32 has been shown in the (Internetworking) literature that it can
be used as a good hash function. I'm wondering what's the concern
here.

Sorry, I've no idea what "(Internetworking) literature" means.

PythonLabs ran large-scale statistical tests of string hashing
algorithms in 2000, using millions of strings taken from a large email
corpus. binascii.crc32 delivered collision rates hundreds of standard
deviations worse than a "truly random" 32-bit hash would have
delivered. That's the concern.

In the same tests, Python's builtin hash() was statistically
indistinguishable from a random 32-bit hash function wrt # of
collisions.

BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for. Like
most CRC polynomials in wide use, it was designed to catch common
forms of data corruption (single-bit errors, bursts of 0 or 1 bits at
either end, stuff like that).

Jul 18 '05 #2

Weiguang Shi

Thanks for the explanation.

The literature I was refering to includes:

.. R. Jain, A Comparison of Hashing Schemes for Address Lookup in
Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.

.. Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
Internet Load Balancing, IEEE INFOCOM 2000, p332-341.

Weiguang

Jul 18 '05 #3

Tim Peters

[Weiguang Shi]

The literature I was refering to includes:

. R. Jain, A Comparison of Hashing Schemes for Address Lookup in
Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.

. Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
Internet Load Balancing, IEEE INFOCOM 2000, p332-341.

So they're doing hashing of "small" address-related inputs, squashing
them down a lot, and using a variety of different 32- and 16-bit CRC
algorithms. In one case, they found that simply xor'ing fields
together did very well -- which implies that they weren't facing a
very demanding problem.

There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.

Since you haven't said what it is you're trying to hash, it's anyone's
guess how relevant what anyone has done to what you want to do. So
the thing for you to do is to write a little app and *try* different
ways. Details can matter, a lot.

For example, we didn't run any tests on subfields of 32-bit hashes,
and some bits may be "more random" than others. One of the papers you
cited found that it wasn't so in what they were testing. But here's
an example showing that's not a reliable conclusion in general:

"""
def drive():
from binascii import crc32
mask = 0x1fffff
d = {}
count = 0
bytes = map(chr, range(256))
for i in bytes[-32:]:
for j in bytes:
prefix = i+j
count += len(bytes)
for k in bytes:
d[crc32(prefix + k) & mask] = 1
print mask+1, "bins", count, "balls", len(d), "occupied"

drive()
"""

This is the output:

2097152 bins 2097152 balls 1048576 occupied

Amazingly enough, crc32 on this data, and taking the last 21 bits,
managed to miss half the bins entirely. For a truly random hash
yielding 21 bits, the expected # of occupied bins is about 1325653
when tossing 2**21 balls into 2**21 buckets, with a standard deviation
of about 452. So this is more than 600 sdevs "worse than random".

If we replace crc32 with hash, the output changes to

2097152 bins 2097152 balls 1490432 occupied

which is a few hundred sdevs "better than random". The reason
Python's hash can be "better than random" for bin problems is
explained in the comments in dictobject.c -- some of the internal hash
functions are specifically trying to give better-than-random collision
statistics when fed dense inputs, because hashing is used in Python
mostly to drive dict lookups.

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash. For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.

Jul 18 '05 #4

Nick Craig-Wood

Tim Peters <ti********@gmail.com> wrote:
[snip interesting info about CRC tests]

BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for.
Like most CRC polynomials in wide use, it was designed to catch
common forms of data corruption (single-bit errors, bursts of 0 or
1 bits at either end, stuff like that).

[Digression] Amongst other properties, data with CRC32(data) appended
is guaranteed a Hamming distance of 3. You can actually use the CRC32
to correct single bit errors (at the cost of some of its error
detection properties). That isn't usually worth it though! (I spent a
long time (a long time ago now) analysing and coding CRC (and FEC)
polynomials ;-)

Anyway, another reason why you'd not want to use it as a hash is that
its linear and its trivial to find things which have the same crc32
value. Its really easy to run backwards and forwards. This may or may
not be important for the application depending on whether you are
expecting malicious attack on the hashes. Probably better safe than
sorry though!

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 18 '05 #5

Weiguang Shi

Tim Peters wrote:

...
together did very well -- which implies that they weren't facing a
very demanding problem. I don't know about that. But all the experiments were trace-driven and
showed CRC32 is an excellent hash function (for the problem). Since
there haven't been counter-results reported, we might be able to
reject the null hypothesis: CRC32 is _not_ good for hashing address
traces.

There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.
...
And I didn't say it did. Different workloads.

It should be apparent by now I was trying to do the similar things the
cited works did. As I gladly found there is a CRC32 function, upon my
first program in Python, I was alerted by the reference page saying
it's not a good hash function and wondered for what particular
reasons.
...

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash. For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.

It's good to know. According to Jain's paper, CRC32 _is_ relatively
bit-wise random (again, for the network workload). That's the property
I'm after.

Thanks for the discussion and the program.

Weiguang

Jul 18 '05 #6

Mike Meyer

wg***@namao.cs.ualberta.ca (Weiguang Shi) writes:

It's good to know. According to Jain's paper, CRC32 _is_ relatively
bit-wise random (again, for the network workload). That's the property
I'm after.

Since no one else has mentioned it - have you considered using md5,
which is also built into Python?

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Jul 18 '05 #7

Similar topics

Calculating CRC32 for uploaded files

by: Ricky Romaya | last post by:

Hi, I'm working on a file upload script. I need to calculate the CRC32 of the file(s) which are successfully uploaded. How can I do this? PHP only have CRC32 function for strings. However, the...

PHP

c to java CRC32 code conversion

by: PaullyB | last post by:

Hi, I am attempting to convert the following code written in c to equivalent java code. This is the CRC32 algorithm used by a GPS received I am interfacing with. Unfortunately, the CRC32 class...

Java

Non-math-geek needs help with CRCs (binascii.crc32)

by: nobody | last post by:

1) Does anyone know if the CRC32 algorithm in binascii has a name? There seem to be a TON of different CRC32 methods; different polynomials, different byte orders, different seeds, some flip the...

Python

Creating a MD5 or CRC32 hash from a file

by: Ericko | last post by:

I want to create a hash of file in order to determine if 2 files are the same, I can't find anything in the standard libraries. Is there some class I've overlooked? or do I have to get third party...

C / C++

CRC32

by: Don | last post by:

Hi NG. Does anyone know of a place where I could download/get a C implementation of a CRC32 check. I would like a simple function that, for example, had a pointer to where the data to be CRC32...

C / C++

looking for a complete CRC32 program

by: UnixUser | last post by:

I am looking for some source code to run on Linux that will enable me to calculate and return a CRC32 value from a string of text. I have found one from snippets.org, but I cannot get it to...

C / C++

CRC32 Fun - How do I initialize a const static int array in an un-managed C++ class (VS2003/C++)

by: Russell Mangel | last post by:

Hi, I would like to encapsulate the following CRC32 routine inside a C++ un-mananged class. I am having difficulty initializing the "const static unsigned int table" inside an un-managed C++...

.NET Framework

CRC32 User Function Library

by: Paul M. | last post by:

Hello, does anyone have either a User Function Library (or the source for one) to create a CRC32 checksum for a given string? I want to use the function in a crystal formula thus: formula =...

Visual Basic .NET

binascii.crc32 results not matching

by: Larry Bates | last post by:

I'm trying to get the results of binascii.crc32 to match the results of another utility that produces 32 bit unsigned CRCs. binascii.crc32 returns results in the range of -2**31-1 and 2**21-1....

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice