473,396 Members | 1,693 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

crc32 to be used as hash

Hi there,

I'm thinking of using binascii.crc32 as a hash-function when I read in
the reference
http://www.python.org/doc/current/li...binascii.html:

crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an
initial crc. This is consistent with the ZIP file checksum.
Since the algorithm is designed for use as a checksum algorithm,
it is not suitable for use as a general hash algorithm.

CRC32 has been shown in the (Internetworking) literature that it can
be used as a good hash function. I'm wondering what's the concern
here.

Thanks
Weiguang
Jul 18 '05 #1
6 5807
[Weiguang Shi]
I'm thinking of using binascii.crc32 as a hash-function when I read in
the reference
http://www.python.org/doc/current/li...binascii.html:

crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an
initial crc. This is consistent with the ZIP file checksum.
Since the algorithm is designed for use as a checksum algorithm,
it is not suitable for use as a general hash algorithm.

CRC32 has been shown in the (Internetworking) literature that it can
be used as a good hash function. I'm wondering what's the concern
here.


Sorry, I've no idea what "(Internetworking) literature" means.

PythonLabs ran large-scale statistical tests of string hashing
algorithms in 2000, using millions of strings taken from a large email
corpus. binascii.crc32 delivered collision rates hundreds of standard
deviations worse than a "truly random" 32-bit hash would have
delivered. That's the concern.

In the same tests, Python's builtin hash() was statistically
indistinguishable from a random 32-bit hash function wrt # of
collisions.

BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for. Like
most CRC polynomials in wide use, it was designed to catch common
forms of data corruption (single-bit errors, bursts of 0 or 1 bits at
either end, stuff like that).
Jul 18 '05 #2
Thanks for the explanation.

The literature I was refering to includes:

.. R. Jain, A Comparison of Hashing Schemes for Address Lookup in
Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.

.. Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
Internet Load Balancing, IEEE INFOCOM 2000, p332-341.

Weiguang
Jul 18 '05 #3
[Weiguang Shi]
The literature I was refering to includes:

. R. Jain, A Comparison of Hashing Schemes for Address Lookup in
Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.

. Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
Internet Load Balancing, IEEE INFOCOM 2000, p332-341.


So they're doing hashing of "small" address-related inputs, squashing
them down a lot, and using a variety of different 32- and 16-bit CRC
algorithms. In one case, they found that simply xor'ing fields
together did very well -- which implies that they weren't facing a
very demanding problem.

There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.

Since you haven't said what it is you're trying to hash, it's anyone's
guess how relevant what anyone has done to what you want to do. So
the thing for you to do is to write a little app and *try* different
ways. Details can matter, a lot.

For example, we didn't run any tests on subfields of 32-bit hashes,
and some bits may be "more random" than others. One of the papers you
cited found that it wasn't so in what they were testing. But here's
an example showing that's not a reliable conclusion in general:

"""
def drive():
from binascii import crc32
mask = 0x1fffff
d = {}
count = 0
bytes = map(chr, range(256))
for i in bytes[-32:]:
for j in bytes:
prefix = i+j
count += len(bytes)
for k in bytes:
d[crc32(prefix + k) & mask] = 1
print mask+1, "bins", count, "balls", len(d), "occupied"

drive()
"""

This is the output:

2097152 bins 2097152 balls 1048576 occupied

Amazingly enough, crc32 on this data, and taking the last 21 bits,
managed to miss half the bins entirely. For a truly random hash
yielding 21 bits, the expected # of occupied bins is about 1325653
when tossing 2**21 balls into 2**21 buckets, with a standard deviation
of about 452. So this is more than 600 sdevs "worse than random".

If we replace crc32 with hash, the output changes to

2097152 bins 2097152 balls 1490432 occupied

which is a few hundred sdevs "better than random". The reason
Python's hash can be "better than random" for bin problems is
explained in the comments in dictobject.c -- some of the internal hash
functions are specifically trying to give better-than-random collision
statistics when fed dense inputs, because hashing is used in Python
mostly to drive dict lookups.

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash. For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.
Jul 18 '05 #4
Tim Peters <ti********@gmail.com> wrote:
[snip interesting info about CRC tests]
BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for.
Like most CRC polynomials in wide use, it was designed to catch
common forms of data corruption (single-bit errors, bursts of 0 or
1 bits at either end, stuff like that).


[Digression] Amongst other properties, data with CRC32(data) appended
is guaranteed a Hamming distance of 3. You can actually use the CRC32
to correct single bit errors (at the cost of some of its error
detection properties). That isn't usually worth it though! (I spent a
long time (a long time ago now) analysing and coding CRC (and FEC)
polynomials ;-)

Anyway, another reason why you'd not want to use it as a hash is that
its linear and its trivial to find things which have the same crc32
value. Its really easy to run backwards and forwards. This may or may
not be important for the application depending on whether you are
expecting malicious attack on the hashes. Probably better safe than
sorry though!

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #5
Tim Peters wrote:
...
together did very well -- which implies that they weren't facing a
very demanding problem. I don't know about that. But all the experiments were trace-driven and
showed CRC32 is an excellent hash function (for the problem). Since
there haven't been counter-results reported, we might be able to
reject the null hypothesis: CRC32 is _not_ good for hashing address
traces.

There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.
...
And I didn't say it did. Different workloads.

It should be apparent by now I was trying to do the similar things the
cited works did. As I gladly found there is a CRC32 function, upon my
first program in Python, I was alerted by the reference page saying
it's not a good hash function and wondered for what particular
reasons.
...

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash. For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.


It's good to know. According to Jain's paper, CRC32 _is_ relatively
bit-wise random (again, for the network workload). That's the property
I'm after.

Thanks for the discussion and the program.

Weiguang
Jul 18 '05 #6
wg***@namao.cs.ualberta.ca (Weiguang Shi) writes:
It's good to know. According to Jain's paper, CRC32 _is_ relatively
bit-wise random (again, for the network workload). That's the property
I'm after.


Since no one else has mentioned it - have you considered using md5,
which is also built into Python?

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Ricky Romaya | last post by:
Hi, I'm working on a file upload script. I need to calculate the CRC32 of the file(s) which are successfully uploaded. How can I do this? PHP only have CRC32 function for strings. However, the...
1
by: PaullyB | last post by:
Hi, I am attempting to convert the following code written in c to equivalent java code. This is the CRC32 algorithm used by a GPS received I am interfacing with. Unfortunately, the CRC32 class...
2
by: nobody | last post by:
1) Does anyone know if the CRC32 algorithm in binascii has a name? There seem to be a TON of different CRC32 methods; different polynomials, different byte orders, different seeds, some flip the...
1
by: Ericko | last post by:
I want to create a hash of file in order to determine if 2 files are the same, I can't find anything in the standard libraries. Is there some class I've overlooked? or do I have to get third party...
14
by: Don | last post by:
Hi NG. Does anyone know of a place where I could download/get a C implementation of a CRC32 check. I would like a simple function that, for example, had a pointer to where the data to be CRC32...
9
by: UnixUser | last post by:
I am looking for some source code to run on Linux that will enable me to calculate and return a CRC32 value from a string of text. I have found one from snippets.org, but I cannot get it to...
3
by: Russell Mangel | last post by:
Hi, I would like to encapsulate the following CRC32 routine inside a C++ un-mananged class. I am having difficulty initializing the "const static unsigned int table" inside an un-managed C++...
6
by: Paul M. | last post by:
Hello, does anyone have either a User Function Library (or the source for one) to create a CRC32 checksum for a given string? I want to use the function in a crystal formula thus: formula =...
12
by: Larry Bates | last post by:
I'm trying to get the results of binascii.crc32 to match the results of another utility that produces 32 bit unsigned CRCs. binascii.crc32 returns results in the range of -2**31-1 and 2**21-1....
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.