473,750 Members | 2,596 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

General Hash Functions In Python

Hi all,

I've ported various hash functions to python if anyone is interested:

def RSHash(key):
a = 378551
b = 63689
hash = 0
for i in range(len(key)) :
hash = hash * a + ord(key[i])
a = a * b
return (hash & 0x7FFFFFFF)
def JSHash(key):
hash = 1315423911
for i in range(len(key)) :
hash ^= ((hash << 5) + ord(key[i]) + (hash >2))
return (hash & 0x7FFFFFFF)
def PJWHash(key):
BitsInUnsignedI nt = 4 * 8
ThreeQuarters = long((BitsInUns ignedInt * 3) / 4)
OneEighth = long(BitsInUnsi gnedInt / 8)
HighBits = (0xFFFFFFFF) << (BitsInUnsigned Int - OneEighth)
hash = 0
test = 0

for i in range(len(key)) :
hash = (hash << OneEighth) + ord(key[i])
test = hash & HighBits
if test != 0:
hash = (( hash ^ (test >ThreeQuarters) ) & (~HighBits));
return (hash & 0x7FFFFFFF)
def ELFHash(key):
hash = 0
x = 0
for i in range(len(key)) :
hash = (hash << 4) + ord(key[i])
x = hash & 0xF0000000
if x != 0:
hash ^= (x >24)
hash &= ~x
return (hash & 0x7FFFFFFF)
def BKDRHash(key):
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
for i in range(len(key)) :
hash = (hash * seed) + ord(key[i])
return (hash & 0x7FFFFFFF)
def SDBMHash(key):
hash = 0
for i in range(len(key)) :
hash = ord(key[i]) + (hash << 6) + (hash << 16) - hash;
return (hash & 0x7FFFFFFF)
def DJBHash(key):
hash = 5381
for i in range(len(key)) :
hash = ((hash << 5) + hash) + ord(key[i])
return (hash & 0x7FFFFFFF)
def DEKHash(key):
hash = len(key);
for i in range(len(key)) :
hash = ((hash << 5) ^ (hash >27)) ^ ord(key[i])
return (hash & 0x7FFFFFFF)
def APHash(key):
hash = 0
for i in range(len(key)) :
if ((i & 1) == 0):
hash ^= ((hash << 7) ^ ord(key[i]) ^ (hash >3))
else:
hash ^= (~((hash << 11) ^ ord(key[i]) ^ (hash >5)))
return (hash & 0x7FFFFFFF)
print RSHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
print JSHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
print PJWHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
print ELFHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
print BKDRHash('abcde fghijklmnopqrst uvwxyz123456789 0')
print SDBMHash('abcde fghijklmnopqrst uvwxyz123456789 0')
print DJBHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
print DEKHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
print APHash ('abcdefghijklm nopqrstuvwxyz12 34567890')

Arash Partow
_______________ _______________ _______________ ___________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net

Jul 17 '06 #1
12 7016
"Arash Partow" <pa****@gmail.c omwrites:
I've ported various hash functions to python if anyone is interested:
Are these useful for any particular interoperabilit y purposes?
Otherwise I'd say just one such hash is plenty, or maybe don't even
bother, since Python's built-in dicts are sufficient for most
applications that call for hashing, and the cryptographic hashes are
available for when you really care about uniformity of the hash
output.
for i in range(len(key)) :
hash = hash * a + ord(key[i])
a = a * b
As a stylistic issue, I'd write this:

for c in key:
hash = hash * a + ord(c)
a *= b

and similarly for the other similar loops.
Jul 17 '06 #2
Hi Paul,

For different data types different hash functions work
better/worse aka fewer or more collisions.

I believe the more choice people have and also the more
ways people see how a particular thing can be done, then
the easier it will be for them to come up with their own
specific efficient solutions.

That said, I believe at least one (most likely more) of
the hash functions in the group above will most always work
better (ala less collisions) than the standard default hash
available in the built-in dict for any random set of strings.

Please feel free to prove me wrong :)


Arash Partow
_______________ _______________ _______________ ___________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
Paul Rubin wrote:
"Arash Partow" <pa****@gmail.c omwrites:
I've ported various hash functions to python if anyone is interested:

Are these useful for any particular interoperabilit y purposes?
Otherwise I'd say just one such hash is plenty, or maybe don't even
bother, since Python's built-in dicts are sufficient for most
applications that call for hashing, and the cryptographic hashes are
available for when you really care about uniformity of the hash
output.
for i in range(len(key)) :
hash = hash * a + ord(key[i])
a = a * b

As a stylistic issue, I'd write this:

for c in key:
hash = hash * a + ord(c)
a *= b

and similarly for the other similar loops.
Jul 17 '06 #3
"Arash Partow" <pa****@gmail.c omwrites:
For different data types different hash functions work
better/worse aka fewer or more collisions.
But you give no indication of which of those hashes works best for
what kind of data. How is the user supposed to figure out which one
to choose?
Jul 17 '06 #4
Arash Partow wrote:
That said, I believe at least one (most likely more) of
the hash functions in the group above will most always work
better (ala less collisions) than the standard default hash
available in the built-in dict for any random set of strings.

Please feel free to prove me wrong :)
You have it backwards. You are the one making the positive assertion. You get to
prove yourself correct.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jul 17 '06 #5
Trial and error - how else? :)

I believe finding a perfect hash function for
every kind and combination of data is a very
very time consuming operation. Also there is
the common case where the data is online
(ie: stateless) that said it doesn't mean you
can't make assumptions about the kind of data.

Arash Partow
_______________ _______________ _______________ ___________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
Paul Rubin wrote:
"Arash Partow" <pa****@gmail.c omwrites:
For different data types different hash functions work
better/worse aka fewer or more collisions.

But you give no indication of which of those hashes works best for
what kind of data. How is the user supposed to figure out which one
to choose?
Jul 17 '06 #6
That is true, but I'm not about to do something
that might potentially prove my point wrong... :)

Arash Partow
_______________ _______________ _______________ ___________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
Robert Kern wrote:
>
You have it backwards. You are the one making the positive assertion. You get to
prove yourself correct.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
Jul 17 '06 #7
Arash Partow wrote:
I've ported various hash functions to python if anyone is interested:
[snip]
Ok, so if you think they are useful, what about writing up an article for the
Python Cookbook that describes their usage and specific advantages/disadvantages?

http://aspn.activestate.com/ASPN/Python/Cookbook/

Stefan
Jul 17 '06 #8
On 17/07/2006 5:52 PM, Arash Partow wrote:
Hi Paul,

For different data types different hash functions work
better/worse aka fewer or more collisions.

I believe the more choice people have and also the more
ways people see how a particular thing can be done, then
the easier it will be for them to come up with their own
specific efficient solutions.
Who is likely to bother? In timbot we trust. Have you read the comments
at the start of Objects/dictobject.c?
>
That said, I believe at least one (most likely more) of
the hash functions in the group above will most always work
better (ala less collisions) than the standard default hash
available in the built-in dict for any random set of strings.
[Aside: it's not "in the built-in dict". Any type which wants its
instances to be hashable defines its own hash method, one that suits the
type.]

This belief would be based on:
(a) actual testing by you
or (b) a refereed academic paper which did such tests on hash functions
(including the Python "standard default hash")
or (c) ...???

What is the Python "standard default hash", anyway? It is written in C.
Wouldn't it have been a good idea to provide a Python function for it,
so that people who are going to "come up with their own specific
efficient solutions" had something to compare with? Or perhaps they are
intended to "port" your functions back into C?

A few more questions:

Why have you truncated the answers to 31 bits?

Have you tested that these functions produce the same output (apart from
the 31-bit thing) as the originals that you worked from? The reason for
asking is that Python unlike C doesn't lose any bits in the <<
operation; if this is followed by a >you may be shifting some unwanted
1-bits back in again.

Talking about not losing bits: For your 36-byte example input, the
SDBMHash (with its << 16) is up to about 566 bits before you truncate it
to 31. A little over the top, perhaps. Maybe not the fastest way of
doing it.

What is the purpose of the calls to long() in PJWHash?

And the $64K question: What is the quintessential difference between
PJWHash and ELFHash?

Cheers,
John
Jul 17 '06 #9
John Machin wrote:
Who is likely to bother? In timbot we trust. Have you read the comments
at the start of Objects/dictobject.c?
No I haven't probably wont be anytime soon, as far as time, well
people interested, as is how started my original port, would be more
than willing to try/assess the routines for sets of strings that they
wish to hash etc.. this site may help explain plus has some code
snippets that may help you understand what I mean.

http://www.partow.net/programming/ha...ons/index.html

[Aside: it's not "in the built-in dict". Any type which wants its
instances to be hashable defines its own hash method, one that suits the
type.]

This belief would be based on:
(a) actual testing by you
or (b) a refereed academic paper which did such tests on hash functions
(including the Python "standard default hash")
or (c) ...???
Experience has shown me that the belief than one "default" way of
hashing is generally not the optimal way for the overwhelming
majority of data, but then again....
What is the Python "standard default hash", anyway? It is written in C.
Wouldn't it have been a good idea to provide a Python function for it,
so that people who are going to "come up with their own specific
efficient solutions" had something to compare with? Or perhaps they are
intended to "port" your functions back into C?
The above link provides a very simple hash test framework, the
"standard default hash" can be placed in there and tested amongst
the other algorithms.

A few more questions:

Why have you truncated the answers to 31 bits?
algorithms were adapted from unsigned int, i should have removed that
final truncation in python.

Have you tested that these functions produce the same output (apart from
the 31-bit thing) as the originals that you worked from? The reason for
asking is that Python unlike C doesn't lose any bits in the <<
operation; if this is followed by a >you may be shifting some unwanted
1-bits back in again.
Some of them do others don't (not really important unless you are
trying to be compatible with other implementations which this is
not), I am aware of python not truncating/wrapping of values under
various operations, I believe its a nice little side effect from
python which gives more bits to play with as long as you don't
truncate them as I have.

Talking about not losing bits: For your 36-byte example input, the
SDBMHash (with its << 16) is up to about 566 bits before you truncate it
to 31. A little over the top, perhaps. Maybe not the fastest way of
doing it.
Possibly, do you have a better solution I'm very keen to learn...

What is the purpose of the calls to long() in PJWHash?
trying to cast to long, looking at it now its rather superfluous.

And the $64K question: What is the quintessential difference between
PJWHash and ELFHash?
Nothing, elf is essentially pjw, its just optimised for 32-bit systems
in that the calculation for th's etc are static where has pjw
required sizeof to calc the th's which i couldn't find a way of doing,
so i fudged it in the hope that maybe sometime in the future a work
around of sorts could be developed.

Again this was just a simple posting, I hoped to maybe gets some
comments and may present some ideas to people, I don't expect anyone
to drop everything and start using these, just thought it might be
something interesting for this NG. btw I'm not a very good python
programmer ATM.


Arash Partow
_______________ _______________ _______________ ___________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net

Jul 17 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
14188
by: Damien Morton | last post by:
Ive been doing some investigation of python hashtables and the hash function used in Python. It turns out, that when running pystone.py, as much time is spent in string_hash() as is spent in lookdict_string(). If you look at python's hash function, shown below, you can get an idea why - a multiply is used for every character in the string. static long string_hash1(register char *p, int size)
14
2935
by: Todd Johnson | last post by:
I am creating a dialog in wxPython for log in purposes. Basically when the user clicks the ok button, the dialog box saves the user name and password as class attributes. Then as long as the dialog exists calling MyDialog.GetUserName() and MyDialog.GetPassword() returns them. This seems insecure to me. Is there a better way to go about this or is it safe as long as I destroy the dialog as soon as I am done with it?
1
1988
by: Steven D'Aprano | last post by:
Do people often use hash() on built-in types? What do you find it useful for? How about on custom classes? Can anyone give me some good tips or hints for writing and using hash functions in Python? Thank you, --
2
3790
by: Bryan Olson | last post by:
The current Python standard library provides two cryptographic hash functions: MD5 and SHA-1 . The authors of MD5 originally stated: It is conjectured that it is computationally infeasible to produce two messages having the same message digest. That conjecture is false, as demonstrated by Wang, Feng, Lai and Yu in 2004 . Just recently, Wang, Yu, and Lin showed a short- cut solution for finding collisions in SHA-1 . Their result
6
1353
by: pycraze | last post by:
Hi , I am using Fedora Core -3 and my Python version is 2.4 . kernel version - 2.6.9-1.667smp There is some freakish thing that happens with python hashes when i run a python script my python file is basically : myhash = {} def summa():
10
2454
by: Qiangning Hong | last post by:
I'm writing a spider. I have millions of urls in a table (mysql) to check if a url has already been fetched. To check fast, I am considering to add a "hash" column in the table, make it a unique key, and use the following sql statement: insert ignore into urls (url, hash) values (newurl, hash_of_newurl) to add new url. I believe this will be faster than making the "url" column unique key and doing string comparation. Right?
24
4307
by: kdotsky | last post by:
Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm...
5
2178
by: Mathias Panzenboeck | last post by:
Hi. I wrote a small hashlib for C. Because I'm new to hashes I looked at pythons implementation and reused *some* of the code... or more the mathematical "hash-function", not really the code. In particular I looked at pythons hash and lookup functions, so I came up with this (see the code underneath). So, can this code be considered as derived and do I have to put my code under the GPL? I'd like to publish it under something less...
16
7380
by: Robin Becker | last post by:
Is the any way to get an efficient 16bit hash in python? -- Robin Becker
0
9001
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8838
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9396
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8263
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6808
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6081
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4716
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4888
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2226
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.