don't need dictionary's keys - hash table?

kd*****@gmail.com wrote:

Will the hash function always generate unique keys?

no. hash() is designed for dictionaries (hash tables), not for use as a
cryptographic hash.

depending on your application, a bloom filter might be a good enough:

http://en.wikipedia.org/wiki/Bloom_filter

(see the links section for a Python implementation)

</F>

Jul 12 '06 #3

kd*****@gmail.com wrote:

I just realized that of course the hash is not always going to be
unique, so this wouldn't really work. And it seems a hash table would
still need to store the keys (as strings) so that string comparisons
can be done when a collision occurs.

btw, Python's dictionary type *is* a highly-optimized implementation of
a "traditional hash table".

</F>

Jul 12 '06 #4

Diez B. Roggisch

kd*****@gmail.com wrote:

Hello,
I am using some very large dictionaries with keys that are long strings
(urls). For a large dictionary these keys start to take up a
significant amount of memory. I do not need access to these keys -- I
only need to be able to retrieve the value associated with a certain
key, so I do not want to have the keys stored in memory. Could I just
hash() the url strings first and use the resulting integer as the key?
I think what I'm after here is more like a tradition hash table.

python dictionaries are "traditional" hash-tables.

If I
do it this way am I going to get the memory savings I am after? Will
the hash function always generate unique keys? Also, would the same
technique work for a set?

Any other thoughts or considerations are appreciated.

You could try and create a md5 sum of your strings and use that as key. It
_should_ be good enough, but I'm no crypto expert so take that with a grain
of salt.

Diez

Jul 12 '06 #5

It should be enough but it might be a little slower than hash(string).

Diez B. Roggisch wrote:

kd*****@gmail.com wrote:

Hello,
I am using some very large dictionaries with keys that are long strings
(urls). For a large dictionary these keys start to take up a
significant amount of memory. I do not need access to these keys -- I
only need to be able to retrieve the value associated with a certain
key, so I do not want to have the keys stored in memory. Could I just
hash() the url strings first and use the resulting integer as the key?
I think what I'm after here is more like a tradition hash table.

python dictionaries are "traditional" hash-tables.

If I
do it this way am I going to get the memory savings I am after? Will
the hash function always generate unique keys? Also, would the same
technique work for a set?

Any other thoughts or considerations are appreciated.

You could try and create a md5 sum of your strings and use that as key. It
_should_ be good enough, but I'm no crypto expert so take that with a grain
of salt.

Diez

Jul 12 '06 #6

Dictionaries are hash tables in Python.

If you don't really want to store the keys, just use
dic[hash(key)]=value, this way the dictionary will have the same shape
and distribution of keys as dic[key]=value because
hash('abc')=hash(hash('abc')) but the long string of actual keys are
not referenced by the dictionary.

Now you couldn't do dic.keys() and see your urls, and every time you
want to do dic['abc'] you would get a KeyError exception.

Hope this helps,
Nick Vatamaniuc

kd*****@gmail.com wrote:

kdot...@gmail.com wrote:
Hello,
I am using some very large dictionaries with keys that are long strings
(urls). For a large dictionary these keys start to take up a
significant amount of memory. I do not need access to these keys -- I
only need to be able to retrieve the value associated with a certain
key, so I do not want to have the keys stored in memory. Could I just
hash() the url strings first and use the resulting integer as the key?
I think what I'm after here is more like a tradition hash table. If I
do it this way am I going to get the memory savings I am after? Will
the hash function always generate unique keys? Also, would the same
technique work for a set?

I just realized that of course the hash is not always going to be
unique, so this wouldn't really work. And it seems a hash table would
still need to store the keys (as strings) so that string comparisons
can be done when a collision occurs. I guess there's no avoiding
storing they keys?

Jul 12 '06 #7

Nick Vatamaniuc wrote:

If you don't really want to store the keys, just use
dic[hash(key)]=value, this way the dictionary will have the same shape
and distribution of keys as dic[key]=value because
hash('abc')=hash(hash('abc')) but the long string of actual keys are
not referenced by the dictionary.

how come you're so sure that there will never be any collisions ?

</F>

Jul 12 '06 #8

Tim Chase

how come you're so sure that there will never be any collisions ?

because none of his strings want their insurance to go up...

:*)

-tkc

Jul 12 '06 #9

Fred,

I never said there will be no collisions. For clarity, can you quote
the exact phrase where I mentioned that?

To say that there will be no collision is nonsense because the # of
possible long url strings is certainly larger than 2^32, applying the
pigeon hole principle you could easily show that there will be
collisions.

The point is to make collision as unlikely as possible for the human
readable text (that is make them as uniform as possible), that is why
creating good cryptographic hashes is not easy. Not that the Python
hash function work as hard as an MD5 (it is probably a multiplication
and a modulo if I had to guess). But this is a topic beyond scope of
this thread.

The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ] then the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Hopefully this make my point more clear,
Nick V.

Fredrik Lundh wrote:

Nick Vatamaniuc wrote:

If you don't really want to store the keys, just use
dic[hash(key)]=value, this way the dictionary will have the same shape
and distribution of keys as dic[key]=value because
hash('abc')=hash(hash('abc')) but the long string of actual keys are
not referenced by the dictionary.

how come you're so sure that there will never be any collisions ?

</F>

Jul 12 '06 #10

Nick Vatamaniuc wrote:

I never said there will be no collisions. For clarity, can you quote
the exact phrase where I mentioned that?

the text I quoted is only true if you can guarantee that there will be
no collisions.

</F>

Jul 12 '06 #11

Diez B. Roggisch

Please don't top post. I had to fix that below.

>>Any other thoughts or considerations are appreciated.
You could try and create a md5 sum of your strings and use that as key. It
_should_ be good enough, but I'm no crypto expert so take that with a grain
of salt.

It should be enough but it might be a little slower than hash(string).

hash alone won't be sufficient, as it is guaranteed to have to much
collisions. The OP already figured that out, btw.

Diez

Jul 12 '06 #12

kd*****@gmail.com writes:

do it this way am I going to get the memory savings I am after? Will
the hash function always generate unique keys? Also, would the same
technique work for a set?

A hash function that always generates unique hashes is called a
perfect hash. They tend to be slow, and of course, to generate one,
you need to know all the keys in advance.

If you use a long cryptographic hash like md5 or sha, the chances of
collision is very small. That's probably your best bet.

In general, if your hash is N bits long, you will probably get
collisions once you have around 2**(N/2) keys. sha is a 160
bit hash so getting collisions by accident is extremely unlikely.

Jul 12 '06 #13

kdotsky

depending on your application, a bloom filter might be a good enough:

>
http://en.wikipedia.org/wiki/Bloom_filter

Thanks (everyone) for the comments. I like the idea of the bloom
filter or using an md5 hash, since a rare collision will not be a
show-stopper in my case.

Jul 12 '06 #14

Terry Hancock

Nick Vatamaniuc wrote:

The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ] then the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
else:
d[key] = val

or something along those lines. No need to keep the original string.

Obviously this is only efficient if collisions are rare.

I am a little surprised that hash(hash(s)) == hash(s), is that actually
true?

Cheers,
Terry

--
Terry Hancock (ha*****@AnansiSpaceworks.com)
Anansi Spaceworks http://www.AnansiSpaceworks.com

Jul 13 '06 #15

Ganesan Rajagopal

>>>>"Terry" == Terry Hancock <ha*****@anansispaceworks.comwrites:

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1

hash is a number. It's sufficient to do

while d.has_key(key):
key += 1

I am a little surprised that hash(hash(s)) == hash(s), is that actually
true?

>>hash(42)

42

Ganesan

--
Ganesan Rajagopal

Jul 13 '06 #16

Ganesan Rajagopal <rg******@myrealbox.comwrites:

hash is a number. It's sufficient to do

while d.has_key(key):
key += 1

This is called linear probing and is not considered that great a
collision resolution strategy for hash tables. Really, if you want an
exhaustive study about hashing, see Knuth vol 3. If you're just
trying to get the application doing something reasonable, let the
database do its job.

Jul 13 '06 #17

Yeah hash(hash(immutable))=hash(immutable) it seems. Not sure this is a
specification but it happens that way:
------------------------------------------
$ >>hash('abc')
-1600925533
$ >>hash(hash('abc'))
-1600925533
$ >>hash(hash(hash(('abc'))))
-1600925533

>>>

-----------------------------------------
Of course then hash(0)=0, hash(1)=0 and also hash(2**32)=1, all this in
standard CPython on IA32.

Nick Vatamaniuc

Ganesan Rajagopal wrote:

>>>"Terry" == Terry Hancock <ha*****@anansispaceworks.comwrites:

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1

hash is a number. It's sufficient to do

while d.has_key(key):
key += 1

I am a little surprised that hash(hash(s)) == hash(s), is that actually
true?

>hash(42)

42

Ganesan

--
Ganesan Rajagopal

Jul 13 '06 #18

Diez B. Roggisch

I am a little surprised that hash(hash(s)) == hash(s), is that actually

true?

As hashes are integers & the hash of an integer is the integer itself,
this isn't to surprising.
Diez

Jul 13 '06 #19

Peter Otten

Terry Hancock wrote:

Nick Vatamaniuc wrote:
> The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ] then the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
else:
d[key] = val

or something along those lines. No need to keep the original string.

How do you intend to get the value back out of the dictionary?

candidates = []
for key in xrange(hash(url)):
try:
candidates.append(d[key])
except KeyError:
break

Instead, I would put all values with the same hash into a container, along
the lines of

d.setdefault(hash(url), []).append(value))

Here is a simple implementation that can be used to put the idea to a test.
It keeps all but the first url for a cluster with identical hashes.

class Cluster(object):
"""A cluster of dictionary values with the same hash(key).

Helper for Dict.
"""
def __init__(self, default):
self.pairs = {}
self.default = default
def __setitem__(self, key, value):
assert key not in self.pairs
self.pairs[key] = value
def __getitem__(self, key):
return self.pairs.get(key, self.default)
def __repr__(self):
return "Cluster(default=%r, pairs=%r)" % (self.default, self.pairs)
def itervalues(self):
yield self.default
for value in self.pairs.itervalues():
yield value

class Dict(object):
"""A dictionary that stores hashes instead of keys as long as there are
no collisions.

The value entered first becomes the default for a cluster of values with
the same hash. It follows that you cannot implement a reliable
containment test and must make sure by exterior means that only existing
keys are looked up.
"""
def __init__(self):
self.dict = {}
def __setitem__(self, key, value):
short_key = hash(key)
try:
cluster = self.dict[short_key]
except KeyError:
self.dict[short_key] = value
else:
if isinstance(cluster, Cluster):
cluster[key] = value
else:
cluster = Cluster(cluster)
cluster[key] = value
self.dict[short_key] = cluster
def __getitem__(self, key):
cluster = self.dict[hash(key)]
if not isinstance(cluster, Cluster):
return cluster
return cluster[key]
def itervalues(self):
for value in self.dict.itervalues():
if isinstance(value, Cluster):
for v in value.itervalues():
yield v
else:
yield value
def __repr__(self):
return "Dict(%s)" % repr(self.dict)[1:-1]

if __name__ == "__main__":
class BadHash(object):
"""Key with user-defined hash.

Allows enforcing hash collisions.
"""
def __init__(self, value, hash):
self.value = value
self.hash = hash
def __hash__(self):
return hash(self.hash)
def __eq__(self, other):
return self.value == other.value
def __repr__(self):
return "BadHash(value=%r, hash=%r)" % (self.value, self.hash)
d = Dict()
for i, v in enumerate("alpha beta gamma delta".split()):
d[BadHash(v, i)] = v
for v in "sigma tau omega".split():
d[BadHash(v, 42)] = v
print d
assert d[BadHash("gamma", 2)] == "gamma"

assert d[BadHash("sigma", 42)] == "sigma"
assert d[BadHash("tau", 42)] == "tau"

# but be warned that
assert d[BadHash("whatever", 42)] == "sigma"

expected = sorted("alpha beta gamma delta sigma tau omega".split())
assert list(sorted(d.itervalues())) == expected

I fear that the memory footprint of a url is not large enough to warrant the
approach, though.

Peter

Jul 13 '06 #20

Peter Otten

Terry Hancock wrote:

Nick Vatamaniuc wrote:
> The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ] then the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
else:
d[key] = val

or something along those lines. No need to keep the original string.

How do you intend to get the value back out of the dictionary?

candidates = []
for key in xrange(hash(url), sys.maxint):
try:
candidates.append(d[key])
except KeyError:
break

Instead, I would put all values with the same hash into a container, along
the lines of

d.setdefault(hash(url), []).append(value))

Here is a simple implementation that can be used to put the idea to a test.
It keeps all but the first url for a cluster with identical hashes.

class Cluster(object):
"""A cluster of dictionary values with the same hash(key).

Helper for Dict.
"""
def __init__(self, default):
self.pairs = {}
self.default = default
def __setitem__(self, key, value):
assert key not in self.pairs
self.pairs[key] = value
def __getitem__(self, key):
return self.pairs.get(key, self.default)
def __repr__(self):
return "Cluster(default=%r, pairs=%r)" % (self.default, self.pairs)
def itervalues(self):
yield self.default
for value in self.pairs.itervalues():
yield value

class Dict(object):
"""A dictionary that stores hashes instead of keys as long as there are
no collisions.

The value entered first becomes the default for a cluster of values with
the same hash. It follows that you cannot implement a reliable
containment test and must make sure by exterior means that only existing
keys are looked up.
"""
def __init__(self):
self.dict = {}
def __setitem__(self, key, value):
short_key = hash(key)
try:
cluster = self.dict[short_key]
except KeyError:
self.dict[short_key] = value
else:
if isinstance(cluster, Cluster):
cluster[key] = value
else:
cluster = Cluster(cluster)
cluster[key] = value
self.dict[short_key] = cluster
def __getitem__(self, key):
cluster = self.dict[hash(key)]
if not isinstance(cluster, Cluster):
return cluster
return cluster[key]
def itervalues(self):
for value in self.dict.itervalues():
if isinstance(value, Cluster):
for v in value.itervalues():
yield v
else:
yield value
def __repr__(self):
return "Dict(%s)" % repr(self.dict)[1:-1]

if __name__ == "__main__":
class BadHash(object):
"""Key with user-defined hash.

Allows enforcing hash collisions.
"""
def __init__(self, value, hash):
self.value = value
self.hash = hash
def __hash__(self):
return hash(self.hash)
def __eq__(self, other):
return self.value == other.value
def __repr__(self):
return "BadHash(value=%r, hash=%r)" % (self.value, self.hash)
d = Dict()
for i, v in enumerate("alpha beta gamma delta".split()):
d[BadHash(v, i)] = v
for v in "sigma tau omega".split():
d[BadHash(v, 42)] = v
print d
assert d[BadHash("gamma", 2)] == "gamma"

assert d[BadHash("sigma", 42)] == "sigma"
assert d[BadHash("tau", 42)] == "tau"

# but be warned that
assert d[BadHash("whatever", 42)] == "sigma"

expected = sorted("alpha beta gamma delta sigma tau omega".split())
assert list(sorted(d.itervalues())) == expected

I fear that the memory footprint of a url is not large enough to warrant the
approach, though.

Peter

Jul 13 '06 #21

kd*****@gmail.com wrote:

>depending on your application, a bloom filter might be a good enough:

http://en.wikipedia.org/wiki/Bloom_filter

Thanks (everyone) for the comments. I like the idea of the bloom
filter or using an md5 hash, since a rare collision will not be a
show-stopper in my case.

btw, since my brain is on vacation, could anyone who already has all the necessary
background information in their head (Paul?) tell me if using a chopped-up MD5 or
SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good idea...

i.e. something like:

import sha

try:
all
except NameError:
def all(S):
for x in S:
if not x:
return False
return True

def digest(data):
d = sha.sha(data).digest()
return [d[i:i+4] for i in range(0, len(d), 4)]

class bloom(object):

def __init__(self):
self.filter = [set() for i in range(len(digest("")))]

def add(self, data):
for s, p in zip(self.filter, digest(data)):
s.add(p)
def __contains__(self, data):

return all(p in s for s, p in zip(self.filter, digest(data)))

b = bloom()
b.add("showbiz")
b.add("absolution")
print "hullaballo" in b
print "showbiz" in b
print "origin of symmetry" in b

(and if this isn't a good idea, why it isn't).

</F>

Jul 13 '06 #22

"Fredrik Lundh" <fr*****@pythonware.comwrites:

btw, since my brain is on vacation, could anyone who already has all
the necessary background information in their head (Paul?) tell me
if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash
functions for a bloom filter is a good idea...

Erm, my brain isn't on vacation, it's just plain missing ;-). However
I'd say there's nothing wrong in principle with chopping up sha/md5
output if you want to get several hashes as you're doing, assuming the
total hash length is enough. The advantage over doing separate hashes
for each piece is optimization: you don't compute as many hashes.
But there's more flexibility with something like (untested):

def multi_hash(data, i, size): # find i'th slice of hash
return sha.new('%x,%s'% (i, data)).digest()[:size]

Also, the idea of a Bloom filter is to use less memory than a hash
table, by not storing the keys. In your implementation you chop the
sha/md5 output into 4-byte pieces, which means about 4 billion "slots"
in each set, which might be too many for a typical computer. Also,
you end up storing each of those 4-byte values as a Python string. So
I think in practice you'd want smaller sets, and to represent them as
bit vectors instead of as Python sets (untested):

import array
class bit_vector(object):
def __init__(self, size_in_bits):
nbytes = (size_in_bits + 7) // 8
self.vec = array.array('B', '\0' * nbytes)

def __getattr__(self, i): # get i'th bit from vector
a,b = divmod(i, 8)
return bool(self.vec[a] & (1 << b))

def __setattr__(self, i, value): # set i'th bit in vector
a,b = divmod(i, 8)
v = int(bool(value)) # the actual bit to set, 1 or 0
self.vec[a] = (self.vec[a] & ~(1 << b)) | (v << b)

If we use a bit vector representation we want the hash value to be an
integer (untested):

def bloom_hash(data, i, size_in_bits):
# my favorite way to get an integer value from sha
h = int(sha.new('%x,%s'% (i, data)).hexdigest(), 16)
return h % size_in_bits

We'd then implement the Bloom filter (untested):

class bloom(object):
def __init__(self, nlayers, layersize):
self.nlayers, self.layersize = nlayers, layersize
self.filter = [bit_vector(layersize) for i in xrange(nlayers)]

def add(self, data):
for i in xrange(self.nlayers):
self.filter[i][bloom_hash(data, i, self.layersize)] = True

def __contains__(self, data):
return False not in \
(self.filter[i][bloom_hash(data, i, self.layersize)] \
for i in xrange(self.nlayers))

Finally, I think to use Bloom filters effectively you have to choose
nlayers and layersize carefully, according to the number of keys you
expect to see, the amount of memory you have available, and/or the
probability of false matches that you're willing to accept. The
standard references (whatever they are) about Bloom filters probably
have some suitable formulas for computing these things.

Jul 13 '06 #23