471,348 Members | 1,894 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,348 software developers and data experts.

16bit hash

Is the any way to get an efficient 16bit hash in python?
--
Robin Becker

Jun 27 '07 #1
16 6658
Robin Becker wrote:
Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535

- Josiah
Jun 27 '07 #2
Josiah Carlson wrote:
Robin Becker wrote:
>Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah
yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
>>hash(1)
1
>>hash(2)
2
>>hash('1234')
1723328704
>>>
--
Robin Becker

Jun 27 '07 #3
Robin Becker <ro***@reportlab.comwrote:
Is the any way to get an efficient 16bit hash in python?
Here is a 32 bit crc of which you could use the bottom 16 bits as a 16
bit hash...
>>import binascii
binascii.crc32("hello world")
222957957
>>crc = binascii.crc32("hello")
crc = binascii.crc32(" world", crc)
crc
222957957
>>>
And in case you are wondering how fast it is...

$ python -m timeit -s 's="."*1024*1024; from binascii import crc32' 'crc32(s)'
100 loops, best of 3: 4.25 msec per loop

Which is 235 MB/s

Ignore the bit in the binascii docs which says "Since the algorithm is
designed for use as a checksum algorithm, it is not suitable for use
as a general hash algorithm". Actually CRCs make pretty good hash
algorithms if you want something for making a hash table. They make
very bad cryptographic hash generators since they are linear and thus
easily forgeable. That said you aren't going to be doing any serious
crypto with only 16 bits.

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Jun 27 '07 #4
Robin Becker schrieb:
Josiah Carlson wrote:
>Robin Becker wrote:
>>Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah
yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
>>hash(1)
1
>>hash(2)
2
>>hash('1234')
1723328704
>>>
I'm not sure if it helps in your case, but this is pretty nifty:

http://www.amk.ca/python/code/perfect-hash

Thomas

Jun 27 '07 #5
On 2007-06-27, Robin Becker <ro***@reportlab.comwrote:
Is the any way to get an efficient 16bit hash in python?
int(md5.new(s).hexdigest()[:4], 16) ?
Jun 27 '07 #6
On Jun 27, 12:11 pm, Robin Becker <r...@reportlab.comwrote:
Josiah Carlson wrote:
Robin Becker wrote:
Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535
- Josiah

yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
>>hash(1)
1
>>hash(2)
2
>>hash('1234')
1723328704
>>>
And then hash(-1) is a SPECIAL special case.
>>hash(-1)
-2

Jun 28 '07 #7
Robin Becker <ro***@reportlab.comwrites:
Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535
- Josiah
yes I thought of that, but cannot figure out if the internal hash
really distributes the bits evenly. Particularly since it seems to
treat integers etc as special cases
The built-in hash function doesn't attempt even distribution, it tries
to make sure that inputs with small differences hash to distinct
values. If you want something really near-random, try

import sha
hash = int(sha.new(repr(obj)).hexdigest()[:4], 16)

It won't be super-fast but hey, you're using Python.
Jun 28 '07 #8
Is the any way to get an efficient 16bit hash in python?

Unfortunately, that problem is underspecified. The simplest
and most efficient 16-bit hash I can come up with is

def super_efficient_hash(o):
return 0

Now, you say: this gives collisions.
So yes, it does - but you didn't ask for it to be
perfect, just efficient.

But I don't think you are looking for a perfect hash,
but one that distributes the values evenly.

Unfortunately, that *still* would be underspecified:
you need to specify the distribution of the input values
to be able to tell whether the distribution of hash
values is even.

More generally: for most 16-bit hash functions H that
are capable of yielding all 65536 hash values, and for
any number N, there is a set S1 of N objects where
H distributes even, and a set S2 of N objects where
all hash values collide.
(the only exceptions are hash functions which produce
some output values only for a finite number of inputs)

So: what are your input data, and what is the
distribution among them?

Regards,
Martin

Jun 28 '07 #9
"Martin v. Lwis" <ma****@v.loewis.dewrites:
So: what are your input data, and what is the
distribution among them?
With good enough hash functions one shouldn't need to care about
the input distribution. Basically functions like SHA can be
used as extractors:

http://en.wikipedia.org/wiki/Extractor

If there's a concern that the input distribution is specially
concocted to give nonuniform results with some known hash function,
then use one unknown to the input provider, e.g.

import hmac
def hash(obj, key='some string unknown to the input source'):
return int(hmac.HMAC(key,repr(obj)).hexdigest()[:4], 16)

Anyway I don't have the impression that the OP is concerned with this
type of issue. Otherwise s/he'd want much longer hashes than 16 bits.
Jun 28 '07 #10
Martin v. Lwis wrote:

0 the ideal hash

:)

can't be argued with
>.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin
I'm trying to create UniqueID's for dynamic postscript fonts. According to my
resources we don't actually need to use these, but if they are required by a
particular postscript program (perhaps to make a print run efficient) then the
private range of these ID's is 4000000<=UID<=4999999 ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and written
out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular glyphs so
ideally I would number these fonts sequentially using a global count, but in
practice several processes separated by application and time can produce
postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated processes
can communicate except via the printer. So either I use a python based scheme
which reduces the risk of clashes ie random or some data based hash scheme or I
attempt to produce a postscript solution like looking for a private global
sequence number.

I'm not sure my postscript is really good enough to do the latter so I hoped to
pursue a python based approach which has a low probability of busting.
Originally I thought the range was a 16bit number which is why I started with
16bit hashes.
--
Robin Becker

Jun 28 '07 #11
Robin Becker wrote:
Martin v. Lwis wrote:

0 the ideal hash

:)

can't be argued with
>.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin
I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular
glyphs so ideally I would number these fonts sequentially using a global
count, but in practice several processes separated by application and
time can produce postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated
processes can communicate except via the printer. So either I use a
python based scheme which reduces the risk of clashes ie random or some
data based hash scheme or I attempt to produce a postscript solution
like looking for a private global sequence number.

I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.

For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.
Jun 28 '07 #12
Thomas Jollans wrote:
Robin Becker wrote:
........
>I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.


For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.
that is in fact what I'm doing; my function looks like this

4000000+(reduce(operator.xor,struct.unpack('i'*4,m d5.md5(s).digest()))&0x3ffff)

whether it's any better than using the lowest bits I have no real idea. I
suppose (sha being flavour of the month) I should really use

4000000+(reduce(operator.xor,struct.unpack('i'*5,s ha.sha(s).digest()))&0x3ffff)

--
Robin Becker

Jun 28 '07 #13
I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash
I don't fully understand the example, but ISTM that "no, you don't need
a hash at all. You need a unique identification".
The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.
And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?

Regards,
Martin
Jun 28 '07 #14
Martin v. Lwis wrote:
...........
>ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?
.......
I do need a unique identifier for each font, unfortunately they are required to
be unique across more than one file. Effectively these fonts are dynamically
generated using the used glyphs rather than just dumping the whole of a ttf into
the postscript.

If the UniqueID is used it must be unique when used in the printer (where I
cannot control which other UniqueID's might be present).

If I knew enough about postscript I might generate a UniqueID at the point of
use (by inspecting some global state) unfortunately my postscript is poor so I'm
attempting to give the fonts an id that is likely to be unique by obtaining an
integer dependant on the font data and then shifting into the private range.

Luckily the cheap option of not using the UniqueID at all is available, but
chances are some printer ps interpreter will barf if it's not present and then I
need a fairly robust way to generate reasonable candidates.
--
Robin Becker

Jun 28 '07 #15
If the UniqueID is used it must be unique when used in the printer
(where I cannot control which other UniqueID's might be present).
I don't know much about PostScript; googling around gives

http://fontforge.sourceforge.net/UniqueID.html

that says that you should not use them anymore, and

http://www.adobe.com/devnet/opentype/archives/id.html

says you should not make up values, but officially register them;
a certain range is available for testing without registration;
that should not be used in production mode.
Luckily the cheap option of not using the UniqueID at all is available,
but chances are some printer ps interpreter will barf if it's not
present and then I need a fairly robust way to generate reasonable
candidates.
See above article. It was always the case that a printer should not
barf; it just might have to re-render all glyphs. However, it seems
that today's printers have advanced caching, making the mechanism
irrelevant.

Regards,
Martin
Jun 28 '07 #16
Robin Becker <ro***@reportlab.comwrites:
whether it's any better than using the lowest bits I have no real
idea. I suppose (sha being flavour of the month) I should really use
I think if you have more than a few fonts you really have to assign
the id's uniquely instead of using hashes, to avoid collisions.
Jun 29 '07 #17

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by McBooCzech | last post: by
3 posts views Thread by Murali | last post: by
2 posts views Thread by Bryan Olson | last post: by
5 posts views Thread by Neeraj | last post: by
24 posts views Thread by kdotsky | last post: by
12 posts views Thread by Arash Partow | last post: by
21 posts views Thread by Johan Tibell | last post: by
18 posts views Thread by beginner | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.