16bit hash - Python

Robin Becker

Is the any way to get an efficient 16bit hash in python?
--
Robin Becker

Jun 27 '07 #1

Subscribe Post Reply

7296

Josiah Carlson

Robin Becker wrote:

Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah

Jun 27 '07 #2

Robin Becker

Josiah Carlson wrote:

Robin Becker wrote:
>Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah

yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases

>>hash(1)

>>hash(2)

>>hash('1234')

1723328704

>>>

--
Robin Becker

Jun 27 '07 #3

Nick Craig-Wood

Robin Becker <ro***@reportlab.comwrote:

Is the any way to get an efficient 16bit hash in python?

Here is a 32 bit crc of which you could use the bottom 16 bits as a 16
bit hash...

>>import binascii
binascii.crc32("hello world")

222957957

>>crc = binascii.crc32("hello")
crc = binascii.crc32(" world", crc)
crc

222957957

>>>

And in case you are wondering how fast it is...

$ python -m timeit -s 's="."*1024*1024; from binascii import crc32' 'crc32(s)'
100 loops, best of 3: 4.25 msec per loop

Which is 235 MB/s

Ignore the bit in the binascii docs which says "Since the algorithm is
designed for use as a checksum algorithm, it is not suitable for use
as a general hash algorithm". Actually CRCs make pretty good hash
algorithms if you want something for making a hash table. They make
very bad cryptographic hash generators since they are linear and thus
easily forgeable. That said you aren't going to be doing any serious
crypto with only 16 bits.

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Jun 27 '07 #4

Thomas Heller

Robin Becker schrieb:

Josiah Carlson wrote:
>Robin Becker wrote:
>>Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah
yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases

>>hash(1)

1

>>hash(2)

2

>>hash('1234')

1723328704

>>>

I'm not sure if it helps in your case, but this is pretty nifty:

http://www.amk.ca/python/code/perfect-hash

Thomas

Jun 27 '07 #5

Jon Ribbens

On 2007-06-27, Robin Becker <ro***@reportlab.comwrote:

Is the any way to get an efficient 16bit hash in python?

int(md5.new(s).hexdigest()[:4], 16) ?

Jun 27 '07 #6

Dan Bishop

On Jun 27, 12:11 pm, Robin Becker <r...@reportlab.comwrote:

Josiah Carlson wrote:
Robin Becker wrote:
Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah

yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases

>>hash(1)

1

>>hash(2)

2

>>hash('1234')

1723328704

>>>

And then hash(-1) is a SPECIAL special case.

>>hash(-1)

-2

Jun 28 '07 #7

Paul Rubin

Robin Becker <ro***@reportlab.comwrites:

Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535
- Josiah
yes I thought of that, but cannot figure out if the internal hash
really distributes the bits evenly. Particularly since it seems to
treat integers etc as special cases

The built-in hash function doesn't attempt even distribution, it tries
to make sure that inputs with small differences hash to distinct
values. If you want something really near-random, try

import sha
hash = int(sha.new(repr(obj)).hexdigest()[:4], 16)

It won't be super-fast but hey, you're using Python.

Jun 28 '07 #8

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Is the any way to get an efficient 16bit hash in python?

Unfortunately, that problem is underspecified. The simplest
and most efficient 16-bit hash I can come up with is

def super_efficient_hash(o):
return 0

Now, you say: this gives collisions.
So yes, it does - but you didn't ask for it to be
perfect, just efficient.

But I don't think you are looking for a perfect hash,
but one that distributes the values evenly.

Unfortunately, that *still* would be underspecified:
you need to specify the distribution of the input values
to be able to tell whether the distribution of hash
values is even.

More generally: for most 16-bit hash functions H that
are capable of yielding all 65536 hash values, and for
any number N, there is a set S1 of N objects where
H distributes even, and a set S2 of N objects where
all hash values collide.
(the only exceptions are hash functions which produce
some output values only for a finite number of inputs)

So: what are your input data, and what is the
distribution among them?

Regards,
Martin

Jun 28 '07 #9

Paul Rubin

"Martin v. Löwis" <ma****@v.loewis.dewrites:

So: what are your input data, and what is the
distribution among them?

With good enough hash functions one shouldn't need to care about
the input distribution. Basically functions like SHA can be
used as extractors:

http://en.wikipedia.org/wiki/Extractor

If there's a concern that the input distribution is specially
concocted to give nonuniform results with some known hash function,
then use one unknown to the input provider, e.g.

import hmac
def hash(obj, key='some string unknown to the input source'):
return int(hmac.HMAC(key,repr(obj)).hexdigest()[:4], 16)

Anyway I don't have the impression that the OP is concerned with this
type of issue. Otherwise s/he'd want much longer hashes than 16 bits.

Jun 28 '07 #10

Robin Becker

Martin v. Löwis wrote:

0 the ideal hash

:)

can't be argued with

>.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin

I'm trying to create UniqueID's for dynamic postscript fonts. According to my
resources we don't actually need to use these, but if they are required by a
particular postscript program (perhaps to make a print run efficient) then the
private range of these ID's is 4000000<=UID<=4999999 ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and written
out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular glyphs so
ideally I would number these fonts sequentially using a global count, but in
practice several processes separated by application and time can produce
postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated processes
can communicate except via the printer. So either I use a python based scheme
which reduces the risk of clashes ie random or some data based hash scheme or I
attempt to produce a postscript solution like looking for a private global
sequence number.

I'm not sure my postscript is really good enough to do the latter so I hoped to
pursue a python based approach which has a low probability of busting.
Originally I thought the range was a 16bit number which is why I started with
16bit hashes.
--
Robin Becker

Jun 28 '07 #11

Thomas Jollans

Robin Becker wrote:

Martin v. Löwis wrote:

0 the ideal hash

:)

can't be argued with

>.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin

I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular
glyphs so ideally I would number these fonts sequentially using a global
count, but in practice several processes separated by application and
time can produce postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated
processes can communicate except via the printer. So either I use a
python based scheme which reduces the risk of clashes ie random or some
data based hash scheme or I attempt to produce a postscript solution
like looking for a private global sequence number.

I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.

For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.

Jun 28 '07 #12

Robin Becker

Thomas Jollans wrote:

Robin Becker wrote:

........

>I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.

For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.

that is in fact what I'm doing; my function looks like this

4000000+(reduce(operator.xor,struct.unpack('i'*4,m d5.md5(s).digest()))&0x3ffff)

whether it's any better than using the lowest bits I have no real idea. I
suppose (sha being flavour of the month) I should really use

4000000+(reduce(operator.xor,struct.unpack('i'*5,s ha.sha(s).digest()))&0x3ffff)

--
Robin Becker

Jun 28 '07 #13

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I'm trying to create UniqueID's for dynamic postscript fonts. According

to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash

I don't fully understand the example, but ISTM that "no, you don't need
a hash at all. You need a unique identification".

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?

Regards,
Martin

Jun 28 '07 #14

Robin Becker

Martin v. Löwis wrote:
...........

>ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?

.......
I do need a unique identifier for each font, unfortunately they are required to
be unique across more than one file. Effectively these fonts are dynamically
generated using the used glyphs rather than just dumping the whole of a ttf into
the postscript.

If the UniqueID is used it must be unique when used in the printer (where I
cannot control which other UniqueID's might be present).

If I knew enough about postscript I might generate a UniqueID at the point of
use (by inspecting some global state) unfortunately my postscript is poor so I'm
attempting to give the fonts an id that is likely to be unique by obtaining an
integer dependant on the font data and then shifting into the private range.

Luckily the cheap option of not using the UniqueID at all is available, but
chances are some printer ps interpreter will barf if it's not present and then I
need a fairly robust way to generate reasonable candidates.
--
Robin Becker

Jun 28 '07 #15

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

If the UniqueID is used it must be unique when used in the printer

(where I cannot control which other UniqueID's might be present).

I don't know much about PostScript; googling around gives

http://fontforge.sourceforge.net/UniqueID.html

that says that you should not use them anymore, and

http://www.adobe.com/devnet/opentype/archives/id.html

says you should not make up values, but officially register them;
a certain range is available for testing without registration;
that should not be used in production mode.

Luckily the cheap option of not using the UniqueID at all is available,
but chances are some printer ps interpreter will barf if it's not
present and then I need a fairly robust way to generate reasonable
candidates.

See above article. It was always the case that a printer should not
barf; it just might have to re-render all glyphs. However, it seems
that today's printers have advanced caching, making the mechanism
irrelevant.

Regards,
Martin

Jun 28 '07 #16

Paul Rubin

Robin Becker <ro***@reportlab.comwrites:

whether it's any better than using the lowest bits I have no real
idea. I suppose (sha being flavour of the month) I should really use

I think if you have more than a few fonts you really have to assign
the id's uniquely instead of using hashes, to avoid collisions.

Jun 29 '07 #17

Similar topics

sources for DOS-16bit

by: McBooCzech | last post by:

Hi all, before I decided to bother you by e-mail, I spent days (not kidding) searching on the Internet. I am looking for Python binaries for DOS-16bit!!!! Not for Win-16bit or DOS-32 which are...

Python

STL hash_map hash

by: Murali | last post by:

I have a requirement where I have to use two unsigned ints as a key in a STL hash map. A couple of ways to do this is 1. create a struct with two unsigned ints and use that as key (write my own...

C / C++

Better crypto hash functions, long, with code

by: Bryan Olson | last post by:

The current Python standard library provides two cryptographic hash functions: MD5 and SHA-1 . The authors of MD5 originally stated: It is conjectured that it is computationally infeasible to...

Python

Load 16bit multiframe TIFF-file in bitmap

by: Greg | last post by:

Hello, I'm trying to load a 16bit multiframe TIFF-file in a bitmap, but it doesn't work. Error message: 'System.Runtime.InteropServices.ExternalException' 'A generic error occurred in GDI+' ...

C# / C Sharp

How to display 16bit dicom image

by: Neeraj | last post by:

hi I am working on dicom viewer. 8 bit raw dicom image working properly.But 16 bit Dicom image is not showing.What Should i have to do.You have any idea about That to convert it in 8bit ....

C# / C Sharp

don't need dictionary's keys - hash table?

by: kdotsky | last post by:

Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to...

Python

General Hash Functions In Python

by: Arash Partow | last post by:

Hi all, I've ported various hash functions to python if anyone is interested: def RSHash(key): a = 378551 b = 63689 hash = 0

Python

My hash table is in need of peer review

by: Johan Tibell | last post by:

I would be grateful if someone had a minute or two to review my hash table implementation. It's not yet commented but hopefully it's short and idiomatic enough to be readable. Some of the code...

C / C++

How to say $a=$b->{"A"} ||={} in Python?

by: beginner | last post by:

Hi All. I'd like to do the following in more succint code: if k in b: a=b else: a={} b=a

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice