473,396 Members | 1,814 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

16bit hash

Is the any way to get an efficient 16bit hash in python?
--
Robin Becker

Jun 27 '07 #1
16 7296
Robin Becker wrote:
Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535

- Josiah
Jun 27 '07 #2
Josiah Carlson wrote:
Robin Becker wrote:
>Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah
yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
>>hash(1)
1
>>hash(2)
2
>>hash('1234')
1723328704
>>>
--
Robin Becker

Jun 27 '07 #3
Robin Becker <ro***@reportlab.comwrote:
Is the any way to get an efficient 16bit hash in python?
Here is a 32 bit crc of which you could use the bottom 16 bits as a 16
bit hash...
>>import binascii
binascii.crc32("hello world")
222957957
>>crc = binascii.crc32("hello")
crc = binascii.crc32(" world", crc)
crc
222957957
>>>
And in case you are wondering how fast it is...

$ python -m timeit -s 's="."*1024*1024; from binascii import crc32' 'crc32(s)'
100 loops, best of 3: 4.25 msec per loop

Which is 235 MB/s

Ignore the bit in the binascii docs which says "Since the algorithm is
designed for use as a checksum algorithm, it is not suitable for use
as a general hash algorithm". Actually CRCs make pretty good hash
algorithms if you want something for making a hash table. They make
very bad cryptographic hash generators since they are linear and thus
easily forgeable. That said you aren't going to be doing any serious
crypto with only 16 bits.

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Jun 27 '07 #4
Robin Becker schrieb:
Josiah Carlson wrote:
>Robin Becker wrote:
>>Is the any way to get an efficient 16bit hash in python?

hash(obj)&65535

- Josiah
yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
>>hash(1)
1
>>hash(2)
2
>>hash('1234')
1723328704
>>>
I'm not sure if it helps in your case, but this is pretty nifty:

http://www.amk.ca/python/code/perfect-hash

Thomas

Jun 27 '07 #5
On 2007-06-27, Robin Becker <ro***@reportlab.comwrote:
Is the any way to get an efficient 16bit hash in python?
int(md5.new(s).hexdigest()[:4], 16) ?
Jun 27 '07 #6
On Jun 27, 12:11 pm, Robin Becker <r...@reportlab.comwrote:
Josiah Carlson wrote:
Robin Becker wrote:
Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535
- Josiah

yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
>>hash(1)
1
>>hash(2)
2
>>hash('1234')
1723328704
>>>
And then hash(-1) is a SPECIAL special case.
>>hash(-1)
-2

Jun 28 '07 #7
Robin Becker <ro***@reportlab.comwrites:
Is the any way to get an efficient 16bit hash in python?
hash(obj)&65535
- Josiah
yes I thought of that, but cannot figure out if the internal hash
really distributes the bits evenly. Particularly since it seems to
treat integers etc as special cases
The built-in hash function doesn't attempt even distribution, it tries
to make sure that inputs with small differences hash to distinct
values. If you want something really near-random, try

import sha
hash = int(sha.new(repr(obj)).hexdigest()[:4], 16)

It won't be super-fast but hey, you're using Python.
Jun 28 '07 #8
Is the any way to get an efficient 16bit hash in python?

Unfortunately, that problem is underspecified. The simplest
and most efficient 16-bit hash I can come up with is

def super_efficient_hash(o):
return 0

Now, you say: this gives collisions.
So yes, it does - but you didn't ask for it to be
perfect, just efficient.

But I don't think you are looking for a perfect hash,
but one that distributes the values evenly.

Unfortunately, that *still* would be underspecified:
you need to specify the distribution of the input values
to be able to tell whether the distribution of hash
values is even.

More generally: for most 16-bit hash functions H that
are capable of yielding all 65536 hash values, and for
any number N, there is a set S1 of N objects where
H distributes even, and a set S2 of N objects where
all hash values collide.
(the only exceptions are hash functions which produce
some output values only for a finite number of inputs)

So: what are your input data, and what is the
distribution among them?

Regards,
Martin

Jun 28 '07 #9
"Martin v. Löwis" <ma****@v.loewis.dewrites:
So: what are your input data, and what is the
distribution among them?
With good enough hash functions one shouldn't need to care about
the input distribution. Basically functions like SHA can be
used as extractors:

http://en.wikipedia.org/wiki/Extractor

If there's a concern that the input distribution is specially
concocted to give nonuniform results with some known hash function,
then use one unknown to the input provider, e.g.

import hmac
def hash(obj, key='some string unknown to the input source'):
return int(hmac.HMAC(key,repr(obj)).hexdigest()[:4], 16)

Anyway I don't have the impression that the OP is concerned with this
type of issue. Otherwise s/he'd want much longer hashes than 16 bits.
Jun 28 '07 #10
Martin v. Löwis wrote:

0 the ideal hash

:)

can't be argued with
>.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin
I'm trying to create UniqueID's for dynamic postscript fonts. According to my
resources we don't actually need to use these, but if they are required by a
particular postscript program (perhaps to make a print run efficient) then the
private range of these ID's is 4000000<=UID<=4999999 ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and written
out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular glyphs so
ideally I would number these fonts sequentially using a global count, but in
practice several processes separated by application and time can produce
postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated processes
can communicate except via the printer. So either I use a python based scheme
which reduces the risk of clashes ie random or some data based hash scheme or I
attempt to produce a postscript solution like looking for a private global
sequence number.

I'm not sure my postscript is really good enough to do the latter so I hoped to
pursue a python based approach which has a low probability of busting.
Originally I thought the range was a 16bit number which is why I started with
16bit hashes.
--
Robin Becker

Jun 28 '07 #11
Robin Becker wrote:
Martin v. Löwis wrote:

0 the ideal hash

:)

can't be argued with
>.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin
I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular
glyphs so ideally I would number these fonts sequentially using a global
count, but in practice several processes separated by application and
time can produce postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated
processes can communicate except via the printer. So either I use a
python based scheme which reduces the risk of clashes ie random or some
data based hash scheme or I attempt to produce a postscript solution
like looking for a private global sequence number.

I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.

For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.
Jun 28 '07 #12
Thomas Jollans wrote:
Robin Becker wrote:
........
>I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.


For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.
that is in fact what I'm doing; my function looks like this

4000000+(reduce(operator.xor,struct.unpack('i'*4,m d5.md5(s).digest()))&0x3ffff)

whether it's any better than using the lowest bits I have no real idea. I
suppose (sha being flavour of the month) I should really use

4000000+(reduce(operator.xor,struct.unpack('i'*5,s ha.sha(s).digest()))&0x3ffff)

--
Robin Becker

Jun 28 '07 #13
I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash
I don't fully understand the example, but ISTM that "no, you don't need
a hash at all. You need a unique identification".
The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.
And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?

Regards,
Martin
Jun 28 '07 #14
Martin v. Löwis wrote:
...........
>ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?
.......
I do need a unique identifier for each font, unfortunately they are required to
be unique across more than one file. Effectively these fonts are dynamically
generated using the used glyphs rather than just dumping the whole of a ttf into
the postscript.

If the UniqueID is used it must be unique when used in the printer (where I
cannot control which other UniqueID's might be present).

If I knew enough about postscript I might generate a UniqueID at the point of
use (by inspecting some global state) unfortunately my postscript is poor so I'm
attempting to give the fonts an id that is likely to be unique by obtaining an
integer dependant on the font data and then shifting into the private range.

Luckily the cheap option of not using the UniqueID at all is available, but
chances are some printer ps interpreter will barf if it's not present and then I
need a fairly robust way to generate reasonable candidates.
--
Robin Becker

Jun 28 '07 #15
If the UniqueID is used it must be unique when used in the printer
(where I cannot control which other UniqueID's might be present).
I don't know much about PostScript; googling around gives

http://fontforge.sourceforge.net/UniqueID.html

that says that you should not use them anymore, and

http://www.adobe.com/devnet/opentype/archives/id.html

says you should not make up values, but officially register them;
a certain range is available for testing without registration;
that should not be used in production mode.
Luckily the cheap option of not using the UniqueID at all is available,
but chances are some printer ps interpreter will barf if it's not
present and then I need a fairly robust way to generate reasonable
candidates.
See above article. It was always the case that a printer should not
barf; it just might have to re-render all glyphs. However, it seems
that today's printers have advanced caching, making the mechanism
irrelevant.

Regards,
Martin
Jun 28 '07 #16
Robin Becker <ro***@reportlab.comwrites:
whether it's any better than using the lowest bits I have no real
idea. I suppose (sha being flavour of the month) I should really use
I think if you have more than a few fonts you really have to assign
the id's uniquely instead of using hashes, to avoid collisions.
Jun 29 '07 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: McBooCzech | last post by:
Hi all, before I decided to bother you by e-mail, I spent days (not kidding) searching on the Internet. I am looking for Python binaries for DOS-16bit!!!! Not for Win-16bit or DOS-32 which are...
3
by: Murali | last post by:
I have a requirement where I have to use two unsigned ints as a key in a STL hash map. A couple of ways to do this is 1. create a struct with two unsigned ints and use that as key (write my own...
2
by: Bryan Olson | last post by:
The current Python standard library provides two cryptographic hash functions: MD5 and SHA-1 . The authors of MD5 originally stated: It is conjectured that it is computationally infeasible to...
0
by: Greg | last post by:
Hello, I'm trying to load a 16bit multiframe TIFF-file in a bitmap, but it doesn't work. Error message: 'System.Runtime.InteropServices.ExternalException' 'A generic error occurred in GDI+' ...
5
by: Neeraj | last post by:
hi I am working on dicom viewer. 8 bit raw dicom image working properly.But 16 bit Dicom image is not showing.What Should i have to do.You have any idea about That to convert it in 8bit ....
24
by: kdotsky | last post by:
Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to...
12
by: Arash Partow | last post by:
Hi all, I've ported various hash functions to python if anyone is interested: def RSHash(key): a = 378551 b = 63689 hash = 0
21
by: Johan Tibell | last post by:
I would be grateful if someone had a minute or two to review my hash table implementation. It's not yet commented but hopefully it's short and idiomatic enough to be readable. Some of the code...
18
by: beginner | last post by:
Hi All. I'd like to do the following in more succint code: if k in b: a=b else: a={} b=a
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.