SHA-based subclass for random module

Paul Rubin

http://www.nightsong.com/phr/python/sharandom.c

This is intended to be less predicable/have fewer correlations than
the default Mersenne Twister or Wichmann-Hill generators. Comments
are appreciated.

Jul 18 '05 #1

Subscribe Post Reply

3645

Raymond Hettinger

[Paul Rubin]

This is intended to be less predicable/have fewer correlations than
the default Mersenne Twister or Wichmann-Hill generators. Comments
are appreciated.

Because SHA-1 is a digest function, additional input can be tacked on
without impacting invertibility. That gives you an opportunity to
incorporate the Mersenne Twister to take advantage of its provably
long period.

Here is a rough cut that extracts a 53 bit float on every pass and
leaves the remaining 107 bits of state hidden:

import struct
import sha
import random

def srandom(seed):
gen = sha.new(seed)
digest = gen.digest
update = gen.update
random.seed(seed)
mt = random.random
unpack = struct.unpack
tofloat = 1.0 / 2 ** 53
while 1:
state = digest()
randint = unpack('Q', state[:8])[0] >> 11 # grab 53 bits
yield randint * tofloat
update(state)
update(str(mt())) # twister never hurts, but can help period

r = srandom('the quick brown fox jumped')
for i in xrange(10):
print r.next()
Raymond Hettinger

Jul 18 '05 #2

Paul Rubin

py****@rcn.com (Raymond Hettinger) writes:

Because SHA-1 is a digest function, additional input can be tacked on
without impacting invertibility. That gives you an opportunity to
incorporate the Mersenne Twister to take advantage of its provably
long period.

I don't see any point to including Mersenne Twister output in the SHA
update. If you want to make sure of a long period, it's easier to
just mix in a simple counter. Note that the generator I posted should
have period about 2**160 since in order to repeat, two different SHA
states would have to repeat simultaneously.

Jul 18 '05 #3

Raymond Hettinger

[Paul Rubin]

This is intended to be less predicable/have fewer correlations than
the default Mersenne Twister or Wichmann-Hill generators. Comments
are appreciated.

I offer this as an alternative:
from random import Random
from struct import unpack
import md5

class MD5Random(Random):
def newrandom(self, tofloat = 1.0 / 2 ** 53):
plaintxt = str(Random.random(self))
ciphertxt = md5.new(plaintxt).digest()
randint = unpack('<Q', ciphertxt[:8])[0] >> 11 # grab 53 bits
return randint * tofloat
Advantages over the original:

* Much faster

* Much shorter code

* More readable code

* Threadsafe

* Takes advantage of MT's proven equidistribution, of its passing
known tests for randomness, and of its known period (in contrast,
SHA-1 was designed as a digest that makes it computationally
intractable to find a different message giving the same signature --
that does *not* imply the non-existence of short cycles for some
keys).

* It uses both MD5 and the Mersenne Twister as they were designed (in
contrast, my google searches show *no* research on using SHA-1 in OFB
and reapplying SHA-1 again to conceal the output).
Raymond Hettinger

Jul 18 '05 #4

Raymond Hettinger

Correction to the last posting.
The method name should have been random() instead of newrandom().
Raymond

Jul 18 '05 #5

Paul Rubin

py****@rcn.com (Raymond Hettinger) writes:

I offer this as an alternative:

from random import Random
from struct import unpack
import md5
Why md5 instead of sha?
class MD5Random(Random):
def newrandom(self, tofloat = 1.0 / 2 ** 53):
plaintxt = str(Random.random(self))
What's Random.random(self) supposed to do?
ciphertxt = md5.new(plaintxt).digest()
I think you mean update.
randint = unpack('<Q', ciphertxt[:8])[0] >> 11 # grab 53 bits
return randint * tofloat

Advantages over the original:

* Much faster
Yes, other version wasn't optimized. Both call the hash function
approx once per output though.
* Much shorter code
But doesn't support all the operations specified in the Random API.
What's up with that?
* More readable code
Perhaps so. It took me a little while to figure out in the previous
version that the security came from revealing only part of the output
of each hash (107 secret bits for SHA). In the md5 version there's
only 75 secret bits, which is cutting things a little close. It may
be harder to do a concrete security proof for this construction too
(I'm not sure, I've never done one).
* Threadsafe
Always a plus.
* Takes advantage of MT's proven equidistribution, of its passing
known tests for randomness, and of its known period (in contrast,
SHA-1 was designed as a digest that makes it computationally
intractable to find a different message giving the same signature --
that does *not* imply the non-existence of short cycles for some
keys).
I'm missing something here--if you're just trying to avoid possible
cycles, why do you want to use MT instead of a sequential counter?
The counter is guaranteed to never cycle (unless you overflow the 512
bit compression function input, which won't happen in the life of the
universe).

Note that while there's probably some short cycles in the double-SHA
construction, the chance of finding one is infinitesimally small (it
means finding a 320 bit birthday collision). Hitting one by accident
is comparable to guessing an AES key by accident. Finding one by
brute force or by cryptanalysis is again comparable to breaking AES.
* It uses both MD5 and the Mersenne Twister as they were designed (in
contrast, my google searches show *no* research on using SHA-1 in OFB
and reapplying SHA-1 again to conceal the output).

Applying a collision resistant hash function to itself is a standard
construction for making a PRF and is the basis of HMAC. I believe
the relevant paper is

Bellare, M., Canetti, R., and Krawczyk, H., "Pseudorandom Functions
Revisited: The Cascade Construction". Proc. of the 37th IEEE Symp. on
Foundation of Computer Science, 1996, webbed (PostScript) at:

http://www.research.ibm.com/security/bck1.ps

but I haven't slogged all the way through it yet. It has a concrete
security proof and I've always found those things incredibly fussy and
pedantic compared to normal math papers. I haven't yet figured out
the reason why they're like that, or if they're really need to be.
I'd like to attempt to do one for this generator though, once we
settle on an algorithm.

See also the other references to RFC 2104, notably Krawczyk et al's
original HMAC paper from Crypto 96.

My guess is the reason you haven't found any publications specific to
SHA1 with this construction is that nobody has found any results
against SHA1. As far as anyone can tell, the output is
undistinguishable from random.

Jul 18 '05 #6

Jeff Epler

Since the size of plaintext is only 2^53, can't I just calculate
all 2^53 md5 values in advance, and invert the output of MD5Random to
get MT outputs, then attack MT just like any LFSR?

A time-space tradeoff could be made between the number of precalculated
hashes and the number of outputs from the MD5Random PRNG needed, too.
A quick search on the internet indicates that about 2n bits are needed
to find the initial state of an n-bit LSFR[1], so something like 40000 bits
would be needed to find the internal state of MT, or 380 53-bit outputs.

So let's say you can precalculate 2^30 MD5 inverses, the chance of any
particular output being in the table is 2^-23. So you'd need something
like 2^31 outputs from MD5Random(). I'm sure the analysis is also
complicated by the fact that your 2n bits won't be consecutive outputs,
but will be widely separated, but it's likely to remain possible.

2^53 is a big number, but it's well less than the 2^128 range for
all MD5 signatures or the 2^64 signatures needed to find a collision.
2^30 storage and 2^31 calls to random is something many of us could at
work, if not at home.

IANAC,
Jeff
[1] http://www.ciphersbyritter.com/RES/COMBCORR.HTM
"(The Berlekamp-Massey algorithm will recover the unknown state of a
simple n-bit LFSR, and its feedback polynomial, with just 2n known
bits.)"

Jul 18 '05 #7

Josiah Carlson

> What's Random.random(self) supposed to do?

Generate a random number using the Mersenne Twister random number generator.

ciphertxt = md5.new(plaintxt).digest()
I think you mean update.

Perhaps yes, perhaps no. Certainly the digest of both are dependant on
two inputs (the current internal state of MD5 and the random number
generated by MT). However, unless you can store the series of updates
to MD5, then getstate() followed by setstate() would not be sufficient
to get the same series of random numbers. This is also a "possible
issue" shared by the double-sha algorithm.
But doesn't support all the operations specified in the Random API.
What's up with that?
That is what the subclass was for: class MD5Random(Random):

- Josiah

Jul 18 '05 #8

Paul Rubin

Josiah Carlson <jc******@nospam.uci.edu> writes:

What's Random.random(self) supposed to do?

Generate a random number using the Mersenne Twister random number
generator.

According to the Random doc, Random.random() doesn't take an argument,
so I'm asking what the "self" is for.

ciphertxt = md5.new(plaintxt).digest()

I think you mean update.

Perhaps yes, perhaps no. Certainly the digest of both are dependant
on two inputs (the current internal state of MD5 and the random number
generated by MT). However, unless you can store the series of updates
to MD5, then getstate() followed by setstate() would not be sufficient
to get the same series of random numbers. This is also a "possible
issue" shared by the double-sha algorithm.

Without the update, all that's happening is you're reading an MT output
and hashing it, so there's no secret state without some security assumption
on MT. And the only security assumption I'm willing to make about MT
is that it's completely insecure ;-).

But doesn't support all the operations specified in the Random API.
What's up with that?

That is what the subclass was for:
> class MD5Random(Random):

But the methods in the superclass don't do anything to manage the state
of subclass instances.

Jul 18 '05 #9

Josiah Carlson

>>>What's Random.random(self) supposed to do?

Generate a random number using the Mersenne Twister random number
generator.

According to the Random doc, Random.random() doesn't take an argument,
so I'm asking what the "self" is for.

But doesn't support all the operations specified in the Random API.
What's up with that?

That is what the subclass was for:
> class MD5Random(Random):
But the methods in the superclass don't do anything to manage the state
of subclass instances.

Perhaps you should spend some more time learning how Python classes work.

Random is a class.
Random.random is an unbound class method.
Random.random(self) calls the unbound class method 'Random.random' with
the first argument being the MD5Random instance.

You should satisfy yourself that the below is correct behavior for
Python classes.

class A: .... def blah(self):
.... print self.__class__
.... class B(A): .... def blah(self):
.... print "I'm in B, calling A"
.... A.blah(self)
.... a = A()
a.blah() __main__.A b = B()
b.blah() I'm in B, calling A
__main__.B

Can you figure out why the following is also correct?
class C: .... def blah(self):
.... print "I'm in C, calling A"
.... A.blah(self)
.... c = C()
c.blah()

I'm in C, calling A
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 4, in blah
TypeError: unbound method blah() must be called with A instance as first
argument (got C instance instead)
Perhaps now you understand why he can use Random.random(self) in the
earlier version, and Random.getrandbits(self, 128) in the current version.

- Josiah

Jul 18 '05 #10

Raymond Hettinger

> Why md5 instead of sha?

You could go either way. Make the decision based on how many bits you
need (128 should be plenty) for generating a single 53-bit float and
then take into account speed considerations (MD5 vs SHA-1) and the
time to generate the bits going into the message digest (it takes
longer to generate 160 bits than 128).

What's Random.random(self) supposed to do?
It's called subclassing ;-)

In this case, it inherits all the other random functionality (being
sure to include my corrected method name from newrandom() to
random()).

ciphertxt = md5.new(plaintxt).digest()

I think you mean update.

Nope, I meant digest(), but it could be done with update as well.
But doesn't support all the operations specified in the Random API.
What's up with that?
Sure it does. That is what the subclassing is for. Again be sure to
incorporate the two submitted revisions resulting in:
from random import Random
from struct import unpack
import md5

class MD5Random(Random):
def random(self, tofloat = 1.0 / 2 ** 53):
plaintxt = str(Random.getrandbits(self, 128))
ciphertxt = md5.new(plaintxt).digest()
randint = unpack('<Q', ciphertxt[:8])[0] >> 11 # grab 53 bits
return randint * tofloat

* Takes advantage of MT's proven equidistribution, of its passing
known tests for randomness, and of its known period (in contrast,
SHA-1 was designed as a digest that makes it computationally
intractable to find a different message giving the same signature --
that does *not* imply the non-existence of short cycles for some
keys).

I'm missing something here--if you're just trying to avoid possible
cycles, why do you want to use MT instead of a sequential counter?

Either my version or your original requires something that generates a
sequence of randoms from a seed (mine does it with MT and yours with
s0=sha1(s0)). The advantages of MT are proven properties, and it can
more efficiently generate an arbitrary number of bits (thanks to
genrandbits()) without the messy pure python code for extracting a
byte at a time. It was also nice to not have to override jumpahead(),
getstate(), setstate(), etc.

For the base generator, MT or s0-sha1(s0), no cryptography is
necessary; you're getting that from the second application of sha1 or
md5.

If you use sha1 for the base generator, you gain cryptographic
strength to the left which is useless for your application (lottery
numbers are selected in a way that is strong to the left and right,
but no one cares about inferring last weeks numbers from this weeks,
you make money only if you can predict the following week). It is
doubly useless, because the second application of SHA1 gives you what
you need.

* It uses both MD5 and the Mersenne Twister as they were designed (in
contrast, my google searches show *no* research on using SHA-1 in OFB
and reapplying SHA-1 again to conceal the output).

Applying a collision resistant hash function to itself is a standard
construction for making a PRF and is the basis of HMAC.

Re-reading your sentence carefully, does it say that SHA-1 is
collision resistant. The design goal for SHA-1, of necessity,
produces collisions for messages over 160 bits in length and AFAICT
there are no citations showing collision resistance at 160 bits.
I believe
the relevant paper is

Bellare, M., Canetti, R., and Krawczyk, H., "Pseudorandom Functions
Revisited: The Cascade Construction". Proc. of the 37th IEEE Symp. on
Foundation of Computer Science, 1996, webbed (PostScript) at:

http://www.research.ibm.com/security/bck1.ps
Good paper, but read it carefully. Basically what it starts out
saying is that not all trapdoor functions make good PRFs, and then it
sets about how to construct functions that are both trapdoor and
collision resistant (with positive implications for the minimum
period).

My guess is the reason you haven't found any publications specific to
SHA1 with this construction is that nobody has found any results
against SHA1. As far as anyone can tell, the output is
undistinguishable from random.

I wish you would abandond this line of reasoning. In security
applications, proofs rule and the absence of published flaws means
nothing (consider the case of Engima where the designers relied on
"nobody has found any results
against" their cipher).

In this particular case, the finding a short cycle would not really be
a major flaw; the function still meets its design goal of producing a
message digest for a given message such that it is difficult to
produce another message with the same digest. A few known 160-bit
collisions would not be an issue except for those exact 160 bit
messages. IOW, even if some collisions are known, don't expect to
find a paper on it.

All of that is moot, MT is fast, subclasses nicely, is threadsafe, has
proven distribution properties, and has a known (large) period. Why
not digest it and produce your shielded random numbers? Or is is just
a matter of being glued to your originally published code (which would
be correct if collision resistance is known)?
Raymond Hettinger

Jul 18 '05 #11

Anton Vredegoor

py****@rcn.com (Raymond Hettinger) wrote:

from random import Random
from struct import unpack
import md5

class MD5Random(Random):
def random(self, tofloat = 1.0 / 2 ** 53):
plaintxt = str(Random.getrandbits(self, 128))
ciphertxt = md5.new(plaintxt).digest()
randint = unpack('<Q', ciphertxt[:8])[0] >> 11 # grab 53 bits
return randint * tofloat

I wonder why it is necessary to do the bit twiddling once one has
chosen to multiply by a fraction instead of directly putting the bits
inside a float. If 1.0 / 2**53 is just some arbitrary value (and why
not: A floats' smallest value is a lot smaller than that?) then we can
generate integers of other sizes and divide them by other floats, for
example:

from random import Random
import md5

class MD5Random(Random):

def random(self, tofloat = 1.0/2**128, prf = md5.new()):
prf.update(repr(Random.random(self)))
return int(prf.hexdigest(),16) * tofloat

def test():
random = MD5Random().random
for i in range(10):
print random()

if __name__=='__main__':
test()

Anton

Jul 18 '05 #12

Paul Rubin

py****@rcn.com (Raymond Hettinger) writes:

Why md5 instead of sha?
You could go either way. Make the decision based on how many bits you
need (128 should be plenty) for generating a single 53-bit float and
then take into account speed considerations (MD5 vs SHA-1) and the
time to generate the bits going into the message digest (it takes
longer to generate 160 bits than 128).

I think it's best to stick with SHA, because of its longer hash,
because it's a US federal standard, and because MD5 is already known
to not meet its security goals (see Dobbertin's paper on finding
collisions in MD5's compression function). Python has already screwed
up choosing its PRNG twice out of speed considerations, first with
WHrandom, and when that wasn't random enough, then with MT, which
still isn't random enough. If we're selecting a PRNG for a third
time, let's stop cutting corners and take care of the problem once and
for all. Even the very non-optimized code that I posted runs at
reasonable speed.

What's Random.random(self) supposed to do?

It's called subclassing ;-)

OK, I misunderstood at first what was going on.

But doesn't support all the operations specified in the Random API.
What's up with that?

Sure it does. That is what the subclassing is for. Again be sure to
incorporate the two submitted revisions resulting in:

Yes, I understand now. You're simply taking MT output and hashing it.

I'm missing something here--if you're just trying to avoid possible
cycles, why do you want to use MT instead of a sequential counter?

Either my version or your original requires something that generates a
sequence of randoms from a seed (mine does it with MT and yours with
s0=sha1(s0)). The advantages of MT are proven properties, and it can
more efficiently generate an arbitrary number of bits (thanks to
genrandbits()) without the messy pure python code for extracting a
byte at a time. It was also nice to not have to override jumpahead(),
getstate(), setstate(), etc.

But MT has no security properties at all. It's equivalent to just
generating the whole output stream by hashing a sequential counter
with a secret seed.
If you use sha1 for the base generator, you gain cryptographic
strength to the left which is useless for your application (lottery
numbers are selected in a way that is strong to the left and right,
but no one cares about inferring last weeks numbers from this weeks,
No, I wanted cryptographic strength to both the left and right,
otherwise I would have just hashed a counter. I do care about
inferring last week's numbers from this week's. If you're playing
poker with someone, you not only don't want him to see your cards
unless he bets against you, you also don't want him to see the cards
you were holding in the last hand when he folded.

See Bruce Schneier's paper on Yarrow for desirable PRNG criteria:

http://www.schneier.com/paper-yarrow.html

Applying a collision resistant hash function to itself is a standard
construction for making a PRF and is the basis of HMAC.

Re-reading your sentence carefully, does it say that SHA-1 is
collision resistant. The design goal for SHA-1, of necessity,
produces collisions for messages over 160 bits in length and AFAICT
there are no citations showing collision resistance at 160 bits.

Collision resistance means there's no practical way to FIND
collisions. Obviously it doesn't mean that no collision exists. It's
just like in cryptography, you don't want the adversary to be able to
find the secret key. That doesn't mean there is no key.

My guess is the reason you haven't found any publications specific to
SHA1 with this construction is that nobody has found any results
against SHA1. As far as anyone can tell, the output is
undistinguishable from random.

I wish you would abandond this line of reasoning. In security
applications, proofs rule and the absence of published flaws means
nothing (consider the case of Engima where the designers relied on
"nobody has found any results against" their cipher).

In that case, where's your proof that MT feeding SHA is secure? I
haven't even seen any assertion made that MT even preserves all the
entropy in the seed that you supply. The only thing I've heard about
MT is that it passes some statistical tests and is guaranteed to have
a long period because of its large internal state. For all I know, it
takes the seed the you give it and crunches it down to 32 bits before
initializing that state. Maybe this question can be answered with a
detailed analysis of MT, but it's simpler if we can avoid the need for
such analysis.
In this particular case, the finding a short cycle would not really be
a major flaw; the function still meets its design goal of producing a
message digest for a given message such that it is difficult to
produce another message with the same digest. A few known 160-bit
collisions would not be an issue except for those exact 160 bit
messages. IOW, even if some collisions are known, don't expect to
find a paper on it.
What do you mean by that? Finding a collision in SHA would be a major
cryptanalysis result, like the result against MD5. Even a
distinguisher (i.e. given N gigabytes of output, a way to tell, with
probability greater than 50%, whether it's SHA output or true random
output) would be a publishable result (there are now several such
results published against RC4 and so RC4 is now deprecated for new
applications).

Or do you mean that if a collision were found by a spook agency, then
we wouldn't hear about it? That will be true of any function we come
up with. At least in SHA's case, if the NSA finds some problem,
they'll probably advise us to stop using SHA (without telling us the
reason), like they did with SHA-0. They sure won't tell us that for
some home-cooked function.

Anyway, the current version of the double SHA generator (which
incorporates a sequential counter into the input of the first hash) is
guaranteed to have no short cycles, because of the sequential counter.

Note that the MT-MD5 generator has a distinguisher after about 2**64
calls, because the MD5 call can see birthday collisions on both its
input and its output rather than only on its output. So it will
repeat outputs slightly more often than expected.
All of that is moot, MT is fast, subclasses nicely, is threadsafe, has
proven distribution properties, and has a known (large) period. Why
not digest it and produce your shielded random numbers? Or is is just
a matter of being glued to your originally published code (which would
be correct if collision resistance is known)?

I have no attachment to that code, which was provided as a sample
implementation--a real version should be much better optimized, and
maybe even written in C. But I have to ask you the same question: are
you glued to MT for some reason? I'm in favor of staying independent
of MT, for these reasons:

1) I'd rather use primitives that were designed from the beginning as
security primitives; and also to use as few primitives as
possible, so that there's fewer parts to go wrong.

2) The double SHA construction is surely easier to analyze for
security, at least if we assume SHA meets its design criteria. MT
doesn't have any such criteria for us to assume.

3) The PRNG's goal is to generate a reproducable sequence, and we
should take that to mean that if someone wants to port an
application to some other system, they should be able to generate
the same sequence. It's one thing to depend on SHA, which is a
standard and available in lots of environments, but if MT is
needed then its implementation needs to be ported, and its code is
a mess. I don't know if even Jython supports MT, much less
whether any non-Python systems support it. It's best to keep the
generator simple and free-standing, that maybe can even be used as
a standard.

Basically making the PRNG depend on a hairy and obscure function like
MT for no reason other than a little implementation convenience
strikes me as "worse as better" in a rather pure form. If we decide
we don't care about "leftward" security, it's simplest to just feed
SHA with a sequential counter.

Jul 18 '05 #13

Paul Rubin

an***@vredegoor.doge.nl (Anton Vredegoor) writes:

I wonder why it is necessary to do the bit twiddling once one has
chosen to multiply by a fraction instead of directly putting the bits
inside a float. If 1.0 / 2**53 is just some arbitrary value (and why
not: A floats' smallest value is a lot smaller than that?) then we can
generate integers of other sizes and divide them by other floats, for
example:

I think life gets very complicated once you start trying to generate
denormalized floats. They aren't uniformly distributed on the unit
interval, so if you make random numbers by dividing uniformly
distributed integers by some constant, you'll get way too many
denorms. If your application is using so many random numbers that it
much chance of ever seeing one below 2**-53, maybe you need to use
extended precision or something. Note this whole thing gets into the
precise properties of machine floats, and Python in principle is not
supposed to be IEEE-dependent. (The magic number 2**-53 is an IEEE
artifact).

Jul 18 '05 #14

Raymond Hettinger

[Paul Rubin]

I think it's best to stick with SHA, because of its longer hash,
because it's a US federal standard, and because MD5 is already known
to not meet its security goals (see Dobbertin's paper on finding
collisions in MD5's compression function).
Okay, that is reasonable.
Python has already screwed
up choosing its PRNG twice out of speed considerations, first with
WHrandom, and when that wasn't random enough, then with MT, which
still isn't random enough.
The cause won't be advanced by becoming a zealot or by insulting
people volunteering time to review your patch. The three principal
maintainers of the module have been me, Tim, and Guido -- bashing all
three in one sentence is an unusual strategy ;-)

Also, the criticism of MT is baseless.
If we're selecting a PRNG for a third
time, let's stop cutting corners and take care of the problem once and
for all.
No corners were cut. MT is a state-of-the-art psuedo random number
generator. The problem is that you want encryption.

Haskell, Perl, Ruby, and SML also have not gone down your suggested
path. A post to their newsgroups telling them that they are screwing
up and cutting corners likely will not be well received. Of course,
that could just be part of the unusual strategy ;-)
But MT has no security properties at all. It's equivalent to just
generating the whole output stream by hashing a sequential counter
with a secret seed.
Right. All of the security comes from the hashing step.
No, I wanted cryptographic strength to both the left and right,
otherwise I would have just hashed a counter. I do care about
inferring last week's numbers from this week's. If you're playing
poker with someone, you not only don't want him to see your cards
unless he bets against you, you also don't want him to see the cards
you were holding in the last hand when he folded.

That makes sense. Still, the hashing step protects you. Given a
series of 10,000 bits from the generator, do you have any means of
deducing either a) the state of MT, b) what came before, or c) what
comes after?
I'm keeping an open mind but will abandon this patch unless the tone
improves.

Raymond

Jul 18 '05 #15

Paul Rubin

py****@rcn.com (Raymond Hettinger) writes:

If we're selecting a PRNG for a third
time, let's stop cutting corners and take care of the problem once and
for all.
No corners were cut. MT is a state-of-the-art psuedo random number
generator. The problem is that you want encryption.

This sha prng module was motivated partly by your own remark in sf bug
#901285 that MT output may not be sufficiently correlation-free if you
take pairs of 16-bit integers from it, in a context having nothing to
do with encryption. Anyway, despite how it may sound, I don't have
any desire to use the random module for encryption (I'd use different
API's for that). I just want to be able to write applications that
use random numbers without having any doubts about whether the random
numbers are good enough. Teukolsky et al in "Numerical Recipes",
which is not a cryptography book, explain that a good way to do that
is to use cryptographic primitives, so I'm proposing a module that
does it that way.

The application in numerics could be something like: I run some
half-hour simulation using MT, and get results that subjectively seem
a bit peculiar. So I run it again using the crypto-based generator
and maybe that takes three hours, but if I still see similarly
peculiar-looking results, I now can be pretty confident that it's not
some consequence of the RNG internals. It subjectively seems to me
that the Gnome version of Freecell is easier to play than the Windows
version (i.e. it deals easier hands), so that's a semi-real-world
example of this situation. (In this instance I suspect Freecell is
using some pathetic linear PRNG or something like that, but I haven't
yet bothered to look at the code).

There's also some applications (like online games) that need secure
random numbers, as mentioned in 917055. The people implementing these
things usually aren't even thinking about cryptography.

No, I wanted cryptographic strength to both the left and right,

That makes sense. Still, the hashing step protects you. Given a
series of 10,000 bits from the generator, do you have any means of
deducing either a) the state of MT, b) what came before, or c) what
comes after?

Cryptographic strength to the left means I shouldn't be able to
recover what came before, even given the current output of getstate()
on the generator (see the Yarrow paper for some rationale for wanting
this protection). I haven't seen any claim that MT was designed to
resist that.

I'm not adamant that this is a crucial property of a deterministic
PRNG; I implemented it because I saw it as straightforward to do and
figured it might as well be included, so that we can say that we did
the most thorough possible job, followed the Yarrow paper's
prescriptions, etc. If there's a deliberate decision to abandon the
property, that can make the generator a bit simpler and faster, and
I'm willing to code one that way for purposes of comparison. I just
don't think such decisions should be driven by implementation
accidents. (They should also be documented).
I'm keeping an open mind but will abandon this patch unless the tone
improves.

I don't mean to bash anyone and I'm sorry if it came across that way.
As I see it, the Random module provides an extensible API so that it's
easy to add different kinds of generators that the user can select
from depending on his or her needs. It then supplies the WH generator
(I guess for backwards compatibility) and the MT generator (for
reasonably good output and very high performance). The philosophy of
this sha module (as I see it) is to supply a third alternative in that
same flexible spirit: one that goes all-out to make the best possible
output, even if that results in only reasonable performance, while
also having increased portability through use of a standard underlying
primitive. Do you also see that as an appropriate goal? If not, we
should try to identify exactly what we're trying to do instead.

Jul 18 '05 #16

David MacQuigg

On 21 Mar 2004 15:16:02 -0800, Paul Rubin
<http://ph****@NOSPAM.invalid> wrote:

py****@rcn.com (Raymond Hettinger) writes:

[ snip discussion on Pseudo Random Number Generators (PRNG) and
Mersenne Twister (MT) ]

No corners were cut. MT is a state-of-the-art psuedo random number
generator. The problem is that you want encryption.

This sha prng module was motivated partly by your own remark in sf bug
#901285 that MT output may not be sufficiently correlation-free if you
take pairs of 16-bit integers from it, in a context having nothing to
do with encryption. Anyway, despite how it may sound, I don't have
any desire to use the random module for encryption (I'd use different
API's for that). I just want to be able to write applications that
use random numbers without having any doubts about whether the random
numbers are good enough. Teukolsky et al in "Numerical Recipes",
which is not a cryptography book, explain that a good way to do that
is to use cryptographic primitives, so I'm proposing a module that
does it that way.

The application in numerics could be something like: I run some
half-hour simulation using MT, and get results that subjectively seem
a bit peculiar. So I run it again using the crypto-based generator
and maybe that takes three hours, but if I still see similarly
peculiar-looking results, I now can be pretty confident that it's not
some consequence of the RNG internals. It subjectively seems to me
that the Gnome version of Freecell is easier to play than the Windows
version (i.e. it deals easier hands), so that's a semi-real-world
example of this situation. (In this instance I suspect Freecell is
using some pathetic linear PRNG or something like that, but I haven't
yet bothered to look at the code).

This reminds me of a similar discussion I had with a bunch of PhD
mathematicians who wanted to sell my company a better statistics tool.
Their claim was that the tool we were using, which had some simple
PRNG, could give us bad results because of the correlations, short
cycles, etc. that were known to mathematicians. This turned into a
heated debate, involving our R&D Director and other company
management, with me ( a dumb engineer who only understands simple
statistics ) vs a team of PhDs who could fill the a blackboard with
incomprehesible math proving I was wrong. We ended up keeping the old
tool *and* buying theirs, when we should have made a clean decision.

I'm still curious as to the answer to my very simple and basic
question: Are there any examples that show a *non-subjective*
difference in a physically meaningful result between one PRNG and
another? By "physically meaningful" I mean to include things like
means, standard deviations, correlations between two distributions,
but rule out some measure that would only mean something to a
mathematician. By "non-subjective" I mean a difference that can be
observed in a repeatable fashion, where the statistical significance
is not always at the edge when you repeat the experiment with more
trials.

The only answer I got from the earlier discussion was a reference to
some paper on a German website, but when I looked, the website was
gone.

-- Dave

Jul 18 '05 #17

David MacQuigg

On Mon, 22 Mar 2004 13:41:46 -0500, "Tim Peters" <ti*****@comcast.net>
wrote:

[David MacQuigg]
...
I'm still curious as to the answer to my very simple and basic
question:

It is basic, but there's nothing simple about it.
Are there any examples that show a *non-subjective* difference in a
physically meaningful result between one PRNG and another?

Yes, but different RNGs have different weaknesses, and whether the
weaknesses of a specific RNG *matter* to you depend on detailed analysis of
the specific app you have in mind.

The primary weakness of linear congruential generators (still the most
common kind) is dramatically illustrated all over the web; e.g., look at the
3D plot at:

http://www.unf.edu/ccec/cis/CIShtml/RANDUinfo.html

If you need to generate "random" points in 3D space, does it matter that the
particular LCG plotted there systematically generates points that fall on
one of 15 planes, no matter how many points you generate? There's no answer
to that short of analyzing the specifics of what you're trying to accomplish
by generating those points. For example, if your app relies on the triples
generated being approximately equally distributed throughout a 3D cube, that
particular LCG is an obvious disaster for that particular application. But
if your app only cares that the projection of the points onto the x axis is
approximately equally distributed, probably no problem, despite that it's
obvious "by eyeball" that there's nothing even remotely "random" about their
distribution in 3D space.

You're not going to get an easy answer. Well, here's one: the Mersenne
Twister in Python 2.3 is probably the best and most widely tested fast PRNG
in existence now, and has provably good equidistribution properties in high
dimensions (a plot like the one above would have no visible patterns if the
Twister had been used instead).

This is the first good answer I have heard, and it has changed my
thinking on this question. I understand now that the problem is not
just something that might be found in the tenth decimal place. The
clustering of random points into very thin planes could radically
affect a simulation depending on the distances between those points.

The problem I am working on (statistical simulation of electronic
circuits) involves a large number of 1D distributions, but with
carefully controlled correlations between these distributions. We are
probably just lucky that the PRNG we are using has not caused a
noticable problem, but seeing just one clear example like the above
has made me decide to be much more cautious in selecting a PRNG.

Thanks for the enlightenment.

-- Dave

Jul 18 '05 #18

Trevor Perrin

Paul Rubin <http://ph****@NOSPAM.invalid> wrote in message news:<7x************@ruckus.brouhaha.com>...

http://www.nightsong.com/phr/python/sharandom.c

This is intended to be less predicable/have fewer correlations than
the default Mersenne Twister or Wichmann-Hill generators. Comments
are appreciated.

Slightly OT, but...

What about a subclass of Random that provides an interface to
platform-specific secure RNGs:
- /dev/urandom on some unixes
- CryptGenRandom on windows
- SecureRandom on Java

I think you've suggested this yourself, a few times. Is any work
being done in this direction?
Trevor

Jul 18 '05 #19

Paul Rubin

tr***@trevp.net (Trevor Perrin) writes:

Slightly OT, but...

What about a subclass of Random that provides an interface to
platform-specific secure RNGs:
- /dev/urandom on some unixes
- CryptGenRandom on windows
- SecureRandom on Java

I think you've suggested this yourself, a few times. Is any work
being done in this direction?

I think it's extremely desirable and important to add such an
interface, and maybe it should go in the random module, but it
shouldn't be a subclass of Random. Random is intended to be a
deterministic generator.

I have a pretty dim view of the built-in SecureRandom that comes with
Java (the one that works by counting thread switches per second) but I
guess it can also be subclassed to use better sources of entropy.

Jul 18 '05 #20

Trevor Perrin

Paul Rubin wrote:

tr***@trevp.net (Trevor Perrin) writes:

What about a subclass of Random that provides an interface to
platform-specific secure RNGs:
[...]
I think it's extremely desirable and important to add such an
interface, and maybe it should go in the random module, but it
shouldn't be a subclass of Random. Random is intended to be a
deterministic generator.

I see that seed(), getstate() and setstate() imply determinism.

Aside from those, the nice thing about subclassing Random is you'd only
have to implement random() and getrandbits() and you'd get randrange(),
randint(), choice(), shuffle(), and sample() for free. So it might be
worth it, even if you have to raise NotImplementedError for a few methods.

Actually, Random could be made even easier to subclass. getrandbits
(which is new in 2.4) returns a long. It would be better for us if
there was a getrandstring() function that an underlying generator could
implement. This would also be a helpful addition to the API, at least
for cryptographic uses.
Trevor
http://www.python.org/dev/doc/devel/...le-random.html

Jul 18 '05 #21

Paul Rubin

Trevor Perrin <tr********@trevp.net> writes:

Actually, Random could be made even easier to subclass. getrandbits
(which is new in 2.4) returns a long. It would be better for us if
there was a getrandstring() function that an underlying generator
could implement. This would also be a helpful addition to the API, at
least for cryptographic uses.

I agree with this, and was going to propose the same thing.
getrandbits can be defined easily in terms of getrandstring, and the
current sample sharandom implemenation at

http://www.nightsong.com/phr/python/sharandom.py

does something along those lines.

Jul 18 '05 #22

Paul Rubin

David MacQuigg <dm*@gain.com> writes:

I'm still curious as to the answer to my very simple and basic
question: Are there any examples that show a *non-subjective*
difference in a physically meaningful result between one PRNG and
another? By "physically meaningful" I mean to include things like
means, standard deviations, correlations between two distributions,
but rule out some measure that would only mean something to a
mathematician. By "non-subjective" I mean a difference that can be
observed in a repeatable fashion, where the statistical significance
is not always at the edge when you repeat the experiment with more
trials.

Tim gave a good answer for stuff like physics simulations. Another
answer might be in terms of something like a computer game: you're
playing poker online with the cards dealt by a server programmed in
Python using some PRNG. Both players trust that the server isn't
cheating, and both have a complete copy of the server source code and
all the data that was on it at the start of the game except the PRNG's
initial seed, and are permitted to use that info in any way they can.
Since you're playing for a US$ 250,000 jackpot (sites like
partypoker.com routinely run online tournaments with prizes that
size), you must assume that your opponent has a team of mathemeticians
analyzing the PRNG algorithm and all its previous observed output, in
order to get some probabilistic edge in guessing what your cards are.
If they fail to get that probabilistic edge, your chances are even,
but if they succeed, they take you to the cleaners. I would call that
a non-sujective difference ;-).

Another simulation example: you're trying to program a computer to
play poker or backgammon, using neural nets, genetic algorithms, or
whatever the AI buzzword of the week is. You train the AI by having
it play a zillion simulated games against itself and eventually it's
playing very well. Has it really tuned itself for the game, or has it
somehow modelled the PRNG and figured out how to guess the next card?
Simulating systems with this kind of emergent behavior can make
heavier demands on PRNG's than simulating random gas molecules.

The point is, a pure statistical analysis of the swirling particles in
the universe may predict the existence of stars, galaxies, planets,
and maybe even living organisms, but normally won't treat emergent
characteristics that lead to things like computers and trombones. If
you're trying to build a laser cannon to protect your house from
meteors, it's one thing to calculate the number of stray asteroids
whose trajectories might randomly intersect your house and build your
laser accordingly, but a really good defense system has to allow that
someone may be aiming stuff at you deliberately. Similarly, a really
good PRNG has to resist deliberate attempts to predict its output or
even to notice that the output isn't perfectly random.

The modern definition of a good PRNG basically says there should be no
polynomial-time algorithm for distinguishing the PRNG from a true
random source with probability better than chance. There's a body of
theoretical CS work that's been done on this subject. See the
pseudorandomness chapter of Goldreich's "Foundations of Cryptography" at

ftp://theory.lcs.mit.edu/pub/people/...Frag/part3N.ps

if you want to see the details.

Of course nobody has ever been able to demonstrate the existence of a
good PRNG, since that runs smack into the P vs NP problem. However,
there are solid results showing how to build PRNG's from secure
primitives (e.g. SHA) if we're willing to assume the security claims
for those primitives. That seems to be about what the current state
of the art is.

Jul 18 '05 #23

David MacQuigg

On 22 Mar 2004 20:14:47 -0800, Paul Rubin
<http://ph****@NOSPAM.invalid> wrote:

David MacQuigg <dm*@gain.com> writes:
I'm still curious as to the answer to my very simple and basic
question: Are there any examples that show a *non-subjective*
difference in a physically meaningful result between one PRNG and
another? By "physically meaningful" I mean to include things like
means, standard deviations, correlations between two distributions,
but rule out some measure that would only mean something to a
mathematician. By "non-subjective" I mean a difference that can be
observed in a repeatable fashion, where the statistical significance
is not always at the edge when you repeat the experiment with more
trials.
Tim gave a good answer for stuff like physics simulations. Another
answer might be in terms of something like a computer game: you're
playing poker online with the cards dealt by a server programmed in
Python using some PRNG. Both players trust that the server isn't
cheating, and both have a complete copy of the server source code and
all the data that was on it at the start of the game except the PRNG's
initial seed, and are permitted to use that info in any way they can.
Since you're playing for a US$ 250,000 jackpot (sites like
partypoker.com routinely run online tournaments with prizes that
size), you must assume that your opponent has a team of mathemeticians
analyzing the PRNG algorithm and all its previous observed output, in
order to get some probabilistic edge in guessing what your cards are.
If they fail to get that probabilistic edge, your chances are even,
but if they succeed, they take you to the cleaners. I would call that
a non-sujective difference ;-).

Another simulation example: you're trying to program a computer to
play poker or backgammon, using neural nets, genetic algorithms, or
whatever the AI buzzword of the week is. You train the AI by having
it play a zillion simulated games against itself and eventually it's
playing very well. Has it really tuned itself for the game, or has it
somehow modelled the PRNG and figured out how to guess the next card?
Simulating systems with this kind of emergent behavior can make
heavier demands on PRNG's than simulating random gas molecules.

The point is, a pure statistical analysis of the swirling particles in
the universe may predict the existence of stars, galaxies, planets,
and maybe even living organisms, but normally won't treat emergent
characteristics that lead to things like computers and trombones. If
you're trying to build a laser cannon to protect your house from
meteors, it's one thing to calculate the number of stray asteroids
whose trajectories might randomly intersect your house and build your
laser accordingly, but a really good defense system has to allow that
someone may be aiming stuff at you deliberately. Similarly, a really
good PRNG has to resist deliberate attempts to predict its output or
even to notice that the output isn't perfectly random.

The modern definition of a good PRNG basically says there should be no
polynomial-time algorithm for distinguishing the PRNG from a true
random source with probability better than chance. There's a body of
theoretical CS work that's been done on this subject. See the
pseudorandomness chapter of Goldreich's "Foundations of Cryptography" at

ftp://theory.lcs.mit.edu/pub/people/...Frag/part3N.ps

if you want to see the details.

These are all good examples of *cryptographic* weaknesses in simple
PRNGs, which is really a different problem than weaknesses which would
lead to an incorrect result in a physical or engineering simulation.
In my earlier debate with the statistical tools salesmen, I was aware
of these cryptographic weaknesses. I wasn't aware of the "clustering"
kind of phenomena, like the one Tim pointed out, that *could* have an
effect on our circuit simulations.
Of course nobody has ever been able to demonstrate the existence of a
good PRNG, since that runs smack into the P vs NP problem. However,
there are solid results showing how to build PRNG's from secure
primitives (e.g. SHA) if we're willing to assume the security claims
for those primitives. That seems to be about what the current state
of the art is.

Looks like the engineering problem should be simpler than the
cryptographic problem, which ultimately comes down to -- can you
generate a sequence so complex that no human mind or computational
process can ever 'decipher' it. I'm guessing now that if the
distributions appear 'smooth' in every direction, we are probably OK
in the engineering sense. I'm not going to rely on that guess,
however. For now, I'll just go with an algorithm like Mersenne
Twister, that lots of smart people have looked at for a long time and
found nothing that would impact an engineering simulation.

-- Dave

Jul 18 '05 #24

Josiah Carlson

Paul,

I've been wanting to email you for a couple days, but I've been having
difficulty in finding your email address online anywhere, or even
derived from the "Sender" header (though I didn't try too hard other
than to strip out the -nospam portion). If you would email me directly,
I'll reply with what I've been wanting to say.

Thanks,
- Josiah

Jul 18 '05 #25

Paul Rubin

Josiah Carlson <jc******@nospam.uci.edu> writes:

I've been wanting to email you for a couple days, but I've been having
difficulty in finding your email address online anywhere, or even
derived from the "Sender" header (though I didn't try too hard other
than to strip out the -nospam portion). If you would email me
directly, I'll reply with what I've been wanting to say.

My From: header gives a url (http://phr.cx) where you can leave me a
message. It looks like it's having DNS problems though.
http://paulrubin.com also works.

Jul 18 '05 #26

Raymond Hettinger

[Trevor Perrin]

Actually, Random could be made even easier to subclass. getrandbits
(which is new in 2.4) returns a long. It would be better for us if
there was a getrandstring() function that an underlying generator
could implement. This would also be a helpful addition to the API, at
least for cryptographic uses.

[Paul Rubin] I agree with this, and was going to propose the same thing.
getrandbits can be defined easily in terms of getrandstring, and the
current sample sharandom implemenation at

http://www.nightsong.com/phr/python/sharandom.py

does something along those lines.

If you're sure you need this, please put a feature request on SF. It
needs to have a use case other than feeding message digest funtions --
that need is readily met with str(x) or the faster hex(x).

The new genrandbits() method was designed around the use case of
making randrange() work over very large intervals which was necessary
for creating large primes for RSA work. In addition, all of the
methods internal to the Random module work with numbers instead of
byte strings.

Also, genrandbits() was selected because it is straight-forward to
implement for many different random number generators including those
that generate a bit at a time.
Raymond

Jul 18 '05 #27

Trevor Perrin

Raymond Hettinger wrote:

[Trevor Perrin]
Actually, Random could be made even easier to subclass. getrandbits
(which is new in 2.4) returns a long. It would be better for us if
there was a getrandstring() function that an underlying generator
could implement. This would also be a helpful addition to the API, at
least for cryptographic uses.

[Paul Rubin]
I agree with this, and was going to propose the same thing.
getrandbits can be defined easily in terms of getrandstring, and the
current sample sharandom implemenation at

http://www.nightsong.com/phr/python/sharandom.py

does something along those lines.

If you're sure you need this, please put a feature request on SF. It
needs to have a use case other than feeding message digest funtions --
that need is readily met with str(x) or the faster hex(x).

One use case is cryptographic things like symmetric keys, IVs, nonces,
random padding, etc.
The new genrandbits() method was designed around the use case of
making randrange() work over very large intervals which was necessary
for creating large primes for RSA work. In addition, all of the
methods internal to the Random module work with numbers instead of
byte strings.

Even still, I think getrandstring() would be a more convenient primitive
than getrandbits():
- cryptographic generators (which you'll want for RSA) are more
naturally defined in terms of byte strings
- it's not hard to convert byte-strings to longs or floats, but it
would be better for this conversion to be done in the Random base class,
instead of forcing every subclass to do it
- getrandstring() is a useful addition to the API, whereas
getrandbits() is trivally done by calling randint(0, 2**k).
Trevor

Jul 18 '05 #28

Trevor Perrin

Paul Rubin wrote:

py****@rcn.com (Raymond Hettinger) writes:
I was thinking the same. How about something like:

bigNumber = long(bigNumberString, 256)

bigNumberString = bigNumber.tostring()

If this is something you want, file a feature request on SF.

I made a request like that a long while ago. Guido suggested that it
be added to the binascii module, if I remember right. See SF # 465045
but there may have been another one elsewhere as well.

yup, I'm following your footsteps again! :-).

It looks like people agreed it was a good idea, it just needed a patch.
I gave it a shot, we'll see what happens:

#923643
http://sourceforge.net/tracker/index...70&atid=305470

Trevor

Jul 18 '05 #29

SHA-based subclass for random module

Similar topics