Does shuffle() produce uniform result ?

tooru honda

Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

2. If not, is there a fast and uniform shuffle() available somewhere ?

Thanks !

-tooru honda

Aug 24 '07 #1

Subscribe Reply

2616

Steve Holden

tooru honda wrote:

Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

Given the cycle length of the Mersenne twister algorithm that generates
the underlying random numbers, it would have to be a pretty long list to
see a significant bias, don't you think? Have you done any calculations
on the length of list you propose to use?

2. If not, is there a fast and uniform shuffle() available somewhere ?

Frankly I don't think you need to worry.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

Aug 24 '07 #2

Hrvoje Niksic

tooru honda <to*********@fast-mail.orgwrites:

I have read the source code of the built-in random module,
random.py. After also reading Wiki article on Knuth Shuffle
algorithm, I wonder if the shuffle method implemented in random.py
produces results with modulo bias.

It doesn't have modulo bias because it doesn't use modulo to produce a
random index; it multiplies the floating point value with the desired
range. I'm not sure if that method produces any measurable bias.

Aug 24 '07 #3

Mark Dickinson

On Aug 24, 8:54 am, Hrvoje Niksic <hnik...@xemacs.orgwrote:

tooru honda <tooru_ho...@fast-mail.orgwrites:
I have read the source code of the built-in random module,
random.py. After also reading Wiki article on Knuth Shuffle
algorithm, I wonder if the shuffle method implemented in random.py
produces results with modulo bias.

It doesn't have modulo bias because it doesn't use modulo to produce a
random index; it multiplies the floating point value with the desired
range. I'm not sure if that method produces any measurable bias.

It produces exactly the same level of bias as the 'modulo bias'
obtained by reducing a random integer in [0, 2**53). For example,
suppose we're trying to produce an integer x in the range 0 through 6
inclusive. If n is a random variable whose values are uniformly
distributed across range(0, 2**53) then:

x = n % 7

will produce 0, 1, 2 and 3 with probability (2**53//7+1)/2**53, and 4,
5 and 6 with probability (2**53//7)/2**53, while

x = floor((n/2**53)*7)

will produce 0, 1, 3 and 5 with probability (2**53//7+1)/2**53, and 2,
4 and 6 with probability (2**53//7)/2*53.

Either way, you'd have a very hard time detecting such a tiny bias.
At the other end of the scale, if you're trying to produce a value in
[0, 2**53-2] (for example) then it looks worse: with either method,
one of the values occurs exactly twice as often as all of the others.
But since there are now so many values, you'd again have problems
detecting any bias.

Steven Holden wrote:

Frankly I don't think you need to worry.

What he said.

Mark

Aug 25 '07 #4

Paul Rubin

tooru honda <to*********@fast-mail.orgwrites:

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

The nonuniformity is too small to matter. But what is the
application? If you are doing something like implementing online
poker for real money, you shouldn't use the built-in RNG. It is not
designed for what we call adversarial indistinguishability from true
randomness. Instead, use the random byte stream available from
os.urandom() and derive your random numbers from that.

Aug 25 '07 #5

Mark Dickinson

On Aug 24, 9:30 pm, Mark Dickinson <dicki...@gmail.comwrote:

x = floor((n/2**53)*7)

will produce 0, 1, 3 and 5 with probability (2**53//7+1)/2**53, and 2,
4 and 6 with probability (2**53//7)/2*53.

Oops---I lied; I forgot to take into account the rounding implicit in
the (n/2**53)*7 multiplication. A bit of experimentation shows that
it's 0, 2, 4 and 6 that occur more often, with 1, 3 and 5 less likely
by a miniscule amount (at least on an IEEE-754 system).

Mark

Aug 25 '07 #6

tooru honda

Hi, First of all, my thanks to all of you who replied.

I am writing a gamble simulation to convince my friend that his "winning
strategy" doesn't work. I use shuffle method from a random.SystemRandom
instance to shuffle 8 decks of cards.

As the number of cards is quite small (number of cards is 416), the
nonuniformity doesn't matter as most of you have already said. Just to
avoid argument from my friend, I am considering writing my own randint
and shuffle methods based on os.urandom() though.

-tooru honda

Aug 25 '07 #7

tooru honda

At the end, I think it is worthwhile to implement my own shuffle and
random methods based on os.urandom. Not only does the resulting code
gets rid of the minuscule bias, but the program also runs much faster.

When using random.SystemRandom.shuffle, posix.open and posix.close from
calling os.urandom account for almost half of the total execution time
for my program. By implementing my own random and getting a much larger
chunk of random bytes from os.urandom each time, I am able to reduce the
total execution time by half.

-tooru honda

P.S. I use python 2.5.1 on MacOSX 10.4.10 (PowerPC).

Aug 25 '07 #8

Alex Martelli

tooru honda <to*********@fast-mail.orgwrote:

At the end, I think it is worthwhile to implement my own shuffle and
random methods based on os.urandom. Not only does the resulting code
gets rid of the minuscule bias, but the program also runs much faster.

When using random.SystemRandom.shuffle, posix.open and posix.close from
calling os.urandom account for almost half of the total execution time
for my program. By implementing my own random and getting a much larger
chunk of random bytes from os.urandom each time, I am able to reduce the
total execution time by half.

If I were in your shoes, I would optimize by subclassing
random.SystemRandom and overriding the random method to use os.urandom
with some large block size and then parcel it out, instead of the
_urandom(7) that it now uses. E.g., something like:

class SystemBlockRandom(random.SystemRandom):

def __init__(self):
random.SystemRandom.__init__(self)
def rand7():
while True:
randata = os.urandom(7*1024)
for i in xrange(0, 7*1024, 7):
yield long(binascii.hexlify(randata[i:i+7]),16)
self.rand7 = rand7().next

def random(self):
"""Get the next random number in the range [0.0, 1.0)."""
return (self.rand7() >3) * random.RECIP_BPF

(untested code). No need to reimplement anything else, it seems to me.
Alex

Aug 25 '07 #9

tooru honda

By incorporating Alex's code, I got another performance boost of 20%.
It is mostly due to Alex's more efficient implementation of block random
than my own version.

-tooru honda

Below is the code I have now:
from binascii import hexlify
from os import urandom

class rcRandomC(random.SystemRandom):

def __init__(self):
random.SystemRandom.__init__(self)

def rand2():

while True:
randata = urandom(2*1024)
for i in xrange(0, 2*1024, 2):
yield int(hexlify(randata[i:i+2]),16) # integer
in [0,65535]

self.rand2_M = rand2().next
# modified from random._randbelow
def randrange(self,startN,stopN):

"""Choose a random integer from range(startN, stopN).
widthN<=65536

"""

widthN=stopN-startN
left_over_N=65536%widthN
upper_bound_N= 65535-left_over_N

random_number=self.rand2_M()

while random_number>upper_bound_N:

random_number=self.rand2_M()

r = random_number%widthN

return startN+r

def shuffle(self, x):
"""x, random=random.random -shuffle list x in place; return
None.

"""

randrange=self.randrange

for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = randrange(0,i+1)
x[i], x[j] = x[j], x[i]

Aug 26 '07 #10

Alex Martelli

tooru honda <to*********@fast-mail.orgwrote:
...

def rand2():
while True:
randata = urandom(2*1024)
for i in xrange(0, 2*1024, 2):
yield int(hexlify(randata[i:i+2]),16) # integer
in [0,65535]

another equivalent possibility, which might probably be faster:

import array
...
def rand2():
while True:
x = array.array("H")
x.fromstring(urandom(2*4000))
for i in x: yield i
Alex

Aug 26 '07 #11

Antoon Pardon

On 2007-08-26, tooru honda <to*********@fast-mail.orgwrote:

By incorporating Alex's code, I got another performance boost of 20%.
It is mostly due to Alex's more efficient implementation of block random
than my own version.

If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary. If you are using a linux
machine just try to execute "od -x /dev/random" at the same time
as your program.

--
Antoon Pardon

Sep 3 '07 #12

Paul Rubin

Antoon Pardon <ap*****@forel.vub.ac.bewrites:

If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary.

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

Sep 3 '07 #13

Antoon Pardon

On 2007-09-03, Paul Rubin <httpwrote:

Antoon Pardon <ap*****@forel.vub.ac.bewrites:
>If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary.

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom

A read from the /dev/urandom device will not block waiting for
more entropy. As a result, if there is not sufficient
entropy in the entropy pool, the returned values are
theoretically vulnerable to a cryptographic attack on the algorithms
used by the driver. Knowledge of how to do this is not available
in the current non-classified literature, but it is the-
oretically possible that such an attack may exist. If this is a
concern in your application, use /dev/random instead.

And reading from /dev/random can block if there is not enough entropy.

--
Antoon Pardon

Sep 4 '07 #14

Paul Rubin

Antoon Pardon <ap*****@forel.vub.ac.bewrites:

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom. ...
the returned values are theoretically vulnerable to a
cryptographic attack on the algorithms used by the driver.

Right. The idea is that those attacks don't exist and therefore the
output is computationally indistinguishable from random. Of course
whether the idea is correct, an unproven conjecture, but it looks
pretty good; certainly finding any problem with the specific
algorithms in urandom would be a significant research discovery and
not likely to affect the application being discussed. Finding a
general attack that couldn't be fixed with some simple tweak would be
a major crypto breakthrough that would probably reshape a lot of
complexity theory. This is unlike the situation with Mersenne
Twister, which was not designed for indistinguishability against an
opponent who knew what to look for.

In short, using /dev/random is fairly silly once you know there's
enough entropy in the randomness pool to make a good key. If
/dev/urandom's algorithms are broken then whatever you're doing with
the /dev/random output is probably also broken.

Sep 4 '07 #15

Steven D'Aprano

On Mon, 03 Sep 2007 23:42:56 -0700, Paul Rubin wrote:

Antoon Pardon <ap*****@forel.vub.ac.bewrites:

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom. ...
the returned values are theoretically vulnerable to a
cryptographic attack on the algorithms used by the driver.

Right. The idea is that those attacks don't exist and therefore the
output is computationally indistinguishable from random.

It is a huge leap from what the man page says, that they don't exist in
the unclassified literature at the time the docs were written, to what
you're saying, that they don't exist.

The man page is clear: there is a possible vulnerability in /dev/urandom.
Any cryptographer worth his salt (pun intended) would be looking to close
that vulnerability BEFORE an attack is made public, and not just wait for
the attack to trickle down from the NSA to the script kiddies. The time
to close the stable door is _before_ the horse gets away.

Of course
whether the idea is correct, an unproven conjecture, but it looks pretty
good; certainly finding any problem with the specific algorithms in
urandom would be a significant research discovery and not likely to
affect the application being discussed.

I agree that this flaw doesn't sound like it will effect the application
being discussed, but a problem has already been found and a solution is
already known: block until there's enough entropy. That's what /dev/
random does.
[snip]

In short, using /dev/random is fairly silly once you know there's enough
entropy in the randomness pool to make a good key. If /dev/urandom's
algorithms are broken then whatever you're doing with the /dev/random
output is probably also broken.

That doesn't follow. Antoon is specifically warning that /dev/urandom is
non-blocking. If you knew there was enough entropy available, you
wouldn't need /dev/random -- but how do you know there's enough entropy?

(I suppose you could look in /proc/sys/kernel/random/entropy_avail.)

For this specific application, it probably doesn't matter -- using /dev/
urandom is surely overkill, and on a single-user Linux desktop you're
unlikely to have vast numbers of applications reading /dev/urandom
without your knowledge. But why not use /dev/random? What's the downside?

--
Steven.

Sep 4 '07 #16

Steven D'Aprano

On Tue, 04 Sep 2007 22:01:47 -0700, Paul Rubin wrote:

OK. /dev/random vs /dev/urandom is a perennial topic in sci.crypt and
there are endless long threads about it there, so I tried to give you
the short version, but will give a somewhat longer version here.

Thank you. Your points are taken, in particular:

4) The man page is fairly seriously bogus because it doesn't explain
the real situation with either /dev/urandom or /dev/random.

--
Steven.

Sep 6 '07 #17

Lawrence D'Oliveiro

In message <13*************@corp.supernews.com>, Steven D'Aprano wrote:

Any cryptographer worth his salt (pun intended) would be looking to close
that vulnerability BEFORE an attack is made public, and not just wait for
the attack to trickle down from the NSA to the script kiddies.

Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

Sep 9 '07 #18

Paul Rubin

Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5, and it's to NSA's credit that SHA-1
held up for as long as it did.

Sep 9 '07 #19

Bryan Olson

Paul Rubin wrote:

Lawrence D'Oliveiro writes:
>Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5, and it's to NSA's credit that SHA-1
held up for as long as it did.

I haven't kept up. Has anyone exhibited a SHA-1 collision?
--
--Bryan

Sep 9 '07 #20

Lawrence D'Oliveiro

In message <7x************@ruckus.brouhaha.com>, Paul Rubin wrote:

Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

>Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5 ...

Nevertheless, it was their job to anticipate attacks on it. After all, they
call themselves the "National _Security_ Agency", don't they?

... and it's to NSA's credit that SHA-1 held up for as long as it did.

But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk down
to insignificance.

Sep 9 '07 #21

Paul Rubin

Bryan Olson <fa*********@nowhere.orgwrites:

I haven't kept up. Has anyone exhibited a SHA-1 collision?

I don't think anyone has shown an actual collision, but apparently
there is now a known way to find them in around 2**63 operations. I
don't know if it parallellizes as well as a brute force attack does.
If it does, then it's presumably within reach of the distributed
attacks like the ones used against DES in the late 1990's, given the
hardware speedups that have occurred since then. NIST is trying to
phase out SHA-1 by 2010.

http://en.wikipedia.org/wiki/SHA1#Cr...lysis_of_SHA-1
http://csrc.nist.gov/hash_standards_comments.pdf

Sep 9 '07 #22

Paul Rubin

Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

... and it's to NSA's credit that SHA-1 held up for as long as it did.
But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk down
to insignificance.

The successor is SHA-2.

Sep 9 '07 #23

Lawrence D'Oliveiro

In message <7x************@ruckus.brouhaha.com>, Paul Rubin wrote:

Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

... and it's to NSA's credit that SHA-1 held up for as long as it did.
But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk
down to insignificance.

The successor is SHA-2.

According to this <http://en.wikipedia.org/wiki/SHA-1>, the family of
algorithms collectively described as "SHA-2" is by no means a definitive
successor to SHA-1.

Sep 10 '07 #24

Paul Rubin

Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

According to this <http://en.wikipedia.org/wiki/SHA-1>, the family of
algorithms collectively described as "SHA-2" is by no means a definitive
successor to SHA-1.

See <http://csrc.nist.gov/hash_standards_comments.pdf>:

However, due to advances in technology, NIST plans to phase out of
SHA-1 in favor of the larger and stronger hash functions (SHA-224,
SHA-256, SHA-384 and SHA-512) by 2010. SHA-1 and the larger hash
functions are specified in FIPS 180-2. For planning purposes by
Federal agencies and others, note also that the use of other
cryptographic algorithms of similar strength to SHA-1 will also be
phased out in 2010. SHA-1 and the stronger hash functions in FIPS
180-2 are all NIST approved.

This may also be of interest:

http://www.csrc.nist.gov/pki/HashWorkshop/index.html

Sep 10 '07 #25

Lawrence D'Oliveiro

In message <13*************@corp.supernews.com>, Steven D'Aprano wrote:

On Sun, 09 Sep 2007 18:53:32 +1200, Lawrence D'Oliveiro wrote:

>In message <7x************@ruckus.brouhaha.com>, Paul Rubin wrote:

>>Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5 ...

Nevertheless, it was their job to anticipate attacks on it. After all,
they call themselves the "National _Security_ Agency", don't they?

The NSA has many jobs, and doing public research in crypto is only one of
them -- and a particularly small one at that. For all we know, they had
an attack on MD5 ten years before anyone else and didn't tell anyone
because keeping it secret made it useful for one of their other jobs.

Yes, but they're supposed to look after US _National_ security, not their
own security. Since people in strategic jobs make so much use of hash
functions in crypto, that means it is most certainly an important part of
the NSA's function to ensure that there are good hash functions available.
They've fallen down on that job.

>>... and it's to NSA's credit that SHA-1 held up for as long as it did.

But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk
down to insignificance.

I don't see how that follows.

Because previously, the NSA has done things that it took open researchers
years, even decades, to figure out. But not any more.

Sep 10 '07 #26

Does shuffle() produce uniform result ?

Similar topics