Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?
Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride
#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1 9 1444
On Wed, 2004-10-20 at 11:01 +0100, Lol McBride wrote: Hi All, I'm looking for some help in developing a way to test for the uniqueness of an item in a large list.To illustrate what I mean the code below is an indicator of what I want to do but I'm hoping for a faster way than the one shown.Basically,I have a list of 20 integers and I want to create another list of 200000 unique subsets of 12 integers from this list.WhatI have done here is to use the sample()function from the random module and then compare the result to every item in the ints list to check for uniqueness - as you can guess this takes an interminable amount of time to grind through.Can someone please enlighten me as to how I can do this and keep the amount of time to do it to a minimum?
Thanks for taking the time to read this and doubly so if you're able to help. Cheers, Lol McBride
#!/usr/bin/python from random import * #seq is the pool to choose from seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] # this is how many times to choose a unique selection rlen = 200000 counter = 100 ints = [0] while 1: # if this is the last value for rlen then we stop if rlen == 0: break intLen = len(ints) cnt = 0 testVal = sample(seq,12) testVal.sort() if len(ints)>=counter: print len(ints) counter = counter+100 while 1: if ints.count(testVal)==1: cnt = cnt+1 break intLen = intLen -1 if intLen == 0: break if cnt == 1: continue else: ints.append(testVal) rlen = rlen -1
Disregarding more efficient methods of generating your list (of which
I'm sure there are many), part of the "interminable time" is probably
taken up in the infinite loop you have ;) Set rlen to a much smaller
number and you'll see that the program still never terminates.
You'll notice that:
ints = [0]
while 1:
...
intLen = len(ints) # aka 1
...
while 1:
...
intLen = intLen - 1 # aka 0
if intLen == 0: # always true
break
...
Basically this means that nothing is ever appended to ints nor is rlen
ever decremented (and these are the tests by which the loops terminate).
I hear that Python 3000 will complete infinite loops in half the time
however ;)
Regards,
Cliff
--
Cliff Wells <cl************@comcast.net>
Lol McBride <ne******@lolmc.com> wrote: I'm looking for some help in developing a way to test for the uniqueness of an item in a large list.To illustrate what I mean the code below is an indicator of what I want to do but I'm hoping for a faster way than the one shown.Basically,I have a list of 20 integers and I want to create another list of 200000 unique subsets of 12 integers from this list.WhatI have done here is to use the sample()function from the random module and then compare the result to every item in the ints list to check for uniqueness - as you can guess this takes an interminable amount of time to grind through.Can someone please enlighten me as to how I can do this and keep the amount of time to do it to a minimum?
One word: dictionaries!
Untested, but should work...:
import random # dont't use from...import *, on general grounds
def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()
this returns a list of tuples; if you need a list of lists,
return[list(pick) for pick in results]
In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.
Alex
On Wed, 20 Oct 2004 04:07:20 -0700, Cliff Wells wrote: On Wed, 2004-10-20 at 11:01 +0100, Lol McBride wrote:snip snip <<
Doh! And once againwith feeling Doh!
Thank you for pointing that out to me
Lol
On Wed, 20 Oct 2004 13:37:32 +0200, Alex Martelli wrote: Lol McBride <ne******@lolmc.com> wrote:
I'm looking for some help in developing a way to test for the uniqueness of an item in a large list.To illustrate what I mean the code below is an indicator of what I want to do but I'm hoping for a faster way than the one shown.Basically,I have a list of 20 integers and I want to create another list of 200000 unique subsets of 12 integers from this list.WhatI have done here is to use the sample()function from the random module and then compare the result to every item in the ints list to check for uniqueness - as you can guess this takes an interminable amount of time to grind through.Can someone please enlighten me as to how I can do this and keep the amount of time to do it to a minimum?
One word: dictionaries!
Untested, but should work...:
import random # dont't use from...import *, on general grounds
def picks(seq=xrange(1, 21), rlen=200000, picklen=12): results = {} while len(results) < rlen: pick = random.sample(seq, picklen) pick.sort() pick = tuple(pick) results[pick] = 1 return results.keys()
this returns a list of tuples; if you need a list of lists,
return[list(pick) for pick in results]
In 2.4, you could use "result=set()" instead of {}, results.add(pick) to maybe add a new pick, and possibly "return list(result)" if you need a list of tuples as the function's overall return value (the list comprehension for the case in which you need a list lists stays OK). But that's just an issue of readability, not speed. Slight speedups can be had by hoisting some lookups out of the while loop, but nothing major, I think -- eg. sample=random.sample just before the while, and use sample(seq, picklen) within the while. Putting all the code in a function and using local variables IS a major speed win -- local variable setting and access is WAY faster than globals'.
Alex
Thank you Alex - i'll try your method and see how i get on
Lol
Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:
if not (len(results) % 5000):
print len(results)
You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.
Hugs,
bearophile be************@lycos.com (bearophile) wrote: Alex Martelli's solution is *very* good, but there is a sampling problem: putting a simple printing inside his program:
if not (len(results) % 5000): print len(results)
You can see that it slows down a lot when the dictionary contains about 100000-120000 different sequences, because there are many collisions, and it keeps slowing down. Probably a little speed up of this code cannot help. A different algoritm can be useful. I don't know... direct sequences generation doesn't seem possibile. Maybe a partially direct generation can be okay.
There are only 125,970 unique sequences of 12 items from 20 where order
does not matter (20!)/(12!8!), which is what Alex was generating, and
what "Lol McBride" asked for.
Remove the "ordering does not matter" constraint (by removing the
list.sort() call) will complete much faster as the space to select from
becomes 60,339,831,552,000 in size.
Math saves the day!
- Josiah
Josiah: There are only 125,970 unique sequences of 12 items from 20 where order does not matter (20!)/(12!8!),
Right! I was thinking the same thing *after* writing my last post ^_-
With random sampling the generation of the last 10 combinations is a
bit slow, it require a couple of minutes.
There are direct ways to compute them: http://www.netlib.org/toms/382
Bye,
bearophile
"bearophile" <be************@lycos.com> wrote in message
news:5c**************************@posting.google.c om... Alex Martelli's solution is *very* good, but there is a sampling problem: putting a simple printing inside his program:
if not (len(results) % 5000): print len(results)
You can see that it slows down a lot when the dictionary contains about 100000-120000 different sequences, because there are many collisions, and it keeps slowing down. Probably a little speed up of this code cannot help. A different algoritm can be useful. I don't know... direct sequences generation doesn't seem possibile. Maybe a partially direct generation can be okay.
Hugs, bearophile
Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?
Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).
-- Paul
"Paul McGuire" <pt***@austin.rr._bogus_.com> wrote in message
news:sR*******************@fe1.texas.rr.com... "bearophile" <be************@lycos.com> wrote in message news:5c**************************@posting.google.c om... Alex Martelli's solution is *very* good, but there is a sampling problem: putting a simple printing inside his program:
if not (len(results) % 5000): print len(results)
You can see that it slows down a lot when the dictionary contains about 100000-120000 different sequences, because there are many collisions, and it keeps slowing down. Probably a little speed up of this code cannot help. A different algoritm can be useful. I don't know... direct sequences generation doesn't seem possibile. Maybe a partially direct generation can be okay.
Hugs, bearophile Is part of this slowdown the fact that the loop will exit only when rlen reaches 200,000, when you have just shown that it can never be greater
than 125,970?
Perhaps some heuristic would opt for using sets instead of random.sample
if rlen approaches some threshold value of len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be
a crossover point at which it is cheaper to build a set of all 125,970 combinations, and then select 100,000 of them using set.pop() (which
selects an arbitrary element from the set, and removes it).
-- Paul
.... and of course, raising a MathError if rlen >
len(seq)!/picklen!(len(seq)-picklen)!
-- Paul This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Hugh Cowan |
last post by:
Hello,
I don't program full-time (anymore), but I do try and stay on-top of
the latest technologies and like most are always trying to upgrade my
skills and remain current (as much as is...
|
by: Tom Verbeure |
last post by:
Hello All,
so I'm now convinced that unit testing is really the way to go and I
want to add it to an existing project. The first thing that I find
out, is that, well, it's hard work. Harder than...
|
by: Edvard Majakari |
last post by:
Hi all ya unit-testing experts there :)
Code I'm working on has to parse large and complex files and detect
equally complex and large amount of errors before the contents of the file
is fed to...
|
by: |
last post by:
Hi!
I'm looking for unit-testing tools for .NET.
Somthing like Java has --> http://www.junit.org
regards,
gicio
|
by: windandwaves |
last post by:
Hi Folk
I have to store up to eight boolean bits of information about an item
in my database.
e.g.
with restaurant
drive-through facility
yellow windows
|
by: Andrew Wan |
last post by:
I have been developing web applications with ASP & Javascript for a long
time. I have been using Visual Studio 2003.NET. While VS2003 is okay for
intellisense of ASP & Javascript, it's still not...
|
by: Alf P. Steinbach |
last post by:
I once suggested in that SomeOne Else(TM) should propose
a string value class that accepted literals and char pointers and so on,
with possible custom deleter, and in case of literal strings just...
|
by: David |
last post by:
Hi list.
What strategies do you use to ensure correctness of new code?
Specifically, if you've just written 100 new lines of Python code, then:
1) How do you test the new code?
2) How do...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
| |