472,143 Members | 1,154 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,143 software developers and data experts.

Generating random strings without duplicates

In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?

Thank you.

Jul 23 '05 #1
7 6889
On 7 Jul 2005 06:49:29 -0700, er*********@loto-quebec.com wrote:
In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?


160 MB with 1-byte characters, or at least twice that for wchar_t ...
hmmm?

Store them in a database, and put a unique index on that column. True
randomness does not exclude duplicate values.

--
Bob Hairgrove
No**********@Home.com
Jul 23 '05 #2
Bob Hairgrove wrote:
On 7 Jul 2005 06:49:29 -0700, er*********@loto-quebec.com wrote:
In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?


160 MB with 1-byte characters, or at least twice that for wchar_t ...
hmmm?

Store them in a database, and put a unique index on that column. True
randomness does not exclude duplicate values.


Depending on what sort of "randomness" you require, you might be
able to avoid any kind of storage/lookup at all. Certain pseudo-
random number generators have the property that the same number
never occurs more than once for the duration of the randomizer's
"period" (after which the entire sequence of numbers repeats).
The linear-congruential algorith used by many implementations of
rand() has this property -- but unfortunately, from what I've
read most tend to have periods of only 65,000 or so.

If you could find a version with a period of 10,000,000 or more
then your problem is 99% solved. All you then have to do is use
each random number to generate a 16-character string in such a
way that no two numbers yield the same string. (This should be
easy enough to guarantee. If you limit yourselves to characters
A-Z and 0-9 this gives you up to 36^16 possible strings which
is more than enough.)

This will give you an algorithm for generating up to N unique
strings (N being the period of the randomizer), where uniqueness
is guaranteed by the algorithm with no lookup required.

Jul 23 '05 #3
ni*****@microsoft.com wrote:
Depending on what sort of "randomness" you require, you might be
able to avoid any kind of storage/lookup at all. Certain pseudo-
random number generators have the property that the same number
never occurs more than once for the duration of the randomizer's
"period" (after which the entire sequence of numbers repeats).
The linear-congruential algorith used by many implementations of
rand() has this property -- but unfortunately, from what I've
read most tend to have periods of only 65,000 or so.

If you could find a version with a period of 10,000,000 or more
then your problem is 99% solved.


Google: mersenne twister
Bart.

Jul 23 '05 #4
On 7 Jul 2005 06:49:29 -0700, er*********@loto-quebec.com wrote:
In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?

Thank you.


An alternative:

1 Generate a random string.
2 Hash the string to a value in a range > 0 - 9 999 999.
3 If this hash value has been used before then goto 1.
4 If the hash value is a new one then the string is also new.

Keep track of which hash values have appeared by using a bit array,
one bit per value in the hash range.

If you pick the hash range to be 25% larger than the number of strings
then on average you will only have to generate 4 strings to find a new
one at the end of the run.
Another alternative:

1 Generate a file of N > 10 000 000 unique strings in advance.
2 Pick a random number R in the range 1 .. N, both inclusive..
3 If R is already picked then pick another.
4 If R is not already picked then output the Rth string.

Again a bit array will keep track of the numbers that you have already
picked.

By generating the strings in advance you can take as long as you want
to get them right. You can load the file from disk for the actual run.
Yet another alternative:

1 Generate a file of N >= 10 000 000 unique strings in advance.
2 For IX = N downto 1
3 Pick a random number R in the range 1 .. IX, both inclusive.
4 Output string R.
5 Copy string IX over string R
6 Endfor

rossum
The ultimate truth is that there is no ultimate truth
Jul 23 '05 #5


Bob Hairgrove wrote:
On 7 Jul 2005 06:49:29 -0700, er*********@loto-quebec.com wrote:
In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?


160 MB with 1-byte characters, or at least twice that for wchar_t ...
hmmm?

Store them in a database, and put a unique index on that column. True
randomness does not exclude duplicate values.

--
Bob Hairgrove
No**********@Home.com


Try "uuidgen"

Greg

Jul 23 '05 #6
In article <11**********************@g43g2000cwa.googlegroups .com>,
<er*********@loto-quebec.com> wrote:
In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?


More information would make it possible to answer in something other
than guesses.

Why are you generating all these strings (how will they be used)? Why
is duplication an issue? etc.
--
Mark Ping
em****@soda.CSUA.Berkeley.EDU
Jul 23 '05 #7

"E. Mark Ping" <em****@soda.csua.berkeley.edu> wrote in message
news:da**********@agate.berkeley.edu...
In article <11**********************@g43g2000cwa.googlegroups .com>,
<er*********@loto-quebec.com> wrote:
In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?


More information would make it possible to answer in something other
than guesses.

Why are you generating all these strings (how will they be used)? Why
is duplication an issue? etc.
--
Mark Ping
em****@soda.CSUA.Berkeley.EDU


And why can't you just use a UUID generator which is guaranteed to be
unique?
Jul 23 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

21 posts views Thread by Andreas Lobinger | last post: by
4 posts views Thread by Phillo | last post: by
7 posts views Thread by Mary | last post: by
16 posts views Thread by Leon | last post: by
2 posts views Thread by Simon Wittber | last post: by
6 posts views Thread by Mike P | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.