I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer
Do you have any advice?
Here a code sample that I used to ensure uniqueness.
typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;
for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}
Thank you 13 12186
Eric wrote: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes very slow, even if I still have enough RAM available on my computer
Do you have any advice?
Here a code sample that I used to ensure uniqueness.
typedef pair<string, bool> StringBool_pair; map<string, bool> MapValues; pair<map<string,bool>::iterator,bool> pr;
for (int iii=0; iii<10000000; iii++) { strCode = random16CharacterCode(); pr = MapValues.insert(StringBool_pair(strCode, true)); if(pr.second == true) { // Accepted... } else { // Rejected... } }
Thank you
See if your compiler has an implementation of hash_set. It is
nonstandard, but many compilers have it as an extension. For example,
in g++ you access it in the namespace __gnu_cxx:
#include <ext/hash_set>
#include <string>
int main()
{
__gnu_cxx::hash_set<std::string> h ;
// Do interesting things here.
return 0 ;
}
This should have the same interface as std::set, but uses a hash table
instead of a balanced binary tree (the usual implementation of
std::set), therefore giving you expected O(1) insertion and lookup,
instead of O(lg n).
Here is some good documentation for hash_set: http://www.sgi.com/tech/stl/hash_set.html
-Alan
Eric wrote: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes very slow, even if I still have enough RAM available on my computer
Do you have any advice?
Can you be more specific about any other constraints on your strings?
If the sole requirement is that they be unique, then you can just list
the first 10^7 in alphabetical order, starting with aaa...aaa.
If they need to be random then you can do various sneaky things that
won't affect the randomness too much (for most purposes). For example,
you can fix the first character of the string and generate 15 subsequent
random characters. Do this 10^7 / (number of distinct characters) times
for each distinct character. That will let you work with substantially
smaller data sets individually.
Other possibilities exist too, it just depends upon how much of your
system's built-in randomness you're willing to sacrifice.
Eric wrote: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes very slow, even if I still have enough RAM available on my computer
Do you have any advice?
Here a code sample that I used to ensure uniqueness.
typedef pair<string, bool> StringBool_pair; map<string, bool> MapValues; pair<map<string,bool>::iterator,bool> pr;
for (int iii=0; iii<10000000; iii++) { strCode = random16CharacterCode(); pr = MapValues.insert(StringBool_pair(strCode, true)); if(pr.second == true) { // Accepted... } else { // Rejected... } }
Thank you
In addition to the ideas others have mentioned (e.g. generating strings
deterministicaly using some one or another listing algorithm, or using a
random string generator that does not generate duplicates over a period
large enough for your purposes) you might want to try a trie. This is a
data structure for storing sets of strings. It has the following
properties: insertion of a new string takes time proportional to the length
of the string (and not the number of strings in the set); testing whether
or not a given string is already a member of the set takes time
proportional to the length of the string (and not the number of strings in
the set).
So what's a "trie"? The name supposedly derives from "reTRIEval". Suppose
you have the set of strings {"HIS", "HERS", "HIM", "HE"} Then the trie
would look like (ascii art coming, hope you have a monospaced font):
+
|
+-H--+
|
+-E
| |
| +-$
| |
| +-R-S-$
|
+-I
|
+-M-$
|
+-S-$
Here "$" denotes end of string. Any path from the root to a leaf spells out
a unique string in the stored set. See how it works? See how you could use
it?
- Michael
Eric wrote: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes very slow, even if I still have enough RAM available on my computer
Which compiler (version), Standard library, operating system?
Do you have any advice?
Here a code sample that I used to ensure uniqueness.
typedef pair<string, bool> StringBool_pair; map<string, bool> MapValues; pair<map<string,bool>::iterator,bool> pr;
for (int iii=0; iii<10000000; iii++) { strCode = random16CharacterCode(); pr = MapValues.insert(StringBool_pair(strCode, true)); if(pr.second == true) { // Accepted... } else { // Rejected... } }
Maybe your string template uses dynamic memory allocation. Try a key
class like this:
class key {
public:
key (const char* s) { ... }
inline friend bool operator< (const key& left, const key& right){...}
private:
char buf[17];
//...
};
The strings are randomly generated, so I can't start with a...a, then
a...b, etc.
Eric wrote: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes very slow, even if I still have enough RAM available on my computer
Do you have any advice?
Here a code sample that I used to ensure uniqueness.
typedef pair<string, bool> StringBool_pair; map<string, bool> MapValues; pair<map<string,bool>::iterator,bool> pr;
for (int iii=0; iii<10000000; iii++) { strCode = random16CharacterCode(); pr = MapValues.insert(StringBool_pair(strCode, true)); if(pr.second == true) { // Accepted... } else { // Rejected... } }
If the template library provided with your compiler has a hash_set or a
hash_map you might want to use that.
If not you might want to ditch the map if you are only using it to
ensure uniqueness of the inserted strings and go for a vector.
The risk that a few of the values already exist is small (but is
there).
You want only 1.0E+7 values, the total number of available values is
2.23007E+43 (or 4.52313E+74 if case sensitive). The chance of getting
the same string (if you use a decent pRNG) is only 1 divided by
2.23E+36 (or 1 divided 4.52E+67 if case sensitive).
If this is to risky since you must have a guarantee for uniqueness
another trick is to generate 36 (or 62) different maps and assign 10
million divided by 36 (or 62) values to each map, where each map stores
only string starting with one specific alpha numerical value. Then do a
random access to one of the submaps when you need a value. You'll want
to generate more values for each map to decrease the chance that one of
the submaps will run out of values when you access it.
This trick reduces the number of values you need to insert into one map
thereby replacing the time it takes to go through the tree for several
million values by a constant time during creation of the strings and a
constant time increase when retrieving a value.
Eric wrote: The strings are randomly generated, so I can't start with a...a, then a...b, etc.
Well, how about the other idea I suggested then? In fact, you can do
even better on it and not sacrifice any randomness. Begin by assigning
10^7 balls into 26 bins (or however many chars you choose from). You
can make an array count[26] and increment a random element 10^7 times.
Then for each bin i, construct count[i] random, distinct 15 char strings
and prepend to each char[i]. This makes the individual problems
substantially smaller and still retains randomness.
More generally you can use on the order of sqrt(10^7) bins and compute
for each on the order of sqrt(10^7) strings. For example, a bin for
every possible first two characters. ve*********@hotmail.com wrote: You want only 1.0E+7 values, the total number of available values is 2.23007E+43 (or 4.52313E+74 if case sensitive). The chance of getting the same string (if you use a decent pRNG) is only 1 divided by 2.23E+36 (or 1 divided 4.52E+67 if case sensitive).
No, it's more than that though still very small. Look up the birthday
problem.
True, I should have used P(k)= 1- n!/(n-k)!*n^k for accuracy.
Now to just to write a program that can handle the numbers involved :)
Eric wrote: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
I used STL set and map but after about 5 000 000 entries, it becomes very slow, even if I still have enough RAM available on my computer
Do you have any advice?
Here a code sample that I used to ensure uniqueness.
typedef pair<string, bool> StringBool_pair; map<string, bool> MapValues; pair<map<string,bool>::iterator,bool> pr;
for (int iii=0; iii<10000000; iii++) { strCode = random16CharacterCode(); pr = MapValues.insert(StringBool_pair(strCode, true)); if(pr.second == true) { // Accepted... } else { // Rejected... } }
You're thinking way too hard on this one. All you need to do is make a
function to generate a non-repeating random number in the range [0,
some number in the range 10**7-1 and 16**6-1] and then convert it to a
fixed length hex string of length 6 and call that function 10**7 times.
"Pseudo-Random Traversal of a Set" on http://www.developer.com/tech/articl...0923_3496381_3 shows one
easy way of doing this.
On 2005-07-07 22:13:41 +0100, "Eric" <er*********@loto-quebec.com> said: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
Not too sure why people should help you generate 10 millions fake
credit card numbers ...
--
JFB
verec wrote: On 2005-07-07 22:13:41 +0100, "Eric" <er*********@loto-quebec.com> said:
I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
Not too sure why people should help you generate 10 millions fake credit card numbers ... -- JFB
Alphanumeric credit-card numbers?
Looks more like TAN codes to me, but whatever :-)
On 2005-07-13 00:28:57 +0100, Paul Groke
<sw**********@der-ball-ist-rund.net> said: I need an effective way (time is my main concern here) to generate 10 000 000 unique alphanumeric strings of 16 characters each.
Not too sure why people should help you generate 10 millions fake credit card numbers ... -- JFB
Alphanumeric credit-card numbers? Looks more like TAN codes to me, but whatever :-)
Hmmm. How would _you_ go about asking for help without
revealing too much of your private agenda?
In case it didn't occur to you, alphanumeric is a proper
superset of numeric ... and thus, any help he can get on
the first, immediately applies to the second.
Call me paranoid :)
--
JFB This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: steve |
last post by:
Hi, I have researched but have not found a good solution to this
problem.
I am importing large amounts of data (over 50 Meg) into a new mysql db
that I set up. I use
>mysql dbname <...
|
by: Mike |
last post by:
This is a general question on the best way to import a large amount of data
to a MS-SQL DB.
I can have the data in just about any format I need to, I just don't know
how to import the data. I...
|
by: typingcat |
last post by:
First of all, I'm an Asian and I need to input Japanese, Korean and so
on. I've tried many PHP IDEs today, but almost non of them supported
Unicode (UTF-8) file.
I've found that the only Unicode...
|
by: Bart |
last post by:
Dear all,
I would like to encrypt a large amount of data by using public/private keys,
but I read on MSDN:
"Symmetric encryption is performed on streams and is therefore useful to
encrypt large...
|
by: Macca |
last post by:
Hi,
I'm writing an application that will pass a large amount of data between
classes/functions.
In C++ it was more efficient to send a pointer to the object, e.g structure
rather than passing...
|
by: kamran |
last post by:
Hi,
I have a web service that may return a very large amount of data. I want
that data to return in chunks, like first return 10% of data than return the
next 10% and so on, until all is...
|
by: ShawnD |
last post by:
I'm having some issues when trying to read input off of a pipe using a python script. I'm trying to process packet data from tcpdump in real-time, so it's a filter that needs to read data while the...
|
by: =?Utf-8?B?TW9iaWxlTWFu?= |
last post by:
Hello everyone:
I am looking for everyone's thoughts on moving large amounts (actually, not
very large, but large enough that I'm throwing exceptions using the default
configurations).
We're...
|
by: Jack |
last post by:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try...
|
by: trpost |
last post by:
I am looking for a way to send data from one page to another as POST
data without using forms or cURL.
I have a php script that is passing a list of cases from on page to
another when a link is...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
| |