Banfa,
For over 20,000,000 short data sets (example: Fred Flintstone 4321 Rock Creak)
It looks like a multi-core (multi-physical-thread) Radix Sort
should work.
For over 20,000,000 longer data sets (example: Communication Avenue: Phone/Land Line, Time Stamp: 123456789, Location Origination: USA/Maryland, VISA #1234567890123456, Order #0987654321)
It looks like a multi-core (multi-physical-thread) combination of starting with Radix Sort first (maybe 20% of data length), then Bitonic Sort to complete the process.
Thank you for that information.
For a very fast changing data load (example: a "massively multiplayer" game as defined by the originator of that term back in the early 1990's as being a minimum of 1,000 concurrent users all in the same game and all playing against each other at the same time, not sectioned off. Full interaction one to 1,000.) Especially if the data sets are long (more than 150 characters each): I think that the sort which I described in my question could be the best.
Example: sort 10% of the data with each of 10 different physical cpu cores, then step over in the data 5% of the total of the data and sort from there in 10% sections, maybe. The actual sorts could be Radix for short data and Bitonic for longer data or a combination as previously described. Do this non-stop as long as the data is very fast changing.
An addition could be to compartmentalize with multiple servers as needed.
etcetera:
In case someone is interested in my observations of the paper reported by Banfa, see the following:
"The paper concludes that an implementation of radix sort is the fastest for GPUs, by a great margin." [pdf page 4].
But, they note that this can vary based upon randomness and presortedness of the data at the time of the sort. [pdf page 5+] That makes sense.
In that paper it says,
"As the data sets grows larger, the speedup increases pending on the number of threads" [pdf page 11].
That was expected.
"Results . Multithreading does improve performance, with two threads in average providing a speedup of up to 2x and four threads up to 3x. However, the potential speedup is bound to the available physical threads of the CPU and dependent of balancing the workload." [pdf page 3].
The 2x was expected. The 3x: I reject as inconclusive due to insufficient data size. I am guessing that the load was not sufficient on the four threads to achieve results approaching 4x. But, the 2x for two threads gave me an answer.
"On a single-core machine, there is little improvement due to multithreading, since it only has one core and can only run one thread at the time." [pdf page 10]
I did not notice where they specifically stated that a multi-threaded single core was to be considered only via the actual core count. They seem to completely discount the "multi-thread" ability. It seems to be a premise. They use the term "physical threads", thus I think that it was a conditional that they are using. It is exactly a part of what I based my question upon. Thank you Banfa for finding this and reporting it.
I have not gotten far in reading this report, and now I see a potential error in their testing logic. I wanted to know about multi-core threading. They reported on what seemed to be multi-core threading, but if I read the report correctly, it is not [specifically not] multi-core threading as I was looking for. For multi-core threading, I would have used cores reported plus one. The plus one would have been for the operating system and the program being executed, without the threads being testing. Then the threads being tested would have been on further separate cores. For a two core test: use a minimum of a three core system. Operate on thread one. Test on thread two and thread three. With the constraints of keeping all other use of those threads, two and three, from being used for other purposes.
I now have finished the report and I think that I agree with them and you (both) conditionally about the sort type to use based upon the randomness of the data and the average individual data set size.
Thank you Banfa.