Jerry Coffin wrote:[color=blue]
>
> In article <40176F2B.CE976BD4@doe.carleton.ca>,
fma@doe.carleton.ca
> says...
>
> [ ... ]
>[color=green]
> > In my case, I will have N items in the container,
> > will look up n (n<N) items, do some processing,
> > then replace the n worst items in the container
> > with the n new items from processing. Here, worse
> > is defined by the sorting function. The problem
> > is that n can vary from 1 to N, making the advantages
> > of vector vs. map very unclear. For small n,
> > map is better because I'm basically interspersing
> > 1 modification of the database with 1 lookup.
> > For large n, vector is better because it's faster
> > to sort a whole bunch of objects at once rather
> > than building a map in a piecemeal fashion.[/color]
>
> Finding n smallest (or largest) items in a container can be done
> considerably more quickly than simply sorting the container and then
> taking the first (or last) n items. The latter requires a sort, which
> is O(N lg N), where the former can be done in approximately O(n lg N)
> time -- a big advantage if n is quite a bit smaller than N, and a
> progressively smaller one as n approaches N.
>
> The usual way of doing this is basically a modification of a Quick Sort.
> In a Quick Sort, you pick a pivot, partition your elements, and then
> recursively do the same with each partition.
>
> To pick n elements instead, you simply pick the partition those elements
> will fall into, and only recurse into the partition, ignoring the other.
> For example, assume you want the 10 smallest elements out of 100. For
> the moment I'll assume you pick perfect pivot elements every time, so
> the first time you divide it into the 50 smallest and the 50 largest.
> Clearly the 10 smallest aren't in the 50 largest, so you simply ignore
> the right hand partition, and continue with the left partition. The
> next time, you divide it into two groups of 25 elements apiece. Again,
> you ignore the larger elements, and concentrate only on the smaller
> ones. The next time around, you're down to 12 elements in the left
> partition -- from there, it's probably easiest to just search linearly
> to find the to largest of what's left, and discard them, leaving the 10
> that you care about.
>
> IIRC, whether this is better or worse than a heap depends on the
> relative frequency of updating elements vs. inserting entirely new
> elements. If (for example) you start with 100 elements, and every time
> around, you insert 50 entirely new elements, find the 10 worst out of
> the entire array, and so on iteratively, then chances are that this will
> work better than a heap. At the opposite extreme, if the only updating
> on the data is due to the processing you do after finding your n
> elements, then chances are much better that a heap will work better.
>
> A heap puts in a bit more work up-front, and does a partial sorting on
> the entire array, so if you can put that to use, it will usually be a
> win. OTOH, if you just search for n elements, you put less work into
> arranging the data (so it's faster the first time) but you're leaving
> the rest of the array relatively random, so you haven't gained much for
> the next iteration.[/color]
Jerry, I am replacing the worst n of N elements (n<N) with new elements,
and I realize that partitioning is the way to go for that step (I am
going to use the partition() in STL for convenience). But prior to
this step, I lookup n out of N elements, and these are not the same
as the n worst elements. They can occur anywhere. I also neglected
to mention details which ultimately decided that binary searching (either
by a tree like a map, or binary search) is not the best way to perform
those lookups. That is, n will average N/4. What I will do is sort
the n elements to be looked up, then look for each one in turn by linearly
scanning the N elements of the lookup table. When one of the n elements
is found, the next one will be scanned for from that position in the lookup
table. So there will be an average of 3 skipped elements in the lookup
table for every one found. I take advantage of locality in the cache
more so than using a binary search, since the lookup table is traversed
sequentially, and the overhead of 3 skipped elements is acceptable because
of fast incrementing between adjacent vector elements.
I guess I'm exploiting a situation specific detail to get around the
larger question of profiling vector/map implementations for the time
being. The heap was a bit of a step which I prefer to find out more
about, but in the future. Reason being, my background is not CS, and
I posed my original choice as a vector/map decision because it's
readily available in standardized form via the STL, and I've got a
nice explanatory book about the formalities of their use in Josuttis.
So restricting my choices is a tradeoff; certainly, I won't have the
best solution, but I can choose the best one for which I can allot
the time.
Thanks for explanation of partitioning, and examples comparing it to
the heap (though I have to admit, I need to follow it up with more
textbook reading--I've briefly read about heaps in the past when
pondering the sort/lookup problem).
Fred
--
Fred Ma
Dept. of Electronics, Carleton University
1125 Colonel By Drive, Ottawa, Ontario
Canada, K1S 5B6