Connecting Tech Pros Worldwide Forums | Help | Site Map

Logic check please

Rich S.
Guest
 
Posts: n/a
#1: Sep 21 '05
I'm reading through records in a file creating bitset objects from them.
Each bitset object has a score and I'm trying to collect just the top 2000
scores from over a million records.

I tihnk the best way to go is to create a vector, add the objects to it in
ascending order while keeping track of the size and lowest score. If the
size reaches the 2000 mark then evaluate the score and if it's lower than
the lowest score, discard the object otherwise traverse the vector and
insert it at its appropriate space and finally chop off the beginning of the
vector so that we're still in the 2000 range.

When I say best way I mean the way I thought of given my somewhat limited
experience with stl containers. I was also thinking about determining if the
object belongs in the 2000 set based upon it's score and then tossing it in
the end of the vector and doing a vector sort but I figured I'd skip the
sort and insert it myself at the proper position.

I didn't see anything that sets a vectors size as fixed and only inserts if
above this range type of thing but I guess that's where you override the
allocater? I haven't been down that road but nows a good time to jump in I
guess.

Thanks

This convoluted method below if you care to read and process is what I came
up with.

iSize = m_BitSetVector.size();

// if the vector is empty then just add it and collect the score
if(iSize == 0)
{
iLowestTotal = currentSet.GetScore();
m_BitSetVector.push_back(currentSet);
continue;
}


bool bAtFullCapacity = false;
if(m_BitSetVector.size() > m_iTotalScoresToKeep)
{
bAtFullCapacity = true;

// if the score is less than the lowest in the vector and
// we're at full capacity then continue
if(currentSet.GetScore() < m_BitSetVector.begin()->GetScore())
continue;
}

BitSetVector::iterator p;
bool bInserted = false;

for(p = m_BitSetVector.begin(); p != m_BitSetVector.end(); ++p)
{
if(currentSet.GetScore() < p->GetScore())
{
// insert at this location
m_BitSetVector.insert(p, currentSet);
bInserted = true;

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}

break;
}
}

// if it's not inserted yet then
if(!bInserted)
{
m_BitSetVector.push_back(currentSet);

if(m_BitSetVector.size() > m_iTotalScoresToKeep)
bAtFullCapacity = true;

// now chop off the end
if(bAtFullCapacity)
{
m_BitSetVector.erase(m_BitSetVector.begin());
iLowestTotal = m_BitSetVector.begin()->GetScore();
}
}



niklasb@microsoft.com
Guest
 
Posts: n/a
#2: Sep 21 '05

re: Logic check please


Rich S. wrote:[color=blue]
> I'm reading through records in a file creating bitset objects from them.
> Each bitset object has a score and I'm trying to collect just the top 2000
> scores from over a million records.[/color]

A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

Greg
Guest
 
Posts: n/a
#3: Sep 22 '05

re: Logic check please


Rich S. wrote:[color=blue]
> I'm reading through records in a file creating bitset objects from them.
> Each bitset object has a score and I'm trying to collect just the top 2000
> scores from over a million records.
>
> I tihnk the best way to go is to create a vector, add the objects to it in
> ascending order while keeping track of the size and lowest score. If the
> size reaches the 2000 mark then evaluate the score and if it's lower than
> the lowest score, discard the object otherwise traverse the vector and
> insert it at its appropriate space and finally chop off the beginning of the
> vector so that we're still in the 2000 range.
>
> When I say best way I mean the way I thought of given my somewhat limited
> experience with stl containers. I was also thinking about determining if the
> object belongs in the 2000 set based upon it's score and then tossing it in
> the end of the vector and doing a vector sort but I figured I'd skip the
> sort and insert it myself at the proper position.
>
> I didn't see anything that sets a vectors size as fixed and only inserts if
> above this range type of thing but I guess that's where you override the
> allocater? I haven't been down that road but nows a good time to jump in I
> guess.
>
> Thanks
>
> This convoluted method below if you care to read and process is what I came
> up with.
>
> iSize = m_BitSetVector.size();
>
> // if the vector is empty then just add it and collect the score
> if(iSize == 0)
> {
> iLowestTotal = currentSet.GetScore();
> m_BitSetVector.push_back(currentSet);
> continue;
> }
>
>
> bool bAtFullCapacity = false;
> if(m_BitSetVector.size() > m_iTotalScoresToKeep)
> {
> bAtFullCapacity = true;
>
> // if the score is less than the lowest in the vector and
> // we're at full capacity then continue
> if(currentSet.GetScore() < m_BitSetVector.begin()->GetScore())
> continue;
> }
>
> BitSetVector::iterator p;
> bool bInserted = false;
>
> for(p = m_BitSetVector.begin(); p != m_BitSetVector.end(); ++p)
> {
> if(currentSet.GetScore() < p->GetScore())
> {
> // insert at this location
> m_BitSetVector.insert(p, currentSet);
> bInserted = true;
>
> if(m_BitSetVector.size() > m_iTotalScoresToKeep)
> bAtFullCapacity = true;
>
> // now chop off the end
> if(bAtFullCapacity)
> {
> m_BitSetVector.erase(m_BitSetVector.begin());
> iLowestTotal = m_BitSetVector.begin()->GetScore();
> }
>
> break;
> }
> }
>
> // if it's not inserted yet then
> if(!bInserted)
> {
> m_BitSetVector.push_back(currentSet);
>
> if(m_BitSetVector.size() > m_iTotalScoresToKeep)
> bAtFullCapacity = true;
>
> // now chop off the end
> if(bAtFullCapacity)
> {
> m_BitSetVector.erase(m_BitSetVector.begin());
> iLowestTotal = m_BitSetVector.begin()->GetScore();
> }
> }[/color]

Just call std::nth_element() to group the 2,000 highest items together
at the end of their container.

Here's a program to illustrate:

#include <vector>
#include <algorithm>
#include <iostream>

int main()
{
const int kTop2K = 2000;

std::vector<int> numberV;

// add numbers 1 - 1,000,000
for (int i=0; i < 10000000; i++)
numberV.push_back( i);

// shuffle the vector's contents
std::random_shuffle( numberV.begin(), numberV.end());

// visually confirm items are in random order
std::cout << "Last 2,000 elements (shuffled) " << std::endl;
for ( int i= numberV.size() - kTop2K; i < numberV.size() ; i++)
std::cout << numberV[i] << " ";
std::cout << std::endl << std::endl;

assert( numberV.size() >= kTop2K);

// use nth_element with the nth element being the 2,000th
// position from the end of the container

std::nth_element(numberV.begin(),
numberV.end()-kTop2K,
numberV.end());

// the top 2,000 items have been moved after the nth element
// all lower ranked items are are now before nth element.
// Note that neither range has been completely sorted

std::cout << "Last 2,000 elements (final): " << std::endl;

// visually inspect results.
// all numbers listed should be > 998,000

for (int i=numberV.size() - kTop2K; i < numberV.size() ; i++)
std::cout << numberV[i] << " ";
}

Greg

Rich S.
Guest
 
Posts: n/a
#4: Sep 22 '05

re: Logic check please



<niklasb@microsoft.com> wrote in message
news:1127322559.129297.309840@g49g2000cwa.googlegr oups.com...[color=blue]
> Rich S. wrote:[color=green]
>> I'm reading through records in a file creating bitset objects from them.
>> Each bitset object has a score and I'm trying to collect just the top
>> 2000
>> scores from over a million records.[/color]
>
> A more efficient way than what you describe would be
> to use priority_queue. Define the predicate in such
> a way that the top() always returns the lowest scoring
> object.
>
> Let q be the priority_queue and suppose you want to
> find the top N scores (where N is 2000 in your example),
> the pseudo-code would be something like this:
>
> q.reserve(N);
>
> // add the first N elements to the priority queue
> for (i = 0; i < N; ++i)
> {
> if (!ReadRecord(&obj))
> break;
>
> q.push(obj);
> }
>
> // invariant: q contains highest N elements
> // q.top() is the lowest of the highest N elements
> while (ReadRecord(&obj))
> {
> if (obj > q.top())
> {
> q.pop();
> q.push(obj);
> }
> }
>[/color]


I tried it and it worked but the only down side is that when I go to dump
the 2000 they're are in ascending order and I wish them to be in descending
order. There must be a way to dump the p queue into a vector quickly and
sort that and then dump it.

Thanks, the logic is much cleaner now.


Rich S.
Guest
 
Posts: n/a
#5: Sep 22 '05

re: Logic check please



"Greg" <greghe@pacbell.net> wrote in message
news:1127356164.673892.113320@z14g2000cwz.googlegr oups.com...[color=blue]
> Rich S. wrote:[color=green]
>> I'm reading through records in a file creating bitset objects from them.
>> Each bitset object has a score and I'm trying to collect just the top
>> 2000
>> scores from over a million records.
>>
>> I tihnk the best way to go is to create a vector, add the objects to it
>> in
>> ascending order while keeping track of the size and lowest score. If the
>> size reaches the 2000 mark then evaluate the score and if it's lower than
>> the lowest score, discard the object otherwise traverse the vector and
>> insert it at its appropriate space and finally chop off the beginning of
>> the
>> vector so that we're still in the 2000 range.
>>
>> When I say best way I mean the way I thought of given my somewhat limited
>> experience with stl containers. I was also thinking about determining if
>> the
>> object belongs in the 2000 set based upon it's score and then tossing it
>> in
>> the end of the vector and doing a vector sort but I figured I'd skip the
>> sort and insert it myself at the proper position.
>>
>> I didn't see anything that sets a vectors size as fixed and only inserts
>> if
>> above this range type of thing but I guess that's where you override the
>> allocater? I haven't been down that road but nows a good time to jump in
>> I
>> guess.
>>
>> Thanks
>>
>> This convoluted method below if you care to read and process is what I
>> came
>> up with.
>>
>> iSize = m_BitSetVector.size();
>>
>> // if the vector is empty then just add it and collect the score
>> if(iSize == 0)
>> {
>> iLowestTotal = currentSet.GetScore();
>> m_BitSetVector.push_back(currentSet);
>> continue;
>> }
>>
>>
>> bool bAtFullCapacity = false;
>> if(m_BitSetVector.size() > m_iTotalScoresToKeep)
>> {
>> bAtFullCapacity = true;
>>
>> // if the score is less than the lowest in the vector and
>> // we're at full capacity then continue
>> if(currentSet.GetScore() < m_BitSetVector.begin()->GetScore())
>> continue;
>> }
>>
>> BitSetVector::iterator p;
>> bool bInserted = false;
>>
>> for(p = m_BitSetVector.begin(); p != m_BitSetVector.end(); ++p)
>> {
>> if(currentSet.GetScore() < p->GetScore())
>> {
>> // insert at this location
>> m_BitSetVector.insert(p, currentSet);
>> bInserted = true;
>>
>> if(m_BitSetVector.size() > m_iTotalScoresToKeep)
>> bAtFullCapacity = true;
>>
>> // now chop off the end
>> if(bAtFullCapacity)
>> {
>> m_BitSetVector.erase(m_BitSetVector.begin());
>> iLowestTotal = m_BitSetVector.begin()->GetScore();
>> }
>>
>> break;
>> }
>> }
>>
>> // if it's not inserted yet then
>> if(!bInserted)
>> {
>> m_BitSetVector.push_back(currentSet);
>>
>> if(m_BitSetVector.size() > m_iTotalScoresToKeep)
>> bAtFullCapacity = true;
>>
>> // now chop off the end
>> if(bAtFullCapacity)
>> {
>> m_BitSetVector.erase(m_BitSetVector.begin());
>> iLowestTotal = m_BitSetVector.begin()->GetScore();
>> }
>> }[/color]
>
> Just call std::nth_element() to group the 2,000 highest items together
> at the end of their container.
>
> Here's a program to illustrate:
>
> #include <vector>
> #include <algorithm>
> #include <iostream>
>
> int main()
> {
> const int kTop2K = 2000;
>
> std::vector<int> numberV;
>
> // add numbers 1 - 1,000,000
> for (int i=0; i < 10000000; i++)
> numberV.push_back( i);
>
> // shuffle the vector's contents
> std::random_shuffle( numberV.begin(), numberV.end());
>
> // visually confirm items are in random order
> std::cout << "Last 2,000 elements (shuffled) " << std::endl;
> for ( int i= numberV.size() - kTop2K; i < numberV.size() ; i++)
> std::cout << numberV[i] << " ";
> std::cout << std::endl << std::endl;
>
> assert( numberV.size() >= kTop2K);
>
> // use nth_element with the nth element being the 2,000th
> // position from the end of the container
>
> std::nth_element(numberV.begin(),
> numberV.end()-kTop2K,
> numberV.end());
>
> // the top 2,000 items have been moved after the nth element
> // all lower ranked items are are now before nth element.
> // Note that neither range has been completely sorted
>
> std::cout << "Last 2,000 elements (final): " << std::endl;
>
> // visually inspect results.
> // all numbers listed should be > 998,000
>
> for (int i=numberV.size() - kTop2K; i < numberV.size() ; i++)
> std::cout << numberV[i] << " ";
> }
>
> Greg
>[/color]

That's good and I'll definately keep that snippet for later use but I don't
think I want to store all of the objects in memory at the same time because
I'll quickly run out of memory.



Karl Heinz Buchegger
Guest
 
Posts: n/a
#6: Sep 22 '05

re: Logic check please


"Rich S." wrote:[color=blue]
>
> I tried it and it worked but the only down side is that when I go to dump
> the 2000 they're are in ascending order and I wish them to be in descending
> order. There must be a way to dump the p queue into a vector quickly[/color]

A loop ?

std::vector<T> Result;
Result.reserve( N );

while( !q.empty() )
Result.push_back( q.top() );
q.pop();
}
[color=blue]
> and
> sort that[/color]

No need to sort it.
The items poped from the queue already are sorted. Just iterate
through the vector in reverse order.
[color=blue]
> and then dump it.
>[/color]

If you don't want to create that vector, you could eg. use
a recursive function to flip the order:

void Dump( priority_queue< T, std::vector<int>, MyLess<int> > q )
{
if( q.empty() )
return;

T Value = q.top();
q.pop();

Dump( q );
cout << Value << endl;
}

but I doubt, that this is faster, clearer or uses less memory space then
the vector.

--
Karl Heinz Buchegger
kbuchegg@gascad.at
Closed Thread