473,408 Members | 1,730 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

? about priority_queue how make objects unique

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.
Oct 5 '05 #1
9 3811
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.


Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

Oct 5 '05 #2
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the best
way.


If you care about uniqueness, then maybe std::set is the way to go. Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N
Best

Kai-Uwe Bux
Oct 5 '05 #3
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
[snip old message]
This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.
What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.
I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.


That's too much work IMHO, and probably inefficient. Just let the set
do that for you.

Kristo

Oct 5 '05 #4
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.


I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.
Oct 6 '05 #5
Kristo wrote:
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so


[snip old message]
This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.


What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.


I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

Oct 6 '05 #6

"mlimber" <ml*****@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.


Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

yes duplicates in my data which I expected. I switched to a set based
solution and it works good so far.

Thanks for the tip about unique that'll be handy one day.
Oct 6 '05 #7

<ni*****@microsoft.com> wrote in message
news:11*********************@f14g2000cwb.googlegro ups.com...
Kristo wrote:
Rich S. wrote:
> Hi
>
> In an earlier posting I was asking how to read thru millions of data
> records
> keeping the top 2000 (where the top values are based upon a field in
> the
> record) and niklasb suggested using a priority_queue like so


[snip old message]
> This is working well for me but I'm getting cases where I have
> duplicate
> records in the top 2000 and I'm not quite sure how to fix that problem.


What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.


I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}


I switched to a set based solution and it works pretty good.

Thanks
Oct 6 '05 #8

"Mark P" <us****@fall2005REMOVE.fastmailCAPS.fm> wrote in message
news:_l*****************@newssvr27.news.prodigy.ne t...
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the
best way.

Thanks for any help.

Regards.


I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.


I switched to a set but was thinking about a map solution for speed.
Oct 6 '05 #9

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:di**********@murdoch.acc.Virginia.EDU...
Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the
best
way.


If you care about uniqueness, then maybe std::set is the way to go.
Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N
Best

Kai-Uwe Bux


Thanks, I did switch to a set solution and it works well.
Oct 6 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Dave | last post by:
Hello all, I'm pondering why the default underlying container for std::priority_queue<> is std::vector<>. It would seem that inserts are liable to happen anywhere, which would make std::list<>...
11
by: Aguilar, James | last post by:
Is there any way to use a priority_queue with pointers? The reason why I ask is that I want a data structure that contains pointers to objects, not copies of objects. Also, I want them to be...
3
by: zl2k | last post by:
hi, all Here is what I want to do: to wrap my self defined class in a shared_ptr and then insert it into the priority_queue. The priority_queue should pump the least element of the self defined...
2
by: robin.chauhan | last post by:
I'm stuck while trying to make this C++ app multi-threaded! Any advice would be appreciated. Basically I have a very big graph data structure in memory. And I want to run an algorithm on it,...
6
by: Eric Lilja | last post by:
Hello, I have a the following priority_queue: priority_queue<pair<int, string pq; AFAICT, priority_queues doesn't support iterators. My question: is there a way to print its contents without...
3
by: thomas | last post by:
I want to use a priority_queue like STL data structure. But I found that priority_queue cannot update element value: only pop/ push is supported. I'm using priority_queue to implement the prim...
8
by: thomas | last post by:
priority_queue usually uses the greater<intpredicate function. But as you know, we don't always use priority_queue<int>. Actually we may need the "priority_queue<pair<int,int>,...
24
by: Joe, G.I. | last post by:
Can anyone help me w/ a priority_queue. I'm generating MyEvent classes and I put them on a priority_queue, but I don't know how to get them in priority. The priority is the event w/ the smallest...
5
card
by: card | last post by:
I was wondering if anyone has a definitive answer on whether the STL priority_queue is dynamic? What I mean by this is best illustrated by a simple example. Suppose I have a priority queue of class...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.