? about priority_queue how make objects unique

Rich S.

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

Oct 5 '05 #1

Subscribe Post Reply

3811

mlimber

Rich S. wrote:

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

Oct 5 '05 #2

Kai-Uwe Bux

Rich S. wrote:

Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the best
way.

If you care about uniqueness, then maybe std::set is the way to go. Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N
Best

Kai-Uwe Bux

Oct 5 '05 #3

Kristo

Rich S. wrote:

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
[snip old message]
This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.
What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.
I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

That's too much work IMHO, and probably inefficient. Just let the set
do that for you.

Kristo

Oct 5 '05 #4

Mark P

Rich S. wrote:

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.

Oct 6 '05 #5

niklasb

Kristo wrote:

Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

[snip old message]
This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

Oct 6 '05 #6

Rich S.

"mlimber" <ml*****@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...

Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

yes duplicates in my data which I expected. I switched to a set based
solution and it works good so far.

Thanks for the tip about unique that'll be handy one day.

Oct 6 '05 #7

Rich S.

<ni*****@microsoft.com> wrote in message
news:11*********************@f14g2000cwb.googlegro ups.com...

Kristo wrote:
Rich S. wrote:
> Hi
>
> In an earlier posting I was asking how to read thru millions of data
> records
> keeping the top 2000 (where the top values are based upon a field in
> the
> record) and niklasb suggested using a priority_queue like so

[snip old message]
> This is working well for me but I'm getting cases where I have
> duplicate
> records in the top 2000 and I'm not quite sure how to fix that problem.

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

I switched to a set based solution and it works pretty good.

Thanks

Oct 6 '05 #8

Rich S.

"Mark P" <us****@fall2005REMOVE.fastmailCAPS.fm> wrote in message
news:_l*****************@newssvr27.news.prodigy.ne t...

Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the
best way.

Thanks for any help.

Regards.

I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.

I switched to a set but was thinking about a map solution for speed.

Oct 6 '05 #9

Rich S.

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:di**********@murdoch.acc.Virginia.EDU...

Rich S. wrote:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so
------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the
best
way.

If you care about uniqueness, then maybe std::set is the way to go.
Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N
Best

Kai-Uwe Bux

Thanks, I did switch to a set solution and it works well.

Oct 6 '05 #10

Similar topics

priority_queue<> underlying container

by: Dave | last post by:

Hello all, I'm pondering why the default underlying container for std::priority_queue<> is std::vector<>. It would seem that inserts are liable to happen anywhere, which would make std::list<>...

C / C++

Using priority_queue

by: Aguilar, James | last post by:

Is there any way to use a priority_queue with pointers? The reason why I ask is that I want a data structure that contains pointers to objects, not copies of objects. Also, I want them to be...

C / C++

Q: STL priority_queue with boost shared_ptr

by: zl2k | last post by:

hi, all Here is what I want to do: to wrap my self defined class in a shared_ptr and then insert it into the priority_queue. The priority_queue should pump the least element of the self defined...

C / C++

multi-threaded priority_queue in C++?

by: robin.chauhan | last post by:

I'm stuck while trying to make this C++ app multi-threaded! Any advice would be appreciated. Basically I have a very big graph data structure in memory. And I want to run an algorithm on it,...

C / C++

Printing priority_queue without emptying it?

by: Eric Lilja | last post by:

Hello, I have a the following priority_queue: priority_queue<pair<int, string pq; AFAICT, priority_queues doesn't support iterators. My question: is there a way to print its contents without...

C / C++

priority_queue cannot update element value

by: thomas | last post by:

I want to use a priority_queue like STL data structure. But I found that priority_queue cannot update element value: only pop/ push is supported. I'm using priority_queue to implement the prim...

C / C++

priority_queue predicate

by: thomas | last post by:

priority_queue usually uses the greater<intpredicate function. But as you know, we don't always use priority_queue<int>. Actually we may need the "priority_queue<pair<int,int>,...

C / C++

priority_queue help

by: Joe, G.I. | last post by:

Can anyone help me w/ a priority_queue. I'm generating MyEvent classes and I put them on a priority_queue, but I don't know how to get them in priority. The priority is the event w/ the smallest...

C / C++

Is a priority_queue dynamic?

by: card | last post by:

I was wondering if anyone has a definitive answer on whether the STL priority_queue is dynamic? What I mean by this is best illustrated by a simple example. Suppose I have a priority queue of class...

C / C++

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA