473,503 Members | 2,142 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Postgresql vs. aggregates

jao
I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.

I found that if I have a thread which vacuums the summary table every
few seconds (e.g. 30), then update times stay reasonable. We're
starting to reconsider this approach, wondering if the vacuuming or
the buildup of row versions (in those 30 seconds) is behind some
variability in performance that we're seeing. To that end, we are
measuring the impact of more frequent and less frequent
vacuuming. I'll skip the details here since this is not the main point
of this email.

Another idea we had is to maintain object_count and sum_size in
sequences. The trigger would use currval and setval to maintain the
values. The problem with this is that when we try to examine the
sequence on another connection, we can't use currval before bumping
the value with nextval, which would interfere with maintenance of the
totals. (The error message for a sequence s is "ERROR: s.currval
is not yet defined in this session".)

Yet another idea is to write C functions which maintain counts in
shared memory, simulating what we can't quite do with sequences.

I understand why asking for aggregate values computed over rapidly
changing data sets is not a completely meaningful thing to do. Yet we
have the requirement. Users of our product need to be able to look at
object_count and sum_size to get a sense of how an object load is
proceeding.

This shouldn't be so difficult. I've seen, on this mailing list,
the advice "don't do that, use a trigger" when someone want to do
select count(*). But I haven't seen any discussion of the consequences
of using a trigger.

Can anyone recommend a good way to maintain aggregates using a
trigger? Is vacuuming every few seconds really the only way to go?

Jack Orenstein
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #1
12 1652
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.


I'm wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.

Just a thought, might not work for you.

Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you're doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they're all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.

Just wondering.
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #2
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.


I'm wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.

Just a thought, might not work for you.

Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you're doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they're all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.

Just wondering.
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #3
jao
Quoting Scott Marlowe <sm******@qwest.net>:
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.
I'm wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.


So once a minute I'd run a stored procedure to compute statistics
and copy rows to the main table? That might be doable, but then
queries would have to be to a view combining the main table and
the temp table.

Can you expand on the use of materialized views to solve this?
Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you're doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they're all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.


The summary table has exactly one row, and whenever an object
is added to the main table, this one summary row is updated. An
odd feature of my application is that there is only one object
creation going on at a time. The summary row would not be a concurrency
hotspot in a non-MVCC database system.

But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update
it. So now, instead of one row with many versions, I have 1000 with 1000x
fewer versions each. When I want object counts and sizes, I'd sum up across
the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?

I'd find all this much easier to reason about if I understood how
the versions of a row are organized and accessed. How does postgresql
locate the correct version of a row?

Jack Orenstein

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #4
jao
Quoting Scott Marlowe <sm******@qwest.net>:
On Wed, 2004-06-09 at 16:32, ja*@geophile.com wrote:
Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

update summary
set object_count = object_count + 1,
sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.
I'm wondering if these data are time sensitive, and if so, how
sensitive. If they could be up to say a minute old or something, using
a materialized view of some sort might work well here. Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.


So once a minute I'd run a stored procedure to compute statistics
and copy rows to the main table? That might be doable, but then
queries would have to be to a view combining the main table and
the temp table.

Can you expand on the use of materialized views to solve this?
Also, do you have a lot of parallel updates to the same rows, or are
they all different rows? If you're doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they're all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.


The summary table has exactly one row, and whenever an object
is added to the main table, this one summary row is updated. An
odd feature of my application is that there is only one object
creation going on at a time. The summary row would not be a concurrency
hotspot in a non-MVCC database system.

But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update
it. So now, instead of one row with many versions, I have 1000 with 1000x
fewer versions each. When I want object counts and sizes, I'd sum up across
the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?

I'd find all this much easier to reason about if I understood how
the versions of a row are organized and accessed. How does postgresql
locate the correct version of a row?

Jack Orenstein

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #5
ja*@geophile.com writes:
I'd find all this much easier to reason about if I understood how
the versions of a row are organized and accessed. How does postgresql
locate the correct version of a row?


It doesn't, particularly. A seqscan will of course visit all the
versions of a row, and an indexscan will visit all the versions
matching the indexscan key-column conditions. It's up to the "time
qualification" tests (tqual.c) to accept only the version that is
visible to your transaction.

There are two or three implementation choices in the btree index
routines that are intended to increase the chances that you'll hit the
version you want sooner rather than later. But they're only heuristics.
The bottom line is that we check all the versions till we find the
right one.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #6
ja*@geophile.com writes:
I'd find all this much easier to reason about if I understood how
the versions of a row are organized and accessed. How does postgresql
locate the correct version of a row?


It doesn't, particularly. A seqscan will of course visit all the
versions of a row, and an indexscan will visit all the versions
matching the indexscan key-column conditions. It's up to the "time
qualification" tests (tqual.c) to accept only the version that is
visible to your transaction.

There are two or three implementation choices in the btree index
routines that are intended to increase the chances that you'll hit the
version you want sooner rather than later. But they're only heuristics.
The bottom line is that we check all the versions till we find the
right one.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #7
ja*@geophile.com wrote:
But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update
it. So now, instead of one row with many versions, I have 1000 with 1000x
fewer versions each. When I want object counts and sizes, I'd sum up across
the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?


Perhaps the simplest approach might be to define the summary table as
containing a SERIAL and your count.
Every time you add another object insert (nextval(...), 1)
Every 10s summarise the table (i.e. replace 10 rows all "scored" 1 with
1 row scored 10)
Use sum() over the much smaller table to find your total.
Vacuum regularly.
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #8
ja*@geophile.com wrote:
But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update
it. So now, instead of one row with many versions, I have 1000 with 1000x
fewer versions each. When I want object counts and sizes, I'd sum up across
the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?


Perhaps the simplest approach might be to define the summary table as
containing a SERIAL and your count.
Every time you add another object insert (nextval(...), 1)
Every 10s summarise the table (i.e. replace 10 rows all "scored" 1 with
1 row scored 10)
Use sum() over the much smaller table to find your total.
Vacuum regularly.
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #9
Hi,

----- Original Message -----
From: "Richard Huxton" <de*@archonet.com>
To: <ja*@geophile.com>
Cc: "Scott Marlowe" <sm******@qwest.net>; <pg***********@postgresql.org>
Sent: Thursday, June 10, 2004 8:03 AM
Subject: Re: [GENERAL] Postgresql vs. aggregates

ja*@geophile.com wrote:
But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update it. So now, instead of one row with many versions, I have 1000 with 1000x fewer versions each. When I want object counts and sizes, I'd sum up across the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?


Perhaps the simplest approach might be to define the summary table as
containing a SERIAL and your count.
Every time you add another object insert (nextval(...), 1)
Every 10s summarise the table (i.e. replace 10 rows all "scored" 1 with
1 row scored 10)
Use sum() over the much smaller table to find your total.
Vacuum regularly.


Something along these lines except using a SUM instead of a COUNT.

http://archives.postgresql.org/pgsql...1/msg00059.php
Nick


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #10
Hi,

----- Original Message -----
From: "Richard Huxton" <de*@archonet.com>
To: <ja*@geophile.com>
Cc: "Scott Marlowe" <sm******@qwest.net>; <pg***********@postgresql.org>
Sent: Thursday, June 10, 2004 8:03 AM
Subject: Re: [GENERAL] Postgresql vs. aggregates

ja*@geophile.com wrote:
But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update it. So now, instead of one row with many versions, I have 1000 with 1000x fewer versions each. When I want object counts and sizes, I'd sum up across the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?


Perhaps the simplest approach might be to define the summary table as
containing a SERIAL and your count.
Every time you add another object insert (nextval(...), 1)
Every 10s summarise the table (i.e. replace 10 rows all "scored" 1 with
1 row scored 10)
Use sum() over the much smaller table to find your total.
Vacuum regularly.


Something along these lines except using a SUM instead of a COUNT.

http://archives.postgresql.org/pgsql...1/msg00059.php
Nick


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #11
> I understand why asking for aggregate values computed over rapidly
changing data sets is not a completely meaningful thing to do. Yet we
have the requirement. Users of our product need to be able to look at
object_count and sum_size to get a sense of how an object load is
proceeding.

If the customer just needs to know how things are *proceeding*
will it not suffice to be looking at reasonable estimates ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #12
> I understand why asking for aggregate values computed over rapidly
changing data sets is not a completely meaningful thing to do. Yet we
have the requirement. Users of our product need to be able to look at
object_count and sum_size to get a sense of how an object load is
proceeding.

If the customer just needs to know how things are *proceeding*
will it not suffice to be looking at reasonable estimates ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

74
7876
by: John Wells | last post by:
Yes, I know you've seen the above subject before, so please be gentle with the flamethrowers. I'm preparing to enter a discussion with management at my company regarding going forward as either...
10
4892
by: Josué Maldonado | last post by:
Hello list, I have 7.3.4 on RH 8, server hardware is a dual processor Intel Xeon 2.4 Ghz, 2G RAM. I was reading about tunning and would like to get some help from you, I changed some of the...
0
937
by: jao | last post by:
I have an application with a table that tracks objects with a "size" attribute. What we want to do is to periodically report on the number of these objects and the sum of the object sizes. The...
18
1562
by: Marc G. Fournier | last post by:
After several fixes were backpatches to the 7_4_STABLE branch, we have now released a 7.4.3. As the list of Changes since 7.4.2 is quite small, they are included in this email: * Fix temporary...
8
4140
by: Kaarel | last post by:
Would PostgreSQL be a good enough choise for GnuCash (or Quickbooks or the likes) type of program? What could be the potential drawbacks of using PostgreSQL (perhaps its big size)? What would be...
1
1566
by: Esteban Kemp | last post by:
This is the Problem: I'm building a Large DataMart with a big table and I want to improve the performace using aggregates I mean a set of table that store some specific aggregacion of the main...
0
1626
by: Greg Sabino Mullane | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This is a PGP-signed copy of the checksums for following PostgreSQL versions: 7.4.5 7.4.4 7.3.7
6
1665
by: SR | last post by:
As a starter project for learning Python/PostgreSQL, I am building a Books database that stores information on the books on my bookshelf. Say I have three tables. Table "books" contains rows...
2
1619
by: John Smith | last post by:
Hi, I have a question regarding the initialisation of aggregates: The C (99) standard states: section 6.7.8, paragraph 21 states: If there are fewer initializers in a brace-enclosed list than...
0
7205
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7093
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
1
7008
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
5594
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5022
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4688
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3168
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
746
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
399
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.