By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
431,795 Members | 1,216 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 431,795 IT Pros & Developers. It's quick & easy.

Help with query: indexes on timestamps

P: n/a
Ok, I've tried a number of things here and I know I'm missing something but at
this point my head is spinning (i.e. lack of sleep, too much coffee, etc...)

My environment is PG 7.4.3 on Linux with 512Mb of ram and swap. This was just
upgraded from 7.4 (just to make sure I'm current). Some of my settings in
postgresql are giving fatal errors but I don't think my issue is related to my
query problems. I also have a laptop running with the same basic specs (no
raid, slower processor).

I use a recent pgadmin-III as my client.

We're also running this query in MS-SQL.

I have a table with with 1 million records in it. Here is the definition

CREATE TABLE report
(
match int4,
action varchar(16),
stamp timestamptz,
account varchar(32),
ipaddress inet,
profile varchar(16),
rating text,
url text
)
WITHOUT OIDS;

The is one index:

CREATE INDEX stamp_idx
ON report
USING btree
(stamp);

That query I'm running is:

SELECT date_part('hour'::text, report.stamp) AS "hour", count(*) AS count
FROM report
GROUP BY date_part('hour'::text, report.stamp)
ORDER BY date_part('hour'::text, report.stamp);

Here is the plan I get:

QUERY PLAN
----------------------------------------------------------------------------
Sort (cost=47420.64..47421.14 rows=200 width=8)
Sort Key: date_part('hour'::text, stamp)
-> HashAggregate (cost=47412.00..47413.00 rows=200 width=8)
-> Seq Scan on report (cost=0.00..42412.00 rows=1000000 width=8)
(4 rows)
Now from from I understand that, the index I created would not be used since I
would be looking at every row to do the date part. The query under 7.4 ran in
about 8 seconds. In 7.4.3, its taking 37 seconds for the same plan (which is
fine for the system not be tuned yet). On my laptop its taking 6 seconds.
MS-SQL is taking 8 seconds. These runs are after I do vacuum full, vacuum
analyse and reindex on the database and table respectively

My question: How can I get this query to use an index build on the date_part
function. On the MS-SQL side, creating a computed column with the date part and
then don't an index on that column bring the query done to 2 seconds.

I tried creating this function:

CREATE OR REPLACE FUNCTION whathour(timestamptz)
RETURNS int4 AS
'begin
return date_part(\'hour\',$1);
end;'
LANGUAGE 'plpgsql' IMMUTABLE;

and then and index:

CREATE INDEX hour_idx
ON report
USING btree
(stamp)
WHERE whathour(stamp) >= 0 AND whathour(stamp) <= 23;

but I get the same plan- which makes sense to me because I'm again inspect
quiet a few row. I'm sure I'm missing something...

I couldn't see from the docs how to make a column equal a function (like
MS-SQL's computed column) but to me it seems like I should not have to do
something like that since it really is wasting space in the table. I hoping a
partial index or a function index will solve this and be just as efficient.
However, that method **does** work. Is there a better way?

Thanks to all in advance.

--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com

____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Keith C. Perry wrote:

I have a table with with 1 million records in it. Here is the definition

CREATE TABLE report
(
match int4,
action varchar(16),
stamp timestamptz,
account varchar(32),
ipaddress inet,
profile varchar(16),
rating text,
url text
)
WITHOUT OIDS;

The is one index:

CREATE INDEX stamp_idx
ON report
USING btree
(stamp);

That query I'm running is:

SELECT date_part('hour'::text, report.stamp) AS "hour", count(*) AS count
FROM report
GROUP BY date_part('hour'::text, report.stamp)
ORDER BY date_part('hour'::text, report.stamp);


You will always get a sequential scan with this query - there is no
other way to count the rows.

With PostgreSQL being MVCC based, you can't know whether a row is
visible to you without checking it - visiting the index won't help. Even
if it could, you'd still have to visit every row in the index.

Assuming the table is a log, with always increasing timestamps, I'd
create a summary table and query that.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #2

P: n/a
Quoting Richard Huxton <de*@archonet.com>:
Keith C. Perry wrote:

I have a table with with 1 million records in it. Here is the definition

CREATE TABLE report
(
match int4,
action varchar(16),
stamp timestamptz,
account varchar(32),
ipaddress inet,
profile varchar(16),
rating text,
url text
)
WITHOUT OIDS;

The is one index:

CREATE INDEX stamp_idx
ON report
USING btree
(stamp);

That query I'm running is:

SELECT date_part('hour'::text, report.stamp) AS "hour", count(*) AS count
FROM report
GROUP BY date_part('hour'::text, report.stamp)
ORDER BY date_part('hour'::text, report.stamp);


You will always get a sequential scan with this query - there is no
other way to count the rows.

With PostgreSQL being MVCC based, you can't know whether a row is
visible to you without checking it - visiting the index won't help. Even
if it could, you'd still have to visit every row in the index.

Assuming the table is a log, with always increasing timestamps, I'd
create a summary table and query that.


Yea, actually it a proxy server log each month the databasae is 500k records. I
have two months loaded only to put some stress on the server. Some ever month
I'm loading the data just so I can do some analysis. The optimization question
came up when one of the other database folks wanted to play with the database in
MS-SQL server.

How can I add a column that respresents a function that returns just the
date_part? I wondering if that will increase the speed of the query in similar
fashion as the MS-SQL did.

I hadn't though about the MVCC vs. file locking issue. The MS-SQL server does
not have any load on it and I'm sure if other users were hitting it the same
table with the same query, PG would be perform better.

--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com

____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.