By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,004 Members | 1,222 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,004 IT Pros & Developers. It's quick & easy.

MDC Design question

P: n/a
DB2 LUW v8 FP15 Linux.

Consider the table T (ID varchar, FIELD_TIME timestamp, Field3
integer). There are approx. 1k different IDs.

Each minute, one application inserts around (avg. 200, max 5k rows)
for each ID. That makes 200k rows per minute, and 12M rows per hour.

As soon as there is at least 1h of inserted data for each ID, another
application summarizes rows then deletes this 'hour', exploiting
existing indexes (delete from T where ID=? and FIELD_TIME between ?
and ?).

Depending on the number of rows each 'hour' has, I break down the
delete statements with smaller ranges (5min/10min/20min/30min/60min
'chunk').

According to the above scenario, what is a good design for a MDC
table ?
Should I use ID as one dimension, and HOUR(FIELD_TIME) [generated
field] as the other ? This table can be quite big.

Thanks for your inputs.

-M

Nov 13 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Michel Esber wrote:
DB2 LUW v8 FP15 Linux.

Consider the table T (ID varchar, FIELD_TIME timestamp, Field3
integer). There are approx. 1k different IDs.

Each minute, one application inserts around (avg. 200, max 5k rows)
for each ID. That makes 200k rows per minute, and 12M rows per hour.

As soon as there is at least 1h of inserted data for each ID, another
application summarizes rows then deletes this 'hour', exploiting
existing indexes (delete from T where ID=? and FIELD_TIME between ?
and ?).

Depending on the number of rows each 'hour' has, I break down the
delete statements with smaller ranges (5min/10min/20min/30min/60min
'chunk').

According to the above scenario, what is a good design for a MDC
table ?
Should I use ID as one dimension, and HOUR(FIELD_TIME) [generated
field] as the other ? This table can be quite big.
You want to keep the rollup monotonic:
BIGINT(FIELD_TIME) / 10000 will give you date to the hour

What do you query on?
200 rows per block is pretty sparse. You may end up with a lot of wasted
space.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab
Nov 13 '07 #2

P: n/a
On 13 nov, 13:47, Serge Rielau <srie...@ca.ibm.comwrote:
Michel Esber wrote:
DB2 LUW v8 FP15 Linux.
Consider the table T (ID varchar, FIELD_TIME timestamp, Field3
integer). There are approx. 1k different IDs.
Each minute, one application inserts around (avg. 200, max 5k rows)
for each ID. That makes 200k rows per minute, and 12M rows per hour.
As soon as there is at least 1h of inserted data for each ID, another
application summarizes rows then deletes this 'hour', exploiting
existing indexes (delete from T where ID=? and FIELD_TIME between ?
and ?).
Depending on the number of rows each 'hour' has, I break down the
delete statements with smaller ranges (5min/10min/20min/30min/60min
'chunk').
According to the above scenario, what is a good design for a MDC
table ?
Should I use ID as one dimension, and HOUR(FIELD_TIME) [generated
field] as the other ? This table can be quite big.

You want to keep the rollup monotonic:
BIGINT(FIELD_TIME) / 10000 will give you date to the hour

What do you query on?
200 rows per block is pretty sparse. You may end up with a lot of wasted
space.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

Hi Serge,

Thanks for the quick reply.

My application (C++ stored proc) retrieves all rows that belong to one
ID and one given hour. Example:

select * from T where ID=? and field_time between
'2007-11-13-15.00.00.000000' and '2007-11-13-16.00.00.000000'.

In 1 hour, there can be (200*60 avg, 5k*60 max) rows for each ID.
There are around 500-1k different IDs.

Wasted space is a concern. By the way, this is a OLTP system. Data is
inserted and removed all the time. Is MDC a good approach for this?

Thanks

Nov 13 '07 #3

P: n/a
Michel Esber wrote:
On 13 nov, 13:47, Serge Rielau <srie...@ca.ibm.comwrote:
>Michel Esber wrote:
>>DB2 LUW v8 FP15 Linux.
Consider the table T (ID varchar, FIELD_TIME timestamp, Field3
integer). There are approx. 1k different IDs.
Each minute, one application inserts around (avg. 200, max 5k rows)
for each ID. That makes 200k rows per minute, and 12M rows per hour.
As soon as there is at least 1h of inserted data for each ID, another
application summarizes rows then deletes this 'hour', exploiting
existing indexes (delete from T where ID=? and FIELD_TIME between ?
and ?).
Depending on the number of rows each 'hour' has, I break down the
delete statements with smaller ranges (5min/10min/20min/30min/60min
'chunk').
According to the above scenario, what is a good design for a MDC
table ?
Should I use ID as one dimension, and HOUR(FIELD_TIME) [generated
field] as the other ? This table can be quite big.
You want to keep the rollup monotonic:
BIGINT(FIELD_TIME) / 10000 will give you date to the hour

What do you query on?
200 rows per block is pretty sparse. You may end up with a lot of wasted
space.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab


Hi Serge,

Thanks for the quick reply.

My application (C++ stored proc) retrieves all rows that belong to one
ID and one given hour. Example:

select * from T where ID=? and field_time between
'2007-11-13-15.00.00.000000' and '2007-11-13-16.00.00.000000'.

In 1 hour, there can be (200*60 avg, 5k*60 max) rows for each ID.
There are around 500-1k different IDs.

Wasted space is a concern. By the way, this is a OLTP system. Data is
inserted and removed all the time. Is MDC a good approach for this?
Yes, I may not follow the mainstream here, but I think MDC is good for
OLTP despite having been provided for warehousing.

So you have and average of 12000 rows per MDC "square". With 300000 max.
How many rows per page? From there you can pick your block size. The
trick is to make sure that your near empty squares don't chew up too
much memory.
I think it's feasible to do (id, BIGINT(time)/10000).

There is an MDC advisor btw.

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab
Nov 13 '07 #4

P: n/a
Ian
Michel Esber wrote:
On 13 nov, 13:47, Serge Rielau <srie...@ca.ibm.comwrote:
>Michel Esber wrote:
>>DB2 LUW v8 FP15 Linux.
Consider the table T (ID varchar, FIELD_TIME timestamp, Field3
integer). There are approx. 1k different IDs.
Each minute, one application inserts around (avg. 200, max 5k rows)
for each ID. That makes 200k rows per minute, and 12M rows per hour.
As soon as there is at least 1h of inserted data for each ID, another
application summarizes rows then deletes this 'hour', exploiting
existing indexes (delete from T where ID=? and FIELD_TIME between ?
and ?).
Depending on the number of rows each 'hour' has, I break down the
delete statements with smaller ranges (5min/10min/20min/30min/60min
'chunk').
According to the above scenario, what is a good design for a MDC
table ?
Should I use ID as one dimension, and HOUR(FIELD_TIME) [generated
field] as the other ? This table can be quite big.
You want to keep the rollup monotonic:
BIGINT(FIELD_TIME) / 10000 will give you date to the hour

What do you query on?
200 rows per block is pretty sparse. You may end up with a lot of wasted
space.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab


Hi Serge,

Thanks for the quick reply.

My application (C++ stored proc) retrieves all rows that belong to one
ID and one given hour. Example:

select * from T where ID=? and field_time between
'2007-11-13-15.00.00.000000' and '2007-11-13-16.00.00.000000'.

In 1 hour, there can be (200*60 avg, 5k*60 max) rows for each ID.
There are around 500-1k different IDs.

Wasted space is a concern. By the way, this is a OLTP system. Data is
inserted and removed all the time. Is MDC a good approach for this?
Based on the limited information you've provided, MDC sounds like
a reasonable approach for this, because it can really help by
eliminating the requirement to reorg on a regular basis. Plus, a
block index for the TIME column may be more efficient.

Just make sure that extentsize for the tablespace is small enough that
you don't waste space as Serge suggested.

Also look into the DB2_MDC_ROLLOUT registry variable.

One other warning: There is an open APAR (LI72585) for the query
rewrite code that prevents DB2 from automatically adding predicates
against the generated column when the source column is a timestamp.
You might run into this, so check your explain plans.
Nov 13 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.