By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,779 Members | 1,122 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,779 IT Pros & Developers. It's quick & easy.

Will splitting up a table improve our select performance?

P: n/a
[DB2 8.2.1 Workgroup on Suse 8.1 on a dual Xeon with 2.5Gb of ram, 6
discs at RAID5]

One of the larger tables in our database is now 6.8 million rows (1 per
financial transaction since 2000).

Every time an amendment is made to a booking, new rows are added to the
table for each transaction, so in general we never have any call to
update old rows. Usually, we only deal with analysis on transactions
in the current financial year or the previous one, but *occasionally*
we'll want to go back further.

So I'm thinking as follows:

Put every row with transaction_date earlier than 1 April 2004 into
transactions_to_end_mar04.
Put every row with a later transaction_date into transactions_current.
Put a clustering index on each table on transaction_date.
Create a view as follows:
CREATE VIEW transactions AS
SELECT * FROM transactions_to_end_mar04
UNION ALL
SELECT * FROM transactions_current;
My analysts will 99 times out of a hundred use the transaction_date
within the WHERE clause of any SQL they write against this view. Will
the clustering indexes be efficacious? Is the optimiser going to
realise it doesn't need to scan transactions_to_end_mar04 when my
analyst asks just for bookings from 1 April 05 onward?

We're trying this now and getting ready to put bits of SQL through the
explain, but I'm not sure if there's anything fundamental that I've
missed. So any comments/warnings/expressions of disdain would be
gratefully received.

Further to this, currently every table sits in the same SMS tablespace.
Would this set up be more efficient in multiple tablespaces, or would
that require DMS tablespaces? (DBA time and resource is costly to us
and we prefer not to use too much of it)

Thanks

Nov 12 '05 #1
Share this Question
Share on Google+
13 Replies


P: n/a
bka
The simplest way to split the table would be to move to enterprise and
use DPF to hash the table across multiple nodes. If you want to pursue
the UNION ALL approach, see this:
http://www-128.ibm.com/developerwork...m-0202zuzarte/

Nov 12 '05 #2

P: n/a
James Conrad St.John Foreman wrote:
[DB2 8.2.1 Workgroup on Suse 8.1 on a dual Xeon with 2.5Gb of ram, 6
discs at RAID5]

One of the larger tables in our database is now 6.8 million rows (1 per
financial transaction since 2000).

Every time an amendment is made to a booking, new rows are added to the
table for each transaction, so in general we never have any call to
update old rows. Usually, we only deal with analysis on transactions
in the current financial year or the previous one, but *occasionally*
we'll want to go back further.

So I'm thinking as follows:

Put every row with transaction_date earlier than 1 April 2004 into
transactions_to_end_mar04.
Put every row with a later transaction_date into transactions_current.
Put a clustering index on each table on transaction_date.
Create a view as follows:
CREATE VIEW transactions AS
SELECT * FROM transactions_to_end_mar04
UNION ALL
SELECT * FROM transactions_current;
My analysts will 99 times out of a hundred use the transaction_date
within the WHERE clause of any SQL they write against this view. Will
the clustering indexes be efficacious? Is the optimiser going to
realise it doesn't need to scan transactions_to_end_mar04 when my
analyst asks just for bookings from 1 April 05 onward?

We're trying this now and getting ready to put bits of SQL through the
explain, but I'm not sure if there's anything fundamental that I've
missed. So any comments/warnings/expressions of disdain would be
gratefully received.

Further to this, currently every table sits in the same SMS tablespace.
Would this set up be more efficient in multiple tablespaces, or would
that require DMS tablespaces? (DBA time and resource is costly to us
and we prefer not to use too much of it)

Thanks

Make sure you set INTRA_PARALLEL to YES in the Database Manager Config.
This should make best usage of your processors when processing the UNION
ALL.

Also DMS table spaces with multiple containers might be a better option.
You can have multiple containers (=directories) for SMS, but only when
you create the tablespace. With DMS you can add containers after it is
created.

--
Anton Versteeg
IBM Netherlands
Nov 12 '05 #3

P: n/a
Well.. Thinking purely in SQL, and not partitioning, hardware, etc.
Perhaps you can make use of a clustered table, and use year as a
cluseter index. This way you'll get a 3D matrix. Don't know for sure if
this will increase overall performance, but as far as I can see it, it
will improve index scan performance. Don't know from which FP this
feature was available.

-R-
Nov 12 '05 #4

P: n/a

James Conrad St.John Foreman wrote:
Is the optimiser going to
realise it doesn't need to scan transactions_to_end_mar04 when my
analyst asks just for bookings from 1 April 05 onward?


Short answer is yes, as long as proper check constraints are in place.

See this (somewhat dated, many improvements since) document for more
details on UNION ALL views processing in DB2:
http://www.ibm.com/developerworks/db...rte/index.html

Regards,
Miro.

Nov 12 '05 #5

P: n/a
Thanks.

Now putting this through some testing. We've got some pretty nasty
queries to test with, and with the largest so far, the results are:

Single table, no clustered index: 3,083,170 timerons
Single table, clustered index on transaction date: 3,083,169 timerons
Two tables, both clustered on transaction date, view over the top so it
still looks to the end user like 1 table: 1,209,692 timerons

So if the optimiser is estimating accurately, we should see some good
time savings (task is taking 57 minutes under the current (single,
unclustered table) configuration.

One thing confuses me. We've set INTRA_PARALLEL = YES in the DBM
config. However, the access plan graph generated has Parallelism :
None. I find that a bit odd, because looking at the graph generated,
it has two large subqueries (one 1,151,000 timerons, and the other
56,000 timerons) that I would assume would benefit from being run in
parallel. Is this an unwarranted assumption? Or is it that because
there's such a large difference in the estimate for each of the two
subqueries that the optimiser is choosing not to run in parallel,
because there's unlikely to be benefit? Is parallelism going to be of
no use with current disc config (RAID5) or space management (SMS) ?

Nov 12 '05 #6

P: n/a
Just a thought here. Wih INTRAS_PARALLEL ON did you also check: dbm cfg
MAX_QYERYDEGREE, should be at -1 or ANY or a value *=< no. of cou. Also
check at db cfg for DFT_DEGREE. It normally is defaulted to 1 and should be
set like MAX_QUERYDEGREE. If it is at default, then your access plan would
show no parallelism as the optimizer will always select the lesser of: SET
RUNTIME DEGREE, DFT_DEGREE,MAX_QUERYDEGREE, in your case 1.
HTH, Pierre.

--
Pierre Saint-Jacques
SES Consultants Inc.
514-737-4515
"James Conrad St.John Foreman" <ja***********@abcmail.co.uk> a écrit dans le
message de news: 11*********************@g43g2000cwa.googlegroups.c om...
Thanks.

Now putting this through some testing. We've got some pretty nasty
queries to test with, and with the largest so far, the results are:

Single table, no clustered index: 3,083,170 timerons
Single table, clustered index on transaction date: 3,083,169 timerons
Two tables, both clustered on transaction date, view over the top so it
still looks to the end user like 1 table: 1,209,692 timerons

So if the optimiser is estimating accurately, we should see some good
time savings (task is taking 57 minutes under the current (single,
unclustered table) configuration.

One thing confuses me. We've set INTRA_PARALLEL = YES in the DBM
config. However, the access plan graph generated has Parallelism :
None. I find that a bit odd, because looking at the graph generated,
it has two large subqueries (one 1,151,000 timerons, and the other
56,000 timerons) that I would assume would benefit from being run in
parallel. Is this an unwarranted assumption? Or is it that because
there's such a large difference in the estimate for each of the two
subqueries that the optimiser is choosing not to run in parallel,
because there's unlikely to be benefit? Is parallelism going to be of
no use with current disc config (RAID5) or space management (SMS) ?


Nov 12 '05 #7

P: n/a
If it's taking 57 minutes to scan 6.8 million rows in a normal table,
and estimated time drops to half with the union-all approach - I'd
suggest to keep on going if possible! :-)

Personally, I'm most fond of MDC - it eats up more space, and can slow
down loads (inserts too probably). But there's no maintenance
required, and if you can identify 1-2 key attributes typically used in
these selects then it can provide vast performance improvements. And
depending on the table, you might not have to make any changes to it -
so it can be quite a lot less work than DPF, union-all, etc.

Just as a reference point: I've got seven year old hardware running a
databases with 300 million rows returning queries in 1-2 seconds. This
is due to mdc, paralleism, and lots of disks.

buck

Nov 12 '05 #8

P: n/a
Ian
James Conrad St.John Foreman wrote:
Thanks.

Now putting this through some testing. We've got some pretty nasty
queries to test with, and with the largest so far, the results are:

Single table, no clustered index: 3,083,170 timerons
Single table, clustered index on transaction date: 3,083,169 timerons
Two tables, both clustered on transaction date, view over the top so it
still looks to the end user like 1 table: 1,209,692 timerons

So if the optimiser is estimating accurately, we should see some good
time savings (task is taking 57 minutes under the current (single,
unclustered table) configuration.

One thing confuses me. We've set INTRA_PARALLEL = YES in the DBM
config. However, the access plan graph generated has Parallelism :
None. I find that a bit odd, because looking at the graph generated,
it has two large subqueries (one 1,151,000 timerons, and the other
56,000 timerons) that I would assume would benefit from being run in
parallel. Is this an unwarranted assumption? Or is it that because
there's such a large difference in the estimate for each of the two
subqueries that the optimiser is choosing not to run in parallel,
because there's unlikely to be benefit? Is parallelism going to be of
no use with current disc config (RAID5) or space management (SMS) ?


Obviously without seeing your query plan this is a guess, but are you
seeing two legs of a nested loop join? (i.e. is the operator sitting
above the two legs NLJOIN?) This type of join usually won't take
advantage of intra-partition parallelism because of how the join works
(search for nested loop join in the documentation).
On Windows all of the EDUs are threads within the db2syscs.exe process.
These include the agents that handle queries, the prefetchers & page
cleaners that read and write to the disks, the log writer, etc. (On
Unix/Linux, these are each separate processse). These threads will
run concurrently regardless of whether intra-partition parallelism
(INTRA_PARALLEL) has been enabled or not.
Nov 12 '05 #9

P: n/a
Thanks: MAX_QUERYDEGREE was already at ANY, but our DFT_DEGREE was set
to 1.

Bit confused about RUNTIME DEGREE - is that in DB CFG or DBM CFG? (may
just be a problem with my sight whilst looking through all of these
parameters)

Nov 12 '05 #10

P: n/a
Query plan gives us back two legs, but the operator above them is
UNION; one's costing 1.2m timerons and the other 50k timerons, so I
suppose executing in parallel isn't going to be a significant benefit
here anyway. Will see what other queries I've got that will be more
likely to benefit from parallelism...

Buck, when you say lots of discs, how many is that? The hardware
department are looking to acquire a SAN soon; is that going to be a
magic bullet that will get rid of my speed problems, or are some
configurations better than others for a database? (We've got about
70Gb of data at the moment, probably going to do no more than double
that in the next 12 months)

Nov 12 '05 #11

P: n/a
> Buck, when you say lots of discs, how many is that? The hardware
department are looking to acquire a SAN soon; is that going to be a
magic bullet that will get rid of my speed problems, or are some
configurations better than others for a database? (We've got about
70Gb of data at the moment, probably going to do no more than double
that in the next 12 months)


The server in question has two raid10 arrays, each with 16 drives
(minus two drives on each to support hot-swap). So, it's 32 drives,
though we only get the benefit of 14, and they're spinning at 10,000
rpms. Each is attached to a separate ssa disk controller with 128
mbytes of cache. All data, tempspace, and logs are on these two
arrays. I'd have more flexibility if I didn't need to keep everything
on a raid array, but availability is very important to this app. It's
replacement is in the works, and will have 4 arrays of 73 byte disk
spinning at 15k rpm. That will be nice. I'm still colocating the temp
space with the data, assuming that spreading each over the same four
arrays is better than splitting the arrays up.

And I wouldn't assume that a new SAN it will eliminate your speed
problems. Iit may help - but that depends on how well balanced your
system is, and how well you configure the SAN. Personally, I'd like
to use a SAN for my application, but don't want to deal with the folks
running our SAN - and responsible for massive outages with it, and I'm
getting fine performance without the extra cost. But I find that the
model is more important than the hardware most of the time anyway.

Nov 12 '05 #12

P: n/a
James Conrad St.John Foreman wrote:
Query plan gives us back two legs, but the operator above them is
UNION; one's costing 1.2m timerons and the other 50k timerons, so I
suppose executing in parallel isn't going to be a significant benefit
here anyway. Will see what other queries I've got that will be more
likely to benefit from parallelism...

Buck, when you say lots of discs, how many is that? The hardware
department are looking to acquire a SAN soon; is that going to be a
magic bullet that will get rid of my speed problems, or are some
configurations better than others for a database? (We've got about
70Gb of data at the moment, probably going to do no more than double
that in the next 12 months)

It might help if you don't make the same mistake that I have seen some
people make. They split the SAN into many small parts, so that it was
impossible to tell if you were on separate drives or not.
You also couldn't predict which other applications were accessing your
physical drives that way. My suggestion is to map the LUN's to complete
physical drives an not cut them into pieces.

--
Anton Versteeg
IBM Netherlands
Nov 12 '05 #13

P: n/a
It is part of SET RUNTIME DEGREE for application (XXX)
It allows you to dynamically change the degree for an application running.
This means if any statement is running it will use what the CURRENT DEGREE
setting is and from the SET command will now use whatever value uou set.
HTH, Pierre.

--
Pierre Saint-Jacques
SES Consultants Inc.
514-737-4515
"James Conrad St.John Foreman" <ja***********@abcmail.co.uk> a écrit dans le
message de news: 11**********************@g49g2000cwa.googlegroups. com...
Thanks: MAX_QUERYDEGREE was already at ANY, but our DFT_DEGREE was set
to 1.

Bit confused about RUNTIME DEGREE - is that in DB CFG or DBM CFG? (may
just be a problem with my sight whilst looking through all of these
parameters)


Nov 12 '05 #14

This discussion thread is closed

Replies have been disabled for this discussion.