473,466 Members | 3,167 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Will splitting up a table improve our select performance?

[DB2 8.2.1 Workgroup on Suse 8.1 on a dual Xeon with 2.5Gb of ram, 6
discs at RAID5]

One of the larger tables in our database is now 6.8 million rows (1 per
financial transaction since 2000).

Every time an amendment is made to a booking, new rows are added to the
table for each transaction, so in general we never have any call to
update old rows. Usually, we only deal with analysis on transactions
in the current financial year or the previous one, but *occasionally*
we'll want to go back further.

So I'm thinking as follows:

Put every row with transaction_date earlier than 1 April 2004 into
transactions_to_end_mar04.
Put every row with a later transaction_date into transactions_current.
Put a clustering index on each table on transaction_date.
Create a view as follows:
CREATE VIEW transactions AS
SELECT * FROM transactions_to_end_mar04
UNION ALL
SELECT * FROM transactions_current;
My analysts will 99 times out of a hundred use the transaction_date
within the WHERE clause of any SQL they write against this view. Will
the clustering indexes be efficacious? Is the optimiser going to
realise it doesn't need to scan transactions_to_end_mar04 when my
analyst asks just for bookings from 1 April 05 onward?

We're trying this now and getting ready to put bits of SQL through the
explain, but I'm not sure if there's anything fundamental that I've
missed. So any comments/warnings/expressions of disdain would be
gratefully received.

Further to this, currently every table sits in the same SMS tablespace.
Would this set up be more efficient in multiple tablespaces, or would
that require DMS tablespaces? (DBA time and resource is costly to us
and we prefer not to use too much of it)

Thanks

Nov 12 '05 #1
13 2626
bka
The simplest way to split the table would be to move to enterprise and
use DPF to hash the table across multiple nodes. If you want to pursue
the UNION ALL approach, see this:
http://www-128.ibm.com/developerwork...m-0202zuzarte/

Nov 12 '05 #2
James Conrad St.John Foreman wrote:
[DB2 8.2.1 Workgroup on Suse 8.1 on a dual Xeon with 2.5Gb of ram, 6
discs at RAID5]

One of the larger tables in our database is now 6.8 million rows (1 per
financial transaction since 2000).

Every time an amendment is made to a booking, new rows are added to the
table for each transaction, so in general we never have any call to
update old rows. Usually, we only deal with analysis on transactions
in the current financial year or the previous one, but *occasionally*
we'll want to go back further.

So I'm thinking as follows:

Put every row with transaction_date earlier than 1 April 2004 into
transactions_to_end_mar04.
Put every row with a later transaction_date into transactions_current.
Put a clustering index on each table on transaction_date.
Create a view as follows:
CREATE VIEW transactions AS
SELECT * FROM transactions_to_end_mar04
UNION ALL
SELECT * FROM transactions_current;
My analysts will 99 times out of a hundred use the transaction_date
within the WHERE clause of any SQL they write against this view. Will
the clustering indexes be efficacious? Is the optimiser going to
realise it doesn't need to scan transactions_to_end_mar04 when my
analyst asks just for bookings from 1 April 05 onward?

We're trying this now and getting ready to put bits of SQL through the
explain, but I'm not sure if there's anything fundamental that I've
missed. So any comments/warnings/expressions of disdain would be
gratefully received.

Further to this, currently every table sits in the same SMS tablespace.
Would this set up be more efficient in multiple tablespaces, or would
that require DMS tablespaces? (DBA time and resource is costly to us
and we prefer not to use too much of it)

Thanks

Make sure you set INTRA_PARALLEL to YES in the Database Manager Config.
This should make best usage of your processors when processing the UNION
ALL.

Also DMS table spaces with multiple containers might be a better option.
You can have multiple containers (=directories) for SMS, but only when
you create the tablespace. With DMS you can add containers after it is
created.

--
Anton Versteeg
IBM Netherlands
Nov 12 '05 #3
Well.. Thinking purely in SQL, and not partitioning, hardware, etc.
Perhaps you can make use of a clustered table, and use year as a
cluseter index. This way you'll get a 3D matrix. Don't know for sure if
this will increase overall performance, but as far as I can see it, it
will improve index scan performance. Don't know from which FP this
feature was available.

-R-
Nov 12 '05 #4

James Conrad St.John Foreman wrote:
Is the optimiser going to
realise it doesn't need to scan transactions_to_end_mar04 when my
analyst asks just for bookings from 1 April 05 onward?


Short answer is yes, as long as proper check constraints are in place.

See this (somewhat dated, many improvements since) document for more
details on UNION ALL views processing in DB2:
http://www.ibm.com/developerworks/db...rte/index.html

Regards,
Miro.

Nov 12 '05 #5
Thanks.

Now putting this through some testing. We've got some pretty nasty
queries to test with, and with the largest so far, the results are:

Single table, no clustered index: 3,083,170 timerons
Single table, clustered index on transaction date: 3,083,169 timerons
Two tables, both clustered on transaction date, view over the top so it
still looks to the end user like 1 table: 1,209,692 timerons

So if the optimiser is estimating accurately, we should see some good
time savings (task is taking 57 minutes under the current (single,
unclustered table) configuration.

One thing confuses me. We've set INTRA_PARALLEL = YES in the DBM
config. However, the access plan graph generated has Parallelism :
None. I find that a bit odd, because looking at the graph generated,
it has two large subqueries (one 1,151,000 timerons, and the other
56,000 timerons) that I would assume would benefit from being run in
parallel. Is this an unwarranted assumption? Or is it that because
there's such a large difference in the estimate for each of the two
subqueries that the optimiser is choosing not to run in parallel,
because there's unlikely to be benefit? Is parallelism going to be of
no use with current disc config (RAID5) or space management (SMS) ?

Nov 12 '05 #6
Just a thought here. Wih INTRAS_PARALLEL ON did you also check: dbm cfg
MAX_QYERYDEGREE, should be at -1 or ANY or a value *=< no. of cou. Also
check at db cfg for DFT_DEGREE. It normally is defaulted to 1 and should be
set like MAX_QUERYDEGREE. If it is at default, then your access plan would
show no parallelism as the optimizer will always select the lesser of: SET
RUNTIME DEGREE, DFT_DEGREE,MAX_QUERYDEGREE, in your case 1.
HTH, Pierre.

--
Pierre Saint-Jacques
SES Consultants Inc.
514-737-4515
"James Conrad St.John Foreman" <ja***********@abcmail.co.uk> a écrit dans le
message de news: 11*********************@g43g2000cwa.googlegroups.c om...
Thanks.

Now putting this through some testing. We've got some pretty nasty
queries to test with, and with the largest so far, the results are:

Single table, no clustered index: 3,083,170 timerons
Single table, clustered index on transaction date: 3,083,169 timerons
Two tables, both clustered on transaction date, view over the top so it
still looks to the end user like 1 table: 1,209,692 timerons

So if the optimiser is estimating accurately, we should see some good
time savings (task is taking 57 minutes under the current (single,
unclustered table) configuration.

One thing confuses me. We've set INTRA_PARALLEL = YES in the DBM
config. However, the access plan graph generated has Parallelism :
None. I find that a bit odd, because looking at the graph generated,
it has two large subqueries (one 1,151,000 timerons, and the other
56,000 timerons) that I would assume would benefit from being run in
parallel. Is this an unwarranted assumption? Or is it that because
there's such a large difference in the estimate for each of the two
subqueries that the optimiser is choosing not to run in parallel,
because there's unlikely to be benefit? Is parallelism going to be of
no use with current disc config (RAID5) or space management (SMS) ?


Nov 12 '05 #7
If it's taking 57 minutes to scan 6.8 million rows in a normal table,
and estimated time drops to half with the union-all approach - I'd
suggest to keep on going if possible! :-)

Personally, I'm most fond of MDC - it eats up more space, and can slow
down loads (inserts too probably). But there's no maintenance
required, and if you can identify 1-2 key attributes typically used in
these selects then it can provide vast performance improvements. And
depending on the table, you might not have to make any changes to it -
so it can be quite a lot less work than DPF, union-all, etc.

Just as a reference point: I've got seven year old hardware running a
databases with 300 million rows returning queries in 1-2 seconds. This
is due to mdc, paralleism, and lots of disks.

buck

Nov 12 '05 #8
Ian
James Conrad St.John Foreman wrote:
Thanks.

Now putting this through some testing. We've got some pretty nasty
queries to test with, and with the largest so far, the results are:

Single table, no clustered index: 3,083,170 timerons
Single table, clustered index on transaction date: 3,083,169 timerons
Two tables, both clustered on transaction date, view over the top so it
still looks to the end user like 1 table: 1,209,692 timerons

So if the optimiser is estimating accurately, we should see some good
time savings (task is taking 57 minutes under the current (single,
unclustered table) configuration.

One thing confuses me. We've set INTRA_PARALLEL = YES in the DBM
config. However, the access plan graph generated has Parallelism :
None. I find that a bit odd, because looking at the graph generated,
it has two large subqueries (one 1,151,000 timerons, and the other
56,000 timerons) that I would assume would benefit from being run in
parallel. Is this an unwarranted assumption? Or is it that because
there's such a large difference in the estimate for each of the two
subqueries that the optimiser is choosing not to run in parallel,
because there's unlikely to be benefit? Is parallelism going to be of
no use with current disc config (RAID5) or space management (SMS) ?


Obviously without seeing your query plan this is a guess, but are you
seeing two legs of a nested loop join? (i.e. is the operator sitting
above the two legs NLJOIN?) This type of join usually won't take
advantage of intra-partition parallelism because of how the join works
(search for nested loop join in the documentation).
On Windows all of the EDUs are threads within the db2syscs.exe process.
These include the agents that handle queries, the prefetchers & page
cleaners that read and write to the disks, the log writer, etc. (On
Unix/Linux, these are each separate processse). These threads will
run concurrently regardless of whether intra-partition parallelism
(INTRA_PARALLEL) has been enabled or not.
Nov 12 '05 #9
Thanks: MAX_QUERYDEGREE was already at ANY, but our DFT_DEGREE was set
to 1.

Bit confused about RUNTIME DEGREE - is that in DB CFG or DBM CFG? (may
just be a problem with my sight whilst looking through all of these
parameters)

Nov 12 '05 #10
Query plan gives us back two legs, but the operator above them is
UNION; one's costing 1.2m timerons and the other 50k timerons, so I
suppose executing in parallel isn't going to be a significant benefit
here anyway. Will see what other queries I've got that will be more
likely to benefit from parallelism...

Buck, when you say lots of discs, how many is that? The hardware
department are looking to acquire a SAN soon; is that going to be a
magic bullet that will get rid of my speed problems, or are some
configurations better than others for a database? (We've got about
70Gb of data at the moment, probably going to do no more than double
that in the next 12 months)

Nov 12 '05 #11
> Buck, when you say lots of discs, how many is that? The hardware
department are looking to acquire a SAN soon; is that going to be a
magic bullet that will get rid of my speed problems, or are some
configurations better than others for a database? (We've got about
70Gb of data at the moment, probably going to do no more than double
that in the next 12 months)


The server in question has two raid10 arrays, each with 16 drives
(minus two drives on each to support hot-swap). So, it's 32 drives,
though we only get the benefit of 14, and they're spinning at 10,000
rpms. Each is attached to a separate ssa disk controller with 128
mbytes of cache. All data, tempspace, and logs are on these two
arrays. I'd have more flexibility if I didn't need to keep everything
on a raid array, but availability is very important to this app. It's
replacement is in the works, and will have 4 arrays of 73 byte disk
spinning at 15k rpm. That will be nice. I'm still colocating the temp
space with the data, assuming that spreading each over the same four
arrays is better than splitting the arrays up.

And I wouldn't assume that a new SAN it will eliminate your speed
problems. Iit may help - but that depends on how well balanced your
system is, and how well you configure the SAN. Personally, I'd like
to use a SAN for my application, but don't want to deal with the folks
running our SAN - and responsible for massive outages with it, and I'm
getting fine performance without the extra cost. But I find that the
model is more important than the hardware most of the time anyway.

Nov 12 '05 #12
James Conrad St.John Foreman wrote:
Query plan gives us back two legs, but the operator above them is
UNION; one's costing 1.2m timerons and the other 50k timerons, so I
suppose executing in parallel isn't going to be a significant benefit
here anyway. Will see what other queries I've got that will be more
likely to benefit from parallelism...

Buck, when you say lots of discs, how many is that? The hardware
department are looking to acquire a SAN soon; is that going to be a
magic bullet that will get rid of my speed problems, or are some
configurations better than others for a database? (We've got about
70Gb of data at the moment, probably going to do no more than double
that in the next 12 months)

It might help if you don't make the same mistake that I have seen some
people make. They split the SAN into many small parts, so that it was
impossible to tell if you were on separate drives or not.
You also couldn't predict which other applications were accessing your
physical drives that way. My suggestion is to map the LUN's to complete
physical drives an not cut them into pieces.

--
Anton Versteeg
IBM Netherlands
Nov 12 '05 #13
It is part of SET RUNTIME DEGREE for application (XXX)
It allows you to dynamically change the degree for an application running.
This means if any statement is running it will use what the CURRENT DEGREE
setting is and from the SET command will now use whatever value uou set.
HTH, Pierre.

--
Pierre Saint-Jacques
SES Consultants Inc.
514-737-4515
"James Conrad St.John Foreman" <ja***********@abcmail.co.uk> a écrit dans le
message de news: 11**********************@g49g2000cwa.googlegroups. com...
Thanks: MAX_QUERYDEGREE was already at ANY, but our DFT_DEGREE was set
to 1.

Bit confused about RUNTIME DEGREE - is that in DB CFG or DBM CFG? (may
just be a problem with my sight whilst looking through all of these
parameters)


Nov 12 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: larry | last post by:
Ok I am working on building my skills to convert my apps to LAMP (read I'm a semi noob), and there was one part I was thinking of. If I create two identical MySQL tables (we'll say, invoice and...
21
by: Rabbit63 | last post by:
Hi: I want to show a set of records in the database table on the clicnt browser. I have two ways to do this (writen in JScript): 1.The first way is: <% var sql = "select firstname from...
1
by: Subrahmanyam Arya | last post by:
Dear oracle gurus, I have created a non partitioned table with no indexes and stuffed in about 15 million rows. Using oracle 8.1.7.4 standard. I then deleted 1 million and it took 1 solid hour....
7
by: Bing Wu | last post by:
Hi Folks, I have a very large table containing 170 million rows of coordinats: CREATE TABLE "DB2ADMIN"."COORDINATE" ( "FID" INTEGER NOT NULL , "AID" INTEGER NOT NULL , "X" REAL NOT NULL ,...
21
by: CSN | last post by:
I have a pretty simple select query that joins a table (p) with 125K rows with another table (pc) with almost one million rows: select p.* from product_categories pc inner join products p on...
2
by: RichardP via AccessMonster.com | last post by:
Hi there, I have a query running against Oracle which returns approx. 140,000 records. I need to store all this data locally in my BE database. Conventionally I would set up a table to contain...
7
by: Matik | last post by:
Hi to everyone, My problem is, that I'm not so quite sure, which way should I go. The user is inputing by second part application a long string (let's say 128 characters), which are separated...
11
by: Paul H | last post by:
Suppose I have a table called tblPeople and I want a field to illustrate whether each person prefers cats or dogs. I could do it one of three ways. 1. A plain text field Create a text field in...
2
by: sdanda | last post by:
Hi , Do you have any idea how to improve my java class performance while selecting and inserting data into DB using JDBC Connectivity ......... This has to work for more than 8,00,000...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.