473,385 Members | 1,693 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Need help with denormalizing query

Hello,

I am currently working with a data mart. The business wants a view of
the data that shows columns from the main table along with up to 6
codes from a related table (not related by the full PK of the main
table though). They don't care about the order of the codes or how they
are assigned to their columns. Here is an example, which will hopefully
make this clearer:

CREATE TABLE dbo.Main_Table (
my_id INT NOT NULL,
line_number SMALLINT NOT NULL,
some_other_columns VARCHAR(20) NULL )
GO

ALTER TABLE dbo.Main_Table
ADD CONSTRAINT PK_Main_Table PRIMARY KEY CLUSTERED (my_id, line_number)
GO

CREATE TABLE dbo.Related_Table (
my_id INT NOT NULL,
my_code CHAR(6) NOT NULL )
GO

ALTER TABLE dbo.Related_Table
ADD CONSTRAINT PK_Related_Table PRIMARY KEY CLUSTERED (my_id, my_code)
GO

-- The two tables above are related by my_id

INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (1, 1)
INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (1, 2)
INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (1, 3)
INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (2, 1)
INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (2, 2)

INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (1, '22.63')
INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (1, '73.09')
INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (1, '51.23')
INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (2, '26.42')
GO

The results that they would want to see are:

my_id line_number my_code_1 my_code_2 my_code_3 my_code_4
1 1 22.63 73.09 51.23 NULL
1 2 22.63 73.09 51.23 NULL
1 3 22.63 73.09 51.23 NULL
2 1 26.42 NULL NULL NULL
2 2 26.42 NULL NULL NULL

I'm pretty sure that I will need to load a table with the data in this
format (or something similar) since generating this output on the fly
for the users will not work performance-wise (the main table is a
partitioned table with 6 partitions, each containting 35M+ rows, while
the related table is a single table that contains about 2M rows. There
is additional logic that has to be done with a similar table of 90M
rows. So, I will try to load a table with this denormalized view of the
world. I could probably add some sort of sequence number to
Related_Table if you think that might help (then it just becomes a
simple series of left outer joins).

Any thoughts?

Thanks!
-Tom.

Jul 10 '06 #1
8 5045
Thomas R. Hummel (to********@hotmail.com) writes:
I am currently working with a data mart. The business wants a view of
the data that shows columns from the main table along with up to 6
codes from a related table (not related by the full PK of the main
table though). They don't care about the order of the codes or how they
are assigned to their columns. Here is an example, which will hopefully
make this clearer:
...
I'm pretty sure that I will need to load a table with the data in this
format (or something similar) since generating this output on the fly
for the users will not work performance-wise (the main table is a
partitioned table with 6 partitions, each containting 35M+ rows, while
the related table is a single table that contains about 2M rows. There
is additional logic that has to be done with a similar table of 90M
rows. So, I will try to load a table with this denormalized view of the
world. I could probably add some sort of sequence number to
Related_Table if you think that might help (then it just becomes a
simple series of left outer joins).
Without a sequence number in Related_Table it becomes a lot more difficult.
So I changed Related_Table to:

CREATE TABLE dbo.Related_Table (
my_id INT NOT NULL,
seq_no tinyint NOT NULL CHECK (seq_no BETWEEN 1 AND 4),
my_code CHAR(6) NOT NULL )

and produced this query (with a single LEFT JOIN):

SELECT M.my_id, M.line_number,
code1 = MIN(CASE R.seq_no WHEN 1 THEN R.my_code END),
code2 = MIN(CASE R.seq_no WHEN 2 THEN R.my_code END),
code3 = MIN(CASE R.seq_no WHEN 3 THEN R.my_code END),
code4 = MIN(CASE R.seq_no WHEN 4 THEN R.my_code END)
FROM Main_Table M
LEFT JOIN Related_Table R ON M.my_id = R.my_id
GROUP BY M.my_id, M.line_number
ORDER BY M.my_id, M.line_number

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Jul 10 '06 #2
Erland Sommarskog wrote:
Without a sequence number in Related_Table it becomes a lot more difficult.
So I changed Related_Table to:

CREATE TABLE dbo.Related_Table (
my_id INT NOT NULL,
seq_no tinyint NOT NULL CHECK (seq_no BETWEEN 1 AND 4),
my_code CHAR(6) NOT NULL )

and produced this query (with a single LEFT JOIN):

SELECT M.my_id, M.line_number,
code1 = MIN(CASE R.seq_no WHEN 1 THEN R.my_code END),
code2 = MIN(CASE R.seq_no WHEN 2 THEN R.my_code END),
code3 = MIN(CASE R.seq_no WHEN 3 THEN R.my_code END),
code4 = MIN(CASE R.seq_no WHEN 4 THEN R.my_code END)
FROM Main_Table M
LEFT JOIN Related_Table R ON M.my_id = R.my_id
GROUP BY M.my_id, M.line_number
ORDER BY M.my_id, M.line_number
Thanks for the interesting solution Erland. In the real life case of
course, I will have many more columns than just my_id and line_number.
I realize that everything is dependent on the specific data, etc., but
in general, would you expect this to perform better than:

SELECT M.my_id,
M.line_number,
R1.my_code AS code1,
R2.my_code AS code2,
R3.my_code AS code3,
R4.my_code AS code4
FROM dbo.Main_Table M
LEFT OUTER JOIN dbo.Related_Table R1 ON R1.my_id = M.my_id AND
R1.seq_no = 1
LEFT OUTER JOIN dbo.Related_Table R2 ON R1.my_id = M.my_id AND
R1.seq_no = 2
LEFT OUTER JOIN dbo.Related_Table R3 ON R1.my_id = M.my_id AND
R1.seq_no = 3
LEFT OUTER JOIN dbo.Related_Table R4 ON R1.my_id = M.my_id AND
R1.seq_no = 4

I'll experiment with both solutions if I can add a sequence number, but
I was just a little surprised to see your query once the seq_no was
added.

Thanks,
-Tom.

Jul 10 '06 #3
Thomas R. Hummel (to********@hotmail.com) writes:
Thanks for the interesting solution Erland. In the real life case of
course, I will have many more columns than just my_id and line_number.
I realize that everything is dependent on the specific data, etc., but
in general, would you expect this to perform better than:
I will have to admit not having done any benchmark on the two solutions.
But my gut feeling is that the solution with MIN and GROUP BY generally
is better. The other seems to give more chances for the optimizer to go
astray.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Jul 10 '06 #4
Erland Sommarskog wrote:
Thomas R. Hummel (to********@hotmail.com) writes:
Thanks for the interesting solution Erland. In the real life case of
course, I will have many more columns than just my_id and line_number.
I realize that everything is dependent on the specific data, etc., but
in general, would you expect this to perform better than:

I will have to admit not having done any benchmark on the two solutions.
But my gut feeling is that the solution with MIN and GROUP BY generally
is better. The other seems to give more chances for the optimizer to go
astray.
I'll be sure to post some results here once I have both solutions in
place for testing. Might be a few days though.

-Tom.

Jul 11 '06 #5
Erland Sommarskog wrote:
Thomas R. Hummel (to********@hotmail.com) writes:
Thanks for the interesting solution Erland. In the real life case of
course, I will have many more columns than just my_id and line_number.
I realize that everything is dependent on the specific data, etc., but
in general, would you expect this to perform better than:

I will have to admit not having done any benchmark on the two solutions.
But my gut feeling is that the solution with MIN and GROUP BY generally
is better. The other seems to give more chances for the optimizer to go
astray.
As promised... I had a chance today to run some tests with each method.
When I used a smaller table for the Main_Table (~17M rows) both had the
same estimated cost for their query plans, although the multiple joins
method ran a bit faster (3m49s vs. 5m11s). With a larger Main_Table
(~39M rows) the estimated cost for the query plan with the group by was
a bit higher than the multiple join method. In that case the group by
method ran 20m23s vs. 12m45s for the multiple join method.

Of course, this may vary depending on server memory, IO speed, data
frequencies, etc., etc., but for my case it looks like doing multiple
joins works out best.

Thanks,
-Tom.

Jul 14 '06 #6
Tom,

you might want to sort and aggregate a narrow result set first, then
join:

SELECT M.my_id, M.line_number,
code1,
code2,
code3,
code4
FROM Main_Table M
LEFT JOIN (
select my_id,
MIN(CASE R.seq_no WHEN 1 THEN R.my_code END) code1,
MIN(CASE R.seq_no WHEN 2 THEN R.my_code END) code2,
MIN(CASE R.seq_no WHEN 3 THEN R.my_code END) code3,
MIN(CASE R.seq_no WHEN 4 THEN R.my_code END) code4
from Related_Table r
group by my_id
) R
ON M.my_id = R.my_id
In many cases it is dramatically faster. We were describing it in
article named "The Less SQL Server Sorts, the Faster It Responds".

Jul 14 '06 #7
Alexander Kuznetsov wrote:
Tom,

you might want to sort and aggregate a narrow result set first, then
join:

SELECT M.my_id, M.line_number,
code1,
code2,
code3,
code4
FROM Main_Table M
LEFT JOIN (
select my_id,
MIN(CASE R.seq_no WHEN 1 THEN R.my_code END) code1,
MIN(CASE R.seq_no WHEN 2 THEN R.my_code END) code2,
MIN(CASE R.seq_no WHEN 3 THEN R.my_code END) code3,
MIN(CASE R.seq_no WHEN 4 THEN R.my_code END) code4
from Related_Table r
group by my_id
) R
ON M.my_id = R.my_id
In many cases it is dramatically faster. We were describing it in
article named "The Less SQL Server Sorts, the Faster It Responds".
Thanks for the suggestion. When I checked a solution based on this the
estimated cost in the query plan was indeed lower and it ran in just
over 10m vs. the 12m45s of the multiple join solution. I may end up
sticking with the multiple join solution because I think that it's more
obvious to future coders what the code is trying to do and the
performance difference is not a big factor, but I'll definitely put
this solution in my toolbox for future use.

Thanks!
-Tom.

Jul 14 '06 #8
Tom,

In such cases I just love to figure out what components add up to those
10 or 12 minutes, for instance:
How much time does it take to just scan the main table?
select count(*) from main_table where col1+col2=17
(the table is scanned and a very minor overhead to calculate
"col1+col2")

How much time does it take to run the inner subquery?

and so on.

IMO it is important. For instance, if it takes 8 minutes just to scan
the main table, then the difference between these 2 approaches is not
10' vs.12'45'', it is in fact closer to 2' vs. 4'45''.

Makes sense?

Jul 16 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: lawrence | last post by:
I've been bad about documentation so far but I'm going to try to be better. I've mostly worked alone so I'm the only one, so far, who's suffered from my bad habits. But I'd like other programmers...
9
by: netpurpose | last post by:
I need to extract data from this table to find the lowest prices of each product as of today. The product will be listed/grouped by the name only, discarding the product code - I use...
1
by: database student | last post by:
Hi, I am a database Ph.D. student who is currently looking for research topics for my thesis. I am very interested in identifying and solving real problems faced by database users....
3
by: pw | last post by:
Hi, I am having a mental block trying to figure out how to code this. Two tables: "tblQuestions" (fields = quesnum, questype, question) "tblAnswers" (fields = clientnum, quesnum, questype,...
7
by: K. Crothers | last post by:
I administer a mechanical engineering database. I need to build a query which uses the results from a subquery as its input or criterion. I am attempting to find all of the component parts of...
3
by: google | last post by:
I have a database with four table. In one of the tables, I use about five lookup fields to get populate their dropdown list. I have read that lookup fields are really bad and may cause problems...
10
by: L. R. Du Broff | last post by:
I own a small business. Need to track a few hundred pieces of rental equipment that can be in any of a few dozen locations. I'm an old-time C language programmer (UNIX environment). If the only...
1
by: John Bode | last post by:
I need a way to fake reflection in C++ code that makes as few assumptions about the data types involved as possible. I suspect there is no good answer for what I need to do, but I'll present the...
3
by: pbd22 | last post by:
Hi. I need some help with structuring my query strings. I have a form with a search bar and some links. Each link is a search type (such as "community"). The HREF for the link's anchor looks...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.