473,902 Members | 5,194 Online

# Creating Deciles

Hello,

I want to set up a query in SQL Server that is "grouped by" a variable
PRICE. Since PRICE takes on continuous decimal values, I want to
create deciles based on this variable and then display the average
price in each decile. As background, PRICE is a calculated quantity:
I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also
want to include an average SLS_UNTS for each decile.

So essentially, I want the result to be something like:
DECILE AVG_PRICE AVG_SLS_UNTS
1 0.50 5.2
2 1.50 4.7
.... ... ...
10 9.50 1.1

Is there a way to do this in an SQL statement?

Indraneel
Jul 20 '05 #1
3 15033
"Indraneel Sheorey" <in************ ***@dartmouth.e du> wrote in message
news:35******** *************** **@posting.goog le.com...
Hello,

I want to set up a query in SQL Server that is "grouped by" a variable
PRICE. Since PRICE takes on continuous decimal values, I want to
create deciles based on this variable and then display the average
price in each decile. As background, PRICE is a calculated quantity:
I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also
want to include an average SLS_UNTS for each decile.

So essentially, I want the result to be something like:
DECILE AVG_PRICE AVG_SLS_UNTS
1 0.50 5.2
2 1.50 4.7
... ... ...
10 9.50 1.1

Is there a way to do this in an SQL statement?

Indraneel

There are different methods, with varying degrees of accuracy, to
calculate percentiles. Here's one reasonable method described through
an example. Consider the 50th percentile (5th decile or median) for the
following numbers, arranged in increasing order:

3, 5, 7, 8, 9, 11, 13, 15

1. Compute the rank R of the 50th percentile.
R = P / 100 * (N + 1)
P = the desired percentile
N = number of numbers in the collection

R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5
2. When R is an integer, the Pth percentile would be the number with
rank R.
3. When R is not an integer, as in this case, the Pth percentile is
computed by interpolation as follows:
a. Define IR as the integer portion of R. For this example, IR is 4.
b. Define FR as the fractional portion of R. For this example, FR is .5.
c. Find the values in the sequence with rank IR and IR + 1. For example,
this means the value with rank 4 and the value with rank 5, which
are 8 and 9, respectively.
d. Interpolate by multiplying the difference between the scores by FR
and adding the result to the lower score. For this example, this is
.5 * (9 - 8) + 8 = 8.5

Therefore, the 50th percentile is 8.5.

CREATE TABLE T
(
sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0),
sls_unts INT NOT NULL CHECK (sls_unts > 0)
)

CREATE VIEW TPrices (sls_dlrs, sls_unts, price)
AS
SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts
FROM T

-- Rank prices in increasing order. For a given price P in an ordered collection,
-- give both the rank of the first occurrence of P and the rank of the last
-- occurrence of P. The first price P in an ordered collection has rank 1.
CREATE VIEW IncreasingOrder Ranks (price, rank_first, rank_last)
AS
SELECT P1.price,
COUNT(*) - P1.occurrences + 1,
COUNT(*)
FROM (SELECT price, COUNT(*) AS occurrences
FROM TPrices
GROUP BY price) AS P1
INNER JOIN
TPrices AS P2
ON P2.price <= P1.price
GROUP BY P1.price, P1.occurrences

CREATE TABLE Digits
(
d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9)
)

INSERT INTO Digits (d)
VALUES (0)
INSERT INTO Digits (d)
VALUES (1)
INSERT INTO Digits (d)
VALUES (2)
INSERT INTO Digits (d)
VALUES (3)
INSERT INTO Digits (d)
VALUES (4)
INSERT INTO Digits (d)
VALUES (5)
INSERT INTO Digits (d)
VALUES (6)
INSERT INTO Digits (d)
VALUES (7)
INSERT INTO Digits (d)
VALUES (8)
INSERT INTO Digits (d)
VALUES (9)

-- Need to generate natural numbers up to 100 to represent whole number
-- percentile ranks
CREATE VIEW PercentileRanks (rank)
AS
SELECT Ones.d + 10 * Tens.d + 1
FROM Digits AS Ones
CROSS JOIN
Digits AS Tens

CREATE VIEW Percentiles (percentile, rank)
AS
SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank
THEN R1.price
WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank
THEN RN.price
WHEN PR.rank * (RN.rank + 1) / 100.0 >
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
THEN
(SELECT CASE WHEN R1.price = R2.price
THEN R1.price
ELSE (PR.rank * (RN.rank + 1) / 100.0 -
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) *
(R2.price - R1.price) + R1.price
END
FROM IncreasingOrder Ranks AS R1
INNER JOIN
IncreasingOrder Ranks AS R2
ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN R1.rank_first AND R1.rank_last AND
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1
BETWEEN R2.rank_first AND R2.rank_last)
ELSE (SELECT price
FROM IncreasingOrder Ranks
WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN rank_first AND rank_last)
END,
PR.rank
FROM PercentileRanks AS PR
CROSS JOIN
(SELECT 1, price
FROM IncreasingOrder Ranks
WHERE rank_first = 1) AS R1(rank, price)
CROSS JOIN
(SELECT MAX(rank_last), MAX(price)
FROM IncreasingOrder Ranks) AS RN(rank, price)

CREATE VIEW Deciles (decile, rank)
AS
SELECT percentile, rank / 10
FROM Percentiles
WHERE rank % 10 = 0

-- The desired query
SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts
FROM Deciles AS D1
LEFT OUTER JOIN
Deciles AS D2
ON D2.rank = D1.rank - 1
INNER JOIN
TPrices AS P
ON P.price <= D1.decile AND
(D2.decile IS NULL OR P.price > D2.decile)
GROUP BY D1.rank
ORDER BY D1.rank

Regards,
jag
Jul 20 '05 #2
"John Gilson" <ja*@acm.org> wrote in message news:<RE******* *************@t wister.nyc.rr.c om>...
"Indraneel Sheorey" <in************ ***@dartmouth.e du> wrote in message
news:35******** *************** **@posting.goog le.com...
Hello,

I want to set up a query in SQL Server that is "grouped by" a variable
PRICE. Since PRICE takes on continuous decimal values, I want to
create deciles based on this variable and then display the average
price in each decile. As background, PRICE is a calculated quantity:
I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also
want to include an average SLS_UNTS for each decile.

So essentially, I want the result to be something like:
DECILE AVG_PRICE AVG_SLS_UNTS
1 0.50 5.2
2 1.50 4.7
... ... ...
10 9.50 1.1

Is there a way to do this in an SQL statement?

Indraneel

There are different methods, with varying degrees of accuracy, to
calculate percentiles. Here's one reasonable method described through
an example. Consider the 50th percentile (5th decile or median) for the
following numbers, arranged in increasing order:

3, 5, 7, 8, 9, 11, 13, 15

1. Compute the rank R of the 50th percentile.
R = P / 100 * (N + 1)
P = the desired percentile
N = number of numbers in the collection

R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5
2. When R is an integer, the Pth percentile would be the number with
rank R.
3. When R is not an integer, as in this case, the Pth percentile is
computed by interpolation as follows:
a. Define IR as the integer portion of R. For this example, IR is 4.
b. Define FR as the fractional portion of R. For this example, FR is .5.
c. Find the values in the sequence with rank IR and IR + 1. For example,
this means the value with rank 4 and the value with rank 5, which
are 8 and 9, respectively.
d. Interpolate by multiplying the difference between the scores by FR
and adding the result to the lower score. For this example, this is
.5 * (9 - 8) + 8 = 8.5

Therefore, the 50th percentile is 8.5.

CREATE TABLE T
(
sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0),
sls_unts INT NOT NULL CHECK (sls_unts > 0)
)

CREATE VIEW TPrices (sls_dlrs, sls_unts, price)
AS
SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts
FROM T

-- Rank prices in increasing order. For a given price P in an ordered collection,
-- give both the rank of the first occurrence of P and the rank of the last
-- occurrence of P. The first price P in an ordered collection has rank 1.
CREATE VIEW IncreasingOrder Ranks (price, rank_first, rank_last)
AS
SELECT P1.price,
COUNT(*) - P1.occurrences + 1,
COUNT(*)
FROM (SELECT price, COUNT(*) AS occurrences
FROM TPrices
GROUP BY price) AS P1
INNER JOIN
TPrices AS P2
ON P2.price <= P1.price
GROUP BY P1.price, P1.occurrences

CREATE TABLE Digits
(
d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9)
)

INSERT INTO Digits (d)
VALUES (0)
INSERT INTO Digits (d)
VALUES (1)
INSERT INTO Digits (d)
VALUES (2)
INSERT INTO Digits (d)
VALUES (3)
INSERT INTO Digits (d)
VALUES (4)
INSERT INTO Digits (d)
VALUES (5)
INSERT INTO Digits (d)
VALUES (6)
INSERT INTO Digits (d)
VALUES (7)
INSERT INTO Digits (d)
VALUES (8)
INSERT INTO Digits (d)
VALUES (9)

-- Need to generate natural numbers up to 100 to represent whole number
-- percentile ranks
CREATE VIEW PercentileRanks (rank)
AS
SELECT Ones.d + 10 * Tens.d + 1
FROM Digits AS Ones
CROSS JOIN
Digits AS Tens

CREATE VIEW Percentiles (percentile, rank)
AS
SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank
THEN R1.price
WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank
THEN RN.price
WHEN PR.rank * (RN.rank + 1) / 100.0 >
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
THEN
(SELECT CASE WHEN R1.price = R2.price
THEN R1.price
ELSE (PR.rank * (RN.rank + 1) / 100.0 -
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) *
(R2.price - R1.price) + R1.price
END
FROM IncreasingOrder Ranks AS R1
INNER JOIN
IncreasingOrder Ranks AS R2
ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN R1.rank_first AND R1.rank_last AND
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1
BETWEEN R2.rank_first AND R2.rank_last)
ELSE (SELECT price
FROM IncreasingOrder Ranks
WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN rank_first AND rank_last)
END,
PR.rank
FROM PercentileRanks AS PR
CROSS JOIN
(SELECT 1, price
FROM IncreasingOrder Ranks
WHERE rank_first = 1) AS R1(rank, price)
CROSS JOIN
(SELECT MAX(rank_last), MAX(price)
FROM IncreasingOrder Ranks) AS RN(rank, price)

CREATE VIEW Deciles (decile, rank)
AS
SELECT percentile, rank / 10
FROM Percentiles
WHERE rank % 10 = 0

-- The desired query
SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts
FROM Deciles AS D1
LEFT OUTER JOIN
Deciles AS D2
ON D2.rank = D1.rank - 1
INNER JOIN
TPrices AS P
ON P.price <= D1.decile AND
(D2.decile IS NULL OR P.price > D2.decile)
GROUP BY D1.rank
ORDER BY D1.rank

Regards,
jag

trouble creating the Percentiles view, however. I am running SQL
Server locally on a desktop computer, so I think this command may be
too much for it -- I get timeout messages. Is there a way I can split
up this command to create the Percentiles view?

Thanks again,
Indraneel
Jul 20 '05 #3
"Indraneel Sheorey" <in************ ***@dartmouth.e du> wrote in message
news:35******** *************** **@posting.goog le.com...
"John Gilson" <ja*@acm.org> wrote in message news:<RE******* *************@t wister.nyc.rr.c om>...
"Indraneel Sheorey" <in************ ***@dartmouth.e du> wrote in message
news:35******** *************** **@posting.goog le.com...
Hello,

I want to set up a query in SQL Server that is "grouped by" a variable
PRICE. Since PRICE takes on continuous decimal values, I want to
create deciles based on this variable and then display the average
price in each decile. As background, PRICE is a calculated quantity:
I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also
want to include an average SLS_UNTS for each decile.

So essentially, I want the result to be something like:
DECILE AVG_PRICE AVG_SLS_UNTS
1 0.50 5.2
2 1.50 4.7
... ... ...
10 9.50 1.1

Is there a way to do this in an SQL statement?

Indraneel

There are different methods, with varying degrees of accuracy, to
calculate percentiles. Here's one reasonable method described through
an example. Consider the 50th percentile (5th decile or median) for the
following numbers, arranged in increasing order:

3, 5, 7, 8, 9, 11, 13, 15

1. Compute the rank R of the 50th percentile.
R = P / 100 * (N + 1)
P = the desired percentile
N = number of numbers in the collection

R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5
2. When R is an integer, the Pth percentile would be the number with
rank R.
3. When R is not an integer, as in this case, the Pth percentile is
computed by interpolation as follows:
a. Define IR as the integer portion of R. For this example, IR is 4.
b. Define FR as the fractional portion of R. For this example, FR is .5.
c. Find the values in the sequence with rank IR and IR + 1. For example,
this means the value with rank 4 and the value with rank 5, which
are 8 and 9, respectively.
d. Interpolate by multiplying the difference between the scores by FR
and adding the result to the lower score. For this example, this is
.5 * (9 - 8) + 8 = 8.5

Therefore, the 50th percentile is 8.5.

CREATE TABLE T
(
sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0),
sls_unts INT NOT NULL CHECK (sls_unts > 0)
)

CREATE VIEW TPrices (sls_dlrs, sls_unts, price)
AS
SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts
FROM T

-- Rank prices in increasing order. For a given price P in an ordered collection,
-- give both the rank of the first occurrence of P and the rank of the last
-- occurrence of P. The first price P in an ordered collection has rank 1.
CREATE VIEW IncreasingOrder Ranks (price, rank_first, rank_last)
AS
SELECT P1.price,
COUNT(*) - P1.occurrences + 1,
COUNT(*)
FROM (SELECT price, COUNT(*) AS occurrences
FROM TPrices
GROUP BY price) AS P1
INNER JOIN
TPrices AS P2
ON P2.price <= P1.price
GROUP BY P1.price, P1.occurrences

CREATE TABLE Digits
(
d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9)
)

INSERT INTO Digits (d)
VALUES (0)
INSERT INTO Digits (d)
VALUES (1)
INSERT INTO Digits (d)
VALUES (2)
INSERT INTO Digits (d)
VALUES (3)
INSERT INTO Digits (d)
VALUES (4)
INSERT INTO Digits (d)
VALUES (5)
INSERT INTO Digits (d)
VALUES (6)
INSERT INTO Digits (d)
VALUES (7)
INSERT INTO Digits (d)
VALUES (8)
INSERT INTO Digits (d)
VALUES (9)

-- Need to generate natural numbers up to 100 to represent whole number
-- percentile ranks
CREATE VIEW PercentileRanks (rank)
AS
SELECT Ones.d + 10 * Tens.d + 1
FROM Digits AS Ones
CROSS JOIN
Digits AS Tens

CREATE VIEW Percentiles (percentile, rank)
AS
SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank
THEN R1.price
WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank
THEN RN.price
WHEN PR.rank * (RN.rank + 1) / 100.0 >
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
THEN
(SELECT CASE WHEN R1.price = R2.price
THEN R1.price
ELSE (PR.rank * (RN.rank + 1) / 100.0 -
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) * (R2.price - R1.price) + R1.price
END
FROM IncreasingOrder Ranks AS R1
INNER JOIN
IncreasingOrder Ranks AS R2
ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN R1.rank_first AND R1.rank_last AND
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1
BETWEEN R2.rank_first AND R2.rank_last)
ELSE (SELECT price
FROM IncreasingOrder Ranks
WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN rank_first AND rank_last)
END,
PR.rank
FROM PercentileRanks AS PR
CROSS JOIN
(SELECT 1, price
FROM IncreasingOrder Ranks
WHERE rank_first = 1) AS R1(rank, price)
CROSS JOIN
(SELECT MAX(rank_last), MAX(price)
FROM IncreasingOrder Ranks) AS RN(rank, price)

CREATE VIEW Deciles (decile, rank)
AS
SELECT percentile, rank / 10
FROM Percentiles
WHERE rank % 10 = 0

-- The desired query
SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts
FROM Deciles AS D1
LEFT OUTER JOIN
Deciles AS D2
ON D2.rank = D1.rank - 1
INNER JOIN
TPrices AS P
ON P.price <= D1.decile AND
(D2.decile IS NULL OR P.price > D2.decile)
GROUP BY D1.rank
ORDER BY D1.rank

Regards,
jag

trouble creating the Percentiles view, however. I am running SQL
Server locally on a desktop computer, so I think this command may be
too much for it -- I get timeout messages. Is there a way I can split
up this command to create the Percentiles view?

Thanks again,
Indraneel

Some appropriate storing and indexing of intermediate results should help.
Apply the following revisions to the above code.

CREATE TABLE TPricesBag
(
sls_dlrs DECIMAL (8, 2) NOT NULL,
sls_unts INT NOT NULL,
price DECIMAL (8, 2) NOT NULL
)

INSERT INTO TPricesBag (sls_dlrs, sls_unts, price)
SELECT sls_dlrs, sls_unts, price
FROM TPrices

CREATE TABLE TPricesSet
(
price DECIMAL (8, 2) NOT NULL,
occurrences INT NOT NULL,
PRIMARY KEY (price)
)

INSERT INTO TPricesSet (price, occurrences)
SELECT price, COUNT(*)
FROM TPricesBag
GROUP BY price

-- *** REPLACE ***
-- Rank prices in increasing order. For a given price P in an ordered collection,
-- give both the rank of the first occurrence of P and the rank of the last
-- occurrence of P. The first price P in an ordered collection has rank 1.
CREATE VIEW IncreasingOrder Ranks (price, rank_first, rank_last)
AS
SELECT P1.price,
COUNT(*) - P1.occurrences + 1,
COUNT(*)
FROM TPricesSet AS P1
INNER JOIN
TPricesBag AS P2
ON P2.price <= P1.price
GROUP BY P1.price, P1.occurrences

CREATE TABLE Ranks
(
price DECIMAL (8, 2) NOT NULL,
rank_first INT NOT NULL CHECK (rank_first >= 1),
rank_last INT NOT NULL CHECK (rank_last >= 1),
CHECK (rank_last >= rank_first),
PRIMARY KEY (rank_first)
)

INSERT INTO Ranks (price, rank_first, rank_last)
SELECT price, rank_first, rank_last
FROM IncreasingOrder Ranks

-- *** REPLACE ***
CREATE VIEW Percentiles (percentile, rank)
AS
SELECT CASE
WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank
THEN R1.price
WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank
THEN RN.price
WHEN PR.rank * (RN.rank + 1) / 100.0 >
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
THEN
(SELECT CASE WHEN R1.price = R2.price
THEN R1.price
ELSE (PR.rank * (RN.rank + 1) / 100.0 -
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) *
(R2.price - R1.price) + R1.price
END
FROM Ranks AS R1
INNER JOIN
Ranks AS R2
ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN R1.rank_first AND R1.rank_last AND
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1
BETWEEN R2.rank_first AND R2.rank_last)
ELSE (SELECT price
FROM Ranks
WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN rank_first AND rank_last)
END,
PR.rank
FROM PercentileRanks AS PR
CROSS JOIN
(SELECT 1, price
FROM Ranks
WHERE rank_first = 1) AS R1(rank, price)
CROSS JOIN
(SELECT MAX(rank_last), MAX(price)
FROM Ranks) AS RN(rank, price)

-- *** REPLACE ***
-- The desired query
SELECT D1.rank, AVG(P.price) AS avg_price,
AVG(P.sls_unts) AS avg_sls_unts
FROM Deciles AS D1
LEFT OUTER JOIN
Deciles AS D2
ON D2.rank = D1.rank - 1
INNER JOIN
TPricesBag AS P
ON P.price <= D1.decile AND
(D2.decile IS NULL OR P.price > D2.decile)
GROUP BY D1.rank
ORDER BY D1.rank

Given a table T with 10,000 rows (randomly generated data) on a 1.7 GHz
P4 with 500 MB RAM, the above query took a bit over 1 minute. Not
blazing, hopefully acceptable.

Regards,
jag
Jul 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.