Working with date ranges

Thomas R. Hummel

Hello,

I am importing data that lists rates for particular coverages for a
particular period of time. Unfortunately, the data source isn't very
clean. I've come up with some rules that I think will work to clean the
data, but I'm having trouble putting those rules into efficient SQL.
The table that I'm dealing with has just under 9M rows and I may need
to use similar logic on an even larger table, so I'd like something
that can be made efficient to some degree using indexes if necessary.

Here is some sample (simplified) code:

CREATE TABLE Coverage_Rates (
rate_id INT IDENTITY NOT NULL,
coverage_id INT NOT NULL,
start_date SMALLDATETIME NOT NULL,
end_date SMALLDATETIME NOT NULL,
rate MONEY NOT NULL )
GO

INSERT INTO Coverage_Rates VALUES (1, '2004-01-01', '2004-06-01',
40.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-03-01', '2004-08-01',
20.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-06-01', '2004-08-01',
30.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-01-01', '9999-12-31',
90.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-03-01', '2004-08-01',
20.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-08-01', '2004-08-01',
30.00)
GO
The rule is basically this... for any given period of time, for a
particular coverage, always use the coverage with the highest rate. So,
given the rows above, I would want the results to be:

coverage_id start_dt end_dt rate
----------- ---------- ---------- --------
1 2004-01-01 2004-06-01 40.00
1 2004-06-01 2004-08-01 30.00
2 2004-01-01 9999-12-31 90.00

There can be any combination of start and end dates in the source, but
in my final results I would like to be able to have only one distinct
row for any given time and coverage ID. So, given any date @my_date,

SELECT coverage_id, COUNT(*)
FROM <results>
WHERE @my_date >= start_dt
AND @my_date < end_dt
GROUP BY coverage_id
HAVING COUNT(*) > 1

the above query should return 0 rows.

Thanks for any help!

-Tom.

Jul 23 '05 #1

Subscribe Post Reply

2617

Thomas R. Hummel

Please change the start_date and end_date to DATETIME instead of
SMALLDATETIME. Sorry about that, I made the change after testing my
script but before doing the cut and paste.

-Tom.

Jul 23 '05 #2

--CELKO--

If you had avoided proprietary data types, pulled out that silly
non-relational IDENTITY, followed ISO-11179 naming rules, used a key on
the table and had some constraints, would the table look like this?

CREATE TABLE CoverageRates
(coverage_id CHAR(15) NOT NULL,
coverage_start_date DATETIME NOT NULL,
coverage_end_date DATETIME NOT NULL,
CHECK (coverage_start_date < coverage_end_date)
coverage_rate DECIMAL(5,3) NOT NULL,
PRIMARY KEY (coverage_id, coverage_start_date));

I am importing data that lists rates for particular coverages for a

particular period of time. Unfortunately, the data source isn't very
clean .. The table that I'm dealing with has just under 9M rows and I
may need to use similar logic on an even larger table ..<<

Let's start with a Calendar table -- just a simple list of dates. The
first shot at this is to build a daily rate card for all the coverages,
thus:

CREATE VIEW DailyCoverageRates (coverage_id, cal_date, coverage_rate)
AS
SELECT R.coverage_id, C.cal_date, MAX(coverage_rate)
FROM Calendar AS C, CoverageRates AS R
WHERE C.cal_date BETWEEN R.coverage_start_date
AND C.coverage_end_date
GROUP BY R.coverage_id, C.cal_date;

You could just use this and keep your stinky data.

I would add a MIN(coverage_rate) and scrub with this query:

SELECT R.coverage_id, C.cal_date,
MIN(coverage_rate) AS low_rate,
MAX(coverage_rate) AS high_rate
FROM Calendar AS C, CoverageRates AS R
WHERE C.cal_date BETWEEN R.coverage_start_date
AND C.coverage_end_date
GROUP BY R.coverage_id, C.cal_date
HAVING MIN(coverage_rate) < MAX(coverage_rate);

This will tell you where you have conflicts. Hey, if you got lucky,
there might be only a few conflicts that can be fixed by hand. Hey, I
might win the lottery!

Here is the query to group the daily rates into runs. Clustered index
on dates and coverage_id is a BIG help, if you can do it.

SELECT X.coverage_id, X.coverage_start_date,
MAX(X.coverage_end_date) AS coverage_end_date,
X.coverage_rate
FROM (SELECT S.coverage_id, MIN(S.cal_date),
E.cal_date, E.coverage_rate
FROM DailyCoverageRates AS S, DailyCoverageRates AS E
WHERE S.coverage_id = E.coverage_id
AND S.coverage_rate = E.coverage_rate
GROUP BY S.coverage_id, E.cal_date, E.coverage_rate)
AS X(coverage_id, coverage_start_date, coverage_end_date,
coverage_rate)
GROUP BY X.coverage_id, X.coverage_start_date, X.coverage_rate;

If you have the SQL-2005 OLAP stuff look at the OVER() operator from
SQL-99.

I do not say this often, but you might also consider a cursor to make a
single pass thru the data and build the ranges that way, since it is a
one-shot deal (I hope). .

Jul 23 '05 #3

Thomas R. Hummel

--CELKO-- wrote:

If you had avoided proprietary data types, pulled out that silly
non-relational IDENTITY, followed ISO-11179 naming rules, used a key on the table and had some constraints, would the table look like this?
It might, but if I were posting a sample to an MS SQL Server specific
newsgroup and my only concern was getting across the idea of my problem
then it would look exactly as I have it.
Let's start with a Calendar table -- just a simple list of dates. The first shot at this is to build a daily rate card for all the coverages, thus:

CREATE VIEW DailyCoverageRates (coverage_id, cal_date, coverage_rate)
AS
SELECT R.coverage_id, C.cal_date, MAX(coverage_rate)
FROM Calendar AS C, CoverageRates AS R
WHERE C.cal_date BETWEEN R.coverage_start_date
AND C.coverage_end_date
GROUP BY R.coverage_id, C.cal_date;

You could just use this and keep your stinky data.
While this is an elegant and useful solution for a lot of problems like
this, unfortunately it turns my 9M rows into 13.5B rows (even if they
are virtual rows). Since I am writing a load process I don't actually
constrain the results, so I would be dealing with the full 13.5B rows,
which is too unwieldly. I've never seen a query with an estimate cost
of over 1M before though. :-)
I would add a MIN(coverage_rate) and scrub with this query:

SELECT R.coverage_id, C.cal_date,
MIN(coverage_rate) AS low_rate,
MAX(coverage_rate) AS high_rate
FROM Calendar AS C, CoverageRates AS R
WHERE C.cal_date BETWEEN R.coverage_start_date
AND C.coverage_end_date
GROUP BY R.coverage_id, C.cal_date
HAVING MIN(coverage_rate) < MAX(coverage_rate);

This will tell you where you have conflicts. Hey, if you got lucky,
there might be only a few conflicts that can be fixed by hand. Hey, I might win the lottery!
Using another method I've found that there are already far too many
conflicts for me to fix them manually.
Here is the query to group the daily rates into runs. Clustered index on dates and coverage_id is a BIG help, if you can do it.

SELECT X.coverage_id, X.coverage_start_date,
MAX(X.coverage_end_date) AS coverage_end_date,
X.coverage_rate
FROM (SELECT S.coverage_id, MIN(S.cal_date),
E.cal_date, E.coverage_rate
FROM DailyCoverageRates AS S, DailyCoverageRates AS E
WHERE S.coverage_id = E.coverage_id
AND S.coverage_rate = E.coverage_rate
GROUP BY S.coverage_id, E.cal_date, E.coverage_rate)
AS X(coverage_id, coverage_start_date, coverage_end_date,
coverage_rate)
GROUP BY X.coverage_id, X.coverage_start_date, X.coverage_rate;

If you have the SQL-2005 OLAP stuff look at the OVER() operator from
SQL-99.
I don't have this, but I'll look into OVER() anyway, thanks.
I do not say this often, but you might also consider a cursor to make a single pass thru the data and build the ranges that way, since it is a one-shot deal (I hope). .

Unfortunately, this is for a monthly load. This is one of those dreaded
situations where I have no control over the source data, but it will be
constantly changing and I need to import it into my database on a
monthly basis.

I'm currently looking into a method of using multiple passes over the
data - an insert followed by a series of updates/deletes. I'll compare
the performance of that with a cursor and see which works best until I
can convince the people in charge of my data source to fix their data.

Thanks,
-Tom.

Jul 23 '05 #4

Thomas R. Hummel

--CELKO-- wrote:

Here is the query to group the daily rates into runs. Clustered index on dates and coverage_id is a BIG help, if you can do it.

SELECT X.coverage_id, X.coverage_start_date,
MAX(X.coverage_end_date) AS coverage_end_date,
X.coverage_rate
FROM (SELECT S.coverage_id, MIN(S.cal_date),
E.cal_date, E.coverage_rate
FROM DailyCoverageRates AS S, DailyCoverageRates AS E
WHERE S.coverage_id = E.coverage_id
AND S.coverage_rate = E.coverage_rate
GROUP BY S.coverage_id, E.cal_date, E.coverage_rate)
AS X(coverage_id, coverage_start_date, coverage_end_date,
coverage_rate)
GROUP BY X.coverage_id, X.coverage_start_date, X.coverage_rate;

After looking at this more closely, I don't see how this gets ranges.
For example, using a series of numbers instead of sequential dates, if
I had: 1, 2, 4, 5, 6, 8, 9 then I would expect to have returned 1-2,
4-6, 8-9. Modifying the query above to fit the simpler model:

CREATE TABLE Series (my_int INT NOT NULL)
GO

INSERT INTO Series VALUES (1)
INSERT INTO Series VALUES (2)
INSERT INTO Series VALUES (4)
INSERT INTO Series VALUES (5)
INSERT INTO Series VALUES (6)
INSERT INTO Series VALUES (8)
INSERT INTO Series VALUES (9)
GO

SELECT SQ.start_my_int, MAX(end_my_int) AS end_my_int
FROM (SELECT MIN(S.my_int) AS start_my_int, E.my_int AS end_my_int
FROM Series S, Series E
GROUP BY E.my_int) AS SQ
GROUP BY SQ.start_my_int
GO

This returns only 1-9, which is clearly wrong. Did I miss something in
simplifying the query?

Thanks,
-Tom.

Jul 23 '05 #5

--CELKO--

You have nothing to tell you that you have anything missing. In the
case of numbers, it the formula that

(end -start +1) = COUNT of numbers between start and end.

I assume that every day had at least one rate and the problem was
finding more than one. Gaps make life harder and you probably want to
have a rule to fill them in.

Jul 23 '05 #6

Thomas R. Hummel

--CELKO-- wrote:

You have nothing to tell you that you have anything missing. In the
case of numbers, it the formula that

(end -start +1) = COUNT of numbers between start and end.

I assume that every day had at least one rate and the problem was
finding more than one. Gaps make life harder and you probably want to have a rule to fill them in.

Every day may have a rate, but the rate might go up and down, so that
the periods for one particular rate might not be contiguous. I don't
have access to a SQL Server at the moment to write and test a test case
here, but to expand on my example with numbers, let's add rates into
it:

CREATE TABLE Series (my_int INT NOT NULL, rate MONEY NOT NULL)
GO

INSERT INTO Series VALUES (1, 10.0)
INSERT INTO Series VALUES (2, 10.0)
INSERT INTO Series VALUES (3, 20.0)
INSERT INTO Series VALUES (4, 10.0)
INSERT INTO Series VALUES (5, 10.0)
INSERT INTO Series VALUES (6, 10.0)
INSERT INTO Series VALUES (7, 20.0)
INSERT INTO Series VALUES (8, 10.0)
INSERT INTO Series VALUES (9, 10.0)
GO

SELECT SQ.rate, SQ.start_num, MAX(SQ.end_num)
FROM (SELECT MIN(S.my_int) AS start_num, E.my_int AS end_num, E.rate
FROM Series AS S, Series AS E
WHERE S.rate = E.rate
GROUP BY E.my_int, E.rate) AS SQ
GROUP BY SQ.rate, SQ.start_num
GO

Now every number has a row and there are no gaps, but the same problem
arises since I will now see:

start_num end_num rate
--------- ------- -------
1 9 10.0000
3 7 20.0000

instead of:

start_num end_num rate
--------- ------- -------
1 2 10.0000
3 3 20.0000
4 6 10.0000
7 7 20.0000
8 9 10.0000
-Tom.

Jul 23 '05 #7

Erland Sommarskog

[posted and mailed]

Thomas R. Hummel (to********@hotmail.com) writes:

The rule is basically this... for any given period of time, for a
particular coverage, always use the coverage with the highest rate. So,
given the rows above, I would want the results to be:

coverage_id start_dt end_dt rate
----------- ---------- ---------- --------
1 2004-01-01 2004-06-01 40.00
1 2004-06-01 2004-08-01 30.00
2 2004-01-01 9999-12-31 90.00

Below are two variants. Since you objected to Celko's use of a calendar
table, I thought a little about and realised that you don't need a full
calendar table. You only need one with the actual dates in the table.
This is why there are two alternatives. The first builds a calendar by
coverage_id; the second builds a global calendar. The latter is probably
the better bet, if start and end dates are typically at the start of
months, and thus shared by many coverage ids. The latter solution
supplements the input with sentinel rows with a zero rate from 1900-01-01
to 9999-1231 for each coverage id. (The eager student who want to know
why, can get a hint by looking at the code for the first alternive.)

The calendar itself appears only as a derived table. However, there is a
temp table to hold intermediate results. Just to tease Celko :-) there is
an IDENTITY column in it, but it serves to make the next DELETE a little
easier. In SQL 2005, I would have used the ROW_NUMBER() function instead.
I've also added indexes on the table, with my idea of which will give
the best performance.

Note also that the final SELECT where we return the data, comes in two
different variations. They are not related to the choice of whether
the calendar is global or by coverage_id.

How well this scales, I have no idea. Thomas who have the data will
have to test that part. :-) There are a few more possible variations
on queries, that I hope Thomas will be able to work out, using the
queries as a starting point.

It goes without saying that there may be ugly cases I have overlooked.
I did check one thing: if there is a gap for a coverage_id, this is
handled well. Such a case is added to the repro below.

Finally, looking at Thomas's desired result, I realise that I may
be one day off with regards to end_date. I believe Thomas will be
able to handle that as well.

CREATE TABLE Coverage_Rates (
rate_id INT IDENTITY NOT NULL,
coverage_id INT NOT NULL,
start_date DATETIME NOT NULL,
end_date DATETIME NOT NULL,
rate MONEY NOT NULL )
GO
INSERT INTO Coverage_Rates VALUES (1, '2004-01-01', '2004-06-01', 40.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-03-01', '2004-08-01', 20.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-06-01', '2004-08-01', 30.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-09-01', '2004-10-01', 25.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-01-01', '9999-12-31', 90.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-03-01', '2004-08-01', 20.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-08-01', '2004-08-01', 30.00)
GO
CREATE TABLE #max_rates(ident int IDENTITY,
coverage_id int NOT NULL,
date datetime NOT NULL,
rate money NOT NULL,
PRIMARY KEY CLUSTERED (ident),
UNIQUE NONCLUSTERED (coverage_id, date))
go
-- First get max rate on all days where this a change, and the days
-- before and after the changes. Note that we must include the day
-- before a period as a day with a zero rate, in case there were no
-- rates at all this day.
INSERT #max_rates (coverage_id, date, rate)
SELECT C.coverage_id, C.date, coalesce(MAX(R.rate), 0)
FROM Coverage_Rates R
RIGHT JOIN
(SELECT coverage_id, date = start_date
FROM Coverage_Rates
UNION
SELECT coverage_id, date = dateadd(DAY, -1, start_date)
FROM Coverage_Rates
UNION
SELECT coverage_id, end_date
FROM Coverage_Rates
UNION
SELECT coverage_id, dateadd(DAY, +1, end_date)
FROM Coverage_Rates
WHERE end_date < '99991231') C
ON C.coverage_id = R.coverage_id
AND C.date BETWEEN R.start_date AND R.end_date
GROUP BY C.coverage_id, C.date
ORDER BY C.coverage_id, C.date
go
SELECT * FROM #max_rates ORDER BY ident
go
-- Delete adjacent rows with the same rate.
DELETE #max_rates
FROM #max_rates a
WHERE EXISTS (SELECT *
FROM #max_rates b
WHERE a.coverage_id = b.coverage_id
AND a.ident - 1 = b.ident
AND a.rate = b.rate)
go
SELECT * FROM #max_rates ORDER BY ident
go
-- And then transform #max_rates back to the Coverage_rates format.
SELECT a.coverage_id, start_date = a.date,
end_date = coalesce(dateadd(DAY, -1, b.date), '99991231'),
a.rate
FROM #max_rates a
LEFT JOIN #max_rates b
ON a.coverage_id = b.coverage_id
AND b.date = (SELECT MIN(c.date)
FROM #max_rates c
WHERE a.coverage_id = c.coverage_id
AND c.date > a.date)
WHERE a.rate > 0
ORDER BY a.coverage_id, a.date
go
DROP TABLE Coverage_Rates, #max_rates
------------------------------------------------------------------------
go
CREATE TABLE Coverage_Rates (
rate_id INT IDENTITY NOT NULL,
coverage_id INT NOT NULL,
start_date DATETIME NOT NULL,
end_date DATETIME NOT NULL,
rate MONEY NOT NULL )
GO
INSERT INTO Coverage_Rates VALUES (1, '2004-01-01', '2004-06-01', 40.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-03-01', '2004-08-01', 20.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-06-01', '2004-08-01', 30.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-09-01', '2004-10-01', 25.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-01-01', '9999-12-31', 90.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-03-01', '2004-08-01', 20.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-08-01', '2004-08-01', 30.00)
GO
CREATE TABLE #max_rates(ident int IDENTITY,
coverage_id int NOT NULL,
date datetime NOT NULL,
rate money NOT NULL,
PRIMARY KEY CLUSTERED (ident),
UNIQUE NONCLUSTERED (coverage_id, date))
go
INSERT #max_rates (coverage_id, date, rate)
SELECT R.coverage_id, C.date, MAX(R.rate)
FROM (SELECT coverage_id, start_date, end_date, rate
FROM Coverage_Rates
UNION
SELECT coverage_id, '19000101', '99991231', 0
FROM Coverage_Rates) AS R
JOIN (SELECT date = start_date
FROM Coverage_Rates
UNION
SELECT date = dateadd(DAY, -1, start_date)
FROM Coverage_Rates
UNION
SELECT end_date
FROM Coverage_Rates
UNION
SELECT dateadd(DAY, +1, end_date)
FROM Coverage_Rates
WHERE end_date < '99991231') C
ON C.date BETWEEN R.start_date AND R.end_date
GROUP BY R.coverage_id, C.date
ORDER BY R.coverage_id, C.date
go
SELECT * FROM #max_rates ORDER BY ident
go
DELETE #max_rates
FROM #max_rates a
WHERE EXISTS (SELECT *
FROM #max_rates b
WHERE a.coverage_id = b.coverage_id
AND a.ident - 1 = b.ident
AND a.rate = b.rate)
go
SELECT * FROM #max_rates ORDER BY ident
go
SELECT a.coverage_id, start_date = a.date,
end_date = coalesce(dateadd(DAY, -1,
(SELECT MIN(b.date) FROM #max_rates b
WHERE a.coverage_id = b.coverage_id
AND b.date > a.date)), '99991231'),
a.rate
FROM #max_rates a
WHERE a.rate > 0
ORDER BY a.coverage_id, a.date
go
DROP TABLE Coverage_Rates, #max_rates

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #8

Thomas R. Hummel

Thanks for the reply Erland!

I had tried something similar at the end of last week, using only the
necessary dates for my calendar, although I actually created a calendar
table and selected the dates into it rather than using a derived table.
Performance with that method was still pretty bad, but then late on
Friday I was told that the business rules were slightly different from
what I was originally told.

Now they have decided that the coverage rate with the later start date
always has precedence when there is an overlap. If two rates have the
same exact start date then I have to use the higher rate. With these
new business rules I was able to quickly come up with an acceptable
solution using a view and insert:

-- This view will narrow down any coverage rates with identical start
dates
-- Priority is given to the higher rate, but if those are the same then
-- priority is given to the longer-running rate. DISTINCT eliminates
exact
-- duplicates
CREATE VIEW Coverage_Rates_Deduped
AS
SELECT DISTINCT
T1.coverage_id,
T1.start_date,
T1.end_date,
T1.rate
FROM Coverage_Rates T1
LEFT OUTER JOIN Coverage_Rates T2 ON T2.coverage_id = T1.coverage_id
AND T2.start_date = T1.start_date
AND (T2.rate > T1.rate
OR (T2.rate = T1.rate AND
T2.end_date > T1.end_date))
WHERE T2.coverage_id IS NULL
GO

SELECT T1.coverage_id,
T1.start_date,
ISNULL((SELECT DATEADD(dy, -1, MIN(T2.start_date))
FROM Coverage_Rates_Deduped T2
WHERE T2.coverage_id = T1.coverage_id
AND T2.start_date > T1.start_date
AND T2.start_date < T1.end_date), T1.end_date),
T1.rate
FROM Coverage_Rates_Deduped T1
GO

Of course, the results will no longer match what was in my original
post because the business rules have been changed.

Thanks again though for the time spent on your response Erland, and I
will file that trick away for the future.

Thanks,
-Tom.

Jul 23 '05 #9

Erland Sommarskog

Thomas R. Hummel (to********@hotmail.com) writes:

I had tried something similar at the end of last week, using only the
necessary dates for my calendar, although I actually created a calendar
table and selected the dates into it rather than using a derived table.
Performance with that method was still pretty bad, but then late on
Friday I was told that the business rules were slightly different from
what I was originally told.

Now they have decided that the coverage rate with the later start date
always has precedence when there is an overlap. If two rates have the
same exact start date then I have to use the higher rate. With these
new business rules I was able to quickly come up with an acceptable
solution using a view and insert:

Ain't life great! First you put a lot of effort into a tricky problem -
and then you tell that you've been working from the wrong presumptions.
It was fortunate that the new rules made the problem simpler this time.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #10

Similar topics