Conditional summaring

Ilik

Hi all
I'm trying to create a summery table of rain data that holds a record
for every 0.1mm of rain in the following format:

Station Name, Time, Value
A, 2002-12-03 14:44:41.000, 0.1
A, 2002-12-03 14:45:45.000, 0.1
A, 2002-12-03 14:49:45.000, 0.1
A, 2002-12-09 05:30:35.000, 0.1
A, 2002-12-09 05:30:37.000, 0.1
A, 2002-12-09 05:33:05.000, 0.1
B, 2002-12-09 05:32:47.000, 0.1
B, 2002-12-09 05:33:00.000, 0.1
B, 2002-12-09 05:35:00.000, 0.1
B, 2002-12-09 05:37:48.000, 0.1
....
B, 2003-02-09 01:32:47.000, 0.1
B, 2003-02-09 05:32:47.000, 0.1

The problem is that I need to count rain events for each station and
produce the starting and ending time of the event (based on the Time
field) and to summarize the amount of rain that fell during the event.
An "event" for that matter is a case where there is a gap of at least
12 hours with no rain before and after a series of rain measurements
(i.e., a rain storm). The data in the table is written in such a way
that each row hold the time where an amount on 0.1mm of rain was
counted by the rain gauge.

For example, the above date should produce:

Station Name, Start Time, End Time, Total rain
A, 2002-12-03 14:44:41.000, 2002-12-03 14:49:45.000, 0.3
A, 2002-12-09 05:30:35.000, 2002-12-09 05:33:05.000, 0.3
B, 2002-12-09 05:32:47.000, 2002-12-09 05:37:48.000, 0.3
B, 2003-02-09 01:32:47.000, 2003-02-09 05:32:47.000, 0.2

As a newbie to SQL I don't know if it can be done.

Many thanks,
Ilik

Jul 23 '05 #1

Subscribe Post Reply

1990

Hugo Kornelis

On 14 Feb 2005 08:00:24 -0800, Ilik wrote:

Hi all
I'm trying to create a summery table of rain data that holds a record
for every 0.1mm of rain in the following format:

Station Name, Time, Value
A, 2002-12-03 14:44:41.000, 0.1
A, 2002-12-03 14:45:45.000, 0.1
A, 2002-12-03 14:49:45.000, 0.1
A, 2002-12-09 05:30:35.000, 0.1
A, 2002-12-09 05:30:37.000, 0.1
A, 2002-12-09 05:33:05.000, 0.1
B, 2002-12-09 05:32:47.000, 0.1
B, 2002-12-09 05:33:00.000, 0.1
B, 2002-12-09 05:35:00.000, 0.1
B, 2002-12-09 05:37:48.000, 0.1
...
B, 2003-02-09 01:32:47.000, 0.1
B, 2003-02-09 05:32:47.000, 0.1

The problem is that I need to count rain events for each station and
produce the starting and ending time of the event (based on the Time
field) and to summarize the amount of rain that fell during the event.
An "event" for that matter is a case where there is a gap of at least
12 hours with no rain before and after a series of rain measurements
(i.e., a rain storm). The data in the table is written in such a way
that each row hold the time where an amount on 0.1mm of rain was
counted by the rain gauge.

For example, the above date should produce:

Station Name, Start Time, End Time, Total rain
A, 2002-12-03 14:44:41.000, 2002-12-03 14:49:45.000, 0.3
A, 2002-12-09 05:30:35.000, 2002-12-09 05:33:05.000, 0.3
B, 2002-12-09 05:32:47.000, 2002-12-09 05:37:48.000, 0.3
B, 2003-02-09 01:32:47.000, 2003-02-09 05:32:47.000, 0.2

As a newbie to SQL I don't know if it can be done.

Hi Ilik,

It can be done, but it's quite complicated. You have to use several
tricks, then combine them into one query.

Trick 1: Find the start of a rain period. Each row for which no other row
exists with a time value in the preceding 12 hours marks the start of a
rain period. You can find these using NOT EXISTS or using an OUTER JOIN.
SELECT Station
, Time
FROM MyTable AS a
WHERE NOT EXISTS (SELECT *
FROM MyTable AS b
WHERE b.Station = a.Station
AND b.Time < a.Time
AND b.Time >= DATEADD(hour, -12, a.Time))
or
SELECT a.Station
, a.Time
FROM MyTable AS a
LEFT OUTER JOIN MyTable AS b
ON b.Station = a.Station
AND b.Time < a.Time
AND b.Time >= DATEADD(hour, -12, a.Time)
I'll use the NOT EXISTS version in the remainder of this message; if
performance is important for you, I'd advise you to test both versions.

Trick 2: Find the end of a rain period. This is basically the same
technique as trick 1.

Trick 3: Match up the start and end times of each rain period. For each
start, the matching end is the FIRST of all end's that occur AFTER that
start.
SELECT a.Station
, a.Time AS StartTime
, MIN(b.Time) AS EndTime
FROM MyTable AS a
INNER JOIN MyTable AS b
ON b.Station = a.Station
AND b.Time > a.Time
WHERE NOT EXISTS (SELECT *
FROM MyTable AS c
WHERE c.Station = a.Station
AND c.Time < a.Time
AND c.Time >= DATEADD(hour, -12, a.Time))
AND NOT EXISTS (SELECT *
FROM MyTable AS d
WHERE d.Station = b.Station
AND d.Time > b.Time
AND d.Time <= DATEADD(hour, +12, b.Time))
GROUP BY a.Station
, a.Time

Trick 4: Find all rows between start and end time of a rain period. This
can be done by using the result of the previous step as a derived table
and joining that to the original table, but in this case, I decided to
extend the result of the previous step with one correlated subquery.
SELECT a.Station
, a.Time AS StartTime
, MIN(b.Time) AS EndTime
,(SELECT SUM(e.Value)
FROM MyTable AS e
WHERE e.Station = a.Station
AND e.Time >= a.Time
AND e.Time <= MIN(b.Time)) AS TotalRain
FROM MyTable AS a
INNER JOIN MyTable AS b
ON b.Station = a.Station
AND b.Time > a.Time
WHERE NOT EXISTS (SELECT *
FROM MyTable AS c
WHERE c.Station = a.Station
AND c.Time < a.Time
AND c.Time >= DATEADD(hour, -12, a.Time))
AND NOT EXISTS (SELECT *
FROM MyTable AS d
WHERE d.Station = b.Station
AND d.Time > b.Time
AND d.Time <= DATEADD(hour, +12, b.Time))
GROUP BY a.Station
, a.Time

Note: all the queries above are untested. To get tested replies, you need
to post SQL that will recreate your tables and data on my system, as shown
in www.aspfaq.com/5006.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Jul 23 '05 #2

Erland Sommarskog

[posted and mailed, please reply in news]

Ilik (le***@walla.co.il) writes:

The problem is that I need to count rain events for each station and
produce the starting and ending time of the event (based on the Time
field) and to summarize the amount of rain that fell during the event.
An "event" for that matter is a case where there is a gap of at least
12 hours with no rain before and after a series of rain measurements
(i.e., a rain storm). The data in the table is written in such a way
that each row hold the time where an amount on 0.1mm of rain was
counted by the rain gauge.

By no means a trivial problem. As Hugo said, CREATE TABLE and INSERT
statements are alwyas a good thing. Then again, the table was easy
to compose in this case, as were the INSERT statement. And your
narrative and sample data, was very good. Thanks!

Here is a multi-step approach. Maybe not the most elegant, but it
gives the correct result. If data volumes are large, performance
could be an issue, though.

A particular note on the datediff exercise. I do datediff per minute,
not per hour. This is because datediff(HOUR, '00:15:00', '12:10:10')
is 12, although it's not fully 12 hours. datediff always counts
cross boundaries. Also, I found that coalesce needed special care.
Originally I had 19000101, and 99991231, but that gave me overflow.

CREATE TABLE data(station char(1) NOT NULL,
time datetime NOT NULL,
value float NOT NULL,
CONSTRAINT pk_data PRIMARY KEY (station, time))
go
INSERT data (station, time, value)
SELECT 'A', '20021203 14:44:41.000', 0.1 UNION
SELECT 'A', '20021203 14:45:45.000', 0.1 UNION
SELECT 'A', '20021203 14:49:45.000', 0.1 UNION
SELECT 'A', '20021209 05:30:35.000', 0.1 UNION
SELECT 'A', '20021209 05:30:37.000', 0.1 UNION
SELECT 'A', '20021209 05:33:05.000', 0.1 UNION
SELECT 'B', '20021209 05:32:47.000', 0.1 UNION
SELECT 'B', '20021209 05:33:00.000', 0.1 UNION
SELECT 'B', '20021209 05:35:00.000', 0.1 UNION
SELECT 'B', '20021209 05:37:48.000', 0.1 UNION
SELECT 'B', '20030209 01:32:47.000', 0.1 UNION
SELECT 'B', '20030209 05:32:47.000', 0.1 UNION
SELECT 'B', '20030209 18:32:47.000', 0.1 UNION
SELECT 'B', '20030212 05:32:47.000', 0.1
go
CREATE TABLE #temp (station char(1) NOT NULL,
time datetime NOT NULL,
n int NOT NULL,
starttime datetime NULL,
endtime datetime NULL,
value float NOT NULL,
PRIMARY KEY (station, n))
go
INSERT #temp (station, time, value, n)
SELECT a.station, a.time, a.value,
(SELECT COUNT(*) + 1 FROM data b
WHERE a.station = b.station AND a.time > b.time)
FROM data a
go
UPDATE b
SET starttime = CASE WHEN datediff(minute,
coalesce(a.time, dateadd(DAY, -1, b.time)),
b.time) > 60*12
THEN b.time
END,
endtime = CASE WHEN datediff(minute,
b.time,
coalesce(c.time, dateadd(DAY, 1, b.time))) > 60*12
THEN b.time
END
FROM #temp b
LEFT JOIN #temp a ON a.station = b.station AND b.n - 1 = a.n
LEFT JOIN #temp c ON b.station = c.station AND b.n + 1 = c.n
go
UPDATE #temp
SET starttime = (SELECT MAX(b.starttime)
FROM #temp b
WHERE b.station = a.station
AND b.starttime < a.time)
FROM #temp a
WHERE a.starttime IS NULL
go
SELECT station, starttime, MAX(endtime), SUM(value)
FROM #temp
GROUP BY station, starttime
go
DROP TABLE #temp
DROP TABLE data
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #3

Hugo Kornelis

On Tue, 15 Feb 2005 21:28:13 +0000 (UTC), Erland Sommarskog wrote:

By no means a trivial problem. As Hugo said, CREATE TABLE and INSERT
statements are alwyas a good thing. Then again, the table was easy
to compose in this case, as were the INSERT statement.
Hi Erland,

You're absolutely right. I guess this just shows that I'm lazy ;-)

I just stole your work to test my query and found one small error. The
corrected version (adapted to the table and column names you've chosen) is

SELECT a.station
, a.time AS Starttime
, MIN(b.time) AS Endtime
,(SELECT SUM(e.value)
FROM data AS e
WHERE e.station = a.station
AND e.time >= a.time
AND e.time <= MIN(b.time)) AS TotalRain
FROM data AS a
INNER JOIN data AS b
ON b.station = a.station
AND b.time >= a.time -- The change is in this line
WHERE NOT EXISTS (SELECT *
FROM data AS c
WHERE c.station = a.station
AND c.time < a.time
AND c.time >= DATEADD(hour, -12, a.time))
AND NOT EXISTS (SELECT *
FROM data AS d
WHERE d.station = b.station
AND d.time > b.time
AND d.time <= DATEADD(hour, +12, b.time))
GROUP BY a.station
, a.time
And your
narrative and sample data, was very good. Thanks!
I can only second that.

Here is a multi-step approach. Maybe not the most elegant, but it
gives the correct result. If data volumes are large, performance
could be an issue, though.
I ran some quick tests for a first impression. Based on the difference
from start to end time for both queries, using the sample data provided,
it was too close to call. Increasing the amount of test data (copying the
insert two more times and changing the station names to C, D, E and F)
resulted in a small advantage for my query, but with so little test data,
that is not really saying anything.

With set statistics io on, my query used

Table 'data'. Scan count 151, logical reads 302, physical reads 0,
read-ahead reads 0.

And your code used

Table '#temp__(snip)'. Scan count 0, logical reads 86, physical reads 0,
read-ahead reads 0.
Table 'data'. Scan count 43, logical reads 86, physical reads 0,
read-ahead reads 0.

Table '#temp__(snip)'. Scan count 85, logical reads 254, physical reads 0,
read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 44, physical reads 0,
read-ahead reads 0.

Table '#temp__(snip)'. Scan count 25, logical reads 98, physical reads 0,
read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 26, physical reads 0,
read-ahead reads 0.

Table '#temp__(snip)'. Scan count 1, logical reads 2, physical reads 0,
read-ahead reads 0.

Adding up these numbers shows that your version appears to do quite some
more work than my version.

I'd be interested to hear from Ilik how the versions compare on his
system, with real data.

A particular note on the datediff exercise. I do datediff per minute,
not per hour. This is because datediff(HOUR, '00:15:00', '12:10:10')
is 12, although it's not fully 12 hours. datediff always counts
cross boundaries. Also, I found that coalesce needed special care.
Originally I had 19000101, and 99991231, but that gave me overflow.

Why didn't you use DATEADD(hour, 12, a.time) > b.time, then? Wouldn't that
have solved all these issues?

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Jul 23 '05 #4

--CELKO--

I have to walk two dogs and install a Pergo floor this morning, so I
cannot work out the code in detail.

Can I assume that the measurements are collected by device that is
giving you a collection time? This means the data comes over time, and
it is a few microseconds behind the current time.

Instead of modeling the data collection form, model the final results:

CREATE TABLE Rainfall
(station_name CHAR(5) NOT NULL,
start_time DATETIME NOT NULL,
end_time DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
rain_accumulation DECIMAL (5,2) DEFAULT 0.00 NOT NULL,
CHECK (start_time < end_time),
PRIMARY KEY (station_name, start_time));

When a new reading comes in:

1) If it is within 12 hours of the start_time of the current row (i.e.
MAX(end_time) , then add 0.1 to the rain_accumulation and the end_time
CURRENT_TIMESTAMP in an UPDATE.

2) If it is more than 12 hours after the start_time of the current row,
then

2.1) update end_time to CURRENT_TIMESTAMP to close out the current
period.
2.2) insert new row with the new start_time, etc.

Something like this in pseudo-code.

CREATE PROCEDURE PostNewReading(@my_station_name, @new_reading_time)
AS
BEGIN
UPDATE Rainfall
SET end_time
= CASE WHEN <<less than 12 hours
FROM most recent>>
THEN @new_reading_time
ELSE end_time END,
rain_accumulation
= CASE WHEN <<less than 12 hours
FROM most recent>>
THEN rain_accumulation + 0.1
ELSE rain_accumulation END;

INSERT INTO Rainfall
SELECT @my_station_name, @reading_time,
CURRENT_TIMESTAMP, 0.1
FROM Rainfall
WHERE <<reading more than 12 hours
FROM most recent>>;
END;

The idea is to put the procedural logic into CASE expressions in an
UPDATE and an INSERT statement. Use ELSE clauses to leave data
unchanged. No proprietary T-SQL control of flow stuff at all. No
tricky self-joins or cursors.

Jul 23 '05 #5

Ilik

Hi all
First, my sincere apologies for not replying earlier, but I was busy
over my head in the last couple of days with other projects.
In reply to CELKO, you're assuming that the data is being collected at
real time, but this is not the case. The data goes back a few decades
in some cases (starting at 1965), so I cannot evaluate it on entry.

As for Erland's solution, I run it only for a few minutes because of
his remark that performance can be an issue there. I tried running the
Hugo's query with the "NOT EXISTS" option. It run for two hours before
I stopped it... The second option ("LEFT OUTER JOIN ") I stopped after
15min .....
The table holding the data has over 1.3 million records, that besides
rain gauge data also has other data types (e.g., water conductivity,
water level, water chemistry etc.) in the "Class_Name" column. I didn't
mention this in my original post because I thought I could deal with it
after I'll get your help with the basic query, but now I see it is much
more complex then I thought. I added this to Hugo's query as you can
see below. So the efficiency of the query is an important issue here.
If you have any ideas on how to make it run faster?

Here is the query as I run it with both options:

--START
SELECT a.Location_Name
, a.Time AS StartTime
, MIN(b.Time) AS EndTime
,(SELECT SUM(e.Value)
FROM V_Cell AS e
WHERE e.Location_Name =
a.Location_Name
AND e.Class_Name = 'Rain'
AND e.Time >= a.Time
AND e.Time <= MIN(b.Time)) AS
TotalRain
FROM V_Cell AS a
INNER JOIN V_Cell AS b
ON b.Location_Name = a.Location_Name
AND b.Class_Name = 'Rain'
AND b.Time >= a.Time
-- WHERE NOT EXISTS (SELECT *
-- FROM V_Cell AS c
-- WHERE c.Location_Name =
a.Location_Name
LEFT OUTER JOIN V_Cell AS c
ON c.Location_Name = a.Location_Name
AND c.Class_Name = 'Rain'
AND c.Time < a.Time
AND c.Time >= DATEADD(hour, -12, a.Time)
-- AND NOT EXISTS (SELECT *
-- FROM V_Cell AS d
-- WHERE d.Location_Name =
b.Location_Name
LEFT OUTER JOIN V_Cell AS d
ON d.Location_Name = b.Location_Name
AND d.Class_Name = 'Rain'
AND d.Time > b.Time
AND d.Time <= DATEADD(hour, +12, b.Time)
GROUP BY a.Location_Name
, a.Time

--END

Many thanks,
Ilik

Jul 23 '05 #6

Hugo Kornelis

On 17 Feb 2005 01:06:22 -0800, Ilik wrote:

(snip)

If you have any ideas on how to make it run faster?

Hi Ilik,

I don't think there's much room for simplification of the query, so I'd
look into indexes. I don't expect my query to return results in a fraction
of a second, but two hours appears to be too much. I'm afraid that you
have no indexes on your table that can be used for this query.

In order for us to advise on this, we need to know the actual table
structure (as CREATE TABLE statement, **INCLUDING** all constraints [esp
primary key and unique!]) and all extra indexes you might have defined.
You might wish to run sp_help V_Cell to get a full report on all indexes
and constraints.

Another thing you could do is to run the index tuning wizard (ITW): load
my query in a QA window, select "Query" / "Index Tuning Wizard" and see
what it advises you. ITW is certainly not always perfect, but it never
hurts to check out what it comes up with.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Jul 23 '05 #7

Erland Sommarskog

Ilik (le***@walla.co.il) writes:

As for Erland's solution, I run it only for a few minutes because of
his remark that performance can be an issue there. I tried running the
Hugo's query with the "NOT EXISTS" option. It run for two hours before
I stopped it... The second option ("LEFT OUTER JOIN ") I stopped after
15min .....
The table holding the data has over 1.3 million records, that besides
rain gauge data also has other data types (e.g., water conductivity,
water level, water chemistry etc.) in the "Class_Name" column. I didn't
mention this in my original post because I thought I could deal with it
after I'll get your help with the basic query, but now I see it is much
more complex then I thought. I added this to Hugo's query as you can
see below. So the efficiency of the query is an important issue here.
If you have any ideas on how to make it run faster?

Your data volumes are certainly in par with what I expected. (No one
would have a table like that with only a few thousands of rows!) But
without test data available, it's difficult to post a solution that
tested for performance.

For my query batch, it could be an idea to let it run for a little
longer, and see what actually takes time.

Certainly there is all reason to review the indexing of the table, and
as Hugo said, posting the complete CREATE TABLE and CREATE INDEX
statement for the table may help.

One solution that cannot be ruled out completely, is to run a cursor
over the lot. That is by no means not going to be blinding fast, but
you can quite soon get a grip of when it will complete. (As you would
be able to see how much have been processed.) But that would really be
a last-ditch attempt if nothing else works.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #8

--CELKO--

>> The data goes back a few decades in some cases (starting at 1965),
so I cannot evaluate it on entry .. The table holding the data has over
1.3 million records, that besides rain gauge data also has other data
types (e.g., water conductivity, water level, water chemistry etc.) in
the "Class_Name" column. <<

Ouch! I think is a case where I would go with a cursor, since it
sounds like you will do this only once. That gives you one scan thru
the whole table.

Any other method I can think of would use a self-join to get the start
and finish times of each collection period. Even with indexing,
self-joins take time.

Jul 23 '05 #9

Ronnie Chee

"--CELKO--" <jc*******@earthlink.net> wrote in message
news:11*********************@c13g2000cwb.googlegro ups.com...

The data goes back a few decades in some cases (starting at 1965),

so I cannot evaluate it on entry .. The table holding the data has over
1.3 million records, that besides rain gauge data also has other data
types (e.g., water conductivity, water level, water chemistry etc.) in
the "Class_Name" column. <<

Ouch! I think is a case where I would go with a cursor, since it
sounds like you will do this only once. That gives you one scan thru
the whole table.

Any other method I can think of would use a self-join to get the start
and finish times of each collection period. Even with indexing,
self-joins take time.

<Uninformed musings>
As others have stated, it is difficult to gauge the efficacy of the
suggestions without knowing more about the table structure.

Is it possible for the OP to extract the "Rain" records and create a new
table with appropriate indexing? He stated there are 1.3 million records
TOTAL but how many are "Rain" records and will it reduce the work to an
acceptable level for one of the aforementioned queries to complete?

How about a hybrid approach? What if you cursor-ed through the data by
"Station Name" and use one of the queries with the additional condition of
"Station Name"? I presume a table/list of the station names exist otherwise
you could "SELECT DISTINCT". As Hugo K. has demonstrated, the queries work
well enough for a small case which this would be especially if there is an
index on "station Name".
</Uninformed musings>

Jul 23 '05 #10

Ilik

Hi all,
Thanks for your suggestions.
After a long consideration with my DB admin, we've decided to use the
curser approach as you suggest.
Currently, our db holds only about 9,000 records of type Rain but this
number is expected to grow
in the future since only a small part of the data has been introduced
into the database.
So using an all-at-once query of the data will be a very slow process.
Since I'm fairly new to SQLServer, I don't know how to run a script on
it directly. But since anyway
I only need to produce a web-report, I'm using MathLab for this (thou
VB or asp could do the job just as easy).

again, many thanks for your help,
Ilik

Jul 23 '05 #11

Similar topics