This was an example in SQL FOR SMARTIES about ten years ago.

24.01. Finding Sub-regions of Size (n)

This example is adapted from SQL and Its Applications (Lorie and

Daudenarde 1991). You are given a table of theater seats, defined by

CREATE TABLE Theater

(seat_nbr INTEGER NOT NULL PRIMARY KEY, -- sequencing number

occupancy_status CHAR(1) NOT NULL -- values

CONSTRAINT valid_occupancy_status

CHECK (occupancy_status IN ('A', 'S'));

where an occupancy_status code of 'A' means available and 'S' means

sold. Your problem is to write a query that will return the subregions

of (n) consecutive seats still available. Assume that consecutive

seat_nbrs means that the seats are also consecutive for a moment,

ignoring rows of seating where seat_nbr(n) and seat_nbr((n) + 1) might

be on different physical theater rows. For (n) = 3, we can write a

self-JOIN query, thus:

SELECT T1.seat_nbr, T2.seat_nbr, T3.seat_nbr

FROM Theater AT T1, Theater AT T2, Theater AT T3

WHERE T1.occupancy_status = 'A'

AND T2.occupancy_status = 'A'

AND T3.occupancy_status = 'A'

AND T2.seat_nbr = T1.seat_nbr + 1

AND T3.seat_nbr = T2.seat_nbr + 1;

The trouble with this answer is that it works only for (n = 3) and for

nothing else. This pattern can be extended for any (n), but what we

really want is a generalized query where we can use (n) as a parameter

to the query.

The solution given by Lorie and Daudenarde starts with a given seat_nbr

and looks at all the available seats between it and ((n) - 1) seats

further up. The real trick is switching from the English-language

statement "All seats between here and there are available" to the

passive-voice version, "Available is the occupancy_status of all the

seats between here and there", so that you can see the query.

SELECT seat_nbr, ' thru ', (seat_nbr + (:(n) - 1))

FROM Theater AS T1

WHERE occupancy_status = 'A'

AND 'A' = ALL (SELECT occupancy_status

FROM Theater AS T2

WHERE T2.seat_nbr > T1.seat_nbr

AND T2.seat_nbr <= T1.seat_nbr + (:(n) - 1));

Please notice that this returns subregions. That is, if seats

(1, 2, 3, 4, 5) are available, this query will return (1, 2, 3),

(2, 3, 4), and (3, 4, 5) as its result set.

24.02. Numbering Regions

Instead of looking for a region, we want to number the regions in the

order in which they appear. For example, given a view or table with a

payment history we want to break it into grouping of behavior, say

whether or not the payments were on time or late.

CREATE TABLE PaymentHistory

(payment_nbr INTEGER NOT NULL PRIMARY KEY,

paid_on_time CHAR(1) DEFAULT 'Y' NOT NULL

CHECK(paid_on_time IN ('Y', 'N')));

INSERT INTO PaymentHistory

VALUES (1006, 'Y'), (1005, 'Y'),

(1004, 'N'),

(1003, 'Y'), (1002, 'Y'), (1001, 'Y'),

(1000, 'N');

The results we want assign a grouping number to each run of

on-time/late payments, thus.

Results

grping payment_nbr paid_on_time

===============================

1 1006 'Y'

1 1005 'Y'

2 1004 'N'

3 1003 'Y'

3 1002 'Y'

3 1001 'Y'

4 1000 'N'

A solution by Hugo Kornelis depends on the payments always being

numbered consecutively.

SELECT (SELECT COUNT(*)

FROM PaymentHistory AS H2,

PaymentHistory AS H3

WHERE H3.payment_nbr = H2.payment_nbr + 1

AND H3.paid_on_time <> H2.paid_on_time

AND H2.payment_nbr >= H1.payment_nbr) + 1 AS grping,

payment_nbr, paid_on_time

FROM PaymentHistory AS H1;

This can be modified for more types of behavior.

24.03. Finding Regions of Maximum Size

A query to find a region, rather than a subregion of a known size, of

seats was presented in SQL Forum (Rozenshtein, Abramovich, and Birger

1993).

SELECT T1.seat_nbr, ' thru ', T2.seat_nbr

FROM Theater AS T1, Theater AS T2

WHERE T1.seat_nbr < T2.seat_nbr

AND NOT EXISTS

(SELECT *

FROM Theater AS T3

WHERE (T3.seat_nbr BETWEEN T1.seat_nbr AND T2.seat_nbr

AND T3.occupancy_status <> 'A')

OR (T3.seat_nbr = T2.seat_nbr + 1

AND T3.occupancy_status = 'A')

OR (T3.seat_nbr = T1.seat_nbr - 1

AND T3.occupancy_status = 'A'));

The trick here is to look for the starting and ending seats in the

region. The starting seat_nbr of a region is to the right of a sold

seat_nbr and the ending seat_nbr is to the left of a sold seat_nbr. No

seat_nbr between the start and the end has been sold.

If you only keep the available seat_nbr's in a table, the solution is a

bit easier. It is also a more general problem that applies to any

table of sequential, possibly non-contiguous, data:

CREATE TABLE AvailableSeating

(seat_nbr INTEGER NOT NULL

CONSTRAINT valid_seat_nbr

CHECK (seat_nbr BETWEEN 001 AND 999));

INSERT INTO Seatings

VALUES (199), (200), (201), (202), (204),

(210), (211), (212), (214), (218);

where you need to create a result which will show the start and finish

values of each sequence in the table, thus:

Results

start finish

============

199 202

204 204

210 212

214 214

218 218

This is a common way of finding the missing values in a sequence of

tickets sold, unaccounted for invoices and so forth. Imagine a number

line with closed dots for the numbers that are in the table and open

dots for the numbers that are not. What you see about a sequence?

Well, we can start with a fact that anyone who has done inventory

knows. The number of elements in a sequence is equal to the ending

sequence number minus the starting sequence number plus one. This is a

basic property of ordinal numbers:

(finish - start + 1) = Length of open seats

This tells us that we need to have a self-JOIN with two copies of the

table, one for the starting value and one for the ending value of each

sequence. Once we have those two items, we can compute the length with

our formula and see if it is equal to the count of the items between

the start and finish.

SELECT S1.seat_nbr, MAX(S2.seat_nbr) -- start and rightmost item

FROM AvailableSeating AS S1

INNER JOIN

AvailableSeating AS S2 -- self-join

ON S1.seat_nbr <= S2.seat_nbr

AND (S2.seat_nbr - S1.seat_nbr + 1) -- formula for length

= (SELECT COUNT(*) -- items in the sequence

FROM AvailableSeating AS S3

WHERE S3.seat_nbr BETWEEN S1.seat_nbr AND S2.seat_nbr)

AND NOT EXISTS (SELECT *

FROM AvailableSeating AS S4

WHERE S1.seat_nbr - 1 =

S4.seat_nbr)

GROUP BY S1.seat_nbr;

Finally, we need to be sure that we have the furthest item to the right

as the end item. Each sequence of (n) items has (n) sub-sequences that

all start with the same item. So we finally do a GROUP BY on the

starting item and use a MAX() to get the rightmost value.

However, there is a faster version with three tables. This solution is

based on another property of the longest possible sequences. If you

look to the right of the last item, you do not find anything.

Likewise, if you look to the left of the first item, you do not find

anything either. These missing items that are "just over the border"

define a sequence by framing it. There also cannot be any "gaps" --

missing items -- inside those borders. That translates into SQL as:

SELECT S1.seat_nbr, MIN(S2.seat_nbr) --start and leftmost border

FROM AvailableSeating AS S1, AvailableSeating AS S2

WHERE S1.seat_nbr <= S2.seat_nbr

AND NOT EXISTS -- border items of the sequence

(SELECT *

FROM AvailableSeating AS S3

WHERE S3.seat_nbr NOT BETWEEN S1.seat_nbr AND S2.seat_nbr

AND (S3.seat_nbr = S1.seat_nbr - 1

OR S3.seat_nbr = S2.seat_nbr + 1))

GROUP BY S1.seat_nbr;

We do not have to worry about getting the rightmost item in the

sequence, but we do have to worry about getting the leftmost border.

Once we do a GROUP BY, but use a MIN() to get what we want.

Since the second approach uses only three copies of the original table,

it should be a bit faster. Also, the EXISTS() predicates can often

take advantage of indexing and thus run faster than subquery

expressions which require a table scan.

Michel Walsh came up with two novel ways of getting the range of seat

numbers that have been used in the table. He saw that the difference

between the value and its rank is a constant for all values in the same

consecutive sequence, so we just have to group, and count, on the value

minus its rank to get the various consecutive runs (or just keep the

maximum). It is so simple, an example will show every thing:

data = {1, 2, 5, 6, 7, 8, 9, 11, 12, 22}

data rank (data_rank) AS absent

================================

1 1 0

2 2 0

5 3 2

6 4 2

7 5 2

8 6 2

9 7 2

11 8 3

12 9 3

22 10 12

absent COUNT(*)

================

0 2

2 5

3 2

12 1

so, the maximum contiguous sequence is 5 (for rows having (data - rank)

= 2). Rank is defined as how many values are less or equal to the

actual value, with the assumption of a set of integers without repeated

values. This is the query.

SELECT X.absent, COUNT(*)

FROM (SELECT my_data,

(SELECT COUNT(*)

FROM Foobar AS F2

WHERE F2.my_data <= F1.my_data),

(SELECT COUNT(*)

FROM Foobar AS F2

WHERE F2.my_data <= F1.my_data) - F1.my_data

FROM Foobar AS F1)

AS X(my_data, rank, absent);

Playing with this basic idea, Mr. Walsh came up with this second query.

SELECT MIN(Z.seat_nbr), MAX(Z.seat_nbr)

FROM (SELECT S1.seat_nbr,

S1.seat_nbr

- (SELECT COUNT(*)

FROM Seating AS S2

WHERE S2.seat_nbr <= S1.seat_nbr)

FROM Seating AS S1)

AS Z (seat_nbr, dif_rank)

GROUP BY Z.dif_rank;

The derived table finds the lengths of the blocks of seats to the left

of each seat_nbr and uses that length to form groups.