Query help: need at least one row for each value in each of two columns

Ross Presser

The purpose, as you can probably guess, is to produce a set of sample
documents from a large document run. The data row has a CLUB column and an
IFC column; I want a set of samples that contains at least one of each CLUB
and at least one of each IFC, but no more than necessary.

Example schema and data:

CREATE TABLE mDATA (
ID INTEGER,
CLUB CHAR(7),
IFC CHAR(4)
);

INSERT INTO mDATA (ID,CLUB,IFC) values (6401715,'AARPRAT','IC17')
INSERT INTO mDATA (ID,CLUB,IFC) values (1058337,'AARPRAT','IC17')
INSERT INTO mDATA (ID,CLUB,IFC) values (459443,'AARPPRT','IC25')
INSERT INTO mDATA (ID,CLUB,IFC) values (4018210,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (2430656,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (6802081,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (4236511,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (2162104,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (2073679,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (8148891,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (1868445,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (6749213,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (8363621,'AARPPUP','IG29')

Nov 12 '05 #1

Subscribe Post Reply

1570

David Portas

This will produce a sample containing at least one row of each value in the
two columns. It isn't guaranteed to give the *minimum* subset but it should
contain at most C + I - 1 rows, where C is the number of distinct Clubs and
I the number of distinct IFCs.

SELECT id,club,ifc
FROM mdata AS M
WHERE id
IN
(SELECT MIN(id)
FROM mdata
GROUP BY club
UNION ALL
SELECT MIN(id)
FROM mdata AS M2
GROUP BY ifc
HAVING NOT EXISTS
(SELECT *
FROM mdata
WHERE id IN
(SELECT MIN(id)
FROM mdata
GROUP BY club)
AND ifc = M2.ifc))

I'm assuming that ID is unique and that there are no NULLs in any of the
columns.

I'm not sure it's possible to find the minimum subset every time unless you
use an iterative solution.

--
David Portas
SQL Server MVP
--

Nov 12 '05 #2

Ross Presser

On Mon, 10 May 2004 17:29:20 +0100, David Portas wrote:

This will produce a sample containing at least one row of each value in the
two columns. It isn't guaranteed to give the *minimum* subset but it should
contain at most C + I - 1 rows, where C is the number of distinct Clubs and
I the number of distinct IFCs.
[snip]
I'm not sure it's possible to find the minimum subset every time unless you
use an iterative solution.

Thanks. Can you give me an idea as to what an iterative solution might look
like? I played around with the 6-step system below, and found that the
number of rows resulting depended on both the order of the insert queries
AND on whether I used MAX() or MIN(). I deduce that this so far is very
dependent on the exact data involved.

That said, the resulting row count only varied from 403 to 410, for a set
where C=325, I=117... so I'm bettering the worst-case (441 by your formula)
but not by a whole lot. Do you think it could theoretically be better?

(table name changed from mData to mmm; hope that's not confusing)

/* first step: unique clubs */
SELECT Min(mmm.RandomID) AS RandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
INTO AAA
FROM mmm
GROUP BY mmm.CLUB
HAVING (((Count(*))=1));

/* second step: unique IFCs where club not already there*/
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS RandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC AS IFC
FROM mmm
GROUP BY mmm.IFC
HAVING (((Min(mmm.CLUB)) Not In (select club from aaa)) AND
((Count(*))=1));

/* intermezzo: views for missing IFCs, missing Clubs */
CREATE VIEW A_MissingClubs AS
SELECT mmm.CLUB
FROM mmm LEFT JOIN AAA ON mmm.CLUB = AAA.CLUB
WHERE (((AAA.CLUB) Is Null))
GROUP BY mmm.CLUB;

CREATE VIEW A_MissingIFCs AS
SELECT mmm.IFC
FROM mmm LEFT JOIN AAA ON mmm.IFC = AAA.IFC
WHERE (((AAA.IFC) Is Null))
GROUP BY mmm.IFC;

/* third step: distinct missing clubs that are also missing IFCs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
FROM (mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB)
INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.CLUB;

/* fourth step: distinct missing IFCs that are also missing clubs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC
AS IFC
FROM (mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB)
INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.IFC;

/* fifth step: remaining missing IFCs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC
AS IFC
FROM mmm INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.IFC;

/* sixth step: remaining missing CLUBs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
FROM mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB
GROUP BY mmm.CLUB;

Nov 12 '05 #3

--CELKO--

I assume that you meant to have a real table, with keys and
constraints instead of what you posted, so let's fix it.

CREATE TABLE Mdata
(m_id INTEGER NOT NULL PRIMARY KEY,
club CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);

CREATE TABLE Samples
(m_id INTEGER NOT NULL PRIMARY KEY,
club CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);

DELETE FROM Mdata;
INSERT INTO Mdata VALUES (6401715, 'aarprat', 'ic17');
INSERT INTO Mdata VALUES (1058337, 'aarprat', 'ic17');
INSERT INTO Mdata VALUES (459443, 'aarpprt', 'ic25');
INSERT INTO Mdata VALUES (4018210, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (2430656, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (6802081, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (4236511, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (2162104, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (2073679, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (8148891, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (1868445, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (6749213, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (8363621, 'aarppup', 'ig29');
INSERT INTO Mdata VALUES (9999, 'aarppup', 'ic17'); -- redudant

I want a set of samples that contains at least one of each CLUB

and at least one of each IFC, but no more than necessary. <<

There can be more than one minimal solution, but if we go with the
minimum m_id number, that will give us a start which we can prune.
INSERT INTO Samples
SELECT MIN(m_id), club, ifc
FROM Mdata
GROUP BY club, ifc;

=============================
9999 aarppup ic17 <== redundant row
1058337 aarprat ic17 <== ifc
459443 aarpprt ic25
1868445 aarpbas ig21
2073679 aarpprd ig29
8363621 aarppup ig29 <== club

a VALUES (9999, 'aarppup', 'ic17'); -- redudant

We can find the candidate rows for redundancy removal:

SELECT MIN(m_id), ifc FROM Sample
GROUP BY ifc
HAVING COUNT(*) > 1

and likewise extra clubs:

SELECT MIN(m_id), club FROM Sample
GROUP BY club
HAVING COUNT(*) > 1;

Put it together:

SELECT m_id
FROM (SELECT MIN(m_id) FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) AS R(m_id)
GROUP BY R.m_id
HAVING COUNT(*) > 1;

and the final nightmare query:

BEGIN
INSERT INTO Sample
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc;
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc
HAVING MIN(m_id)
NOT IN
(SELECT R.m_id
FROM (SELECT MIN(m_id) FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) AS R(m_id)
GROUP BY R.m_id
HAVING COUNT(*) > 1);
END;

Warning: I have not tested this for all cases!! If we had INTERSECT
and some other SQL-92 operators, this could be more compact. I also
have no proof this is minimal, either.

Nov 12 '05 #4

Ross Presser

On 11 May 2004 13:52:25 -0700, --CELKO-- wrote:

Warning: I have not tested this for all cases!! If we had INTERSECT
and some other SQL-92 operators, this could be more compact. I also
have no proof this is minimal, either.

Thanks for the help and the effort, but no, it's not minimal; not even
close. Using the 6-step method I outlined in my message Monday
(news:1f*****************************@40tude.net) on my real run of 52776
invoices, I ended up with a set of 403 samples. Your first INSERT step
gave 648 samples, which the "nightmare" query reduced to 629 samples.
However, in order to get it to run on my MS SQL 6.5 server, I had to modify
it a bit and split it up; I may have made an error. Here was my final
query batch; is it equivalent to what you were writing?

SELECT m_id into #celko
FROM (SELECT MIN(Sample.m_id) "m_id" FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) "Z" group by m_id having count(*) > 1
go
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc
HAVING MIN(m_id)
NOT IN (select m_id from #celko)
Interestingly, David Portas's solution produced a result of 405 samples,
which was almost as good as my 6-step solution.

The fact that my final app needs to run with an MS Access backend (please
control your gagging :) ) rather than a SQL Server backend will probably
make me end up with my 6-step solution, unless anyone comes with something
both better and simpler. :) But I thank you both for all your time and
suggestions; this has been educational.

Nov 12 '05 #5

Query help: need at least one row for each value in each of two columns

Similar topics