Query help: need at least one row for each value in each of two columns

Ross Presser

The purpose, as you can probably guess, is to produce a set of sample
documents from a large document run. The data row has a CLUB column and an
IFC column; I want a set of samples that contains at least one of each CLUB
and at least one of each IFC, but no more than necessary.

Example schema and data:

CREATE TABLE mDATA (
ID INTEGER,
CLUB CHAR(7),
IFC CHAR(4)
);

INSERT INTO mDATA (ID,CLUB,IFC) values (6401715,'AARPRAT','IC17')
INSERT INTO mDATA (ID,CLUB,IFC) values (1058337,'AARPRAT','IC17')
INSERT INTO mDATA (ID,CLUB,IFC) values (459443,'AARPPRT','IC25')
INSERT INTO mDATA (ID,CLUB,IFC) values (4018210,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (2430656,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (6802081,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (4236511,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (2162104,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (2073679,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (8148891,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (1868445,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (6749213,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (8363621,'AARPPUP','IG29')

Jul 20 '05 #1

Subscribe Post Reply

3319

David Portas

This will produce a sample containing at least one row of each value in the
two columns. It isn't guaranteed to give the *minimum* subset but it should
contain at most C + I - 1 rows, where C is the number of distinct Clubs and
I the number of distinct IFCs.

SELECT id,club,ifc
FROM mdata AS M
WHERE id
IN
(SELECT MIN(id)
FROM mdata
GROUP BY club
UNION ALL
SELECT MIN(id)
FROM mdata AS M2
GROUP BY ifc
HAVING NOT EXISTS
(SELECT *
FROM mdata
WHERE id IN
(SELECT MIN(id)
FROM mdata
GROUP BY club)
AND ifc = M2.ifc))

I'm assuming that ID is unique and that there are no NULLs in any of the
columns.

I'm not sure it's possible to find the minimum subset every time unless you
use an iterative solution.

--
David Portas
SQL Server MVP
--

Jul 20 '05 #2

Ross Presser

On Mon, 10 May 2004 17:29:20 +0100, David Portas wrote:

This will produce a sample containing at least one row of each value in the
two columns. It isn't guaranteed to give the *minimum* subset but it should
contain at most C + I - 1 rows, where C is the number of distinct Clubs and
I the number of distinct IFCs.
[snip]
I'm not sure it's possible to find the minimum subset every time unless you
use an iterative solution.

Thanks. Can you give me an idea as to what an iterative solution might look
like? I played around with the 6-step system below, and found that the
number of rows resulting depended on both the order of the insert queries
AND on whether I used MAX() or MIN(). I deduce that this so far is very
dependent on the exact data involved.

That said, the resulting row count only varied from 403 to 410, for a set
where C=325, I=117... so I'm bettering the worst-case (441 by your formula)
but not by a whole lot. Do you think it could theoretically be better?

(table name changed from mData to mmm; hope that's not confusing)

/* first step: unique clubs */
SELECT Min(mmm.RandomID) AS RandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
INTO AAA
FROM mmm
GROUP BY mmm.CLUB
HAVING (((Count(*))=1));

/* second step: unique IFCs where club not already there*/
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS RandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC AS IFC
FROM mmm
GROUP BY mmm.IFC
HAVING (((Min(mmm.CLUB)) Not In (select club from aaa)) AND
((Count(*))=1));

/* intermezzo: views for missing IFCs, missing Clubs */
CREATE VIEW A_MissingClubs AS
SELECT mmm.CLUB
FROM mmm LEFT JOIN AAA ON mmm.CLUB = AAA.CLUB
WHERE (((AAA.CLUB) Is Null))
GROUP BY mmm.CLUB;

CREATE VIEW A_MissingIFCs AS
SELECT mmm.IFC
FROM mmm LEFT JOIN AAA ON mmm.IFC = AAA.IFC
WHERE (((AAA.IFC) Is Null))
GROUP BY mmm.IFC;

/* third step: distinct missing clubs that are also missing IFCs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
FROM (mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB)
INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.CLUB;

/* fourth step: distinct missing IFCs that are also missing clubs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC
AS IFC
FROM (mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB)
INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.IFC;

/* fifth step: remaining missing IFCs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC
AS IFC
FROM mmm INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.IFC;

/* sixth step: remaining missing CLUBs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
FROM mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB
GROUP BY mmm.CLUB;

Jul 20 '05 #3

--CELKO--

I assume that you meant to have a real table, with keys and
constraints instead of what you posted, so let's fix it.

CREATE TABLE Mdata
(m_id INTEGER NOT NULL PRIMARY KEY,
club CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);

CREATE TABLE Samples
(m_id INTEGER NOT NULL PRIMARY KEY,
club CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);

DELETE FROM Mdata;
INSERT INTO Mdata VALUES (6401715, 'aarprat', 'ic17');
INSERT INTO Mdata VALUES (1058337, 'aarprat', 'ic17');
INSERT INTO Mdata VALUES (459443, 'aarpprt', 'ic25');
INSERT INTO Mdata VALUES (4018210, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (2430656, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (6802081, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (4236511, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (2162104, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (2073679, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (8148891, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (1868445, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (6749213, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (8363621, 'aarppup', 'ig29');
INSERT INTO Mdata VALUES (9999, 'aarppup', 'ic17'); -- redudant

I want a set of samples that contains at least one of each CLUB

and at least one of each IFC, but no more than necessary. <<

There can be more than one minimal solution, but if we go with the
minimum m_id number, that will give us a start which we can prune.
INSERT INTO Samples
SELECT MIN(m_id), club, ifc
FROM Mdata
GROUP BY club, ifc;

=============================
9999 aarppup ic17 <== redundant row
1058337 aarprat ic17 <== ifc
459443 aarpprt ic25
1868445 aarpbas ig21
2073679 aarpprd ig29
8363621 aarppup ig29 <== club

a VALUES (9999, 'aarppup', 'ic17'); -- redudant

We can find the candidate rows for redundancy removal:

SELECT MIN(m_id), ifc FROM Sample
GROUP BY ifc
HAVING COUNT(*) > 1

and likewise extra clubs:

SELECT MIN(m_id), club FROM Sample
GROUP BY club
HAVING COUNT(*) > 1;

Put it together:

SELECT m_id
FROM (SELECT MIN(m_id) FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) AS R(m_id)
GROUP BY R.m_id
HAVING COUNT(*) > 1;

and the final nightmare query:

BEGIN
INSERT INTO Sample
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc;
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc
HAVING MIN(m_id)
NOT IN
(SELECT R.m_id
FROM (SELECT MIN(m_id) FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) AS R(m_id)
GROUP BY R.m_id
HAVING COUNT(*) > 1);
END;

Warning: I have not tested this for all cases!! If we had INTERSECT
and some other SQL-92 operators, this could be more compact. I also
have no proof this is minimal, either.

Jul 20 '05 #4

Ross Presser

On 11 May 2004 13:52:25 -0700, --CELKO-- wrote:

Warning: I have not tested this for all cases!! If we had INTERSECT
and some other SQL-92 operators, this could be more compact. I also
have no proof this is minimal, either.

Thanks for the help and the effort, but no, it's not minimal; not even
close. Using the 6-step method I outlined in my message Monday
(news:1f*****************************@40tude.net) on my real run of 52776
invoices, I ended up with a set of 403 samples. Your first INSERT step
gave 648 samples, which the "nightmare" query reduced to 629 samples.
However, in order to get it to run on my MS SQL 6.5 server, I had to modify
it a bit and split it up; I may have made an error. Here was my final
query batch; is it equivalent to what you were writing?

SELECT m_id into #celko
FROM (SELECT MIN(Sample.m_id) "m_id" FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) "Z" group by m_id having count(*) > 1
go
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc
HAVING MIN(m_id)
NOT IN (select m_id from #celko)
Interestingly, David Portas's solution produced a result of 405 samples,
which was almost as good as my 6-step solution.

The fact that my final app needs to run with an MS Access backend (please
control your gagging :) ) rather than a SQL Server backend will probably
make me end up with my 6-step solution, unless anyone comes with something
both better and simpler. :) But I thank you both for all your time and
suggestions; this has been educational.

Jul 20 '05 #5

Joe Celko

>> Interestingly, David Portas's solution produced a result of 405
samples, which was almost as good as my 6-step solution. <<

He usually comes up with a good solution.

The fact that my final app needs to run with an MS Access backend (please c ontrol your gagging :)) <<

I would never make fun of those who suffer unjustly :)
.. will probably make me end up with my 6-step solution, unless anyone comes with something both better and simpler. <<

That will probably be the easiest way in ACCESS and the fastest.
this has been educational <<

Ditto. This is a good general problem. Dennis Shasha had a Scientific
American Games column on this kind of problem in the last 2-3 years. It
was worded as a combination lock in which you had to produce a set of
vectors which had all possible arrangements in certain positions.

Might want to drop him a line for some leads in the literature. I am
going to be swamped for the next two weeks, so I will not have time to
look at it. About as far as I got was that the lower limit has to be
the highest number of distinct values among all the columns (pretty
obvious).

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #6

Similar topics

Need help with query

by: netpurpose | last post by:

I need to extract data from this table to find the lowest prices of each product as of today. The product will be listed/grouped by the name only, discarding the product code - I use...

Microsoft SQL Server

UPDATE query in SQL Server

by: Lauren Quantrell | last post by:

In VBA, I constructed the following to update all records in tblmyTable with each records in tblmyTableTEMP having the same UniqueID: UPDATE tblMyTable RIGHT JOIN tblMyTableTEMP ON...

Microsoft SQL Server

Query help: need at least one row for each value in each of two columns

by: Ross Presser | last post by:

The purpose, as you can probably guess, is to produce a set of sample documents from a large document run. The data row has a CLUB column and an IFC column; I want a set of samples that contains...

Microsoft Access / VBA

Query Drop box, etc Help will pay for your time!!

by: Rated R1 | last post by:

I wrote this before in the NGs, so I am going to paste the responses that I got and see if someone can please help me. Email me and we can set something up as Id even be willing to pay for your...

Microsoft Access / VBA

1 SP with dynamic input parameters and multiple rows as the source of the query

by: serge | last post by:

How can I run a single SP by asking multiple sales question either by using the logical operator AND for all the questions; or using the logical operator OR for all the questions. So it's always...

Microsoft SQL Server

SQL query - duplicate records - different dates - how to get only latest information?

by: clare at snyder.on.ca | last post by:

I have a SQL query I need to design to select name and email addresses for policies that are due and not renewed in a given time period. The problem is, the database keeps the information for every...

Microsoft SQL Server

Complex query calculating sales - question explained

by: Wired Hosting News | last post by:

I tried to be breif and give a scenario so as not to be overlooked because it was soooo long. Let me give you real world. I am a manufacturer of goods and produce 11 items that are distributed...

Microsoft Access / VBA

SQL Query Error Help

by: RCapps | last post by:

When running the below SQL Query I keep getting the following error: Server: Msg 4924, Level 16, State 1, Line 1 ALTER TABLE DROP COLUMN failed because column 'ContractDef' does not exist in table...

Microsoft SQL Server

SQL Query to transpose data from rows into columns

by: Haas C | last post by:

Hi all, I have a table with two columns, labeled Year and Loss. In the Year field, I have the numbers 1 to 10,000, each which can or cannot repeat. In the Loss column, i have numbers...

Microsoft Access / VBA

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing