473,387 Members | 1,456 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Query help: need at least one row for each value in each of two columns

The purpose, as you can probably guess, is to produce a set of sample
documents from a large document run. The data row has a CLUB column and an
IFC column; I want a set of samples that contains at least one of each CLUB
and at least one of each IFC, but no more than necessary.

Example schema and data:

CREATE TABLE mDATA (
ID INTEGER,
CLUB CHAR(7),
IFC CHAR(4)
);

INSERT INTO mDATA (ID,CLUB,IFC) values (6401715,'AARPRAT','IC17')
INSERT INTO mDATA (ID,CLUB,IFC) values (1058337,'AARPRAT','IC17')
INSERT INTO mDATA (ID,CLUB,IFC) values (459443,'AARPPRT','IC25')
INSERT INTO mDATA (ID,CLUB,IFC) values (4018210,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (2430656,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (6802081,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (4236511,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (2162104,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (2073679,'AARPPRD','IG29')
INSERT INTO mDATA (ID,CLUB,IFC) values (8148891,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (1868445,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (6749213,'AARPBAS','IG21')
INSERT INTO mDATA (ID,CLUB,IFC) values (8363621,'AARPPUP','IG29')

Jul 20 '05 #1
5 3320
This will produce a sample containing at least one row of each value in the
two columns. It isn't guaranteed to give the *minimum* subset but it should
contain at most C + I - 1 rows, where C is the number of distinct Clubs and
I the number of distinct IFCs.

SELECT id,club,ifc
FROM mdata AS M
WHERE id
IN
(SELECT MIN(id)
FROM mdata
GROUP BY club
UNION ALL
SELECT MIN(id)
FROM mdata AS M2
GROUP BY ifc
HAVING NOT EXISTS
(SELECT *
FROM mdata
WHERE id IN
(SELECT MIN(id)
FROM mdata
GROUP BY club)
AND ifc = M2.ifc))

I'm assuming that ID is unique and that there are no NULLs in any of the
columns.

I'm not sure it's possible to find the minimum subset every time unless you
use an iterative solution.

--
David Portas
SQL Server MVP
--
Jul 20 '05 #2
On Mon, 10 May 2004 17:29:20 +0100, David Portas wrote:
This will produce a sample containing at least one row of each value in the
two columns. It isn't guaranteed to give the *minimum* subset but it should
contain at most C + I - 1 rows, where C is the number of distinct Clubs and
I the number of distinct IFCs.
[snip]
I'm not sure it's possible to find the minimum subset every time unless you
use an iterative solution.


Thanks. Can you give me an idea as to what an iterative solution might look
like? I played around with the 6-step system below, and found that the
number of rows resulting depended on both the order of the insert queries
AND on whether I used MAX() or MIN(). I deduce that this so far is very
dependent on the exact data involved.

That said, the resulting row count only varied from 403 to 410, for a set
where C=325, I=117... so I'm bettering the worst-case (441 by your formula)
but not by a whole lot. Do you think it could theoretically be better?

(table name changed from mData to mmm; hope that's not confusing)

/* first step: unique clubs */
SELECT Min(mmm.RandomID) AS RandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
INTO AAA
FROM mmm
GROUP BY mmm.CLUB
HAVING (((Count(*))=1));

/* second step: unique IFCs where club not already there*/
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS RandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC AS IFC
FROM mmm
GROUP BY mmm.IFC
HAVING (((Min(mmm.CLUB)) Not In (select club from aaa)) AND
((Count(*))=1));

/* intermezzo: views for missing IFCs, missing Clubs */
CREATE VIEW A_MissingClubs AS
SELECT mmm.CLUB
FROM mmm LEFT JOIN AAA ON mmm.CLUB = AAA.CLUB
WHERE (((AAA.CLUB) Is Null))
GROUP BY mmm.CLUB;

CREATE VIEW A_MissingIFCs AS
SELECT mmm.IFC
FROM mmm LEFT JOIN AAA ON mmm.IFC = AAA.IFC
WHERE (((AAA.IFC) Is Null))
GROUP BY mmm.IFC;

/* third step: distinct missing clubs that are also missing IFCs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
FROM (mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB)
INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.CLUB;

/* fourth step: distinct missing IFCs that are also missing clubs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC
AS IFC
FROM (mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB)
INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.IFC;

/* fifth step: remaining missing IFCs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, Min(mmm.CLUB) AS CLUB, mmm.IFC
AS IFC
FROM mmm INNER JOIN A_MissingIFCs ON mmm.IFC = A_MissingIFCs.IFC
GROUP BY mmm.IFC;

/* sixth step: remaining missing CLUBs */
INSERT INTO AAA ( RandomID, CLUB, IFC )
SELECT Min(mmm.RandomID) AS MinOfRandomID, mmm.CLUB, Min(mmm.IFC) AS IFC
FROM mmm INNER JOIN A_MissingClubs ON mmm.CLUB = A_MissingClubs.CLUB
GROUP BY mmm.CLUB;
Jul 20 '05 #3
I assume that you meant to have a real table, with keys and
constraints instead of what you posted, so let's fix it.

CREATE TABLE Mdata
(m_id INTEGER NOT NULL PRIMARY KEY,
club CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);

CREATE TABLE Samples
(m_id INTEGER NOT NULL PRIMARY KEY,
club CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);

DELETE FROM Mdata;
INSERT INTO Mdata VALUES (6401715, 'aarprat', 'ic17');
INSERT INTO Mdata VALUES (1058337, 'aarprat', 'ic17');
INSERT INTO Mdata VALUES (459443, 'aarpprt', 'ic25');
INSERT INTO Mdata VALUES (4018210, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (2430656, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (6802081, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (4236511, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (2162104, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (2073679, 'aarpprd', 'ig29');
INSERT INTO Mdata VALUES (8148891, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (1868445, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (6749213, 'aarpbas', 'ig21');
INSERT INTO Mdata VALUES (8363621, 'aarppup', 'ig29');
INSERT INTO Mdata VALUES (9999, 'aarppup', 'ic17'); -- redudant
I want a set of samples that contains at least one of each CLUB

and at least one of each IFC, but no more than necessary. <<

There can be more than one minimal solution, but if we go with the
minimum m_id number, that will give us a start which we can prune.
INSERT INTO Samples
SELECT MIN(m_id), club, ifc
FROM Mdata
GROUP BY club, ifc;

=============================
9999 aarppup ic17 <== redundant row
1058337 aarprat ic17 <== ifc
459443 aarpprt ic25
1868445 aarpbas ig21
2073679 aarpprd ig29
8363621 aarppup ig29 <== club

a VALUES (9999, 'aarppup', 'ic17'); -- redudant

We can find the candidate rows for redundancy removal:

SELECT MIN(m_id), ifc FROM Sample
GROUP BY ifc
HAVING COUNT(*) > 1

and likewise extra clubs:

SELECT MIN(m_id), club FROM Sample
GROUP BY club
HAVING COUNT(*) > 1;

Put it together:

SELECT m_id
FROM (SELECT MIN(m_id) FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) AS R(m_id)
GROUP BY R.m_id
HAVING COUNT(*) > 1;

and the final nightmare query:

BEGIN
INSERT INTO Sample
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc;
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc
HAVING MIN(m_id)
NOT IN
(SELECT R.m_id
FROM (SELECT MIN(m_id) FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) AS R(m_id)
GROUP BY R.m_id
HAVING COUNT(*) > 1);
END;

Warning: I have not tested this for all cases!! If we had INTERSECT
and some other SQL-92 operators, this could be more compact. I also
have no proof this is minimal, either.
Jul 20 '05 #4
On 11 May 2004 13:52:25 -0700, --CELKO-- wrote:
Warning: I have not tested this for all cases!! If we had INTERSECT
and some other SQL-92 operators, this could be more compact. I also
have no proof this is minimal, either.


Thanks for the help and the effort, but no, it's not minimal; not even
close. Using the 6-step method I outlined in my message Monday
(news:1f*****************************@40tude.net) on my real run of 52776
invoices, I ended up with a set of 403 samples. Your first INSERT step
gave 648 samples, which the "nightmare" query reduced to 629 samples.
However, in order to get it to run on my MS SQL 6.5 server, I had to modify
it a bit and split it up; I may have made an error. Here was my final
query batch; is it equivalent to what you were writing?

SELECT m_id into #celko
FROM (SELECT MIN(Sample.m_id) "m_id" FROM Sample GROUP BY ifc
HAVING COUNT(*) > 1
UNION ALL
SELECT MIN(m_id)FROM Sample GROUP BY club
HAVING COUNT(*) > 1) "Z" group by m_id having count(*) > 1
go
SELECT MIN(m_id) AS m_id, club, ifc
FROM Mdata
GROUP BY club, ifc
HAVING MIN(m_id)
NOT IN (select m_id from #celko)
Interestingly, David Portas's solution produced a result of 405 samples,
which was almost as good as my 6-step solution.

The fact that my final app needs to run with an MS Access backend (please
control your gagging :) ) rather than a SQL Server backend will probably
make me end up with my 6-step solution, unless anyone comes with something
both better and simpler. :) But I thank you both for all your time and
suggestions; this has been educational.
Jul 20 '05 #5
>> Interestingly, David Portas's solution produced a result of 405
samples, which was almost as good as my 6-step solution. <<

He usually comes up with a good solution.
The fact that my final app needs to run with an MS Access backend (please c ontrol your gagging :)) <<

I would never make fun of those who suffer unjustly :)
.. will probably make me end up with my 6-step solution, unless anyone comes with something both better and simpler. <<

That will probably be the easiest way in ACCESS and the fastest.
this has been educational <<


Ditto. This is a good general problem. Dennis Shasha had a Scientific
American Games column on this kind of problem in the last 2-3 years. It
was worded as a combination lock in which you had to produce a set of
vectors which had all possible arrangements in certain positions.

Might want to drop him a line for some leads in the literature. I am
going to be swamped for the next two weeks, so I will not have time to
look at it. About as far as I got was that the lower limit has to be
the highest number of distinct values among all the columns (pretty
obvious).

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 20 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: netpurpose | last post by:
I need to extract data from this table to find the lowest prices of each product as of today. The product will be listed/grouped by the name only, discarding the product code - I use...
8
by: Lauren Quantrell | last post by:
In VBA, I constructed the following to update all records in tblmyTable with each records in tblmyTableTEMP having the same UniqueID: UPDATE tblMyTable RIGHT JOIN tblMyTableTEMP ON...
4
by: Ross Presser | last post by:
The purpose, as you can probably guess, is to produce a set of sample documents from a large document run. The data row has a CLUB column and an IFC column; I want a set of samples that contains...
5
by: Rated R1 | last post by:
I wrote this before in the NGs, so I am going to paste the responses that I got and see if someone can please help me. Email me and we can set something up as Id even be willing to pay for your...
7
by: serge | last post by:
How can I run a single SP by asking multiple sales question either by using the logical operator AND for all the questions; or using the logical operator OR for all the questions. So it's always...
24
by: clare at snyder.on.ca | last post by:
I have a SQL query I need to design to select name and email addresses for policies that are due and not renewed in a given time period. The problem is, the database keeps the information for every...
5
by: Wired Hosting News | last post by:
I tried to be breif and give a scenario so as not to be overlooked because it was soooo long. Let me give you real world. I am a manufacturer of goods and produce 11 items that are distributed...
0
by: RCapps | last post by:
When running the below SQL Query I keep getting the following error: Server: Msg 4924, Level 16, State 1, Line 1 ALTER TABLE DROP COLUMN failed because column 'ContractDef' does not exist in table...
4
by: Haas C | last post by:
Hi all, I have a table with two columns, labeled Year and Loss. In the Year field, I have the numbers 1 to 10,000, each which can or cannot repeat. In the Loss column, i have numbers...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.