473,378 Members | 1,671 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

How can I remove duplicate entries in a sql query?

I have a database being populated by hits to a program on a server.
The problem is each client connection may require a few hits in a 1-2
second time frame. This is resulting in multiple database entries -
all exactly the same, except the event_id field, which is
auto-numbered.

I need a way to query the record w/out duplicates. That is, any
records exactly the same except event_id should only return one record.

Is this possible??

Thank you,

Barry

Mar 2 '06 #1
14 21942
das
try this, not tested :-)

SELECT *
FROM table1 a,
(SELECT min(id) FROM table1) AS b
WHERE a.id = b.id
BarrySDCA wrote:
I have a database being populated by hits to a program on a server.
The problem is each client connection may require a few hits in a 1-2
second time frame. This is resulting in multiple database entries -
all exactly the same, except the event_id field, which is
auto-numbered.

I need a way to query the record w/out duplicates. That is, any
records exactly the same except event_id should only return one record.

Is this possible??

Thank you,

Barry


Mar 2 '06 #2
hrm....no luck. I ran this:
SELECT * from `playback_log` a,(SELECT min(EVENT_ID) FROM
`playback_log` ) AS b WHERE a.EVENT_ID = b.EVENT_ID
Can you see anything? I appreciate the help a bunch. thank you

Mar 2 '06 #3
Haven't tested this:

SELECT *
FROM playback_log a
WHERE a.event_id = (select min(event_id) from playback_log b
where a.field1 = b.field1)
;

You will need to list as many fields as need to match the identical rows in the
where clause of the sub-select.

Thomas
BarrySDCA wrote on 03.03.2006 00:33:
hrm....no luck. I ran this:
SELECT * from `playback_log` a,(SELECT min(EVENT_ID) FROM
`playback_log` ) AS b WHERE a.EVENT_ID = b.EVENT_ID
Can you see anything? I appreciate the help a bunch. thank you

Mar 3 '06 #4
"Thomas Kellerer" <WV**********@spammotel.com> wrote in message
news:46************@individual.net...
SELECT *
FROM playback_log a
WHERE a.event_id = (select min(event_id) from playback_log b
where a.field1 = b.field1)


Here's a similar possibility, without using a correlated subquery:

SELECT a.*
FROM playback_log AS a
WHERE a.event_id IN (
SELECT MIN(b.event_id)
FROM playpack_log AS b
GROUP BY b.field1, b.field2, b.field3, ...)

What I've seen missing in the several solutions proposed is any use of GROUP
BY. You'll need to GROUP BY all the fields of the table _except_ for
event_id.

Regards,
Bill K.
Mar 3 '06 #5
Bill Karwin (bi**@karwin.com) writes:
"Thomas Kellerer" <WV**********@spammotel.com> wrote in message
news:46************@individual.net...
SELECT *
FROM playback_log a
WHERE a.event_id = (select min(event_id) from playback_log b
where a.field1 = b.field1)


Here's a similar possibility, without using a correlated subquery:

SELECT a.*
FROM playback_log AS a
WHERE a.event_id IN (
SELECT MIN(b.event_id)
FROM playpack_log AS b
GROUP BY b.field1, b.field2, b.field3, ...)

What I've seen missing in the several solutions proposed is any use of
GROUP BY. You'll need to GROUP BY all the fields of the table _except_
for event_id.


One more variation, using a derived table:

SELECT a.*
FROM playback_log AS a
JOIN (SELECT MIN(event_id)
FROM playback_log
GROUP BY field1, field2, ...) AS b ON a.event_id = b.event_id


--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Mar 3 '06 #6
I'm not sure why you need a nested query as others have suggested.

I see two options:

Select Distinct blah1, blah2, [all fields except event_id], blah99
From LoginTable
or

Select blah1, blah2, [all fields except event_id], blah99From LoginTable Group by blah1, blah2, [all fields except event_id], blah99

if you want an event_id, just for kicks, you could add an aggregator:

Select blah1, blah2, max(event_id), blah99From LoginTable

Group by blah1, blah2, [all fields except event_id], blah99

Mar 3 '06 #7

BarrySDCA wrote:
I have a database being populated by hits to a program on a server.
The problem is each client connection may require a few hits in a 1-2
second time frame. This is resulting in multiple database entries -
all exactly the same, except the event_id field, which is
auto-numbered.

I need a way to query the record w/out duplicates. That is, any
records exactly the same except event_id should only return one record.

Is this possible??

Thank you,

Barry


Hopefully you found the suggestions on how to filter out the duplicates
helpful.

I just want to point out that this shows the weakness of using a
pseudokey as the primary key of a table. If they really are the same
events, they shouldn't be duplicated.

Actually, I don't think you have duplicate events, since at least some
of the connection hits must have failed or timed out. Otherwise why
were there retries? So they are not really the same. Are you maybe
missing a status or result attribute?

(but the ID field is still a poor crutch for a possibly weak design.)

just some food for thought.
Ed.

Mar 3 '06 #8
>(but the ID field is still a poor crutch for a possibly weak design.)

hmmmm. i'd prefer to think of it as the table stores ALL hits to the
table, but this particular data consumer only wants to see ONE. The
way I think about it, your natural key would be the composite of all
fields in the table; perhaps "natural", but certainly awkword.

I like figital's solution.

Mar 3 '06 #9
It's exactly what I'm doing.....I'm recording hits to an advertising
engine, before they get to the media server. Some client players hit
more than once before they connect, causing a duplicate entry. I just
want to see it sorted out w/out the duplicates. I have the
distribution server logs for accurate playback reporting, I'm only
interested in seeing cleaned up hits.

I setup code to remove the duplicates, but it's not fool proof. I will
try the suggestions here and let the group know how it goes. Here is a
link to what I'm doing:

http://www.sundiegolive.com/advertising.htm notice if two duplicates
come in now and are staggered between eachother, it shows twice still.
I will try and let you all know...

I would like to say that I didn't expect such great help from the
group. I've thrown a few questions out to different groups but usually
don't have so many great replies. Thank you everyone!

Mar 4 '06 #10
Have you looked at Kx and Stonebreaker's StreamBase? They are databaes
designed for a high te,mproal flow of data. SQL is not always thje
right tool, in spite of what Bill Gates tells you.

Mar 5 '06 #11
On Sat, 04 Mar 2006 11:01:08 -0800, BarrySDCA wrote:
It's exactly what I'm doing.....I'm recording hits to an advertising
engine, before they get to the media server. Some client players hit
more than once before they connect, causing a duplicate entry. I just
want to see it sorted out w/out the duplicates. I have the
distribution server logs for accurate playback reporting, I'm only
interested in seeing cleaned up hits.


If everything is the same except the timestamp, you can round that up to
the nearest second and use a SELECT DISTINCT.

Check your SQL server's date conversion options to find out how to get
different values from the timestamp field.

Mar 6 '06 #12

Doug wrote:
(but the ID field is still a poor crutch for a possibly weak design.)
hmmmm. i'd prefer to think of it as the table stores ALL hits to the
table, but this particular data consumer only wants to see ONE.


It isn't clear that this was ONE of several data consumers.. Even so
the ID pseudokey might not be the best choice.
The
way I think about it, your natural key would be the composite of all
fields in the table; perhaps "natural", but certainly awkword.

I like figital's solution.

It is not clear from the previous posts that the natural key would
require ALL the columns of that table. If the table is really storing
events, then the key might be as simple as a timestamp, perhaps with an
event type. It doesn't have to be ALL of the columns.

My point is too many people immediately jump to ID fields as the PK.
Many times before they even sit down to think about what entity the
table represents. If you never look for the natural key, you are sure
not to find it.

HTH
ed

Mar 7 '06 #13
"Ed Prochak" <ed*******@gmail.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com...
My point is too many people immediately jump to ID fields as the PK.


That's certainly a good point. However, I've worked on projects in which
the decision-makers wouldn't commit to _any_ combination of attributes that
would uniquely identify the entity. There were always cases where the value
in any column could be either non-unique, or else have no value specified
(i.e. NULL). Neither would they commit to any attributes that could be
reasonably stable and unchanging (though I understand that this is not
strictly necessary for a key).

So in those kinds of situations, I felt I had to create pseudokeys to have
any chance of the application working. Even if we know the best practices
for database modeling, the project on which we are working may have
constraints that don't allow us to follow those best practices.

Regards,
Bill K.
Mar 7 '06 #14

Bill Karwin wrote:
"Ed Prochak" <ed*******@gmail.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com...
My point is too many people immediately jump to ID fields as the PK.
That's certainly a good point. However, I've worked on projects in which
the decision-makers wouldn't commit to _any_ combination of attributes that
would uniquely identify the entity. There were always cases where the value
in any column could be either non-unique, or else have no value specified
(i.e. NULL). Neither would they commit to any attributes that could be
reasonably stable and unchanging (though I understand that this is not
strictly necessary for a key).


It's not necessarily something the End users or even managers should
make the final decision upon. I one really big project I had to good
fortune of working with a real Data Modeler. No she did not
programming, only modelling. She interviewed the users to find out what
data they had, what data they wanted and how they used it. She
organized it and created the ER model. The model included enough
information to design the DB model, which is when the PK's were
determined. Almost all of the cases you mentioned were encountered.
Each was handled by the DB designers. There are times when you create a
pseudokey and times when you split the logical table
maybe to model a more generic entity you create a pair of
parent/child tables where the parent lacks one of the fields that might
be NULL if only the Child table existed, or
maybe the entity really was 2 entities that looked similar, as an
address that represents the customer's location and the adddress that
represents the billing address. A DB that application that needed both
would be poorly served by one ADDRESS table.
What happens internally to the DB does not matter as long as you can
present the right data in the right combinations.

It is our job to engineer the software, not the end users and managers.

So in those kinds of situations, I felt I had to create pseudokeys to have
any chance of the application working. Even if we know the best practices
for database modeling, the project on which we are working may have
constraints that don't allow us to follow those best practices.

Regards,
Bill K.


Sounds like you take pseudokeys as the last resort, which IMO is
exactly what they are. That approach is best practices. Keep it up.

Ed

Mar 7 '06 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: m|sf|t | last post by:
All, Is it possible to use PHP to open/read a TXT file (i.e. IP.TXT) that contains ip addresses (1 per line), remove any duplicates found, and re-write the file back out to IP_NEW.TXT ? The...
1
by: Gary Lundquest | last post by:
It appears to me that MySQL version 4 returns an error messge when doing an Insert that results in duplicate entries. Version 3 did NOT return an error - it dropped the duplicate entries and ran...
4
by: MLH | last post by:
I never quite figured out how to reconfigure it to automatically delete redundant entries. Of course, one cannot always blatantly blow redundant records away w/o regard to which one it is that you...
4
by: sri2097 | last post by:
Hi all, I'm storing number of dictionary values into a file using the 'cPickle' module and then am retrieving it. The following is the code for it - # Code for storing the values in the file...
1
by: Allerdyce.John | last post by:
I have a vector of Pair of int: typedef pair<int, int> MyPair; typedef vector <MyPair > MyVector I would like to remove entries if their first are equal, or if their value is swap ( first of...
5
by: Manish | last post by:
The topic is related to MySQL database. Suppose a table "address" contains the following records ------------------------------------------------------- | name | address | phone |...
1
by: JTreefrog | last post by:
Hello - I've read a ton of stuff about deleting duplicate values in an array. They are all very useful - they just haven't addressed an array of objects. Here's my array: var sDat = ; The...
12
by: joestevens232 | last post by:
Hello Im having problems figuring out how to remove the duplicate entries in an array...Write a program that accepts a sequence of integers (some of which may repeat) as input into an array. Write...
7
by: php_mysql_beginer911 | last post by:
Hi .. i am trying to update a table where if field contents any duplictaed entries than one of the field should be updated with random number which is unique so i can make all entries unique i...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.