473,414 Members | 2,030 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,414 software developers and data experts.

Removing duplicates

26
Okay, I have a table with the following fields (amongst others):
[ID] (an autonumber)
[Date] (YYYY-MM-DD HH:MM:SS)
[Title] (255 character text)
[xcount] (Numeric)

What I want to do is delete the duplicate posts. They are duplicate ONLY if [title] is the same, and the [Date] is within 3 days of each other. I want to delete the lower numbered posts, or those with the lowest [xcount].

Anyone have a good idea for me?

Thanks!

TIger
Oct 29 '08 #1
7 2120
coolsti
310 100+
I was just going to give you a suggestion of how you might go about this, but then I realize there is a specific problem in your problem specification.

You are saying that two rows are "duplicate" if the ID is the same and the date field is within 3 days of each other. I have a problem deciding when to decide that rows are duplicates.

Take this case: Let us say you have 10 rows in your table with the same title, and with different values in the date field, but where these values are 2 days apart from each other. In other words, the total date range here spans about 20 days. Now, which rows can be considered duplicates to which other rows?

So your basic problem is not really well defined.
Oct 31 '08 #2
TVining
26
Sorry.

I get about 500 records per day. These are press releases.

Sometimes, I have a person who has family members in 2 seperate states, so I get 2 news releases. I only want to keep one. I don't care WHICH one. Those would be duplicates.

Sometimes, the news agency has an issue and re-releases the same news 2 days in a row. Those would be duplicates.

Sometimes "Mike Smith" graduates school, then 2 months later he gets promoted, those would not be duplicates, but would have the same title.

Here's an example. I would like the system to look at this and come up with 2 posts, one for "arrived for duty" and one for "graduated BMT": http://www.bmtgrad.com/HOMETOWNNEWS/...ip+D.+Smith%22

[xcount] is if they have any comments. If someone has commented, I would like to keep THAT post.

That make more sense?
Nov 4 '08 #3
r035198x
13,262 8TB
See if this helps.
Nov 4 '08 #4
coolsti
310 100+
There is still the conceptual problem of knowing when to delete what, since even as you describe it, you can still have a long series of rows in your table that have adjacent entries with dates within 3 days of each other, making it difficult to know what is a duplicate or not.

I think what you might better want to do is to not think in terms of "deleting" what you consider to be duplicate, but to instead refuse to "insert" a new row if it can be considered a duplicate. This would remove the problem.

Let us say a new entry is to be added and it has the same title to another entry that was made 2 days ago, and does not include comments or whatever that would make it a non-duplicate. Then you simply do not do the insert. It is never entered and does not need to be deleted. Then 2 days later, another entry is to be made with the same title. But now a simple query finds that the latest entry with the same title is now from 4 days ago, and so you decide this is not a duplicate, and you do the insert.

Maybe this helps? It sometimes is easier to "not insert in the first place" than to decide what to "delete afterwards".
Nov 12 '08 #5
TVining
26
I appreciate the help on this. I'm down to only about 2,000 duplicate posts by deleting a few a day! :-P

The problem on this solution is that I use an automated script to make the [TITLE] Field. All records are inserted with the same title. [NEWS RELEASE], as that is how they arrive via email.
I then run a script which renames them based upon the first several words of the release. "Mr John B Smith has graduated from Johnson College", "Mr Micheal Jones has been promoted to senior sales engineer" My script goes through and finds everything after "Mr, Mrs or Ms" and takes the words up to "Has" so the title would be "John B Smith", "Michael Jones", etc.
Arrgh, If my head wasn't shaven, I'd pull my hair out.
Jan 7 '09 #6
TVining
26
@r035198x
That does help a lot, but how would I write in the part about them being posted within 72 hours.

Tony
Jan 7 '09 #7
r035198x
13,262 8TB
First transform the problem to that of selecting records such that their date is within three days of another record. You then run the delete against that result set.
Jan 7 '09 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Iain | last post by:
Hi I have inherited a web app with the following table structure, and need to produce a table without any duplicates. Email seems like the best unique identifier - so only one of each e-mail...
20
by: Rubinho | last post by:
I've a list with duplicate members and I need to make each entry unique. I've come up with two ways of doing it and I'd like some input on what would be considered more pythonic (or at least...
6
by: M B HONG 20 | last post by:
Hi all - I was wondering if Javascript has a way to easily remove duplicates from a string. For example, if I had a string: "car truck car truck truck tree post post tree" it should turn...
4
by: Drew | last post by:
I have a permission tracking app that I am working on, and I have made the insert page for it. I am having issues on how to prevent duplicates from getting entered. Currently the interface for...
0
by: makthar | last post by:
In your query use DISTINCT SELECT DISTINCT CITY FROM <tablename> WHERE STATE='<state name>'. This will bring only one of each city from the table. >-----Original Message----- >I'm getting a...
16
by: tyrfboard | last post by:
I've been searching for awhile now on how to remove duplicates from a table within an Access db and have found plenty of articles on finding or deleting duplicates. All I want to do is remove them...
6
by: Niyazi | last post by:
Hi all, What is fastest way removing duplicated value from string array using vb.net? Here is what currently I am doing but the the array contains over 16000 items. And it just do it in 10 or...
2
by: sjlung | last post by:
I apologise if this is a trivial question but I have appended three tables in access and within this table, there are duplicate entries. I have tried to set my reference number for this table to be...
7
by: vsgdp | last post by:
I have a container of pointers. It is possible for two pointers to point to the same element. I want to remove duplicates. I am open to which container is best for this. I thought of using...
4
by: Mokita | last post by:
Hello, I am working with Taverna to build a workflow. Taverna has a beanshell where I can program in java. I am having some problems in writing a script, where I want to eliminate the duplicates...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.