problem w/ duplicates - MySQL Database

Mark

Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"

But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?

Jul 8 '06 #1

Subscribe Post Reply

1812

Rich Ryan

"Mark" <mn*******@gmail.comwrote in message
news:11**********************@s13g2000cwa.googlegr oups.com...

Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"

But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?

Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.

Rich

Jul 8 '06 #2

Mark

Rich Ryan wrote:

"Mark" <mn*******@gmail.comwrote in message
news:11**********************@s13g2000cwa.googlegr oups.com...
Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"

But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?

Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.

Rich

Jul 9 '06 #3

zac.carey

users
| userID | user | (userID is Primary)
| 01 | john |
| 02 | paul |
| 03 | george |
| 04 | ringo |

categories
| categoryID | category | (categoryID is Primary)
| 01 | lead vocals |
| 02 | lead guitar |
| 03 | keyboard |
| 04 | harmonica |
| 05 | backing vocals|
| 06 | drums |
| 07 | rhythm guitar |
| 08 | bass guitar |

userID_categoryID
| userID | categoryID | (the primary is made from both together!)
| 01 | 01 |
| 01 | 03 |
| 01 | 04 |
| 01 | 05 |
| 01 | 07 |
| 02 | 01 |
| 02 | 05 |
| 02 | 07 |
| 02 | 08 |
| 03 | 02 |
| 03 | 05 |
| etc | etc |
Mark wrote:

Rich Ryan wrote:
"Mark" <mn*******@gmail.comwrote in message
news:11**********************@s13g2000cwa.googlegr oups.com...
Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"
>
But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?
>
Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.

Rich

make what the key? the "these" columns?? i don't understand what you
mean.

and if i did it that method, wouldn't there be a lot of excessive data?

userid | categories
0 | life
0 | work
0 | web
1 | life
1 | work
1 | starcraft
2 | work
2 | starcraft
2 | programming

something like that..? i mean..i guess it works, but it seems like
wasted space. thought there might be a way to group everyone who has
the same category.

Jul 9 '06 #4

Mark

za*******@gmail.com wrote:

users
| userID | user | (userID is Primary)
| 01 | john |
| 02 | paul |
| 03 | george |
| 04 | ringo |

categories
| categoryID | category | (categoryID is Primary)
| 01 | lead vocals |
| 02 | lead guitar |
| 03 | keyboard |
| 04 | harmonica |
| 05 | backing vocals|
| 06 | drums |
| 07 | rhythm guitar |
| 08 | bass guitar |

userID_categoryID
| userID | categoryID | (the primary is made from both together!)
| 01 | 01 |
| 01 | 03 |
| 01 | 04 |
| 01 | 05 |
| 01 | 07 |
| 02 | 01 |
| 02 | 05 |
| 02 | 07 |
| 02 | 08 |
| 03 | 02 |
| 03 | 05 |
| etc | etc |
Mark wrote:
Rich Ryan wrote:
"Mark" <mn*******@gmail.comwrote in message
news:11**********************@s13g2000cwa.googlegr oups.com...
Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"

But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?

>
Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.
>
Rich
make what the key? the "these" columns?? i don't understand what you
mean.

and if i did it that method, wouldn't there be a lot of excessive data?

userid | categories
0 | life
0 | work
0 | web
1 | life
1 | work
1 | starcraft
2 | work
2 | starcraft
2 | programming

something like that..? i mean..i guess it works, but it seems like
wasted space. thought there might be a way to group everyone who has
the same category.

hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Jul 11 '06 #5

strawberry

Mark wrote:

za*******@gmail.com wrote:
users
| userID | user | (userID is Primary)
| 01 | john |
| 02 | paul |
| 03 | george |
| 04 | ringo |

categories
| categoryID | category | (categoryID is Primary)
| 01 | lead vocals |
| 02 | lead guitar |
| 03 | keyboard |
| 04 | harmonica |
| 05 | backing vocals|
| 06 | drums |
| 07 | rhythm guitar |
| 08 | bass guitar |

userID_categoryID
| userID | categoryID | (the primary is made from both together!)
| 01 | 01 |
| 01 | 03 |
| 01 | 04 |
| 01 | 05 |
| 01 | 07 |
| 02 | 01 |
| 02 | 05 |
| 02 | 07 |
| 02 | 08 |
| 03 | 02 |
| 03 | 05 |
| etc | etc |
Mark wrote:
Rich Ryan wrote:
"Mark" <mn*******@gmail.comwrote in message
news:11**********************@s13g2000cwa.googlegr oups.com...
Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"
>
But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?
>

Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.

Rich
>
make what the key? the "these" columns?? i don't understand what you
mean.
>
and if i did it that method, wouldn't there be a lot of excessive data?
>
userid | categories
0 | life
0 | work
0 | web
1 | life
1 | work
1 | starcraft
2 | work
2 | starcraft
2 | programming
>
something like that..? i mean..i guess it works, but it seems like
wasted space. thought there might be a way to group everyone who has
the same category.

hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Well, I'm definitely not qualified to comment on efficiency but I'm
sure there must be lots out there on the performance comparisons of
flat tables vs normalized dbs.

In this simple example, there's probably not a lot in it. By putting
the categories in a separate table, I'm reducing the risk of errors in
user input - or at least making those errors more consistent! If the
categories also had descriptions, for instance, then the performance
benefits would become more apparent.

But the db I've suggested is a poor example for demonstrating the real
benefits of (at least some degree of) normalization. I'd write a better
one - but there's SO many well-written tutorials already out there on
db construction and normalization that it hardly seems worth it :-)

Jul 11 '06 #6

Shawn Hamzee

Indeed, in the example provided the three table is faster. Overall though,
the performance of your tables depend on the number of rows. If your number
of rows in any of the tables is not going to surpass maybe 400,000 or
500,000 rows, the denormalized solution is the way to go; however, there is
a point of diminishing return that you need to be aware of by examining the
logs and the performance of your database.

Hope this helps.
On 7/11/06 02:33, in article
11*********************@b28g2000cwb.googlegroups.c om, "Mark"
<mn*******@gmail.comwrote:

>
za*******@gmail.com wrote:
>users
| userID | user | (userID is Primary)
| 01 | john |
| 02 | paul |
| 03 | george |
| 04 | ringo |

categories
| categoryID | category | (categoryID is Primary)
| 01 | lead vocals |
| 02 | lead guitar |
| 03 | keyboard |
| 04 | harmonica |
| 05 | backing vocals|
| 06 | drums |
| 07 | rhythm guitar |
| 08 | bass guitar |

userID_categoryID
| userID | categoryID | (the primary is made from both together!)
| 01 | 01 |
| 01 | 03 |
| 01 | 04 |
| 01 | 05 |
| 01 | 07 |
| 02 | 01 |
| 02 | 05 |
| 02 | 07 |
| 02 | 08 |
| 03 | 02 |
| 03 | 05 |
| etc | etc |
Mark wrote:
>>Rich Ryan wrote:
"Mark" <mn*******@gmail.comwrote in message
news:11**********************@s13g2000cwa.googl egroups.com...
Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"
>
But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?
>

Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.

Rich

make what the key? the "these" columns?? i don't understand what you
mean.

and if i did it that method, wouldn't there be a lot of excessive data?

userid | categories
0 | life
0 | work
0 | web
1 | life
1 | work
1 | starcraft
2 | work
2 | starcraft
2 | programming

something like that..? i mean..i guess it works, but it seems like
wasted space. thought there might be a way to group everyone who has
the same category.

hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Jul 11 '06 #7

Skarjune

Mark wrote:

hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Mark,

As Rich first pointed out Normalization is important for data quality
and performance. The number of tables and rows alone does not determine
performance. Proper primary and foreign keys between related tables
along with indexes on columns used for conditions in WHERE clauses has
a greater effect on performance, since that controls how the query is
parsed and how the data engine determines how to fetch the data.
Whereas, hacking delimiters for nested values will tend to slow things
down.

Imagine that you went to the library and the books were simply stacked
on the shelves in whatever order could cram in the most books on the
least shelves using the least staff. That'd be easy for the library to
file the books, but a hassle for patrons to find the books...

-DHS-

Jul 11 '06 #8

Mark

Skarjune wrote:

Mark wrote:
hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Mark,

As Rich first pointed out Normalization is important for data quality
and performance. The number of tables and rows alone does not determine
performance. Proper primary and foreign keys between related tables
along with indexes on columns used for conditions in WHERE clauses has
a greater effect on performance, since that controls how the query is
parsed and how the data engine determines how to fetch the data.
Whereas, hacking delimiters for nested values will tend to slow things
down.

Imagine that you went to the library and the books were simply stacked
on the shelves in whatever order could cram in the most books on the
least shelves using the least staff. That'd be easy for the library to
file the books, but a hassle for patrons to find the books...

-DHS-

Thanks for explaining all this to me guys :) I knew there was a proper
or better method to approach this problem, but I guess I'm still sort
of new to databases, and wasn't sure what it was. Perhaps I'll google
some stuff on normalization and find out more. Anyways, this should
solve my problem. Thanks a ton!

Jul 17 '06 #9

Similar topics

need help on generator...

by: Joh | last post by:

hello, i'm trying to understand how i could build following consecutive sets from a root one using generator : l = would like to produce : , , , ,

Python

Problem with a simple union

by: Gerry | last post by:

I am relatively new to DB2 and having a problem with a simple union statement. Running Db2UDB version 8.1.1 on Aix 5.1 The union and union all SQL statements I am running produce the same...

DB2 Database

REQ HELP: Problem eliminating duplicates

by: MHenry | last post by:

Hi, I have a table with duplicate records. Some of the duplicates need to be eliminated from the table and some need not. A duplicate record does not need to be eliminated if the one record...

Microsoft Access / VBA

Removing duplicates from query, but not from table

by: tyrfboard | last post by:

I've been searching for awhile now on how to remove duplicates from a table within an Access db and have found plenty of articles on finding or deleting duplicates. All I want to do is remove them...

Microsoft Access / VBA

removing duplicates from container,

by: vsgdp | last post by:

I have a container of pointers. It is possible for two pointers to point to the same element. I want to remove duplicates. I am open to which container is best for this. I thought of using...

C / C++

Problem When Serializing Array With Multiline Text

by: dawnerd | last post by:

Hello everyone. I have a question, or problem if you will, that I'm sure someone knows the answer to. I have a database that stores information on a given user. The information is stored in a...

PHP

Index based on 2 fields (No Duplicates)

by: ryan.paquette | last post by:

In the table there are 2 fields in which I wish to limit (i.e. No Duplicates) Although I do not want to limit them to "No Duplicates" separately. I need them to be limited to "No Duplicates" as...

Microsoft Access / VBA

Reporting Services Hide Duplicates Problem

by: MattGaff | last post by:

I have a parameter (dropdown menu) on my report which will filter or sort the data depending on the chosen option. There are only 2 options to choose from in this dropdown. I want to Hide duplicates...

Microsoft SQL Server

This XSLT problem makes no sense to me

by: =?ISO-8859-1?Q?Jean=2DFran=E7ois_Michaud?= | last post by:

Context: I'm trying to compare XML tree fragments and I'm doing so by outputting the attributes of each element in the tree and outputting it to a string then normalizing the strings. Then I'm...

.NET Framework

How to check Internal and External duplicates?

by: tskmjk55 | last post by:

Recently, I have a requirement to develop a vb.net application wherein the input excel sheet data which has an average of 5000 records should be checked for Internal duplicates (duplicates within the...

Visual Basic .NET

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware