473,396 Members | 1,792 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

eliminating redundant data

edit: this came out longer than I thought, any comments about anything
here is greatly appreciated. thank you for reading

My system stores millions of records, each with fields like firstname,
lastname, email address, city, state, zip, along with any number of user
defined fields. The application allows users to define message templates
with variables. They can then select a template, and for each variable
in the template, type in a value or select a field.

The system allows you to query for messages you've sent by specifying
criteria for the variables (not the fields).

This requirement has made it difficult to normalize my datamodel at all
for speed. What I have is this:

[fieldindex]
id int PK
name nvarchar
type datatype

[recordindex]
id int PK
....

[recordvalues]
recordid int PK
fieldid int PK
value nvarchar

whenever messages are sent, I store which fields were mapped to what
variables for that deployment. So the query with a variable criteria
looks like this:

select coalesce(vm.value, rv.value)
from sentmessages sm
inner join variablemapping vm on vm.deploymentid=sm.deploymentid
left outer join recordvalues rv on
rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
where coalesce(vm.value, rv.value) ....

this model works pretty well for searching messages with variable
criteria and looking up variable values for a particular message. the
big problem I have is that the recordvalues table is HUGE, 1 million
records with 50 fields each = 50 million recordvalues rows. The value,
two int columns plus the two indexes I have on the table make it into a
beast. Importing data takes forever. Querying the records (with a field
criteria) also takes longer than it should.

makes sense, the performance was largely IO bound.

I decided to try and cut into that IO. looking at a recordvalues table
with over 100 million rows in it, there were only about 3 million unique
values. so I split the recordvalues table into two tables:

[recordvalues]
recordid int PK
fieldid int PK
valueid int

[valueindex]
id int PK
value nvarchar (unique)

now, valueindex holds 3 million unique values and recordvalues
references them by id. to my suprise this shaved only 500mb off a 4gb
database!

importing didn't get any faster either, although it's no longer IO bound
it appears the cpu as the new bottleneck outweighed the IO bottleneck.
this is probably because I haven't optimized the queries for the new
tables (was hoping it wouldn't be so hard w/o the IO problem).

is there a better way to accomplish what I'm trying to do? (eliminate
the redundant data).. does SQL have built-in constructs to do stuff like
this? It seems like maybe I'm trying to duplicate functionality at a
high level that may already exist at a lower level.

IO is becoming a serious bottleneck.
the million record 50 field csv file is only 500mb. I would've thought
that after eliminating all the redundant first name, city, last name,
etc it would be less data and not 8x more!

-
Gordon

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Jul 20 '05 #1
5 2055
No database vendor, version, platform... Why are you using tables as indexes
when you can use indexes as indexes? Normalizing a data model generally
slows it down. Denormalizing generally speeds things up.
"gordy" <go***@dynamicsdirect.com> wrote in message
news:40**********@Usenet.com...
edit: this came out longer than I thought, any comments about anything
here is greatly appreciated. thank you for reading

My system stores millions of records, each with fields like firstname,
lastname, email address, city, state, zip, along with any number of user
defined fields. The application allows users to define message templates
with variables. They can then select a template, and for each variable
in the template, type in a value or select a field.

The system allows you to query for messages you've sent by specifying
criteria for the variables (not the fields).

This requirement has made it difficult to normalize my datamodel at all
for speed. What I have is this:

[fieldindex]
id int PK
name nvarchar
type datatype

[recordindex]
id int PK
...

[recordvalues]
recordid int PK
fieldid int PK
value nvarchar

whenever messages are sent, I store which fields were mapped to what
variables for that deployment. So the query with a variable criteria
looks like this:

select coalesce(vm.value, rv.value)
from sentmessages sm
inner join variablemapping vm on vm.deploymentid=sm.deploymentid
left outer join recordvalues rv on
rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
where coalesce(vm.value, rv.value) ....

this model works pretty well for searching messages with variable
criteria and looking up variable values for a particular message. the
big problem I have is that the recordvalues table is HUGE, 1 million
records with 50 fields each = 50 million recordvalues rows. The value,
two int columns plus the two indexes I have on the table make it into a
beast. Importing data takes forever. Querying the records (with a field
criteria) also takes longer than it should.

makes sense, the performance was largely IO bound.

I decided to try and cut into that IO. looking at a recordvalues table
with over 100 million rows in it, there were only about 3 million unique
values. so I split the recordvalues table into two tables:

[recordvalues]
recordid int PK
fieldid int PK
valueid int

[valueindex]
id int PK
value nvarchar (unique)

now, valueindex holds 3 million unique values and recordvalues
references them by id. to my suprise this shaved only 500mb off a 4gb
database!

importing didn't get any faster either, although it's no longer IO bound
it appears the cpu as the new bottleneck outweighed the IO bottleneck.
this is probably because I haven't optimized the queries for the new
tables (was hoping it wouldn't be so hard w/o the IO problem).

is there a better way to accomplish what I'm trying to do? (eliminate
the redundant data).. does SQL have built-in constructs to do stuff like
this? It seems like maybe I'm trying to duplicate functionality at a
high level that may already exist at a lower level.

IO is becoming a serious bottleneck.
the million record 50 field csv file is only 500mb. I would've thought
that after eliminating all the redundant first name, city, last name,
etc it would be less data and not 8x more!

-
Gordon

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Jul 20 '05 #2
> No database vendor, version, platform... Why are you using tables as indexes
when you can use indexes as indexes? Normalizing a data model generally
slows it down. Denormalizing generally speeds things up.


sorry, I'm using MS SQL2000

How can I use indexes as indexes? I mean, in the example I posted, can
you give an example?

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Jul 20 '05 #3
Well, I don't know SQL Server, but in Oracle, you create an index using the
CREATE INDEX statement. I suspect it works the same or similar in SQL
Server.

Here's an Oracle example that creates an index called
"asearch_client_id_idx" on the client_id field in a table called
"alphasearch" owned by user "alphasearch":

CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);

Now, I am about to lie a little bit for simplicity's sake, but here goes...

When a query is executed that uses client_id as a search or sort criteria,
the Oracle optimizer will decide whether or not to use the index. If it
does, it looks up the values needed in the index, and retrieves their row
ids, which in turn are essentially pointers to the location of the data in
data blocks on disc, so it goes dirctly to that location on disc and
retrieves the data out of the blocks. It does not need to go logically into
the table. Note that what I refer to as row_id in Oracle may not be the same
concept in SQL Server.

Hope you get the general idea, and you should consult your documentation
about indexes.

"gordy" <go***@dynamicsdirect.com> wrote in message
news:40**********@Usenet.com...
No database vendor, version, platform... Why are you using tables as indexes when you can use indexes as indexes? Normalizing a data model generally
slows it down. Denormalizing generally speeds things up.


sorry, I'm using MS SQL2000

How can I use indexes as indexes? I mean, in the example I posted, can
you give an example?

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Jul 20 '05 #4
> Well, I don't know SQL Server, but in Oracle, you create an index using the
CREATE INDEX statement. I suspect it works the same or similar in SQL
Server.

Here's an Oracle example that creates an index called
"asearch_client_id_idx" on the client_id field in a table called
"alphasearch" owned by user "alphasearch":

CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);


wow, what a concept ;)

I appreciate the criticism.. after all that's the intent of my original
post, however, I would prefer it to be of the constructive variety.

In my system, there are 'fields' and there are 'variables'. the user
creates the relationships between them whenever they send a message. in
order to search for messages by 'variable' values, sql needs a
relationship of its own to translate between them.

this has kept me from being able to use the obvious:
[records]
id,field1,field2,field3,...

because in a query for 'variable1', depending on the message it may have
to look at 'field3' or 'field4' for the value. this requirement is why I
have the tables I have now (recordindex, fieldindex and recordvalues).

I realize this makes for very large indexes.. and like you said, the
table itself is nothing more than a big index. This is the problem I'd
like to solve. In my original post I explained how I attempted to
eliminate redundant data, but I only eliminated 500mb (of 4gb) because
the majority of volume in this db isn't the data itself, but the index size.

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Jul 20 '05 #5

"gordy" <go***@dynamicsdirect.com> wrote in message
news:40**********@Usenet.com...
Well, I don't know SQL Server, but in Oracle, you create an index using the CREATE INDEX statement. I suspect it works the same or similar in SQL
Server.

Here's an Oracle example that creates an index called
"asearch_client_id_idx" on the client_id field in a table called
"alphasearch" owned by user "alphasearch":

CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);
wow, what a concept ;)

I appreciate the criticism.. after all that's the intent of my original
post, however, I would prefer it to be of the constructive variety.


It wasn't criticism. I thought you really didn't know about RDBMS indexes,
being that you essentially created your own.

In my system, there are 'fields' and there are 'variables'. the user
creates the relationships between them whenever they send a message. in
order to search for messages by 'variable' values, sql needs a
relationship of its own to translate between them.

this has kept me from being able to use the obvious:
[records]
id,field1,field2,field3,...

because in a query for 'variable1', depending on the message it may have
to look at 'field3' or 'field4' for the value. this requirement is why I
have the tables I have now (recordindex, fieldindex and recordvalues).

I realize this makes for very large indexes.. and like you said, the
table itself is nothing more than a big index. This is the problem I'd
like to solve. In my original post I explained how I attempted to
eliminate redundant data, but I only eliminated 500mb (of 4gb) because
the majority of volume in this db isn't the data itself, but the index size.

The only similar situation I've seen like this (home-brew index constructs)
is with a document imaging system called FileNET. In that case, the vendor
actually created its own mini RDBMS to handle just these index/table
constructs. It was very fast, but, of course, proprietary.

It's hard to tell exactly what you are trying to do, though. Could you get
into the business requirements a bit? It would help me to understand what
you need to do. It get the feeling from the solution you came up with that
you are a programmer, not a DBA.

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Jul 20 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: zzapper | last post by:
Hi, I've inherited a mysql database with many apparently redundant tables (probably abandoned projects). Without analysing the webpages many of which are also redundant; is there any MYSQL query...
23
by: Mark Dickinson | last post by:
I have a simple 192-line Python script that begins with the line: dummy0 = 47 The script runs in less than 2.5 seconds. The variable dummy0 is never referenced again, directly or indirectly,...
4
by: Rob Conner | last post by:
No you don't need to know Zope to help me. The whole reason I'd even want to do this is because of Zope though. I made a Zope product, and now want to perfect it. some simple example code... ...
5
by: MX1 | last post by:
Hi, I have a report with Name1 and Name2 in the address section . Sometimes, Name2 is not populated with data via the query that feeds the report. Unfortunately, the blank line stays in the...
22
by: John Fisher | last post by:
void f(int p) { } Many (most?) compilers will report that p is unreferenced here. This may not be a problem as f may have to match some common prototype. Typically pointers to functions are...
40
by: Neo The One | last post by:
I think C# is forcing us to write more code by enforcing a rule that can be summarized as 'A local variable must be assgined *explicitly* before reading its value.' If you are interested in what...
1
by: lynux | last post by:
Hye, I'm quite new in postgres.I have 2 problems that i do not how to solve. Hope somebody can help me. 1. Although i put my zone field is unique, postgres sometimes redundant my data...
8
by: Rasmus Kromann-Larsen | last post by:
The With Conundrum I'm currently writing a master thesis on (preparations for) static analysis of JavaScript, and after investigating the with statement, it only even more evident to me that the...
4
by: snfctech | last post by:
I have a project to create a dashboard that will connect to existing systems as well as create new features based on combining data from the existing systems. For example, the dashboard will be able...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.