edit: this came out longer than I thought, any comments about anything
here is greatly appreciated. thank you for reading
My system stores millions of records, each with fields like firstname,
lastname, email address, city, state, zip, along with any number of user
defined fields. The application allows users to define message templates
with variables. They can then select a template, and for each variable
in the template, type in a value or select a field.
The system allows you to query for messages you've sent by specifying
criteria for the variables (not the fields).
This requirement has made it difficult to normalize my datamodel at all
for speed. What I have is this:
[fieldindex]
id int PK
name nvarchar
type datatype
[recordindex]
id int PK
....
[recordvalues]
recordid int PK
fieldid int PK
value nvarchar
whenever messages are sent, I store which fields were mapped to what
variables for that deployment. So the query with a variable criteria
looks like this:
select coalesce(vm.value, rv.value)
from sentmessages sm
inner join variablemapping vm on vm.deploymentid=sm.deploymentid
left outer join recordvalues rv on
rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
where coalesce(vm.value, rv.value) ....
this model works pretty well for searching messages with variable
criteria and looking up variable values for a particular message. the
big problem I have is that the recordvalues table is HUGE, 1 million
records with 50 fields each = 50 million recordvalues rows. The value,
two int columns plus the two indexes I have on the table make it into a
beast. Importing data takes forever. Querying the records (with a field
criteria) also takes longer than it should.
makes sense, the performance was largely IO bound.
I decided to try and cut into that IO. looking at a recordvalues table
with over 100 million rows in it, there were only about 3 million unique
values. so I split the recordvalues table into two tables:
[recordvalues]
recordid int PK
fieldid int PK
valueid int
[valueindex]
id int PK
value nvarchar (unique)
now, valueindex holds 3 million unique values and recordvalues
references them by id. to my suprise this shaved only 500mb off a 4gb
database!
importing didn't get any faster either, although it's no longer IO bound
it appears the cpu as the new bottleneck outweighed the IO bottleneck.
this is probably because I haven't optimized the queries for the new
tables (was hoping it wouldn't be so hard w/o the IO problem).
is there a better way to accomplish what I'm trying to do? (eliminate
the redundant data).. does SQL have built-in constructs to do stuff like
this? It seems like maybe I'm trying to duplicate functionality at a
high level that may already exist at a lower level.
IO is becoming a serious bottleneck.
the million record 50 field csv file is only 500mb. I would've thought
that after eliminating all the redundant first name, city, last name,
etc it would be less data and not 8x more!
-
Gordon
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com