469,106 Members | 2,185 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,106 developers. It's quick & easy.

Should I split this 175 Million record table?

Hello,

We maintain a 175 million record database table for our customer.
This is an extract of some data collected for them by a third party
vendor, who sends us regular updates to that data (monthly).

The original data for the table came in the form of a single, large
text file, which we imported.

This table contains name and address information on potential
customers.

It is a maintenance nightmare for us, as prior to this the largest
table we maintained was about 10 million records, with less
complicated updates required.

Here is the problem:
* In order to do the searching we need to do on the table it has 8 of
its 20 columns indexed.
* It takes hours and hours to do anything to the table.
* I'd like to cut down as much as possible the time required to update
the file.

We receive monthly one file containing 10 million records that are
new, and can just be appended to the table (no problem, simple import
into SQL Server).

We also receive monthly one file containing 10 million records that
are updates of information in the table. This is the tricky one. The
only way to uniquely pair up a record in the update file with a record
in the full database table is by a combination of individual_id, zip,
and zip_plus4.

There can be multiple records in the database for any given
individual, because that individual could have a history that includes
multiple addresses.

How would you recommend handling this update? So far I have mostly
tried a number of execution plans involving deleting out the records
in the table that match those in the text file, so I can then import
the text file, but the best of those plans takes well over 6 hours to
run.

My latest thought: Would it help in any way to partition the table
into a number of smaller tables, with a view used to reference them?

We have no performance issues querying the table, but I need some
thoughts on how to better maintain it.

One more thing, we do have 2 copies of the table on the server at all
times so that one can be actively used in production while we run
updates on the other one, so I can certainly try out some suggestions
over the next week.

Regards,

Warren Wright
Dallas
Jul 20 '05 #1
7 11382
On 1 Apr 2004 10:21:35 -0800, Warren Wright wrote:
We maintain a 175 million record database table for our customer.
This is an extract of some data collected for them by a third party
vendor, who sends us regular updates to that data (monthly). (snip)

If this is such a performance nightmare, my first thought is whether
all this data is really needed. Maybe you can move all data older than
a set number of months to a history table?

How would you recommend handling this update? So far I have mostly
tried a number of execution plans involving deleting out the records
in the table that match those in the text file, so I can then import
the text file, but the best of those plans takes well over 6 hours to
run.


This might be a situation wherre SQL Server proprietary syntax wins
over ANSI standard. I'd try: first importing the text file in a temp
work table, then doing the update as follows:

UPDATE BigTable
SET Column1 = WorkTable.Column1,
Column2 = WorkTable.Column2,
(...)
ColumnN = WorkTable.ColumnN
FROM BigTable
INNER JOIN WorkTable
ON WorkTable.individual_id = BigTable.individual_id
AND WorkTable.zip = BigTable.zip
AND WorkTable.zip_plus4 = BigTable.zip_plus4

Of course, you would need an index on individual_id, zip and zip_plus4
in the BigTable. It would be ideal if that was the clustered index
(but unclustering another index might hurt performance elsewhere, so
do doublecheck) Check the execution plan - WorkTable should be
table-scanned; BigTable should be accessed through said index.

That being said, I still think moving most of the rowsout of BigTable
into HistoryTable is probably your best bet.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Jul 20 '05 #2
Warren Wright (wa***********@us.scorex.com) writes:
We also receive monthly one file containing 10 million records that
are updates of information in the table. This is the tricky one. The
only way to uniquely pair up a record in the update file with a record
in the full database table is by a combination of individual_id, zip,
and zip_plus4.

There can be multiple records in the database for any given
individual, because that individual could have a history that includes
multiple addresses.

How would you recommend handling this update? So far I have mostly
tried a number of execution plans involving deleting out the records
in the table that match those in the text file, so I can then import
the text file, but the best of those plans takes well over 6 hours to
run.

My latest thought: Would it help in any way to partition the table
into a number of smaller tables, with a view used to reference them?


Partitioned views are helpful for the case where you import new rows.
Then you can import the new data into an empty table, set up indexes
on that table, and then as a last step reorganize the view. Don't
forget the partitioning constraint.

Assuming that the updates can come all over this big table, partitioning
cannot help you as far as you can work with a smaller table for the
update.

However, if you have plenty of disks and disk controllers available, you
can spread out the partitions on different filegroups on different disks.
This can improve speed somewhat, but since the transaction log is
unchanged, I would guess that this does not give any fanastic effects.

In any case, deleting to insert again, does not sound like the most
effecient way. Hugo's UPDATE statements seems like a better way to
go. You can also break up the UPDATE in batches, so that you only
update half a million rows at a time or so. This is necessarily not
faster, but can ease the strain on the transaction log.

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #3
Hugo & Erland,

Thanks for the advice. I will try the update option, and I'll drop the
partitioned view idea for now.

Since the new records that come each month would be spread randomly into
the partitions, most likely, it wouldn't help much...and the imports
aren't what is really killing us right now.

One thing I thought was odd yesterday as I was working on these updates:

I was just trying to get the simple import of new records files out of
the way. There are two of them, and each contains about 10 million
records.

Before importing them I first dropped all indexes from the 175 million
record table so that the import operation should be a simple append, and
shouldn't take much time (a few hours).

However, unlike previous imports where I did that, this one really
crawled...only incrementing the row count by a thousand every 7 or 8
seconds, and this would have taken forever.

Anything that could have been keeping the import from just acting as a
simple append?

Regards,

Warren Wright
Dallas

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 20 '05 #4
As part of a performance impact study I would examine whether
pre-sorting the input data and having an appropriately ordered index
(based on the same sort order) gives rise to any performance
improvements due to improved local caching of the data by SQL Server
(ie. improving the cache hit ratio).

If you have 175 million records and all 11 million result in updates,
that's p[roughly] 6% of the database but you don't know which 6%, or 1
in 16 records approx. Depending on the length of your records, if you
have > 16 records per page (8Kb in SQL Server 2000), then having the
input records ordered improves your chances of hitting multiple pages
in succession, benefitting from the cache. Otherwise, with 175 million
records which are randomly hit, the cache will probably improve the
index performance but the data pages are likely to continue to be read
from disk, unless you have a seriously large amount of RAM (ie. enough
for 175 million records).

You get 11 million records every month but you don't mention what
proportion of those 11 million records are new records versus updated
records.

I also agree with the comments made about archiving the older data. If
a particular customer record has multiple addresses listed, I would
keep the current address in a Current table and all but the current
address in a separate table.

In doing this, this *might* also allow you to reduce the 3-part
id/zip/zip+4 key lookup to just an id lookup, and the update process
could then "sanity check" the other 2 parts as part of its processing.
This assumes the ID you refer to is like a unique customer id that
remains with the customer even if they change address - which you
don't describe.
Jul 20 '05 #5
ke**********@hotmail.com (Kevin Frey) wrote
As part of a performance impact study I would examine whether
pre-sorting the input data and having an appropriately ordered index
(based on the same sort order) gives rise to any performance
improvements due to improved local caching of the data by SQL Server
(ie. improving the cache hit ratio).
Good idea. I'll try that as well.
You get 11 million records every month but you don't mention what
proportion of those 11 million records are new records versus updated
records.
I get 11 million records that are updates of old records, and another
11 million that are entirely new. I also get a list of around 6
million indiv_id's that are individuals whose records can be
completely dropped (it ends up dropping around 10 million records
since individuals often have multiple records).
I also agree with the comments made about archiving the older data. If
a particular customer record has multiple addresses listed, I would
keep the current address in a Current table and all but the current
address in a separate table.
The purpose of the table is to be able to, based on a customer's
name/address/ssn_last_4, find a match in the table, and get the
indiv_id associated with that person.

There are multiple addresses for each individual because it increases
the chance there will be a match...if the customer gives us an address
other than his most recent one, or if he has multiple addresses even
discounting old ones.
In doing this, this *might* also allow you to reduce the 3-part
id/zip/zip+4 key lookup to just an id lookup, and the update process
could then "sanity check" the other 2 parts as part of its processing.
This assumes the ID you refer to is like a unique customer id that
remains with the customer even if they change address - which you
don't describe.


I really wish I could drop it down to having an indiv_id primary key.
You are right that it would make life easier.

Do you think it would be useful to set up a true primary key on the
three fields indiv_id, zip, and zip4? I already have to have each of
the 3 independently indexed so I can do lookups on only one of them as
part of various queries into the database.

If I create a primary key index across all three, I still need an
individual index on each of them separately, right? Assuming my
queries need to sometimes use one of the three, but not all three?

Thanks,

Warren
Jul 20 '05 #6
On 6 Apr 2004 10:18:59 -0700, Warren Wright wrote:
If I create a primary key index across all three, I still need an
individual index on each of them separately, right? Assuming my
queries need to sometimes use one of the three, but not all three?


In that case, you can drop one of the individual indexes. Which
depends on the order in which you specify the columns in the PK.

If, for example, you use the order indiv_id/zip/zip4, this index can
also be used to search on indiv_id or indiv_id + zip, but not to
search on zip, on zip4 or on zip + zip4. If you search on indiv_id +
zip4, the PK index can only be used to do the first part of the search
(the indiv_id), not the last (zip4). In this case, you no longer need
a seperate index on indiv_id.

Of course, if you change the column order on the PK index, you should
keep the index on indiv_id and drop one of the others.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Jul 20 '05 #7
> The purpose of the table is to be able to, based on a customer's
name/address/ssn_last_4, find a match in the table, and get the
indiv_id associated with that person.
Ummm - in your first email you said name/zip/zip+4. What's this one
about?
Good idea. I'll try that as well.


This is one of those tables that doesn't have a "natural" primary key
from what I can tell. You could synthesize a primary key using an
IDENTIFIER column but as you suggest it is probably better to use the
combination of indiv_id,zip,zip+4 and make that a clustered index.

I would then take the input file of 11 million update records and
pre-sort those into indiv_id,zip,zip+4 to match the clustered index.

Having indiv_id is also important because with your 6 million delete
records (which produces about 10 million records to delete from the
DB), the clustered index, because the first part of the key is
indiv_id, the server can use that to find all the matching records and
I would suspect quite efficiently.

As the other poster mentioned you don't need three separate indexes
for indiv_id,zip,zip+4 once you have this primary key index. You'll
only need zip and zip+4 indexes since the primary key index satisfies
indiv_id queries (by itself).

Is the actual customer information in this table as well (eg. their
name?) or is that stored in a separate table? If it's stored in this
table that seems like duplication, unless there is good reason for it.

With the archiving concept that has been suggested, you could still
investigate (if possible) how many of your update records "hit" the
"current address" of a customer versus updates that hit "former"
addresses. Assuming you know what the current address is. But I would
not personally go down this track until applying the clustered index
on indiv_id,zip,zip+4, pre-sorting the input data, and doing the
updates to see what happens.

Your original approach of deleting the matching records before
performing an import of the replacement data might also benefit from
having the records pre-sorted to match the clustered key ordering.

Implementation-wise, assuming you have some kind of update application
already written: I'm not sure if UPDATE statements can be prepared,
but if they can, you might achieve a performance improvement by using
prepared statements (I've never personally tried a prepared UPDATE
statement). This should eliminate the server needing to parse each
update statement and instead you only supply the values.

Cheers

Kevin
Jul 20 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by SL_McManus | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by kglaser89 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.