473,729 Members | 1,886 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Should I split this 175 Million record table?

Hello,

We maintain a 175 million record database table for our customer.
This is an extract of some data collected for them by a third party
vendor, who sends us regular updates to that data (monthly).

The original data for the table came in the form of a single, large
text file, which we imported.

This table contains name and address information on potential
customers.

It is a maintenance nightmare for us, as prior to this the largest
table we maintained was about 10 million records, with less
complicated updates required.

Here is the problem:
* In order to do the searching we need to do on the table it has 8 of
its 20 columns indexed.
* It takes hours and hours to do anything to the table.
* I'd like to cut down as much as possible the time required to update
the file.

We receive monthly one file containing 10 million records that are
new, and can just be appended to the table (no problem, simple import
into SQL Server).

We also receive monthly one file containing 10 million records that
are updates of information in the table. This is the tricky one. The
only way to uniquely pair up a record in the update file with a record
in the full database table is by a combination of individual_id, zip,
and zip_plus4.

There can be multiple records in the database for any given
individual, because that individual could have a history that includes
multiple addresses.

How would you recommend handling this update? So far I have mostly
tried a number of execution plans involving deleting out the records
in the table that match those in the text file, so I can then import
the text file, but the best of those plans takes well over 6 hours to
run.

My latest thought: Would it help in any way to partition the table
into a number of smaller tables, with a view used to reference them?

We have no performance issues querying the table, but I need some
thoughts on how to better maintain it.

One more thing, we do have 2 copies of the table on the server at all
times so that one can be actively used in production while we run
updates on the other one, so I can certainly try out some suggestions
over the next week.

Regards,

Warren Wright
Dallas
Jul 20 '05 #1
7 11954
On 1 Apr 2004 10:21:35 -0800, Warren Wright wrote:
We maintain a 175 million record database table for our customer.
This is an extract of some data collected for them by a third party
vendor, who sends us regular updates to that data (monthly). (snip)

If this is such a performance nightmare, my first thought is whether
all this data is really needed. Maybe you can move all data older than
a set number of months to a history table?

How would you recommend handling this update? So far I have mostly
tried a number of execution plans involving deleting out the records
in the table that match those in the text file, so I can then import
the text file, but the best of those plans takes well over 6 hours to
run.


This might be a situation wherre SQL Server proprietary syntax wins
over ANSI standard. I'd try: first importing the text file in a temp
work table, then doing the update as follows:

UPDATE BigTable
SET Column1 = WorkTable.Colum n1,
Column2 = WorkTable.Colum n2,
(...)
ColumnN = WorkTable.Colum nN
FROM BigTable
INNER JOIN WorkTable
ON WorkTable.indiv idual_id = BigTable.indivi dual_id
AND WorkTable.zip = BigTable.zip
AND WorkTable.zip_p lus4 = BigTable.zip_pl us4

Of course, you would need an index on individual_id, zip and zip_plus4
in the BigTable. It would be ideal if that was the clustered index
(but unclustering another index might hurt performance elsewhere, so
do doublecheck) Check the execution plan - WorkTable should be
table-scanned; BigTable should be accessed through said index.

That being said, I still think moving most of the rowsout of BigTable
into HistoryTable is probably your best bet.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Jul 20 '05 #2
Warren Wright (wa***********@ us.scorex.com) writes:
We also receive monthly one file containing 10 million records that
are updates of information in the table. This is the tricky one. The
only way to uniquely pair up a record in the update file with a record
in the full database table is by a combination of individual_id, zip,
and zip_plus4.

There can be multiple records in the database for any given
individual, because that individual could have a history that includes
multiple addresses.

How would you recommend handling this update? So far I have mostly
tried a number of execution plans involving deleting out the records
in the table that match those in the text file, so I can then import
the text file, but the best of those plans takes well over 6 hours to
run.

My latest thought: Would it help in any way to partition the table
into a number of smaller tables, with a view used to reference them?


Partitioned views are helpful for the case where you import new rows.
Then you can import the new data into an empty table, set up indexes
on that table, and then as a last step reorganize the view. Don't
forget the partitioning constraint.

Assuming that the updates can come all over this big table, partitioning
cannot help you as far as you can work with a smaller table for the
update.

However, if you have plenty of disks and disk controllers available, you
can spread out the partitions on different filegroups on different disks.
This can improve speed somewhat, but since the transaction log is
unchanged, I would guess that this does not give any fanastic effects.

In any case, deleting to insert again, does not sound like the most
effecient way. Hugo's UPDATE statements seems like a better way to
go. You can also break up the UPDATE in batches, so that you only
update half a million rows at a time or so. This is necessarily not
faster, but can ease the strain on the transaction log.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #3
Hugo & Erland,

Thanks for the advice. I will try the update option, and I'll drop the
partitioned view idea for now.

Since the new records that come each month would be spread randomly into
the partitions, most likely, it wouldn't help much...and the imports
aren't what is really killing us right now.

One thing I thought was odd yesterday as I was working on these updates:

I was just trying to get the simple import of new records files out of
the way. There are two of them, and each contains about 10 million
records.

Before importing them I first dropped all indexes from the 175 million
record table so that the import operation should be a simple append, and
shouldn't take much time (a few hours).

However, unlike previous imports where I did that, this one really
crawled...only incrementing the row count by a thousand every 7 or 8
seconds, and this would have taken forever.

Anything that could have been keeping the import from just acting as a
simple append?

Regards,

Warren Wright
Dallas

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 20 '05 #4
As part of a performance impact study I would examine whether
pre-sorting the input data and having an appropriately ordered index
(based on the same sort order) gives rise to any performance
improvements due to improved local caching of the data by SQL Server
(ie. improving the cache hit ratio).

If you have 175 million records and all 11 million result in updates,
that's p[roughly] 6% of the database but you don't know which 6%, or 1
in 16 records approx. Depending on the length of your records, if you
have > 16 records per page (8Kb in SQL Server 2000), then having the
input records ordered improves your chances of hitting multiple pages
in succession, benefitting from the cache. Otherwise, with 175 million
records which are randomly hit, the cache will probably improve the
index performance but the data pages are likely to continue to be read
from disk, unless you have a seriously large amount of RAM (ie. enough
for 175 million records).

You get 11 million records every month but you don't mention what
proportion of those 11 million records are new records versus updated
records.

I also agree with the comments made about archiving the older data. If
a particular customer record has multiple addresses listed, I would
keep the current address in a Current table and all but the current
address in a separate table.

In doing this, this *might* also allow you to reduce the 3-part
id/zip/zip+4 key lookup to just an id lookup, and the update process
could then "sanity check" the other 2 parts as part of its processing.
This assumes the ID you refer to is like a unique customer id that
remains with the customer even if they change address - which you
don't describe.
Jul 20 '05 #5
ke**********@ho tmail.com (Kevin Frey) wrote
As part of a performance impact study I would examine whether
pre-sorting the input data and having an appropriately ordered index
(based on the same sort order) gives rise to any performance
improvements due to improved local caching of the data by SQL Server
(ie. improving the cache hit ratio).
Good idea. I'll try that as well.
You get 11 million records every month but you don't mention what
proportion of those 11 million records are new records versus updated
records.
I get 11 million records that are updates of old records, and another
11 million that are entirely new. I also get a list of around 6
million indiv_id's that are individuals whose records can be
completely dropped (it ends up dropping around 10 million records
since individuals often have multiple records).
I also agree with the comments made about archiving the older data. If
a particular customer record has multiple addresses listed, I would
keep the current address in a Current table and all but the current
address in a separate table.
The purpose of the table is to be able to, based on a customer's
name/address/ssn_last_4, find a match in the table, and get the
indiv_id associated with that person.

There are multiple addresses for each individual because it increases
the chance there will be a match...if the customer gives us an address
other than his most recent one, or if he has multiple addresses even
discounting old ones.
In doing this, this *might* also allow you to reduce the 3-part
id/zip/zip+4 key lookup to just an id lookup, and the update process
could then "sanity check" the other 2 parts as part of its processing.
This assumes the ID you refer to is like a unique customer id that
remains with the customer even if they change address - which you
don't describe.


I really wish I could drop it down to having an indiv_id primary key.
You are right that it would make life easier.

Do you think it would be useful to set up a true primary key on the
three fields indiv_id, zip, and zip4? I already have to have each of
the 3 independently indexed so I can do lookups on only one of them as
part of various queries into the database.

If I create a primary key index across all three, I still need an
individual index on each of them separately, right? Assuming my
queries need to sometimes use one of the three, but not all three?

Thanks,

Warren
Jul 20 '05 #6
On 6 Apr 2004 10:18:59 -0700, Warren Wright wrote:
If I create a primary key index across all three, I still need an
individual index on each of them separately, right? Assuming my
queries need to sometimes use one of the three, but not all three?


In that case, you can drop one of the individual indexes. Which
depends on the order in which you specify the columns in the PK.

If, for example, you use the order indiv_id/zip/zip4, this index can
also be used to search on indiv_id or indiv_id + zip, but not to
search on zip, on zip4 or on zip + zip4. If you search on indiv_id +
zip4, the PK index can only be used to do the first part of the search
(the indiv_id), not the last (zip4). In this case, you no longer need
a seperate index on indiv_id.

Of course, if you change the column order on the PK index, you should
keep the index on indiv_id and drop one of the others.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Jul 20 '05 #7
> The purpose of the table is to be able to, based on a customer's
name/address/ssn_last_4, find a match in the table, and get the
indiv_id associated with that person.
Ummm - in your first email you said name/zip/zip+4. What's this one
about?
Good idea. I'll try that as well.


This is one of those tables that doesn't have a "natural" primary key
from what I can tell. You could synthesize a primary key using an
IDENTIFIER column but as you suggest it is probably better to use the
combination of indiv_id,zip,zi p+4 and make that a clustered index.

I would then take the input file of 11 million update records and
pre-sort those into indiv_id,zip,zi p+4 to match the clustered index.

Having indiv_id is also important because with your 6 million delete
records (which produces about 10 million records to delete from the
DB), the clustered index, because the first part of the key is
indiv_id, the server can use that to find all the matching records and
I would suspect quite efficiently.

As the other poster mentioned you don't need three separate indexes
for indiv_id,zip,zi p+4 once you have this primary key index. You'll
only need zip and zip+4 indexes since the primary key index satisfies
indiv_id queries (by itself).

Is the actual customer information in this table as well (eg. their
name?) or is that stored in a separate table? If it's stored in this
table that seems like duplication, unless there is good reason for it.

With the archiving concept that has been suggested, you could still
investigate (if possible) how many of your update records "hit" the
"current address" of a customer versus updates that hit "former"
addresses. Assuming you know what the current address is. But I would
not personally go down this track until applying the clustered index
on indiv_id,zip,zi p+4, pre-sorting the input data, and doing the
updates to see what happens.

Your original approach of deleting the matching records before
performing an import of the replacement data might also benefit from
having the records pre-sorted to match the clustered key ordering.

Implementation-wise, assuming you have some kind of update application
already written: I'm not sure if UPDATE statements can be prepared,
but if they can, you might achieve a performance improvement by using
prepared statements (I've never personally tried a prepared UPDATE
statement). This should eliminate the server needing to parse each
update statement and instead you only supply the values.

Cheers

Kevin
Jul 20 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
12836
by: SL_McManus | last post by:
Hi All; I am fairly new to Perl. I have a file with close to 3000 lines that I would like to split out in a certain way. I would like to put the record type starting in column 1 for 2 spaces, the employer code in column 23 for 29 spaces and employer description in column 53 for 30 spaces. I have tried modifying an existing file with no real success. I haven't found anything that specifically answers my question. Any guidance would be...
2
4102
by: PeteCresswell | last post by:
This is pursuant to another thread I tried to start, but can't find on my server. I've got to store rolling rates of return for investment funds - calendar year, quarterly, and monthly To cut to the chase, with a normal data structure, my estimates are coming out around 20 million recs max. They'd be indexed on FundID, BeginDate, EndDate, and ReturnType
6
3670
by: Terry Bell | last post by:
We've had a very large A97 app running fine for the last seven years. I've just converted to SQL Server backend, which is being tested, but meanwhile the JET based version, running under terminal server, has suddenly started running very slowly. The network engineer has thrown up his hands and said "It's Access 97". I've checked out lots of things including the Oplocks setting and other stuff from this NG, and I think I've done everything...
47
4540
by: ship | last post by:
Hi We need some advice: We are thinking of upgrading our Access database from Access 2000 to Access 2004. How stable is MS Office 2003? (particularly Access 2003). We are just a small company and this is a big decision for us(!) It's not just the money it's committing to an new version of Access!
2
4694
by: ThurstonHowl | last post by:
Hello, my task is the following: Input are tables with fields containing strings where the strings are actually delimited lists. For example, one field could contain 'AB|CD|EF|GH' I've written code that reads the input table and writes to an output table with each delimited list parsed out. So if the delimited list has 4 entries, 4 output records are created, corresponding to the one input record.
11
12618
by: pmarisole | last post by:
I am using the following code to split/join values in a multi-select field. It is combining all the values in All the records into one long string in each record in recordset. Example: I have a recordset with 2 records. The 1st contains the split/joined values: Alan Smir, Jeff Karl The 2nd contains the value: Keith Robb When it updates database, it will put Alan Smir, Jeff Karl, Keith Robb into each record in the recordset
2
2536
by: gordonjones | last post by:
I have users on different computers getting the error message "could not update;currently locked" when trying to leave identical fields after making a change on the same form but on different records. MS Access 2002 MDB, front-end split from data. Data stored on network server with front-end stored on local computer's desktop. Back-end data is also replicated. All forms have "Record Locks" set to No Locks, Recordset Type is Dynaset....
10
3057
by: ARC | last post by:
This is mainly a speed question. In this example: I have a QuotesHdr table that has a few memo fields. If these memo fields are used extensively by some users, and if their are a large number of records in QuotesHdr, should I break out the memo fields into a separate table? The thinking here is that, for Quotes selection dropdowns that display all entries in QuotesHdr for selection, I would think that the entire record comes down over...
5
3883
by: HSXWillH | last post by:
I have a table entitled Stock_Catalog. Fields include Stock_ID (random autonumber), Brand, Card_Number and Player_Name. Most of the entries have 1 name in the Player_Name field, but some show multiple players and have entry format such as Warren Spahn/Jim O'Toole/etc.... What I currently have is upon closing out the Stock_Catalog entry form, code runs to append the Player_Name to a table titled Player_Hdr where I then add details about the...
0
8936
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8768
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
9226
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9166
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4538
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4799
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3248
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2173
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.