Size of SQLServer databases with indices

Bart Torbert

Hello,

I am starting to examine using SQLServer instead of Oracle. I went
through the SQLServer import utility to copy tables from Oracle into
SQLServer. I then checked the size of the database.

I then started to specify indices for the tables. I did not see the
database file size grow all that much. I kept on using the shrink
command to make sure that wasted space was removed. I even made sure
that minimal free space was allowed.

I was rather expecting (from experience with Oracle) that there would
be a large growth in the database when addind indices. So what did I
miss? Are the indices stored elsewhere and I missed them? Does
SQLServer handle indices differently so they dont' bloat the database
size?

The size of our data in SQLServer is a key factor in whether we move
from Oracle, so I need to under how indices effect storage size.

Thanks for any help.

Bart Torbert
ba*******@comcast.net

Jul 20 '05 #1

Subscribe Post Reply

2335

Erland Sommarskog

Bart Torbert (ba*******@comcast.net) writes:

I am starting to examine using SQLServer instead of Oracle. I went
through the SQLServer import utility to copy tables from Oracle into
SQLServer. I then checked the size of the database.

I then started to specify indices for the tables. I did not see the
database file size grow all that much. I kept on using the shrink
command to make sure that wasted space was removed. I even made sure
that minimal free space was allowed.

I was rather expecting (from experience with Oracle) that there would
be a large growth in the database when addind indices. So what did I
miss? Are the indices stored elsewhere and I missed them? Does
SQLServer handle indices differently so they dont' bloat the database
size?

I cannot compare with Oracle, since I don't have any experience of
Oracle.

In SQL Server, yes, indexes take up space. But it depends a little on
what sort of index you add. A clustered index, has the data as its
leaf pages, which means that only the index tree itself takes up space.

To find out how much space you use on indexes for a certain table, you
can do:

sp_spaceused tbl, TRUE

(The second parameter forces an update of usage data.)

To see the overall database size, it's better to use sp_spaceused with
NULL as the first parameter. The second result set, gives you four
numbers:

reserved: this how much space that has been reserved for the table.

data: data + clustered index if there is one.

index: non-clustered indexes.

unused: reserved space, which is not occupied by data or indexes yet.
Since SQL Server allocates space in extents of eight pages at a time,
there is always some unused space. But with heavy fragmentation, this
value can grow.

reserved = data + indexes + unused.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #2

Greg D. Moore \(Strider\)

"Bart Torbert" <ba*******@comcast.net> wrote in message
news:67**************************@posting.google.c om...

Hello,

Erland has done a decent job (as usual :-) of explaining some factors here,
I just wanted to comment:

The size of our data in SQLServer is a key factor in whether we move
from Oracle, so I need to under how indices effect storage size.
Given the price of disk space these days, I'm going to have to suggest (w/o
knowing your details of cours) that this is perhaps not a very good metric
to go by.

Especially if one considers cost of disk space vs. rewriting code, etc.

I'd be real curious to know more about your app and how this metric was
determined. So, if you don't mind sharing, I'd love to hear it as I suspect
I'll learn something which in my book is always good.

Thanks for any help.

Bart Torbert
ba*******@comcast.net

Jul 20 '05 #3

Bart Torbert

There are a couple of aspects of our app that makes size usage a big
factor. First of all we have a database with about 5 million
individual entities. These entities many children tables. These
children tables can have multiple children per entity. So the children
tables can have 10 million plus rows. We have a lot of data to deal
with.

On top of this, we are trying to see if we can install the database
onto our customer's system and do periodic updates of their data. The
smaller the size the easier the support will be, the less new hardware
we force our customers to buy, etc. We have many hundreds of
customers, so the data update issue is really a bottleneck.

Which brings up a couple more questions.

Can someone explain the clustering of indices? How does this affect
performance and storage size?

On one of our tables with 3 million rows, I tried to apply an index on
a text field. It took 45 minutes for the index to process. I did not
turn on clustering. Is this normal or do I not have something setup
right. The PC I a running on has 1 gig of memory and I think 1 gig
CPU. Do we need more horsepower for SQLServer with this large of a
database? Right now, without too many indices, the MDF file is about
23 GIG.

Which brings up another question. Is there some threshold within
SQLServer as far as number of rows, number of tables, overall DB size
that it is known that performance start to take a hit.

Thanks for the help.

Bart Torbert
ba**********@ihsenergy.com

"Greg D. Moore \(Strider\)" <mo****************@greenms.com> wrote in message news:<lF*******************@twister.nyroc.rr.com>. ..

"Bart Torbert" <ba*******@comcast.net> wrote in message
news:67**************************@posting.google.c om...
Hello,

Erland has done a decent job (as usual :-) of explaining some factors here,
I just wanted to comment:

The size of our data in SQLServer is a key factor in whether we move
from Oracle, so I need to under how indices effect storage size.

Given the price of disk space these days, I'm going to have to suggest (w/o
knowing your details of cours) that this is perhaps not a very good metric
to go by.

Especially if one considers cost of disk space vs. rewriting code, etc.

I'd be real curious to know more about your app and how this metric was
determined. So, if you don't mind sharing, I'd love to hear it as I suspect
I'll learn something which in my book is always good.

Thanks for any help.

Bart Torbert
ba*******@comcast.net

Jul 20 '05 #4

Greg D. Moore \(Strider\)

"Bart Torbert" <ba*******@comcast.net> wrote in message
news:67**************************@posting.google.c om...

There are a couple of aspects of our app that makes size usage a big
factor. First of all we have a database with about 5 million
individual entities. These entities many children tables. These
children tables can have multiple children per entity. So the children
tables can have 10 million plus rows. We have a lot of data to deal
with.
Actually that's not all that big. :-) (I have a database that we insert
that many rows a day into. :-)

On top of this, we are trying to see if we can install the database
onto our customer's system and do periodic updates of their data. The
smaller the size the easier the support will be, the less new hardware
we force our customers to buy, etc. We have many hundreds of
customers, so the data update issue is really a bottleneck.
Understood. I would just think getting them to convert to SQL over Oracle
would be a bigger issue.

Which brings up a couple more questions.

Can someone explain the clustering of indices? How does this affect
performance and storage size?

Basically you can have one clustered index per table. It really has no net
affect on the size of the table. It simply ensures that the data is
physically in the same order as the index. (i.e. if you were indexed on a
SSN, that would match the physical order on the disk.) This permits you to
quickly select data based on that key. (note there's a bit more to this and
I'm sure others will jump in).

So generally you want you clustered index on your most useful key.

On one of our tables with 3 million rows, I tried to apply an index on
a text field. It took 45 minutes for the index to process. I did not
turn on clustering.
That's not out of the range of normal. Clustering probably would have
taken longer since it would have to phyiscally re-order data. (A
non-clustered index is basically a lookup table so it just has to create
that.)
Is this normal or do I not have something setup
right. The PC I a running on has 1 gig of memory and I think 1 gig
CPU. Do we need more horsepower for SQLServer with this large of a
database? Right now, without too many indices, the MDF file is about
23 GIG.
In my experience, cache on the CPU makes a BIG difference in processing..

Another factor is disk I/O. What sort of disks do you have?

Incidentlly these days 23 gig isn't that large of a database any more. :-)

One thing we do on some of our databases is create our indices in a separate
file. This permits SQL server to update data to one set of disks, update
the index on the other and the logs on a 3rd.

Which brings up another question. Is there some threshold within
SQLServer as far as number of rows, number of tables, overall DB size
that it is known that performance start to take a hit.
It really depends on usage and coding. Like I said, we've got one database
where we insert millions of rows a day. That's running on a quad Xeon
700Mhz box with 2MB cache. Previous to a code change, this service required
almost 2x the horsepower.

We've got several databases that are over 23GB in size and we get great
performance on them.
BTW, I recommend you pick up a copy of Kalen Delany's book "Inside SQL
Server 2000". She's done a far better job of explaining this than I ever
will.

Thanks for the help.

Bart Torbert
ba**********@ihsenergy.com

"Greg D. Moore \(Strider\)" <mo****************@greenms.com> wrote in

message news:<lF*******************@twister.nyroc.rr.com>. ..

"Bart Torbert" <ba*******@comcast.net> wrote in message
news:67**************************@posting.google.c om...
Hello,

Erland has done a decent job (as usual :-) of explaining some factors here, I just wanted to comment:

The size of our data in SQLServer is a key factor in whether we move
from Oracle, so I need to under how indices effect storage size.

Given the price of disk space these days, I'm going to have to suggest (w/o knowing your details of cours) that this is perhaps not a very good metric to go by.

Especially if one considers cost of disk space vs. rewriting code, etc.

I'd be real curious to know more about your app and how this metric was
determined. So, if you don't mind sharing, I'd love to hear it as I suspect I'll learn something which in my book is always good.

Thanks for any help.

Bart Torbert
ba*******@comcast.net

Jul 20 '05 #5

Erland Sommarskog

Bart Torbert (ba*******@comcast.net) writes:

There are a couple of aspects of our app that makes size usage a big
factor. First of all we have a database with about 5 million
individual entities. These entities many children tables. These
children tables can have multiple children per entity. So the children
tables can have 10 million plus rows. We have a lot of data to deal
with.
This far, it does by no means sounds like a huge database. You later
mention 23 GB. As Greg said, that's a not a big database these days.
I would call it mid-sized.

And, I echo Greg and say that size alone is not the measure you should
go with. If you can add an indexed view which takes up 1GB, but which
changes response time of a critical query from 1 hour to 5 seconds, it
would be a folly not to add the view.

Disk is cheap these days, and 23 GB or 70 GB for the database is no
big issue. (Had you been in for 23 TB it would have been another matter!)
The main cost for a bigger database is not in the hardware, but longer
times for backup and restore.
Can someone explain the clustering of indices? How does this affect
performance and storage size?
In a clustered index, the leaf pages are the data pages. It is considered
best practice to always have a clustered index on a table in SQL Server.
There may be situations where it is better to have a heap (as is the name
of a table without a clustered index), but in doubt assume clustering
is the way to go.

From this follows that if the only index of your table is the primary key,
this index should be clustered.

If you have more than one index, it takes some judgement to find the
best index to cluster. Usually, the clustered index should be on a column
on which you do range queries. For instance, for an Orders table, the
OrderID is a poor candidate to cluster on, because you usually look at
orders one by one, and rarely at orders 10222 to 11762. The customer ID
or the order date may be better choices. On the other hand in an
OrderDetails table with a PK of (OrderID, Rowno), this PK may be an
excellent choice to cluster on, because you often want to see all rows
for an order.

One thing to keep in mind is that you should keep your clustered index
small bytewise, because in the non-clustered indexes, the clustering
key serve as row pointers to the data. So a wide cluster key, makes the
non-clustered indexes wide too. And the main reason to keep down the non-
clustered indexes in size is not to conserve disk space, but to speed up
response time. The narrower the index, the more index nodes on a page,
and the faster the index can be searched.

If you go for a non-unique clustered index, SQL Server will add a 4-byte
uniquifier to the clustered index. For this reason, it may be better to
add the PK to the clustered index to make it unique, if the PK is a
four-byte value.

On one of our tables with 3 million rows, I tried to apply an index on
a text field. It took 45 minutes for the index to process. I did not
turn on clustering. Is this normal or do I not have something setup
right. The PC I a running on has 1 gig of memory and I think 1 gig
CPU. Do we need more horsepower for SQLServer with this large of a
database? Right now, without too many indices, the MDF file is about
23 GIG.
45 minutes to add a non-clustered index on a three-million row table
sounds a long time to me, but you did not mention the size of the column.
You said "text field", but you cannot add an index on a column of the
text datatype, so I assume that it was a varchar column. But it might
take longer time to add an index on a heap; I rarely do that, so I don't
have experience.

Changing the clustered index on a table (including adding one, or dropping
one) can take considerable time, since all data has to be moved, and non-
clustered indexes must be updated. Thankfully you can change a clustered
index by adding WITH DROP_EXISTING when you create the new definition,
so you don't have to move data twice.
Which brings up another question. Is there some threshold within
SQLServer as far as number of rows, number of tables, overall DB size
that it is known that performance start to take a hit.

No, there are no such magic numbers.

However, for some database applications, there can indeed be some magic
number over which performance dips considerably. Consider a database
where the active data is small enough to fit entirely in cache. Response
time will be good, maybe even if some queries are less-than-optimal.
Now, if the portion of active data outgrows the cache, and on top of
that, the access pattern is somewhat cyclic, there will be a lot more
disk accesses, because data falls out of the cache. The decisive factor
is of course how much main memory you have in the machine.

How much that active data is, depends on the application and its business
needs and how well-written the queries are. If a sloppy programmer
introduces a table scan over a big table with historic data of which
the data normally is dormant, you can see the performance dip allover,
because of the the big table taking over thc cache. I would assume that
Oracle would hehave in the same way.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #6

Bart Torbert

Next question,

What has been anyone experiences with multiple users hitting a single
database?
Is there some practical limit as to who many users SQLServer can
support? Is the cituation helped by heftier hardware (more memory,
faster cpu', faster disks).

Has anyone played with putting different tables on different disks to
improve performance?

Bart Torbert
ba**********@ihsenrgy.com
Erland Sommarskog <es****@sommarskog.se> wrote in message news:<Xn*********************@127.0.0.1>...

Bart Torbert (ba*******@comcast.net) writes:
There are a couple of aspects of our app that makes size usage a big
factor. First of all we have a database with about 5 million
individual entities. These entities many children tables. These
children tables can have multiple children per entity. So the children
tables can have 10 million plus rows. We have a lot of data to deal
with.

This far, it does by no means sounds like a huge database. You later
mention 23 GB. As Greg said, that's a not a big database these days.
I would call it mid-sized.

And, I echo Greg and say that size alone is not the measure you should
go with. If you can add an indexed view which takes up 1GB, but which
changes response time of a critical query from 1 hour to 5 seconds, it
would be a folly not to add the view.

Disk is cheap these days, and 23 GB or 70 GB for the database is no
big issue. (Had you been in for 23 TB it would have been another matter!)
The main cost for a bigger database is not in the hardware, but longer
times for backup and restore.
Can someone explain the clustering of indices? How does this affect
performance and storage size?

In a clustered index, the leaf pages are the data pages. It is considered
best practice to always have a clustered index on a table in SQL Server.
There may be situations where it is better to have a heap (as is the name
of a table without a clustered index), but in doubt assume clustering
is the way to go.

From this follows that if the only index of your table is the primary key,
this index should be clustered.

If you have more than one index, it takes some judgement to find the
best index to cluster. Usually, the clustered index should be on a column
on which you do range queries. For instance, for an Orders table, the
OrderID is a poor candidate to cluster on, because you usually look at
orders one by one, and rarely at orders 10222 to 11762. The customer ID
or the order date may be better choices. On the other hand in an
OrderDetails table with a PK of (OrderID, Rowno), this PK may be an
excellent choice to cluster on, because you often want to see all rows
for an order.

One thing to keep in mind is that you should keep your clustered index
small bytewise, because in the non-clustered indexes, the clustering
key serve as row pointers to the data. So a wide cluster key, makes the
non-clustered indexes wide too. And the main reason to keep down the non-
clustered indexes in size is not to conserve disk space, but to speed up
response time. The narrower the index, the more index nodes on a page,
and the faster the index can be searched.

If you go for a non-unique clustered index, SQL Server will add a 4-byte
uniquifier to the clustered index. For this reason, it may be better to
add the PK to the clustered index to make it unique, if the PK is a
four-byte value.

On one of our tables with 3 million rows, I tried to apply an index on
a text field. It took 45 minutes for the index to process. I did not
turn on clustering. Is this normal or do I not have something setup
right. The PC I a running on has 1 gig of memory and I think 1 gig
CPU. Do we need more horsepower for SQLServer with this large of a
database? Right now, without too many indices, the MDF file is about
23 GIG.

45 minutes to add a non-clustered index on a three-million row table
sounds a long time to me, but you did not mention the size of the column.
You said "text field", but you cannot add an index on a column of the
text datatype, so I assume that it was a varchar column. But it might
take longer time to add an index on a heap; I rarely do that, so I don't
have experience.

Changing the clustered index on a table (including adding one, or dropping
one) can take considerable time, since all data has to be moved, and non-
clustered indexes must be updated. Thankfully you can change a clustered
index by adding WITH DROP_EXISTING when you create the new definition,
so you don't have to move data twice.
Which brings up another question. Is there some threshold within
SQLServer as far as number of rows, number of tables, overall DB size
that it is known that performance start to take a hit.

No, there are no such magic numbers.

However, for some database applications, there can indeed be some magic
number over which performance dips considerably. Consider a database
where the active data is small enough to fit entirely in cache. Response
time will be good, maybe even if some queries are less-than-optimal.
Now, if the portion of active data outgrows the cache, and on top of
that, the access pattern is somewhat cyclic, there will be a lot more
disk accesses, because data falls out of the cache. The decisive factor
is of course how much main memory you have in the machine.

How much that active data is, depends on the application and its business
needs and how well-written the queries are. If a sloppy programmer
introduces a table scan over a big table with historic data of which
the data normally is dormant, you can see the performance dip allover,
because of the the big table taking over thc cache. I would assume that
Oracle would hehave in the same way.

Jul 20 '05 #7

Greg D. Moore \(Strider\)

"Bart Torbert" <ba*******@comcast.net> wrote in message
news:67**************************@posting.google.c om...

Next question,

What has been anyone experiences with multiple users hitting a single
database?
Umm, fairly common.

Is there some practical limit as to who many users SQLServer can
support? Is the cituation helped by heftier hardware (more memory,
faster cpu', faster disks).
Again, I'd recommend Kalen Delany's book Inside SQL Server 2000. I don't
remember the details, but I believe each USER requires about 1K of memory
(it might be connection in which case it doesn't make much difference if you
split those connections over multiple users.) However, generally this is
NOT where most of your memory is going to. Caching data will be far in
almost every case be the dominant use of memory. More memory is always good
for SQL Server. In fact it's generally the first place I'd play with.

Has anyone played with putting different tables on different disks to
improve performance?

Tables no, but we do break out our indices for that reason. I don't forget
the improvement we saw, but we did see one.

Bart Torbert
ba**********@ihsenrgy.com
Erland Sommarskog <es****@sommarskog.se> wrote in message

news:<Xn*********************@127.0.0.1>...

Bart Torbert (ba*******@comcast.net) writes:
There are a couple of aspects of our app that makes size usage a big
factor. First of all we have a database with about 5 million
individual entities. These entities many children tables. These
children tables can have multiple children per entity. So the children
tables can have 10 million plus rows. We have a lot of data to deal
with.

This far, it does by no means sounds like a huge database. You later
mention 23 GB. As Greg said, that's a not a big database these days.
I would call it mid-sized.

And, I echo Greg and say that size alone is not the measure you should
go with. If you can add an indexed view which takes up 1GB, but which
changes response time of a critical query from 1 hour to 5 seconds, it
would be a folly not to add the view.

Disk is cheap these days, and 23 GB or 70 GB for the database is no
big issue. (Had you been in for 23 TB it would have been another matter!) The main cost for a bigger database is not in the hardware, but longer
times for backup and restore.
Can someone explain the clustering of indices? How does this affect
performance and storage size?

In a clustered index, the leaf pages are the data pages. It is considered best practice to always have a clustered index on a table in SQL Server.
There may be situations where it is better to have a heap (as is the name of a table without a clustered index), but in doubt assume clustering
is the way to go.

From this follows that if the only index of your table is the primary key, this index should be clustered.

If you have more than one index, it takes some judgement to find the
best index to cluster. Usually, the clustered index should be on a column on which you do range queries. For instance, for an Orders table, the
OrderID is a poor candidate to cluster on, because you usually look at
orders one by one, and rarely at orders 10222 to 11762. The customer ID
or the order date may be better choices. On the other hand in an
OrderDetails table with a PK of (OrderID, Rowno), this PK may be an
excellent choice to cluster on, because you often want to see all rows
for an order.

One thing to keep in mind is that you should keep your clustered index
small bytewise, because in the non-clustered indexes, the clustering
key serve as row pointers to the data. So a wide cluster key, makes the
non-clustered indexes wide too. And the main reason to keep down the non- clustered indexes in size is not to conserve disk space, but to speed up
response time. The narrower the index, the more index nodes on a page,
and the faster the index can be searched.

If you go for a non-unique clustered index, SQL Server will add a 4-byte
uniquifier to the clustered index. For this reason, it may be better to
add the PK to the clustered index to make it unique, if the PK is a
four-byte value.

On one of our tables with 3 million rows, I tried to apply an index on
a text field. It took 45 minutes for the index to process. I did not
turn on clustering. Is this normal or do I not have something setup
right. The PC I a running on has 1 gig of memory and I think 1 gig
CPU. Do we need more horsepower for SQLServer with this large of a
database? Right now, without too many indices, the MDF file is about
23 GIG.

45 minutes to add a non-clustered index on a three-million row table
sounds a long time to me, but you did not mention the size of the column. You said "text field", but you cannot add an index on a column of the
text datatype, so I assume that it was a varchar column. But it might
take longer time to add an index on a heap; I rarely do that, so I don't
have experience.

Changing the clustered index on a table (including adding one, or dropping one) can take considerable time, since all data has to be moved, and non- clustered indexes must be updated. Thankfully you can change a clustered
index by adding WITH DROP_EXISTING when you create the new definition,
so you don't have to move data twice.
Which brings up another question. Is there some threshold within
SQLServer as far as number of rows, number of tables, overall DB size
that it is known that performance start to take a hit.

No, there are no such magic numbers.

However, for some database applications, there can indeed be some magic
number over which performance dips considerably. Consider a database
where the active data is small enough to fit entirely in cache. Response
time will be good, maybe even if some queries are less-than-optimal.
Now, if the portion of active data outgrows the cache, and on top of
that, the access pattern is somewhat cyclic, there will be a lot more
disk accesses, because data falls out of the cache. The decisive factor
is of course how much main memory you have in the machine.

How much that active data is, depends on the application and its business needs and how well-written the queries are. If a sloppy programmer
introduces a table scan over a big table with historic data of which
the data normally is dormant, you can see the performance dip allover,
because of the the big table taking over thc cache. I would assume that
Oracle would hehave in the same way.

Jul 20 '05 #8

Size of SQLServer databases with indices

Similar topics