472,119 Members | 983 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,119 software developers and data experts.

design issues with large amounts of data

I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?
Jul 4 '08 #1
10 2759
On Thu, 3 Jul 2008 18:55:08 -0700 (PDT), nflacco
<ma*********@gmail.comwrote:

A table per day is a REALLY BAD IDEA.
Remember that indexed lookups are very efficient. 1 record out of 4
billion can be found using 32 comparisons.
Sure, indexing that table (if there was no index before) might take a
while but doesn't have to be done more than once.

Why don't you run some performance tests on realistic hardware with a
realistic data set.

-Tom.

>I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?
Jul 4 '08 #2
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?
Partitioning is good for managing very large tables because you can rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that the
partitioning feature is available only in Enterprise and Developer editions.

With a good indexing strategy, response time should ideally be proportional
to the amount of data retrieved (barring cached data) regardless of whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"nflacco" <ma*********@gmail.comwrote in message
news:9e**********************************@d19g2000 prm.googlegroups.com...
I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?
Jul 4 '08 #3
On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospam-online.sbcglobal.net>
wrote:
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Partitioning is good for managing very large tables because you can rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that the
partitioning feature is available only in Enterprise and Developer editions.

With a good indexing strategy, response time should ideally be proportional
to the amount of data retrieved (barring cached data) regardless of whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVPhttp://weblogs.sqlteam.com/dang/

"nflacco" <mail.fla...@gmail.comwrote in message

news:9e**********************************@d19g2000 prm.googlegroups.com...
I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.
---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.
This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?
Any advice?
The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
Jul 4 '08 #4
The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
I think partitioning by date will address the indexing issue since you can
rebuild individual partitions (assuming Enterprise Edition is available for
your application). In the case of separate tables, I would go with Erland's
suggestion of monthly tables and a partitioned view. One table per day
seems overkill to me.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"nflacco" <ma*********@gmail.comwrote in message
news:be**********************************@v1g2000p ra.googlegroups.com...
On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospam-online.sbcglobal.net>
wrote:
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Partitioning is good for managing very large tables because you can
rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that
the
partitioning feature is available only in Enterprise and Developer
editions.

With a good indexing strategy, response time should ideally be
proportional
to the amount of data retrieved (barring cached data) regardless of
whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as
certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVPhttp://weblogs.sqlteam.com/dang/

"nflacco" <mail.fla...@gmail.comwrote in message

news:9e**********************************@d19g200 0prm.googlegroups.com...
I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.
---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.
This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?
Any advice?

The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
Jul 4 '08 #5
"Erland Sommarskog" <es****@sommarskog.sewrote in message
news:Xn**********************@127.0.0.1...
[ SNIP ]
You seem to be concerned with indexing, and it cannot be denied that
there might be a problem. Then again, if you build the index once,
and new data is added in such a way that there is little fragmentation,
it's not an issue at all. We would need to know more how the table looks
like, and how it's loaded to give more specific advice.
I've been reading this thread with interest. I have never had to deal with
monster databases, for some suitable definition of "monster" (you'll know
one when you see one, I guess). I started doing some research, which is
ongoing: for an intermediate-level guy like myself these were helpful:

http://www.sqljunkies.com/article/f4...7eb0a5835.scuk

and

http://msdn.microsoft.com/en-us/library/aa964133.aspx

were very useful. Several questions I have related to indexing are:

1) I saw that MySQL has a delayed insert feature, that is, inserts wait
until the table is not in use by any other thread, plus inserts get written
as a block. I can see this being useful if there are lots of selects, and an
insert can take a fair bit of time. Does SQL Server have anything like this?

2) If the new data is being delivered in batches (large batches) it seems to
me that scheduling a drop index - bulk insert - create index would be
better, at least in many cases, since the index would get rebuilt from
scratch. Is this correct?

AHS
Jul 4 '08 #6
Arved Sandstrom (as********@accesswave.ca) writes:
1) I saw that MySQL has a delayed insert feature, that is, inserts wait
until the table is not in use by any other thread, plus inserts get
written as a block. I can see this being useful if there are lots of
selects, and an insert can take a fair bit of time. Does SQL Server have
anything like this?
One wonders what happens with the connection that performs the INSERT?
Does it block while the INSERT is delayed? Or does this mean that if a
process inserts, commits and then select for its inserted data, it may
not be there? Appears likely violation of the ACID principle to me.

No, there is nothing like that in SQL Server. (There is a deferred deleted
model, where rows are not always physically deleted immediately, but
are freed up by a cleanup process, but the rows are logically deleted,
so this is just an implementation detail.)
2) If the new data is being delivered in batches (large batches) it
seems to me that scheduling a drop index - bulk insert - create index
would be better, at least in many cases, since the index would get
rebuilt from scratch. Is this correct?
As always when it comes to performance issues: you will have to benchmark.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Jul 4 '08 #7
"Erland Sommarskog" <es****@sommarskog.sewrote in message
news:Xn**********************@127.0.0.1...
Arved Sandstrom (as********@accesswave.ca) writes:
>1) I saw that MySQL has a delayed insert feature, that is, inserts wait
until the table is not in use by any other thread, plus inserts get
written as a block. I can see this being useful if there are lots of
selects, and an insert can take a fair bit of time. Does SQL Server have
anything like this?

One wonders what happens with the connection that performs the INSERT?
Does it block while the INSERT is delayed? Or does this mean that if a
process inserts, commits and then select for its inserted data, it may
not be there? Appears likely violation of the ACID principle to me.
The INSERT DELAYED in MySQL returns to the client immediately...there is no
blocking. The insert data are queued in memory until the insert. This means
that delayed rows are not visible to SELECT statements until the actual
insert. The MySQL documentation
(http://dev.mysql.com/doc/refman/6.0/...t-delayed.html) has a lot of
caveats and explanations, which upon further perusal sort of answers my
question: no, one should not normally use the feature.
No, there is nothing like that in SQL Server. (There is a deferred deleted
model, where rows are not always physically deleted immediately, but
are freed up by a cleanup process, but the rows are logically deleted,
so this is just an implementation detail.)
>2) If the new data is being delivered in batches (large batches) it
seems to me that scheduling a drop index - bulk insert - create index
would be better, at least in many cases, since the index would get
rebuilt from scratch. Is this correct?

As always when it comes to performance issues: you will have to benchmark.
True enough.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
AHS
Jul 4 '08 #8
Arved Sandstrom (as********@accesswave.ca) writes:
The INSERT DELAYED in MySQL returns to the client immediately...there is
no blocking. The insert data are queued in memory until the insert. This
means that delayed rows are not visible to SELECT statements until the
actual insert.
And if the server goes down before that, the rows are never persisted?
sort of answers my question: no, one should not normally use the
feature.
Certainly! Possibly they should rename it to INSERT MAYBE? :-)

Then again, with SQL Server (and surely other products as well) some client
API offers asynchronous call, so that you can send away an SQL statement,
and pick the result later. The difference here is that the caller is
actually compelled to pick up the acknowledge, and cannot submit a new
command on that connetion until this has been done. And most of all, the
RDBMS as such does not violate ACID.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Jul 5 '08 #9
"Erland Sommarskog" <es****@sommarskog.sewrote in message
news:Xn**********************@127.0.0.1...
Arved Sandstrom (as********@accesswave.ca) writes:
>The INSERT DELAYED in MySQL returns to the client immediately...there is
no blocking. The insert data are queued in memory until the insert. This
means that delayed rows are not visible to SELECT statements until the
actual insert.

And if the server goes down before that, the rows are never persisted?
That is correct.
>sort of answers my question: no, one should not normally use the
feature.

Certainly! Possibly they should rename it to INSERT MAYBE? :-)
One should treat it that way. MySQL also has INSERT LOW PRIORITY which
causes the client to wait until the insert is done. It appears not to have
an asynchronous call (such as you mention below for SQL Server) that
_compels_ the client not to do anything (command-wise) on that connection
until an ack; however, I'd assume that in those use cases where one feels
that INSERT DELAYED is called for, one would do a SHOW STATUS and get the
value of the 'not_flushed_delayed_rows' variable before doing a SELECT on
the affected table.
Then again, with SQL Server (and surely other products as well) some
client
API offers asynchronous call, so that you can send away an SQL statement,
and pick the result later.
Yes, I'd seen that in ADO.NET 2.0.
The difference here is that the caller is
actually compelled to pick up the acknowledge, and cannot submit a new
command on that connetion until this has been done. And most of all, the
RDBMS as such does not violate ACID.
PostgreSQL has this too. PGsendQuery (in the C libpq library, for example),
along with variants, sends a command without waiting for the result.
Invoking PGgetResult (again, the C function) will return a null pointer if
the command is complete. You cannot call PGsendQuery again on the same
connection until you get that null pointer from PGgetResult. There is also
asynchronous notification.

AHS
Jul 5 '08 #10
nflacco wrote:
>
The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
If you have a unique clustered index on Day, DataId where both are
incrementing then there's no reason to reindex. As I understand your
proposal the data will always be appended to the table rather than causing
page splits.
Jul 7 '08 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

36 posts views Thread by Andrea Griffini | last post: by
2 posts views Thread by Kymert persson | last post: by
2 posts views Thread by David C | last post: by
3 posts views Thread by RDI | last post: by
5 posts views Thread by TonyM | last post: by
reply views Thread by YellowFin Announcements | last post: by
7 posts views Thread by =?Utf-8?B?TW9iaWxlTWFu?= | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.