473,569 Members | 2,870 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

design issues with large amounts of data

I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?
Jul 4 '08 #1
10 2846
On Thu, 3 Jul 2008 18:55:08 -0700 (PDT), nflacco
<ma*********@gm ail.comwrote:

A table per day is a REALLY BAD IDEA.
Remember that indexed lookups are very efficient. 1 record out of 4
billion can be found using 32 comparisons.
Sure, indexing that table (if there was no index before) might take a
while but doesn't have to be done more than once.

Why don't you run some performance tests on realistic hardware with a
realistic data set.

-Tom.

>I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?
Jul 4 '08 #2
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?
Partitioning is good for managing very large tables because you can rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that the
partitioning feature is available only in Enterprise and Developer editions.

With a good indexing strategy, response time should ideally be proportional
to the amount of data retrieved (barring cached data) regardless of whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"nflacco" <ma*********@gm ail.comwrote in message
news:9e******** *************** ***********@d19 g2000prm.google groups.com...
I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.

---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.

This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Any advice?
Jul 4 '08 #3
On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospa m-online.sbcgloba l.net>
wrote:
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Partitioning is good for managing very large tables because you can rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that the
partitioning feature is available only in Enterprise and Developer editions.

With a good indexing strategy, response time should ideally be proportional
to the amount of data retrieved (barring cached data) regardless of whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVPhttp://weblogs.sqlteam .com/dang/

"nflacco" <mail.fla...@gm ail.comwrote in message

news:9e******** *************** ***********@d19 g2000prm.google groups.com...
I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.
---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.
This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?
Any advice?
The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
Jul 4 '08 #4
The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
I think partitioning by date will address the indexing issue since you can
rebuild individual partitions (assuming Enterprise Edition is available for
your application). In the case of separate tables, I would go with Erland's
suggestion of monthly tables and a partitioned view. One table per day
seems overkill to me.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"nflacco" <ma*********@gm ail.comwrote in message
news:be******** *************** ***********@v1g 2000pra.googleg roups.com...
On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospa m-online.sbcgloba l.net>
wrote:
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?

Partitioning is good for managing very large tables because you can
rebuild
individual partition indexes without touching the entire table. This
reduces rebuild time and intermediate space requirements. Be aware that
the
partitioning feature is available only in Enterprise and Developer
editions.

With a good indexing strategy, response time should ideally be
proportional
to the amount of data retrieved (barring cached data) regardless of
whether
or not partitioning is used. Partitioning by date can facilitate certain
processes, like incremental data loads and purge/archival as well as
certain
types of queries. However, with or without partitioning, indexing is the
key from a a performance perspective.

--
Hope this helps.

Dan Guzman
SQL Server MVPhttp://weblogs.sqlteam .com/dang/

"nflacco" <mail.fla...@gm ail.comwrote in message

news:9e******* *************** ************@d1 9g2000prm.googl egroups.com...
I'm tinkering around with a data collection system, and have come up
with a very hackish way to store my data- for reference, I'm
anticipating collecting at least 100 million different dataId
whatevers per year, possibly much more.
---366 data tables ( one for each day of the year ), each row being
assigned a unique DataId ( unique across all 366 tables too )
---100 data_map tables, table 0 having all DataIds ending in 00, table
99 having all DataIds ending in 99 and so on.
This is mostly because a friend of mine who works with mySQL said it
is very slow to index large tables, even if you work with mostly
integers.
However, I've read mysql can handle millions of rows no problem, so it
seems my basic design is overly complicated and will lead to tons of
slowdowns thanks to all the joins.
Another friend of mine suggested using file partioning ( though he
uses MSSQL ), so is that another option?
Any advice?

The re-indexing is what worries me. I'll be constantly adding new data
to main table ( formerly the 366 day tables ) if we follow the not use
too many tables scheme, as well as the processed-data tables.
(
Jul 4 '08 #5
"Erland Sommarskog" <es****@sommars kog.sewrote in message
news:Xn******** **************@ 127.0.0.1...
[ SNIP ]
You seem to be concerned with indexing, and it cannot be denied that
there might be a problem. Then again, if you build the index once,
and new data is added in such a way that there is little fragmentation,
it's not an issue at all. We would need to know more how the table looks
like, and how it's loaded to give more specific advice.
I've been reading this thread with interest. I have never had to deal with
monster databases, for some suitable definition of "monster" (you'll know
one when you see one, I guess). I started doing some research, which is
ongoing: for an intermediate-level guy like myself these were helpful:

http://www.sqljunkies.com/article/f4...7eb0a5835.scuk

and

http://msdn.microsoft.com/en-us/library/aa964133.aspx

were very useful. Several questions I have related to indexing are:

1) I saw that MySQL has a delayed insert feature, that is, inserts wait
until the table is not in use by any other thread, plus inserts get written
as a block. I can see this being useful if there are lots of selects, and an
insert can take a fair bit of time. Does SQL Server have anything like this?

2) If the new data is being delivered in batches (large batches) it seems to
me that scheduling a drop index - bulk insert - create index would be
better, at least in many cases, since the index would get rebuilt from
scratch. Is this correct?

AHS
Jul 4 '08 #6
Arved Sandstrom (as********@acc esswave.ca) writes:
1) I saw that MySQL has a delayed insert feature, that is, inserts wait
until the table is not in use by any other thread, plus inserts get
written as a block. I can see this being useful if there are lots of
selects, and an insert can take a fair bit of time. Does SQL Server have
anything like this?
One wonders what happens with the connection that performs the INSERT?
Does it block while the INSERT is delayed? Or does this mean that if a
process inserts, commits and then select for its inserted data, it may
not be there? Appears likely violation of the ACID principle to me.

No, there is nothing like that in SQL Server. (There is a deferred deleted
model, where rows are not always physically deleted immediately, but
are freed up by a cleanup process, but the rows are logically deleted,
so this is just an implementation detail.)
2) If the new data is being delivered in batches (large batches) it
seems to me that scheduling a drop index - bulk insert - create index
would be better, at least in many cases, since the index would get
rebuilt from scratch. Is this correct?
As always when it comes to performance issues: you will have to benchmark.
--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Jul 4 '08 #7
"Erland Sommarskog" <es****@sommars kog.sewrote in message
news:Xn******** **************@ 127.0.0.1...
Arved Sandstrom (as********@acc esswave.ca) writes:
>1) I saw that MySQL has a delayed insert feature, that is, inserts wait
until the table is not in use by any other thread, plus inserts get
written as a block. I can see this being useful if there are lots of
selects, and an insert can take a fair bit of time. Does SQL Server have
anything like this?

One wonders what happens with the connection that performs the INSERT?
Does it block while the INSERT is delayed? Or does this mean that if a
process inserts, commits and then select for its inserted data, it may
not be there? Appears likely violation of the ACID principle to me.
The INSERT DELAYED in MySQL returns to the client immediately...t here is no
blocking. The insert data are queued in memory until the insert. This means
that delayed rows are not visible to SELECT statements until the actual
insert. The MySQL documentation
(http://dev.mysql.com/doc/refman/6.0/...t-delayed.html) has a lot of
caveats and explanations, which upon further perusal sort of answers my
question: no, one should not normally use the feature.
No, there is nothing like that in SQL Server. (There is a deferred deleted
model, where rows are not always physically deleted immediately, but
are freed up by a cleanup process, but the rows are logically deleted,
so this is just an implementation detail.)
>2) If the new data is being delivered in batches (large batches) it
seems to me that scheduling a drop index - bulk insert - create index
would be better, at least in many cases, since the index would get
rebuilt from scratch. Is this correct?

As always when it comes to performance issues: you will have to benchmark.
True enough.
--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
AHS
Jul 4 '08 #8
Arved Sandstrom (as********@acc esswave.ca) writes:
The INSERT DELAYED in MySQL returns to the client immediately...t here is
no blocking. The insert data are queued in memory until the insert. This
means that delayed rows are not visible to SELECT statements until the
actual insert.
And if the server goes down before that, the rows are never persisted?
sort of answers my question: no, one should not normally use the
feature.
Certainly! Possibly they should rename it to INSERT MAYBE? :-)

Then again, with SQL Server (and surely other products as well) some client
API offers asynchronous call, so that you can send away an SQL statement,
and pick the result later. The difference here is that the caller is
actually compelled to pick up the acknowledge, and cannot submit a new
command on that connetion until this has been done. And most of all, the
RDBMS as such does not violate ACID.

--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Jul 5 '08 #9
"Erland Sommarskog" <es****@sommars kog.sewrote in message
news:Xn******** **************@ 127.0.0.1...
Arved Sandstrom (as********@acc esswave.ca) writes:
>The INSERT DELAYED in MySQL returns to the client immediately...t here is
no blocking. The insert data are queued in memory until the insert. This
means that delayed rows are not visible to SELECT statements until the
actual insert.

And if the server goes down before that, the rows are never persisted?
That is correct.
>sort of answers my question: no, one should not normally use the
feature.

Certainly! Possibly they should rename it to INSERT MAYBE? :-)
One should treat it that way. MySQL also has INSERT LOW PRIORITY which
causes the client to wait until the insert is done. It appears not to have
an asynchronous call (such as you mention below for SQL Server) that
_compels_ the client not to do anything (command-wise) on that connection
until an ack; however, I'd assume that in those use cases where one feels
that INSERT DELAYED is called for, one would do a SHOW STATUS and get the
value of the 'not_flushed_de layed_rows' variable before doing a SELECT on
the affected table.
Then again, with SQL Server (and surely other products as well) some
client
API offers asynchronous call, so that you can send away an SQL statement,
and pick the result later.
Yes, I'd seen that in ADO.NET 2.0.
The difference here is that the caller is
actually compelled to pick up the acknowledge, and cannot submit a new
command on that connetion until this has been done. And most of all, the
RDBMS as such does not violate ACID.
PostgreSQL has this too. PGsendQuery (in the C libpq library, for example),
along with variants, sends a command without waiting for the result.
Invoking PGgetResult (again, the C function) will return a null pointer if
the command is complete. You cannot call PGsendQuery again on the same
connection until you get that null pointer from PGgetResult. There is also
asynchronous notification.

AHS
Jul 5 '08 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

36
6341
by: Andrea Griffini | last post by:
I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but something I'll need in this case is some experience-based set of rules about how to use python in this context. For example... is defining...
1
2322
by: michaaal | last post by:
If I use a form to pass data (say, for example, through a textbox) the data seems to be limited to somewhat smaller amounts. What should I do if I want to pass a large amount of data? For example a list of 200 items?
2
3562
by: Kymert persson | last post by:
Hi. I was wondering if there are any more C++ books along the lines of "Large scale C++ software design" by Lakos, J. I.e. concerning larger design issues in close relation to C++. I have made a fairly thorough literature search, but i haven't found anything fitting this criteria. In general there seems to be a huge amount concerning the...
2
1233
by: David C | last post by:
trying to design a db for a company that has 20 customers that take product from a pipeline. This product is metered and totaled at midnight for the past 24 hours. What I would like to do is be able to run queries on this info like how much a certain customer has taken for the last 90 days etc. I am having a bit of trouble trying to figure...
3
1758
by: RDI | last post by:
I'm using RSACryptoServiceProvider to encrypt data. I successfully got it to encrypt a string of less than 59 chars. Now I'm trying to handled longer strings. I was able to get it encrypt the longer strings but now I can't get it to DECRYPT the longer string. I created a string of 120 chars and it encrypted to a HUGE string. What I did...
6
2485
by: Mudcat | last post by:
Hi, I am trying to build a tool that analyzes stock data. Therefore I am going to download and store quite a vast amount of it. Just for a general number - assuming there are about 7000 listed stocks on the two major markets plus some extras, 255 tradying days a year for 20 years, that is about 36 million entries. Obviously a database is...
5
1227
by: TonyM | last post by:
I recently completed the general guidelines for a future project that I would like to start developing...but I've sort of hit a wall with respect to how to design it. In short, I want to run through approximately 5gigs of financial data, all of which is stored in a large number of text files. Now as far as formatting and data integrity...I...
0
2495
by: YellowFin Announcements | last post by:
Introduction Usability and relevance have been identified as the major factors preventing mass adoption of Business Intelligence applications. What we have today are traditional BI tools that don't work nearly as well as they should, even for analysts and power users. The reason they haven't reached the masses is because most of the tools...
7
10812
by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:
Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're doing a proof-of-concept on WCF whereby we have a Windows form client and a Server. Our server is a middle-tier that interfaces with our SQL 05...
0
7695
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7964
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6281
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5509
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5218
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3637
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2111
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1209
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
936
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.