By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,403 Members | 880 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,403 IT Pros & Developers. It's quick & easy.

Need design advice. What's my best approach for storing this data?

P: n/a
Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.

My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files. Based on ones that I have
already done, I estimated at least 5 gigs for storage this way. My
structure for this files was a 3 layered dictionary.
[Market][Stock][Date](Data List). That allows me to easily access any
data for any date or stock in a particular market. Therefore I wasn't
really concerned about the organizational aspects of a db since this
would serve me fine.

But before I put this all together I wanted to ask around to see if
this is a good approach. Will it be faster to use a database over a
structured dictionary? And will I get a lot of overhead if I go with a
database? I'm hoping people who have dealt with such large data before
can give me a little advice.

Thanks ahead of time,
Marc

Mar 17 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a

"Mudcat" <mn******@gmail.com> wrote in message
news:11*********************@i40g2000cwc.googlegro ups.com...
Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.


On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.
They also have an API which allows you full access to the database
(including from Python via COM). The API is pretty robust and allows
you do pre-filtering (e.g. give me last 20 years of all stocks over $50
with ave daily vol > 100k) at the db level meaning you can focus on using
Python for analysis. The database is also updated daily.

If you don't need daily updates, then subscribe (first 30 days free) and
cancel, and you've got a snapshot db of all the data you need.

They also used to send out an evaluation CD which had all
the history data barring the last 3 months or so which is certainly
good enough for analysis and testing. Not sure if they still do that.

HTH.
Mar 17 '06 #2

P: n/a
Mudcat:
My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files.


ZODB solves that problem for you.
http://www.zope.org/Wikis/ZODB/FrontPage

More in particular "5.3 BTrees Package":
http://www.zope.org/Wikis/ZODB/guide...00000000000000

But I've only used ZODB for small databases compared to yours. It's
supposed to scale very well, but I can't speak from experience.

--
René Pijlman
Mar 17 '06 #3

P: n/a
>On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.


That is an interesting option also. I had actually looked for ready
made databases and didn't come across this one. Although, I don't
understand how they can fit all that info into 900Mb.

I like this option, but I guess if I decide to keep using this database
then I need to keep up my subcription. The thing I liked about
downloading everything from Yahoo was that I didn't have to pay anyone
for the data.

Does anyone know the best way to compress this data? or do any of these
databases handle compression automatically? 5gig will be hard for any
computer to deal with, even in a database.

Mar 17 '06 #4

P: n/a
In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.

Thanks

Mar 19 '06 #5

P: n/a
Mudcat wrote:
In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.


PyTables is exactly suited to storing large amounts of numerical data aranged in
tables and arrays. The ZODB is not.

--
Robert Kern
ro*********@gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Mar 19 '06 #6

P: n/a
Mudcat wrote:
I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.


This is a typical use case for relational database systems.
With something like DB2 or Oracle here, you can take advantage
of more than 20 years of work by lots of developers trying to
solve the kind of problems you will run into.

You haven't really stated all the facts to decide what product
to choose though. Will this be a multi-user applications?
Do you forsee a client/server application? What operating
system(s) do you need to support?

With relational databases, it's plausible to move some of
the hard work in the data analysis into the server. Using
this well means that you need to learn a bit about how
relational databases work, but I think it's with the trouble.
It could mean that much less data ever needs to reach your
Python program for processing, and that will mean a lot for
your performance. Relational databases are very good at
searching, sorting and simple aggregations of data. SQL is
a declarative language, and in principle, your SQL code
will just declare the correct queries and manipulations that
you want to achieve, and tuning will be a separate activity,
which doesn't need to involve program changes. In reality,
there are certainly cases where changes in SQL code will
influence performance, but to a very large extent, you can
achieve good performance through building indices and by
letting the database gather statistics and analyze the
queries your programs contain. As a bonus, you also have
advanced systems for security, transactional safety, on-
line backup, replication etc.

You don't get these advantages with any other data storage
systems.

I'd get Chris Fehily's "SQL Visual Quickstart Guide", which
is as good as his Python book. As database, it depends a bit
on your platform you work with. I'd avoid MySQL. Some friends
of mine have used it for needs similar to yours, and they are
now running into its severe shortcomings. (I did warn them.)

For Windows, I think the single user version of SQL Server
(MSDE?) is gratis. For both Windows and Linux/Unix, there are
(I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
I think DB2 is somewhere in between. PostgreSQL is also a good
option.

Either way, it certainly seems natural to learn relational
databases and SQL if you want to work with financial software.
Mar 20 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.