473,835 Members | 1,832 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Need design advice. What's my best approach for storing this data?

Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.

My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files. Based on ones that I have
already done, I estimated at least 5 gigs for storage this way. My
structure for this files was a 3 layered dictionary.
[Market][Stock][Date](Data List). That allows me to easily access any
data for any date or stock in a particular market. Therefore I wasn't
really concerned about the organizational aspects of a db since this
would serve me fine.

But before I put this all together I wanted to ask around to see if
this is a good approach. Will it be faster to use a database over a
structured dictionary? And will I get a lot of overhead if I go with a
database? I'm hoping people who have dealt with such large data before
can give me a little advice.

Thanks ahead of time,
Marc

Mar 17 '06 #1
6 2508

"Mudcat" <mn******@gmail .com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com...
Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.


On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.
They also have an API which allows you full access to the database
(including from Python via COM). The API is pretty robust and allows
you do pre-filtering (e.g. give me last 20 years of all stocks over $50
with ave daily vol > 100k) at the db level meaning you can focus on using
Python for analysis. The database is also updated daily.

If you don't need daily updates, then subscribe (first 30 days free) and
cancel, and you've got a snapshot db of all the data you need.

They also used to send out an evaluation CD which had all
the history data barring the last 3 months or so which is certainly
good enough for analysis and testing. Not sure if they still do that.

HTH.
Mar 17 '06 #2
Mudcat:
My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files.


ZODB solves that problem for you.
http://www.zope.org/Wikis/ZODB/FrontPage

More in particular "5.3 BTrees Package":
http://www.zope.org/Wikis/ZODB/guide...00000000000000

But I've only used ZODB for small databases compared to yours. It's
supposed to scale very well, but I can't speak from experience.

--
René Pijlman
Mar 17 '06 #3
>On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.


That is an interesting option also. I had actually looked for ready
made databases and didn't come across this one. Although, I don't
understand how they can fit all that info into 900Mb.

I like this option, but I guess if I decide to keep using this database
then I need to keep up my subcription. The thing I liked about
downloading everything from Yahoo was that I didn't have to pay anyone
for the data.

Does anyone know the best way to compress this data? or do any of these
databases handle compression automatically? 5gig will be hard for any
computer to deal with, even in a database.

Mar 17 '06 #4
In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.

Thanks

Mar 19 '06 #5
Mudcat wrote:
In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.


PyTables is exactly suited to storing large amounts of numerical data aranged in
tables and arrays. The ZODB is not.

--
Robert Kern
ro*********@gma il.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Mar 19 '06 #6
Mudcat wrote:
I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.


This is a typical use case for relational database systems.
With something like DB2 or Oracle here, you can take advantage
of more than 20 years of work by lots of developers trying to
solve the kind of problems you will run into.

You haven't really stated all the facts to decide what product
to choose though. Will this be a multi-user applications?
Do you forsee a client/server application? What operating
system(s) do you need to support?

With relational databases, it's plausible to move some of
the hard work in the data analysis into the server. Using
this well means that you need to learn a bit about how
relational databases work, but I think it's with the trouble.
It could mean that much less data ever needs to reach your
Python program for processing, and that will mean a lot for
your performance. Relational databases are very good at
searching, sorting and simple aggregations of data. SQL is
a declarative language, and in principle, your SQL code
will just declare the correct queries and manipulations that
you want to achieve, and tuning will be a separate activity,
which doesn't need to involve program changes. In reality,
there are certainly cases where changes in SQL code will
influence performance, but to a very large extent, you can
achieve good performance through building indices and by
letting the database gather statistics and analyze the
queries your programs contain. As a bonus, you also have
advanced systems for security, transactional safety, on-
line backup, replication etc.

You don't get these advantages with any other data storage
systems.

I'd get Chris Fehily's "SQL Visual Quickstart Guide", which
is as good as his Python book. As database, it depends a bit
on your platform you work with. I'd avoid MySQL. Some friends
of mine have used it for needs similar to yours, and they are
now running into its severe shortcomings. (I did warn them.)

For Windows, I think the single user version of SQL Server
(MSDE?) is gratis. For both Windows and Linux/Unix, there are
(I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
I think DB2 is somewhere in between. PostgreSQL is also a good
option.

Either way, it certainly seems natural to learn relational
databases and SQL if you want to work with financial software.
Mar 20 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1695
by: Ken Fine | last post by:
I periodically receive a 5+ MB XML document that I hand-load into SQL Server using SQLXML running under a DTS process. Unfortunately, the document is human-created, and (very unfortunately) often has invalid elements, which breaks the bulk load. I've been managing this problem by loading the document into Visual Studio and using it to identify the offending line numbers, and fixing it by hand. Given my truly minimal coding skills when I...
9
2943
by: sk | last post by:
I have an applicaton in which I collect data for different parameters for a set of devices. The data are entered into a single table, each set of name, value pairs time-stamped and associated with a device. The definition of the table is as follows: CREATE TABLE devicedata ( device_id int NOT NULL REFERENCES devices(id), -- id in the device
3
10679
by: google | last post by:
I have a database with four table. In one of the tables, I use about five lookup fields to get populate their dropdown list. I have read that lookup fields are really bad and may cause problems that are hard to find. The main problem I am having right now is that I have a report that is sorted by one of these lookup fields and it only displays the record's ID number. When I add the source table to the query it makes several records...
3
1201
by: bardo | last post by:
Hello all, I want to make an application withs centers around 1 windowform. This form I want to give the same appearance as the Outlook. I have a listbox on the left-handside then a splitter. The rightside of the splitter I want to fill with different things. Now I would like have some advice on how what is the best approach to program the rightside. The main purpose of the program is entering data that will be stored in a
15
4654
by: Cheryl Langdon | last post by:
Hello everyone, This is my first attempt at getting help in this manner. Please forgive me if this is an inappropriate request. I suddenly find myself in urgent need of instruction on how to communicate with a MySQL database table on a web server, from inside of my company's Access-VBA application. I know VBA pretty well but have never before needed to do this HTTP/XML/MySQL type functions.
7
2241
by: Gladen Blackshield | last post by:
Hello All! Still very new to PHP and I was wondering about the easiest and simplest way to go about doing something for a project I am working on. I would simply like advice on what I'm asking so I can go and learn it myself through doing (best way for me). I am building a card game as a learning-project as it involves many (to most of the) things that I would like to learn to do with PHP.
12
7030
by: nyathancha | last post by:
Hi, I have a question regarding best practices in database design. In a relational database, is it wise/necessary to sometimes create tables that are not related to other tables through a foreign Key relationship or does this always indicate some sort of underlying design flaw. Something that requires a re evaluation of the problem domain? The reason I ask is because in our application, the user can perform x
6
1669
by: jason | last post by:
Hello, I have a question about what kind of datastructure to use. I'm reading collumn based data in the form of: 10\t12\t9\t11\n 24\t11\t4\t10\n ..... I now have a structure which allows me to access the data like this: x->row.coll.value.d;
6
2198
by: shapper | last post by:
Hello, I am creating a form that includes a few JQuery scripts and TinyMCE Editor: http://www.27lamps.com/Beta/Form/Form.html I am having a few problems with my CSS: 1. Restyling the Select
0
9808
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10523
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10561
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10235
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9346
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7766
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6966
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5804
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3089
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.