By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,833 Members | 2,076 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,833 IT Pros & Developers. It's quick & easy.

Tricky schema question - dimension can split and combine over time

P: n/a
Hi all,

I'm working on the schema for a database that must represent data about stock
& bond funds over time. My connundrum is that, for any of several dimension
fields, including the fund name itself, the dimension may be represented in
different ways over time, and may split or combine from one period to the
next.

When querying from the database for an arbitrary time period, I need the data
to be rolled up to the smallest extent possible so that apples can be compared
to apples. For instance, if the North America region becomes 2 regions, USA
and Canada, and I query a time period that spans the period in which this
split occurred, I should roll up USA and Canada and for records in the period
that has both, and I should call the result something like "(North
America)(USA/Canada)" in the output. The client specifies that the dimension
output must represent all dimensions that went into the input.
Of course, I have to account for more complex possibilities as well, e.g.
Fund-A splits into Fund-B and Fund-C, then Fund-C merges into Fund-D producing
(Fund-A/Fund-D)(Fund-B/Fund-C/Fund-D)(Fund-B/Fund-D)

I can think of several ways to handle this issue, and they're all
extraordinarily complex and ugly. Any suggestions?

Thanks,

- Steve Jorgensen
Jul 20 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a

"Steve Jorgensen" <no****@nospam.nospam> wrote in message
news:hp********************************@4ax.com...
Hi all,

I'm working on the schema for a database that must represent data about stock & bond funds over time. My connundrum is that, for any of several dimension fields, including the fund name itself, the dimension may be represented in different ways over time, and may split or combine from one period to the
next.


Apologies if this sounds obvious, but you seem to be combining two questions
into one.

When you are storing / processing your data (under a SQL database) , you
will usually find it best to store data down to the finest possible level of
detail.

Beyond this, the data can be aggregated (using SQL Analysis Services (
OLAP ) ) into one or more dimensions.

---

There are two simple options that you can choose from ...

1 - Only use SQL
2 - Use SQL + OLAP
Common
======
Store the data in its most detailed form in tables in SQL
Create "Lookup-tables" for things like product-types, funds, etc.
- these lookup tables will allow you to define a hierarchical structure
for regions or funds
e.g. Reading is in Berkshire is in England is in the United Kingdom
is in Europe

(Post again to this newsgroup for any advice on creating such a schema)

-------

Create a list of questions that you want the database to answer

e.g. What was the total sales figures per country for Fund X

1 - Create a SQL View / Proc / Function that queries the data directly to
return the data that you want

2 - Create dimensions in OLAP, and build one or more cubes to display this
(e.g. a fund dimension, region dimension, time dimension)

....

This is only very brief, but hopefully it has given you some ideas of where
to start.

Steven
SQL Only
=======
Create views or stored procedures t
Jul 20 '05 #2

P: n/a
I guess I didn't make myself clear. The finest level of detail has a
different meaning from month to month, and there is a mapping between the
meanings. When querying across a range of months, I need a system that
automatically figures out what it the finest level of meaning that can be
used. Within the same request, there could be a region called North America
in some months, and 2 regions called USA and Canada in other months. I need
to have the data automatically sum USA and Canada together for the months in
which that data occurs because the other months do not contain that same level
of detail. I need the system to figure out exactly the finest level of detail
it can use and still map apples to apples for any given date range.

Across the same dimension, some dimension names may merge, and others may
split at various points within the timespan of a request. Perhaps Region is
not the best example. Another dimension is Market Sector. At on epoint, it
was decided that Technology would now be Telecom, Software, Electronics, etc.
There is no meaningful finest possible granularity that could be known in
advance because the meaningful market sectors in the world are not known until
they arise.

On Sat, 3 Jan 2004 12:58:42 -0000, "Steven Wilmot" <st*********@wilmot.me.uk>
wrote:

"Steve Jorgensen" <no****@nospam.nospam> wrote in message
news:hp********************************@4ax.com.. .
Hi all,

I'm working on the schema for a database that must represent data about

stock
& bond funds over time. My connundrum is that, for any of several

dimension
fields, including the fund name itself, the dimension may be represented

in
different ways over time, and may split or combine from one period to the
next.


Apologies if this sounds obvious, but you seem to be combining two questions
into one.

When you are storing / processing your data (under a SQL database) , you
will usually find it best to store data down to the finest possible level of
detail.

Beyond this, the data can be aggregated (using SQL Analysis Services (
OLAP ) ) into one or more dimensions.

---

There are two simple options that you can choose from ...

1 - Only use SQL
2 - Use SQL + OLAP
Common
======
Store the data in its most detailed form in tables in SQL
Create "Lookup-tables" for things like product-types, funds, etc.
- these lookup tables will allow you to define a hierarchical structure
for regions or funds
e.g. Reading is in Berkshire is in England is in the United Kingdom
is in Europe

(Post again to this newsgroup for any advice on creating such a schema)

-------

Create a list of questions that you want the database to answer

e.g. What was the total sales figures per country for Fund X

1 - Create a SQL View / Proc / Function that queries the data directly to
return the data that you want

2 - Create dimensions in OLAP, and build one or more cubes to display this
(e.g. a fund dimension, region dimension, time dimension)

...

This is only very brief, but hopefully it has given you some ideas of where
to start.

Steven
SQL Only
=======
Create views or stored procedures t


Jul 20 '05 #3

P: n/a
Hi Steve,

The problem description provided is unclear to me. Here is my best
guess at what you're trying to do. I think you're trying to come up
with a scheme for mutual funds. I assume the funds are made up of
individual stocks. But the fund's mix of stocks change on a daily
basis.

I would create:
a) a fund table, which simply identifies the different funds
b) a stock table, which simply identifies the different underlying
stocks
c) a fund2stock table; for a given fund and date, it has a row for
every underlying stock and the amount or proportion it is of the fund.
For example, suppose a unit of the "Fast Food Fund" consists of 0.5
McDonald's stock, 0.25 BurgerKing stock, and 0.25 Church's Chicken
stock. The fund2stock table will have 3 rows for the Fast Food Fund
for that particular date and the amount(s) will be 0.5, 0.25 & 0.25.
d) a stock performance table, which has a row for every stock and
date. The rows have a price column.

I would then create a stored procedure that accepts as parameters:
fund and date range. This "fundMixSp" purpose is to return the
underlying stocks. It returns rows containing: date, stock, & amount.
Call this fundMixSp using an INSERT EXEC statement and join it
against the stock performance table as needed.

For example to calculate the average price of a unit of the "Fast Food
Fund" from 1Dec03 to 15Dec15. The code would look like this:
create table #t(...)
insert into #t
exec fundMixSp @fund='fast food fund', @startdate='2003/12/1',
@enddate='2003/12/15'

select sum(a.amount*b.price) / count(b.*) as avgPrice
from #T as a
join stockPerformance as b
on a.stock=b.stock and a.date=b.date

Anyways, good luck -- Louis
Jul 20 '05 #4

P: n/a
On 3 Jan 2004 23:46:38 -0800, lo************@hotmail.com (louis nguyen) wrote:
Hi Steve,

The problem description provided is unclear to me. Here is my best
guess at what you're trying to do. I think you're trying to come up
with a scheme for mutual funds. I assume the funds are made up of
individual stocks. But the fund's mix of stocks change on a daily
basis.
Well, it could be that my explanation is to blame, but I obviously haven't
communicated the problem well enough yet, so - trying again...

Yes, I'm talking about funds, but the input data is already rolled up. I
don't have information about the stocks comprising the funds or information on
the criteria for categorizing the stocks. Furthermore, it's not just that
stocks can change categorizations, it's that the categorizations that are
meaningful may actually change over time, and these categorizations may simply
appear, simply vanish, or be mergers or splits of prior categorizations.

Changes in categorization breakdowns affect the entire data set when they
occur, so for instance, in one month, the data may be broken down by industry
sector including a sector called Technology, and the next month, what was
previously represented collectively as Technology is now broken down into
Telecom, Software, etc. The application users will be responsible for
entering data decribing how the old breakdown maps to the new breakdown, so
this information will be available to the application.

When the user wants to query data from before the change, it should simply
show the earlier breakdown. When querying a period after the change, it
should show the later breakdown. The hard part is that when a query spans the
period in which the split occurred, the newer data should be summed together
so it can meaningfully coexist with the older data. In this example, the
sector shown for output would have to be called
"(Technology)(Telecom/Software/...)" Other sectors that didn't change during
the query time span would show as normal.

I would create:
a) a fund table, which simply identifies the different funds
b) a stock table, which simply identifies the different underlying
stocks
c) a fund2stock table; for a given fund and date, it has a row for
every underlying stock and the amount or proportion it is of the fund.
For example, suppose a unit of the "Fast Food Fund" consists of 0.5
McDonald's stock, 0.25 BurgerKing stock, and 0.25 Church's Chicken
stock. The fund2stock table will have 3 rows for the Fast Food Fund
for that particular date and the amount(s) will be 0.5, 0.25 & 0.25.
d) a stock performance table, which has a row for every stock and
date. The rows have a price column.

I would then create a stored procedure that accepts as parameters:
fund and date range. This "fundMixSp" purpose is to return the
underlying stocks. It returns rows containing: date, stock, & amount.
Call this fundMixSp using an INSERT EXEC statement and join it
against the stock performance table as needed.

For example to calculate the average price of a unit of the "Fast Food
Fund" from 1Dec03 to 15Dec15. The code would look like this:
create table #t(...)
insert into #t
exec fundMixSp @fund='fast food fund', @startdate='2003/12/1',
@enddate='2003/12/15'

select sum(a.amount*b.price) / count(b.*) as avgPrice
from #T as a
join stockPerformance as b
on a.stock=b.stock and a.date=b.date

Anyways, good luck -- Louis


Jul 20 '05 #5

P: n/a
I have worked on a very similar problem before.

I think you really need to have TWO schemas. One to store the raw data
that you receive from your source. The second one to store the
sanitised data that gets reclassified by your users (or perhaps
automatically).

The second schema is the one that is then used for the bulk of your
data analysis. This should be reasonably straightforward. The hardest
part is the reclassifying that needs to be done on new raw data or on
data that has already been classified. In certain circumstances, it
may be easier to reclassify all again. This model gives you the
flexibility to do that.

If you try to do the analysis on the raw data then this could be very
complex and become a huge maintenance issue.

Please let me know what you think of this solution and how it relates
to your problem domain as I may have misunderstood your issue?
Jul 20 '05 #6

P: n/a
> Furthermore, it's not just that
stocks can change categorizations, it's that the categorizations that are
meaningful may actually change over time, and these categorizations may simply
appear, simply vanish, or be mergers or splits of prior categorizations.

Hi Steve,

I don't see how this is workable w/o the raw data. For example,
suppose in Jan you have 3 sectors: Telecom, Software & Hardware. In
Feb, instead of 3 numbers you only receive 1 number for the combined
sector Technology. I don't see how you can map 1 number into 3
different numbers. If you only have access to the rolled up data, I
would tell your clients that it's not really programmable. They'd be
better letting some analyst keep track of it in Excel. I ran into a
similar situation once. Out in the field, a worker would count the
number of parcels by type. The application took the counts and
multiplied them against a pre-determined conversion factor. (For this
parcel type, an average parcel weighs this much). The app then summed
up the weights and stored it as a single number. I was then asked to
magically break out the single number again. Geeeeeeezzzz.
Jul 20 '05 #7

P: n/a
On 5 Jan 2004 10:29:41 -0800, lo************@hotmail.com (louis nguyen) wrote:
Furthermore, it's not just that
stocks can change categorizations, it's that the categorizations that are
meaningful may actually change over time, and these categorizations may simply
appear, simply vanish, or be mergers or splits of prior categorizations.

Hi Steve,

I don't see how this is workable w/o the raw data. For example,
suppose in Jan you have 3 sectors: Telecom, Software & Hardware. In
Feb, instead of 3 numbers you only receive 1 number for the combined
sector Technology. I don't see how you can map 1 number into 3
different numbers. If you only have access to the rolled up data, I


The client has spec'd that out. I simply need to roll all those values into
one for the data in months when all 3 exist, so that the rolled up data
becomes orthogonal to the non-rolled up data. The trick is that this needs to
happen if and only if the remapping occurs somewhere within the time period
being analyzed.

Jul 20 '05 #8

P: n/a
On 5 Jan 2004 05:36:03 -0800, Pr************@hotmail.com (Mystery Man) wrote:
I have worked on a very similar problem before.

I think you really need to have TWO schemas. One to store the raw data
that you receive from your source. The second one to store the
sanitised data that gets reclassified by your users (or perhaps
automatically).

The second schema is the one that is then used for the bulk of your
data analysis. This should be reasonably straightforward. The hardest
part is the reclassifying that needs to be done on new raw data or on
data that has already been classified. In certain circumstances, it
may be easier to reclassify all again. This model gives you the
flexibility to do that.

If you try to do the analysis on the raw data then this could be very
complex and become a huge maintenance issue.

Please let me know what you think of this solution and how it relates
to your problem domain as I may have misunderstood your issue?


I think you understand the question. The solution I decided to implement if
no one knows of a better one is to first build a set of interval-dimension
records for a specified interval, with a 1-M relationship from
interval-dimension records to simple dimension records, then use the
interval-dimension records as the dimensions for any criteria and output.

I think this will work, but it means you have to explicitly prepare the
mapping records for a given interval before you can do any analysis of data in
that interval. It also means a very large schema. Every dimension type also
has a translation node table, a translation item table, an interval dimension
table, and an interval dimension/dimension table. That's 4 extra tables for
every single dimension type.
Jul 20 '05 #9

P: n/a
Hi Steve,

Based on what you described, I think your app is better suited for an
OO application than a database. In the OO world because everything is
loaded in memory, you can evaluate things like an attribute of an
attribute of an attribute of a key, and all the attributes are get/set
at run-time.
Jul 20 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.