473,691 Members | 2,476 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Tricky schema question - dimension can split and combine over time

Hi all,

I'm working on the schema for a database that must represent data about stock
& bond funds over time. My connundrum is that, for any of several dimension
fields, including the fund name itself, the dimension may be represented in
different ways over time, and may split or combine from one period to the
next.

When querying from the database for an arbitrary time period, I need the data
to be rolled up to the smallest extent possible so that apples can be compared
to apples. For instance, if the North America region becomes 2 regions, USA
and Canada, and I query a time period that spans the period in which this
split occurred, I should roll up USA and Canada and for records in the period
that has both, and I should call the result something like "(North
America)(USA/Canada)" in the output. The client specifies that the dimension
output must represent all dimensions that went into the input.
Of course, I have to account for more complex possibilities as well, e.g.
Fund-A splits into Fund-B and Fund-C, then Fund-C merges into Fund-D producing
(Fund-A/Fund-D)(Fund-B/Fund-C/Fund-D)(Fund-B/Fund-D)

I can think of several ways to handle this issue, and they're all
extraordinarily complex and ugly. Any suggestions?

Thanks,

- Steve Jorgensen
Jul 20 '05 #1
9 3725

"Steve Jorgensen" <no****@nospam. nospam> wrote in message
news:hp******** *************** *********@4ax.c om...
Hi all,

I'm working on the schema for a database that must represent data about stock & bond funds over time. My connundrum is that, for any of several dimension fields, including the fund name itself, the dimension may be represented in different ways over time, and may split or combine from one period to the
next.


Apologies if this sounds obvious, but you seem to be combining two questions
into one.

When you are storing / processing your data (under a SQL database) , you
will usually find it best to store data down to the finest possible level of
detail.

Beyond this, the data can be aggregated (using SQL Analysis Services (
OLAP ) ) into one or more dimensions.

---

There are two simple options that you can choose from ...

1 - Only use SQL
2 - Use SQL + OLAP
Common
======
Store the data in its most detailed form in tables in SQL
Create "Lookup-tables" for things like product-types, funds, etc.
- these lookup tables will allow you to define a hierarchical structure
for regions or funds
e.g. Reading is in Berkshire is in England is in the United Kingdom
is in Europe

(Post again to this newsgroup for any advice on creating such a schema)

-------

Create a list of questions that you want the database to answer

e.g. What was the total sales figures per country for Fund X

1 - Create a SQL View / Proc / Function that queries the data directly to
return the data that you want

2 - Create dimensions in OLAP, and build one or more cubes to display this
(e.g. a fund dimension, region dimension, time dimension)

....

This is only very brief, but hopefully it has given you some ideas of where
to start.

Steven
SQL Only
=======
Create views or stored procedures t
Jul 20 '05 #2
I guess I didn't make myself clear. The finest level of detail has a
different meaning from month to month, and there is a mapping between the
meanings. When querying across a range of months, I need a system that
automatically figures out what it the finest level of meaning that can be
used. Within the same request, there could be a region called North America
in some months, and 2 regions called USA and Canada in other months. I need
to have the data automatically sum USA and Canada together for the months in
which that data occurs because the other months do not contain that same level
of detail. I need the system to figure out exactly the finest level of detail
it can use and still map apples to apples for any given date range.

Across the same dimension, some dimension names may merge, and others may
split at various points within the timespan of a request. Perhaps Region is
not the best example. Another dimension is Market Sector. At on epoint, it
was decided that Technology would now be Telecom, Software, Electronics, etc.
There is no meaningful finest possible granularity that could be known in
advance because the meaningful market sectors in the world are not known until
they arise.

On Sat, 3 Jan 2004 12:58:42 -0000, "Steven Wilmot" <st*********@wi lmot.me.uk>
wrote:

"Steve Jorgensen" <no****@nospam. nospam> wrote in message
news:hp******* *************** **********@4ax. com...
Hi all,

I'm working on the schema for a database that must represent data about

stock
& bond funds over time. My connundrum is that, for any of several

dimension
fields, including the fund name itself, the dimension may be represented

in
different ways over time, and may split or combine from one period to the
next.


Apologies if this sounds obvious, but you seem to be combining two questions
into one.

When you are storing / processing your data (under a SQL database) , you
will usually find it best to store data down to the finest possible level of
detail.

Beyond this, the data can be aggregated (using SQL Analysis Services (
OLAP ) ) into one or more dimensions.

---

There are two simple options that you can choose from ...

1 - Only use SQL
2 - Use SQL + OLAP
Common
======
Store the data in its most detailed form in tables in SQL
Create "Lookup-tables" for things like product-types, funds, etc.
- these lookup tables will allow you to define a hierarchical structure
for regions or funds
e.g. Reading is in Berkshire is in England is in the United Kingdom
is in Europe

(Post again to this newsgroup for any advice on creating such a schema)

-------

Create a list of questions that you want the database to answer

e.g. What was the total sales figures per country for Fund X

1 - Create a SQL View / Proc / Function that queries the data directly to
return the data that you want

2 - Create dimensions in OLAP, and build one or more cubes to display this
(e.g. a fund dimension, region dimension, time dimension)

...

This is only very brief, but hopefully it has given you some ideas of where
to start.

Steven
SQL Only
=======
Create views or stored procedures t


Jul 20 '05 #3
Hi Steve,

The problem description provided is unclear to me. Here is my best
guess at what you're trying to do. I think you're trying to come up
with a scheme for mutual funds. I assume the funds are made up of
individual stocks. But the fund's mix of stocks change on a daily
basis.

I would create:
a) a fund table, which simply identifies the different funds
b) a stock table, which simply identifies the different underlying
stocks
c) a fund2stock table; for a given fund and date, it has a row for
every underlying stock and the amount or proportion it is of the fund.
For example, suppose a unit of the "Fast Food Fund" consists of 0.5
McDonald's stock, 0.25 BurgerKing stock, and 0.25 Church's Chicken
stock. The fund2stock table will have 3 rows for the Fast Food Fund
for that particular date and the amount(s) will be 0.5, 0.25 & 0.25.
d) a stock performance table, which has a row for every stock and
date. The rows have a price column.

I would then create a stored procedure that accepts as parameters:
fund and date range. This "fundMixSp" purpose is to return the
underlying stocks. It returns rows containing: date, stock, & amount.
Call this fundMixSp using an INSERT EXEC statement and join it
against the stock performance table as needed.

For example to calculate the average price of a unit of the "Fast Food
Fund" from 1Dec03 to 15Dec15. The code would look like this:
create table #t(...)
insert into #t
exec fundMixSp @fund='fast food fund', @startdate='200 3/12/1',
@enddate='2003/12/15'

select sum(a.amount*b. price) / count(b.*) as avgPrice
from #T as a
join stockPerformanc e as b
on a.stock=b.stock and a.date=b.date

Anyways, good luck -- Louis
Jul 20 '05 #4
On 3 Jan 2004 23:46:38 -0800, lo************@ hotmail.com (louis nguyen) wrote:
Hi Steve,

The problem description provided is unclear to me. Here is my best
guess at what you're trying to do. I think you're trying to come up
with a scheme for mutual funds. I assume the funds are made up of
individual stocks. But the fund's mix of stocks change on a daily
basis.
Well, it could be that my explanation is to blame, but I obviously haven't
communicated the problem well enough yet, so - trying again...

Yes, I'm talking about funds, but the input data is already rolled up. I
don't have information about the stocks comprising the funds or information on
the criteria for categorizing the stocks. Furthermore, it's not just that
stocks can change categorizations , it's that the categorizations that are
meaningful may actually change over time, and these categorizations may simply
appear, simply vanish, or be mergers or splits of prior categorizations .

Changes in categorization breakdowns affect the entire data set when they
occur, so for instance, in one month, the data may be broken down by industry
sector including a sector called Technology, and the next month, what was
previously represented collectively as Technology is now broken down into
Telecom, Software, etc. The application users will be responsible for
entering data decribing how the old breakdown maps to the new breakdown, so
this information will be available to the application.

When the user wants to query data from before the change, it should simply
show the earlier breakdown. When querying a period after the change, it
should show the later breakdown. The hard part is that when a query spans the
period in which the split occurred, the newer data should be summed together
so it can meaningfully coexist with the older data. In this example, the
sector shown for output would have to be called
"(Technology)(T elecom/Software/...)" Other sectors that didn't change during
the query time span would show as normal.

I would create:
a) a fund table, which simply identifies the different funds
b) a stock table, which simply identifies the different underlying
stocks
c) a fund2stock table; for a given fund and date, it has a row for
every underlying stock and the amount or proportion it is of the fund.
For example, suppose a unit of the "Fast Food Fund" consists of 0.5
McDonald's stock, 0.25 BurgerKing stock, and 0.25 Church's Chicken
stock. The fund2stock table will have 3 rows for the Fast Food Fund
for that particular date and the amount(s) will be 0.5, 0.25 & 0.25.
d) a stock performance table, which has a row for every stock and
date. The rows have a price column.

I would then create a stored procedure that accepts as parameters:
fund and date range. This "fundMixSp" purpose is to return the
underlying stocks. It returns rows containing: date, stock, & amount.
Call this fundMixSp using an INSERT EXEC statement and join it
against the stock performance table as needed.

For example to calculate the average price of a unit of the "Fast Food
Fund" from 1Dec03 to 15Dec15. The code would look like this:
create table #t(...)
insert into #t
exec fundMixSp @fund='fast food fund', @startdate='200 3/12/1',
@enddate='20 03/12/15'

select sum(a.amount*b. price) / count(b.*) as avgPrice
from #T as a
join stockPerformanc e as b
on a.stock=b.stock and a.date=b.date

Anyways, good luck -- Louis


Jul 20 '05 #5
I have worked on a very similar problem before.

I think you really need to have TWO schemas. One to store the raw data
that you receive from your source. The second one to store the
sanitised data that gets reclassified by your users (or perhaps
automatically).

The second schema is the one that is then used for the bulk of your
data analysis. This should be reasonably straightforward . The hardest
part is the reclassifying that needs to be done on new raw data or on
data that has already been classified. In certain circumstances, it
may be easier to reclassify all again. This model gives you the
flexibility to do that.

If you try to do the analysis on the raw data then this could be very
complex and become a huge maintenance issue.

Please let me know what you think of this solution and how it relates
to your problem domain as I may have misunderstood your issue?
Jul 20 '05 #6
> Furthermore, it's not just that
stocks can change categorizations , it's that the categorizations that are
meaningful may actually change over time, and these categorizations may simply
appear, simply vanish, or be mergers or splits of prior categorizations .

Hi Steve,

I don't see how this is workable w/o the raw data. For example,
suppose in Jan you have 3 sectors: Telecom, Software & Hardware. In
Feb, instead of 3 numbers you only receive 1 number for the combined
sector Technology. I don't see how you can map 1 number into 3
different numbers. If you only have access to the rolled up data, I
would tell your clients that it's not really programmable. They'd be
better letting some analyst keep track of it in Excel. I ran into a
similar situation once. Out in the field, a worker would count the
number of parcels by type. The application took the counts and
multiplied them against a pre-determined conversion factor. (For this
parcel type, an average parcel weighs this much). The app then summed
up the weights and stored it as a single number. I was then asked to
magically break out the single number again. Geeeeeeezzzz.
Jul 20 '05 #7
On 5 Jan 2004 10:29:41 -0800, lo************@ hotmail.com (louis nguyen) wrote:
Furthermore, it's not just that
stocks can change categorizations , it's that the categorizations that are
meaningful may actually change over time, and these categorizations may simply
appear, simply vanish, or be mergers or splits of prior categorizations .

Hi Steve,

I don't see how this is workable w/o the raw data. For example,
suppose in Jan you have 3 sectors: Telecom, Software & Hardware. In
Feb, instead of 3 numbers you only receive 1 number for the combined
sector Technology. I don't see how you can map 1 number into 3
different numbers. If you only have access to the rolled up data, I


The client has spec'd that out. I simply need to roll all those values into
one for the data in months when all 3 exist, so that the rolled up data
becomes orthogonal to the non-rolled up data. The trick is that this needs to
happen if and only if the remapping occurs somewhere within the time period
being analyzed.

Jul 20 '05 #8
On 5 Jan 2004 05:36:03 -0800, Pr************@ hotmail.com (Mystery Man) wrote:
I have worked on a very similar problem before.

I think you really need to have TWO schemas. One to store the raw data
that you receive from your source. The second one to store the
sanitised data that gets reclassified by your users (or perhaps
automatically) .

The second schema is the one that is then used for the bulk of your
data analysis. This should be reasonably straightforward . The hardest
part is the reclassifying that needs to be done on new raw data or on
data that has already been classified. In certain circumstances, it
may be easier to reclassify all again. This model gives you the
flexibility to do that.

If you try to do the analysis on the raw data then this could be very
complex and become a huge maintenance issue.

Please let me know what you think of this solution and how it relates
to your problem domain as I may have misunderstood your issue?


I think you understand the question. The solution I decided to implement if
no one knows of a better one is to first build a set of interval-dimension
records for a specified interval, with a 1-M relationship from
interval-dimension records to simple dimension records, then use the
interval-dimension records as the dimensions for any criteria and output.

I think this will work, but it means you have to explicitly prepare the
mapping records for a given interval before you can do any analysis of data in
that interval. It also means a very large schema. Every dimension type also
has a translation node table, a translation item table, an interval dimension
table, and an interval dimension/dimension table. That's 4 extra tables for
every single dimension type.
Jul 20 '05 #9
Hi Steve,

Based on what you described, I think your app is better suited for an
OO application than a database. In the OO world because everything is
loaded in memory, you can evaluate things like an attribute of an
attribute of an attribute of a key, and all the attributes are get/set
at run-time.
Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
5821
by: Patrick K. O'Brien | last post by:
I'm looking for a good schema to use as an example for an object database system. Something like a books/authors/publishers or teachers/students/courses kind of thing. There are plenty of examples of such to be found, especially in the academic literature involving object databases, but most of them are pathetic. It's hard to take a schema seriously when it shows MarriedTeacher inheriting from Teacher, etc. I want something that makes...
6
2485
by: Pieter | last post by:
I've read a lot of posts on "why relax ng is so very good" and on "why w3c xml schema should be the only schema language". I'm, however, still not clear on why I should prefer one over the other. I've made a small list of some good and bad points of both. These points don't really go into the grammar aspects of these languages, but are more about secondary aspects. The grammar aspects are different, but both are suitable for validating...
4
2390
by: Gordon Dickens | last post by:
I have target xml to generate from schema. All of the XML instances have the same global element i.e. <base>. I would like to combine all of the schemas into a single schema where I could generate any of the specific instances. sample schema one: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="base">
0
2411
by: Almoni | last post by:
Hi, I have a few .xsd files that include each other in the following way: <!-- lets call the main schema file AA.xsd and it includes BB.xsd inside it --> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
9
22055
by: Steve Wasser | last post by:
I need to sort a two-dimensional array. Each day I process a file with 9 comma-delimited fields, and varying amount of records (lines). I want to pull in each line, split it according to the comma into a two dimensional array (Perl has it f'r crissake), and sort by one of the fields (Purchase Order #). Trouble is, Array.Sort only supports single dimension arrays. Originally I was thinking of making a jagged array, pulling out the Purchase...
6
1479
by: localhost | last post by:
I have a string that looks like this: "document.form1.textBox1.focus ();document.form1.textBox1.select();" I want to replace the text between "document.form1." and ".focus()", as well as the text betwen "document.form1." and ".select ()".
5
5394
by: Jeff | last post by:
We are using .Net and the wsdl Utility to generate proxies to consume web services built using the BEA toolset. The data architects on the BEA side create XML schemas with various entities in separate files for ease of maintainability. These schemas are all part of the same namespace. When defining a web service that access more than one of these entities, the wsdl file generated by BEA contains multiple schema elements with the same...
4
1870
by: Paul | last post by:
Hi, This is probably a no-brainer, but I could do with some help here. I have some code which looks like this $event = array("Date"=>array(), "Time"=>array(), "Venue"=>array(), "Event"=>array()); $line = array();
2
4126
by: olympus_mons | last post by:
Hi, I'm just discovering the power of xsd.exe, so maybe I'm doing something wrong. schema files describing requests and responses. So there is an extra xsd file for each response and each request as well as some base schemas, i.e. defining a data type for a result element. Among others this result element contains a simple enumeration type "ResultState" which can either be "OK" or "FAILED".
0
8599
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9079
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
8791
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7623
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6457
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5813
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4322
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2965
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2227
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.