OLAP Proposal for MySQL

Philip Stoev

Hi all,

Please tell me if any of this makes sense. Any pointers to relevant
projects/articles will be much appreciated.

Philip Stoev
http://www.stoev.org/pivot/manifest.htm

===================================

OLAP PROPOSAL FOR MYSQL

The goal is to create an OLAP engine coupled with a presentation layer that
will be easy enough for normal people to use, with no MDX experience
required. While it is probably a fact that Wal-Mart has 70 GB of data, this
does not mean that all people have such data sets, so the goal is reasonable
performance for reasonably-sized datasets. Most people do not join 30 tables
together either. Also, it is pre-supposed that Wal-Mart engage in
extra-complex calculations to determine business strategies, most people are
often content to know "How much I sold yesterday".

I. OLAP ENGINE AND CACHING

The OLAP "engine" takes a standard SQL query with GROUP BY statements and
aggregate functions, executes it, and saves the entire resulting dataset in
the cache. A cache index entry is then created, noting what the source
tables, the GROUP_BY columns, the aggregate functions and the WHERE
conditions that were used.

Upon execution of further queries, the OLAP engine checks the cache whether
there is a cached dataset that can be used to answer the query immediately.
This would include any of the following:

1. The query's GROUP BY columns are equal or a sub-set of the cached query.
So, a query like:
SELECT salesman, state, SUM(sales) FROM company.sales GROUP BY
salesman, state
provides the answer for
SELECT salesman, SUM(sales) FROM company.sales GROUP BY salesman

2. The query's WHERE clause is equal or more restrictive to the WHERE clause
of a cached query, and contains columns that were GROUP BY-ed.
A query like:
SELECT date, salesman, SUM(sales) FROM company.sales GROUP BY
date, salesman WHERE date > '2003-01-01'
provides the answer for:
SELECT date, salesman, SUM(sales) FROM company.sales GROUP BY
date, salesman WHERE date > '2003-01-01' AND date > '2003-06-01'
Obviously, a human will not write a query with such a WHERE statement,
however a graphical Pivot tool may be explicitly designed to create such a
query when drilling-down so that a cache hit is scored.

3. The query's source tables are equal or a sub-set of the cached query's
source tables.
So, the query:
SELECT salesman, gender, SUM(sales) FROM company.sales INNER JOIN salesman
USING (salesman_id) GROUP BY salesman, gender
or even something very complex with 10 joined tables, can be used to answer:
SELECT salesman, SUM(sales) FROM company.sales GROUP BY salesman
or even something even more complex with 5 joined tables

4. The query's aggregate functions are equal of a sub-set of the cached
query's. Certain aggregate functions may not be cached like COUNT(DISTINCT),
and others require special care (AVERAGE(value) must be translated to
SUM(value)/COUNT(value)).

The benefits of such a cache implementation is that is it data-independent.
You do not have to describe your data prior to executing your queries. It
also does not rely on creating your own cache structure and your own cache
index - a few tables can be used to hold the cache index and can be then
queried by SQL themselves to determine a hit.

If an interactive Pivoting tool is executing those queries, the cache should
(hopefully) soon fill with entries that allow most, if not all, of the
queries resulting from interactive browsing to be served from the cache.
Additionally, the tool can apply for pre-fetching of relevant data by
drilling down a bit more than the user has requested, resulting in a cache
hit when the user indeed drills deeper. Also, the tool does not have to
cache data to sort it on its own, since queries that differ only in their
SORT BY are cached. An additional enhancement would be the ability to serve
a hit from the cache using more than one cached table.

Example:

A. No cache hit, so we just populate the cache
Initial query:
SELECT salesman, state, COUNT(*) FROM sales GROUP BY salesman,
state
The server does:
CREATE TABLE 1234567 SELECT salesman, COUNT(*) FROM sales GROUP
BY salesman, state
SELECT * FROM 1234567

B. A cache hit
Initial query:
SELECT state, COUNT(*) FROM sales GROUP BY state
The server does:
SELECT state, SUM(`COUNT(*)`) AS `COUNT(*)` FROM 1234567 GROUP
BY state
[`COUNT(*)` being a valid column name for table 1234567]

II. DATA DESCRIPTION AND MANIPULATION

1. In my humble opinion, people do not think in MDX. Instead, they think in
terms of GROUP BY. So, for most uses, it should be sufficient to allow the
user to construct his own GROUP BY statement and specify the aggregate
functions that he is interested in, rather than asking him to create a cube,
an axis, a view, a measure, etc, etc.

2. People also think in terms of everyday phrases, like "last 7 days" or
"all Mondays". A pre-compiled dictionary of such phrases will be immensely
useful, as well as the ability to specify such phrases. People also like to
be able to do "call duration in 5-minite intervals", which is not available
in Microsoft Excel when working with columns of type "time".

3. Normal people do not expect all of their columns to be available for
analysis, and they do not want their report to have either 2 or 2000 rows.

For example, if you have a date column and you do a Microsoft Excel
PivotTable, you will first have to select that column from a list that
contains bunch of other fields, then wait for the table to be generated with
a row for each date, and then you group or sort the dates somehow to arrive
to the numbers that interest you. Other tools (at least in their example
scenarios) facing a date column will start with the data grouped by year,
and you then have to expand to month (the months often being shown as
numbers), and from there on to weeks and days, and table has to refresh and
recalculate a dozen times for your convenience.

Instead, a person should have a list of phrases that we can use as rows and
columns, like "last 7 days per day", "all months since January by week",
etc. She will then be able to arrive precisely to the data that she wants to
see. Only one SQL query will be required.

4. Data is not always perfect

If you store your data as 1 and 0, and your boss wants to see "yes" and "no"
, this should be possible. If sales > $5000 means a pro salesman, then the
user does not have to display the row sales number in a column, and then
group on figures below $5000 and figures above $5000, and then separately
calculate the salesmen that are too recently hired to be able to score.
Months and days of week have names. Times of the day may be morning,
afternoon and evening, not (0..24:0..59:0.59). Times that are messed up due
to time zones can be adjusted on the fly without jeopardizing the work of
company software that relates on data being messed up.

III. PRESENTATION

A mod_perl GUI is envisioned that will allow you view and rotate your data
as you see fit. In particular, the following goals have been set:
1. Fully bookmarkable URLs that people can mail around to others
so that they too can see the same report;
2. Usage of phrases described in Section II to make access to
the most relevant portions of the report easier;
3. Sorting, drilling up and down, expanding, contracting,
hiding, showing, axis-swapping, grouping and ungrouping, coloring, etc.,
etc.
4. Tabs instead of drop-down lists, e.g. a tab for January, a
tab for February, etc.
5. Access control, full logging, etc. etc.;
6. Speed, speed, speed. Anything that is slower than Microsoft
Excel for comparable datasets should be optimized. Data may be queried (and
retrieved) in portions to provide concurrency and instant feedback to user.
For example, if we have a table keyed by date, we can always retrieve
January, show it to the user, and then proceed to retrieve the other months
and keep displaying them as they arrive (which, as a side effect, may cause
other queries to slip in between, providing faster performance for everyone
at least perceptually). Any queries that are known to run long (based on
timing previous invocations), should have a progress bar.
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/my***********...ie.nctu.edu.tw

Jul 19 '05 #1

Subscribe Post Reply

4665

by: Will | last post by:

On the subject of Data Warehouses, Data Cubes & OLAP…. I would like to speak frankly about Data Warehouses, Data Cubes and OLAP (on-line analytical processing). Has it dawned on anyone else...

Oracle Database

Oracle 9IR2 OLAP and BI Beans - How to get it working

by: DD | last post by:

Hi Guys! Just would like to share with you my experiense in this matter. I was trying to evaluate how suitable Oracle OLAP for our applications. As probably you did, I have downloaded from OTN...

Oracle Database

OLAP Proposal for MySQL

by: Philip Stoev | last post by:

Hi all, Please tell me if any of this makes sense. Any pointers to relevant projects/articles will be much appreciated. Philip Stoev http://www.stoev.org/pivot/manifest.htm ...

MySQL Database

OLAP and/or data mining?

by: Framework fan | last post by:

Hello, If I wrote the next ebay (yes I know, yawn-snore) and I had a database with 5 million auction items in it, what would be a really good strategy to get a search done very quickly? Would...

Microsoft SQL Server

DB2 OLAP vs Hyperion Essbase (version, package)

by: Gadi Refaeli | last post by:

Hello All, We are currently considering DB purchases for a new system, we are looking at Cognos, Oracle, DB2 and Essbase. We came across some questions regarding Essbase and DB2 OLAP. 1. Are...

DB2 Database

OLAP cubes

by: Eduardo Quiroz Salinas | last post by:

do someone knows where can i get a good tutorial or how to make OLAP cubes.???? thanx a lot -- Linux user number 344659 "...Los que no requieren de un dios para ser virtuosos, son la...

Microsoft Access / VBA

Micro OLAP dba center

by: universalbitmapper | last post by:

This time I've really met the ultimate experience in my computing life. I've downloaded and installed Micro Olap dba center. It has actually wiped out all my MySQL databases in Wamp 5 Just...

PHP

OLAP and pivot tables

by: George Sakkis | last post by:

After a brief search, I didn't find any python package related to OLAP and pivot tables. Did I miss anything ? To be more precise, I'm not so interested in a full-blown OLAP server with an RDBMS...

Python

Yellowfin Reporting Release 3 OLAP Connectivity

by: YellowFin Announcements | last post by:

Yellowfin Reporting Announces Release 3 OLAP Connectivity New Features Including OLAP-to-Relational Drill Through Provide Customers with One Complete Web BI Tool for OLAP Analysis Yellowfin,...

Microsoft SQL Server

Yellowfin Announces Release 3 OLAP Connectivity

by: YellowFin | last post by:

Yellowfin Announces Release 3 OLAP Connectivity New Features Including OLAP-to-Relational Drill Through Provide Customers with One Complete Web BI Tool for OLAP Analysis Yellowfin, today...

Microsoft Access / VBA

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Similar topics