Hi,
I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).
The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:
Original CSV file with column header and values:
"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210
Using the GUI, it is converted to this:
dimensions = [
{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
{ 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
{ 'name', 'Price', 'colindex':3 },
{ 'name', 'Vmax', 'colindex':4 },
]
facts = [
( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
measure_values )
( (1,5,1),(50000.0,210.0) ),
.... # Some million rows or less
]
The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)
The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics.
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?
I already looked at:
- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed
What else can I use?
Thanks,
Laszlo 2 1081
On Aug 7, 1:41*pm, Laszlo Nagy <gand...@shopzeus.comwrote:
* Hi,
I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).
The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:
Original CSV file with column header and values:
"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210
Using the GUI, it is converted to this:
dimensions = [
* * { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
* * { 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
* * { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
* * { 'name', 'Price', 'colindex':3 },
* * { 'name', 'Vmax', 'colindex':4 },
]
facts = [
* * ( (3,2,0),(100000.0,254.0) *), # ( dimension_value_indexes,
measure_values )
* * ( (1,5,1),(50000.0,210.0) ),
* *.... # Some million rows or less
]
The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)
The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics.
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?
I already looked at:
- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed
What else can I use?
Thanks,
* *Laszlo
Take a look at the mmap module. You get direct memory access, backed
by the file system. struct + mmap, if you keep your strings small?
Laszlo Nagy <ga*****@shopzeus.comwrites:
The facts table cannot be kept in memory because it is too big. I need to
store it on disk, be able to read incrementally, and make statistics. In most
cases, the "statistic" will be simple sum of the measures, and counting the
number of facts affected. To be effective, reading the facts from disk should
not involve complex conversions. For this reason, storing in CSV or XML or any
textual format would be bad. I'm thinking about a binary format, but how can I
interface that with Python?
I already looked at:
- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned about
its speed
What else can I use?
pytables (<http://www.pytables.org/>) looks like the right kind of
thing.
-M- This discussion thread is closed Replies have been disabled for this discussion. Similar topics
7 posts
views
Thread by Jenny |
last post: by
|
4 posts
views
Thread by mchoya |
last post: by
|
11 posts
views
Thread by hoopsho |
last post: by
|
2 posts
views
Thread by Daniel Mori |
last post: by
|
5 posts
views
Thread by Tales Normando |
last post: by
|
15 posts
views
Thread by Buddy Home |
last post: by
|
21 posts
views
Thread by Pieter |
last post: by
|
3 posts
views
Thread by M.-A. Lemburg |
last post: by
|
5 posts
views
Thread by castironpi |
last post: by
| | | | | | | | | | |