Re: Fastest way to store ints and floats on disk

On 2008-08-07 20:41, Laszlo Nagy wrote:

>
Hi,

I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).

The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:

Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
{ 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
{ 'name', 'Price', 'colindex':3 },
{ 'name', 'Vmax', 'colindex':4 },
]
facts = [
( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
measure_values )
( (1,5,1),(50000.0,210.0) ),
.... # Some million rows or less
]
The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)

The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics.
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?

I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed

What else can I use?

>>import marshal
marshal.dump(1, open('test.db', 'wb'))
marshal.load(open('test.db', 'rb'))

1

It also very fast at dumping/loading lists, tuples, dictionaries,
floats, etc.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Aug 07 2008)

>>Python/Zope Consulting and Support ... http://www.egenix.com/
mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

__________________________________________________ ______________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611

Aug 7 '08 #1

Subscribe Post Reply

2308

castironpi

On Aug 7, 2:27*pm, "M.-A. Lemburg" <m...@egenix.comwrote:

On 2008-08-07 20:41, Laszlo Nagy wrote:

*Hi,

I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).

The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:

Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
* *{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
* *{ 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
* *{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
* *{ 'name', 'Price', 'colindex':3 },
* *{ 'name', 'Vmax', 'colindex':4 },
]
facts = [
* *( (3,2,0),(100000.0,254.0) *), # ( dimension_value_indexes,
measure_values )
* *( (1,5,1),(50000.0,210.0) ),
* .... # Some million rows or less
]

The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)

The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics..
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?

I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed

What else can I use?

*>>import marshal
*>>marshal.dump(1, open('test.db', 'wb'))
*>>marshal.load(open('test.db', 'rb'))
1

It also very fast at dumping/loading lists, tuples, dictionaries,
floats, etc.

Depending on how hard-core you want to be, store the int, float,
string, and long C structures directly to disk, at a given offset.
Either use fixed-length strings, or implement (or find) a memory
manager. Anyone have a good alloc-realloc-free library, C or Python?

Aug 9 '08 #2

John Machin

On Aug 10, 4:58 am, castironpi <castiro...@gmail.comwrote:

On Aug 7, 2:27 pm, "M.-A. Lemburg" <m...@egenix.comwrote:

On 2008-08-07 20:41, Laszlo Nagy wrote:

Hi,

I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).

The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:

Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
{ 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
{ 'name', 'Price', 'colindex':3 },
{ 'name', 'Vmax', 'colindex':4 },
]
facts = [
( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
measure_values )
( (1,5,1),(50000.0,210.0) ),
.... # Some million rows or less
]

The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)

The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics.
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?

I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed

What else can I use?

>>import marshal
>>marshal.dump(1, open('test.db', 'wb'))
>>marshal.load(open('test.db', 'rb'))
1

It also very fast at dumping/loading lists, tuples, dictionaries,
floats, etc.

Depending on how hard-core you want to be, store the int, float,
string, and long C structures directly to disk, at a given offset.
Either use fixed-length strings, or implement (or find) a memory
manager. Anyone have a good alloc-realloc-free library, C or Python?

A long time ago, when I last needed to bother about such things (to
override the memory allocator in the DJGPP RTL), Doug Lea's malloc did
the trick.

A memory allocator written in Python? That's a novel concept.

Aug 9 '08 #3

castironpi

On Aug 9, 4:43*pm, John Machin <sjmac...@lexicon.netwrote:

On Aug 10, 4:58 am, castironpi <castiro...@gmail.comwrote:

On Aug 7, 2:27 pm, "M.-A. Lemburg" <m...@egenix.comwrote:

On 2008-08-07 20:41, Laszlo Nagy wrote:

*Hi,

I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).

The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:

Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
* *{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
* *{ 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
* *{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
* *{ 'name', 'Price', 'colindex':3 },
* *{ 'name', 'Vmax', 'colindex':4 },
]
facts = [
* *( (3,2,0),(100000.0,254.0) *), # ( dimension_value_indexes,
measure_values )
* *( (1,5,1),(50000.0,210.0) ),
* .... # Some million rows or less
]

The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)

The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics.
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?

I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed

What else can I use?

*>>import marshal
*>>marshal.dump(1, open('test.db', 'wb'))
*>>marshal.load(open('test.db', 'rb'))
1

It also very fast at dumping/loading lists, tuples, dictionaries,
floats, etc.

Depending on how hard-core you want to be, store the int, float,
string, and long C structures directly to disk, at a given offset.
Either use fixed-length strings, or implement (or find) a memory
manager. *Anyone have a good alloc-realloc-free library, C or Python?

A long time ago, when I last needed to bother about such things (to
override the memory allocator in the DJGPP RTL), Doug Lea's malloc did
the trick.

A memory allocator written in Python? That's a novel concept.

For strings and longs, you need variable-length records. Wrap Lea's
malloc in Python calls, and design a Python class where Year, Price,
and VMax are stored as ints at given offsets from the start of the
file, and Color and Make are stored as strings at given offsets.
Don't bother to cache, just seek and read. Part 1 of the file, or
File 1, looks like:

Car 1 color_offset year_offset make_offset price_offset vmax_offset
Car 2 color_offset year_offset make_offset price_offset vmax_offset

Store them directly as bytes, not string reps. of numbers.

1024 1050 1054 1084 1088
1092 1112 1116 1130 1134

Part 2 looks like

1024 Yell
1028 ow
1050 2000
1054 Ferr
1058 ari
1084 100000
1088 254
1092 Blue
1112 2003
1116 Volv

and so on.

Aug 10 '08 #4

by: Jenny | last post by:

Hi, I have a class foo which will construct some objects in my code. some of the objects store int values into the data deque, while others store float values to the deque. template <class...

C / C++

displaying ints in price form

by: mchoya | last post by:

Hello all, I want to display a list of ints for example 100, 1000, 10000 as prices $1.00, $10.00, $100.00. Can I do this? I guess I have to use the iomanip lib. But the only manipulators I...

C / C++

List<t> generic type - Fastest way to copy into an unmanaged array?

by: Daniel Mori | last post by:

Hi, Im hoping someone can give me some advice. Im doing some development using VS Whidbey 2005 beta 1. Im about to implement some highly time critical code related to a managed collection of...

.NET Framework

Fastest way to save an array of doubles on disk?

by: Tales Normando | last post by:

The title says it all. Anyone?

Visual Basic .NET

Fastest way to convert an object to a string

by: Buddy Home | last post by:

Hello, I'm trying to speed up a piece of code that is causing performance issues with our product. The problem is we are using serialization to convert the object to a string, this is costing us...

C# / C Sharp

fastest searchable datastructure?

by: Pieter | last post by:

Hi, I need some type of array/list/... In which I can store objects together with a unique key. The most important thing is performance: I will need to do the whole time searches in the list of...

.NET Framework

Fastest way to store ints and floats on disk

by: Laszlo Nagy | last post by:

Hi, I'm working on a pivot table. I would like to write it in Python. I know, I should be doing that in C, but I would like to create a cross platform version which can deal with smaller...

Python

Re: Fastest way to store ints and floats on disk

by: Emile van Sebille | last post by:

Laszlo Nagy wrote: Hmm... I wrote an browser based analysis tool and used the working name pyvot... I found Numeric to provide the best balance of memory footprint and speed. I also...

Python

In-place memory manager, mmap (was: Fastest way to store ints andfloats on disk)

by: castironpi | last post by:

Hi, I've got an "in-place" memory manager that uses a disk-backed memory- mapped buffer. Among its possibilities are: storing variable-length strings and structures for persistence and...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Re: Fastest way to store ints and floats on disk

Similar topics