Numpy array to gzip file

Sean Davis

I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file

In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.

Thanks,
Sean

Jun 27 '08 #1

Subscribe Post Reply

5921

drobinow

On Jun 11, 9:17 am, Sean Davis <seand...@gmail.comwrote:

I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file

In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.

Thanks,
Sean

Use
fd.write(a)

The documentation says that gzip simulates most of the methods of a
file object.
Apparently that means it does not subclass it. numpy.tofile wants a
file object
Or something like that.

Jun 27 '08 #2

Sean Davis

On Jun 11, 12:42 pm, "drobi...@gmail.com" <drobi...@gmail.comwrote:

On Jun 11, 9:17 am, Sean Davis <seand...@gmail.comwrote:

I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file

In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.

Thanks,
Sean

Use
fd.write(a)

That seems to work fine. Just to add to the answer a bit, one can
then use:

b=numpy.frombuffer(fd.read(),dtype=numpy.uint8)

to get the array back as a numpy uint8 array.

Thanks for the help.

Sean

Jun 27 '08 #3

Robert Kern

Sean Davis wrote:

I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file

As drobinow says, the .tofile() method needs an actual file object with a real
FILE* pointer underneath it. You will need to call fd.write() on strings (or
buffers) made from the arrays instead. If your arrays are large (as they must be
if compression helps), then you will probably want to split it up. Use
numpy.array_split() to do this. For example:

In [13]: import numpy

In [14]: a=numpy.zeros(1000000,dtype=numpy.uint8)

In [15]: chunk_size = 256*1024

In [17]: import gzip

In [18]: fd = gzip.open('foo.gz', 'wb')

In [19]: for chunk in numpy.array_split(a, len(a) // chunk_size):
....: fd.write(buffer(chunk))
....:

In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.

File size perhaps, but I suspect the speed gains you get will be swamped by the
Python-level manipulation you will have to do to reconstruct the array. You will
have to read in (partial!) strings and then put the data into an array. If you
think compression will really help, look into PyTables. It uses the HDF5 library
which includes the ability to compress arrays with gzip and other compression
schemes. All of the decompression happens in C, so you don't have to do all of
the manipulations at the Python level. If you stand to gain anything from
compression, this is the best way to find out and probably the best way to
implement it, too.

http://www.pytables.org

If you have more numpy questions, you will probably want to ask on the numpy
mailing list:

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '08 #4

Similar topics

Props to numpy programmers!

by: Andrew Felch | last post by:

So I wanted to matrixmultiply , ] * Of course this is impossible, because the number of columns in the

Python

exposing C array to python namespace: NumPy and array module.

by: Bo Peng | last post by:

Dear list, I am writing a Python extension module that needs a way to expose pieces of a big C array to python. Currently, I am using NumPy like the following: PyObject* res =...

Python

Atlas and NumPy Problems

by: Justin Lemkul | last post by:

Hello all, I am hoping someone out there will be able to help me. I am trying to install a program that utilizes NumPy. In installing NumPy, I realized that I was lacking Atlas. I ran into...

Python

Linear regression in NumPy

by: nikie | last post by:

I'm a little bit stuck with NumPy here, and neither the docs nor trial&error seems to lead me anywhere: I've got a set of data points (x/y-coordinates) and want to fit a straight line through...

Python

numpy numbers converted wrong

by: robert | last post by:

in Gnuplot (Gnuplot.utils) the input array will be converted to a Numeric float array as shown below. When I insert a numpy array into Gnuplot like that below, numbers 7.44 are cast to 7.0 Why is...

Python

numpy help

by: Chris Smith | last post by:

Howdy, I'm a college student and for one of we are writing programs to numerically compute the parameters of antenna arrays. I decided to use Python to code up my programs. Up to now I haven't...

Python

numpy or _numpy or Numeric?

by: auditory | last post by:

I am a newbie here I am trying to read "space separated floating point data" from file I read about csv module by searching this group, but I couldn't read space separated values with csv....

Python

Questions on migrating from Numeric/Scipy to Numpy

by: vj | last post by:

I've tried to post this to the numpy google group but it seems to be down. My migration seems to be going well. I currently have one issue with using scipy_base.insert. array() array() array(,...

Python

numpy migration (also posted to numpy-discussion)

by: Duncan Smith | last post by:

Hello, Since moving to numpy I've had a few problems with my existing code. It basically revolves around the numpy scalar types. e.g. ------------------------------------------------ array(,...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++