By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,536 Members | 1,290 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,536 IT Pros & Developers. It's quick & easy.

Numpy array to gzip file

P: n/a
I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file

In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.

Thanks,
Sean
Jun 27 '08 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Jun 11, 9:17 am, Sean Davis <seand...@gmail.comwrote:
I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file

In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.

Thanks,
Sean
Use
fd.write(a)

The documentation says that gzip simulates most of the methods of a
file object.
Apparently that means it does not subclass it. numpy.tofile wants a
file object
Or something like that.
Jun 27 '08 #2

P: n/a
On Jun 11, 12:42 pm, "drobi...@gmail.com" <drobi...@gmail.comwrote:
On Jun 11, 9:17 am, Sean Davis <seand...@gmail.comwrote:
I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:
b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()
This works fine. However, this does not:
fd = gzip.open('test.dat','wb')
a.tofile(fd)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file
In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.
Thanks,
Sean

Use
fd.write(a)
That seems to work fine. Just to add to the answer a bit, one can
then use:

b=numpy.frombuffer(fd.read(),dtype=numpy.uint8)

to get the array back as a numpy uint8 array.

Thanks for the help.

Sean
Jun 27 '08 #3

P: n/a
Sean Davis wrote:
I have a set of numpy arrays which I would like to save to a gzip
file. Here is an example without gzip:

b=numpy.ones(1000000,dtype=numpy.uint8)
a=numpy.zeros(1000000,dtype=numpy.uint8)
fd = file('test.dat','wb')
a.tofile(fd)
b.tofile(fd)
fd.close()

This works fine. However, this does not:

fd = gzip.open('test.dat','wb')
a.tofile(fd)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: first argument must be a string or open file
As drobinow says, the .tofile() method needs an actual file object with a real
FILE* pointer underneath it. You will need to call fd.write() on strings (or
buffers) made from the arrays instead. If your arrays are large (as they must be
if compression helps), then you will probably want to split it up. Use
numpy.array_split() to do this. For example:

In [13]: import numpy

In [14]: a=numpy.zeros(1000000,dtype=numpy.uint8)

In [15]: chunk_size = 256*1024

In [17]: import gzip

In [18]: fd = gzip.open('foo.gz', 'wb')

In [19]: for chunk in numpy.array_split(a, len(a) // chunk_size):
....: fd.write(buffer(chunk))
....:
In the bigger picture, I want to be able to write multiple numpy
arrays with some metadata to a binary file for very fast reading, and
these arrays are pretty compressible (strings of small integers), so I
can probably benefit in speed and file size by gzipping.
File size perhaps, but I suspect the speed gains you get will be swamped by the
Python-level manipulation you will have to do to reconstruct the array. You will
have to read in (partial!) strings and then put the data into an array. If you
think compression will really help, look into PyTables. It uses the HDF5 library
which includes the ability to compress arrays with gzip and other compression
schemes. All of the decompression happens in C, so you don't have to do all of
the manipulations at the Python level. If you stand to gain anything from
compression, this is the best way to find out and probably the best way to
implement it, too.

http://www.pytables.org

If you have more numpy questions, you will probably want to ask on the numpy
mailing list:

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.