By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,846 Members | 1,857 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,846 IT Pros & Developers. It's quick & easy.

How to read gzipped utf8 file in Python?

P: n/a
I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding. The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.) Is there some way to do this?
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?

John Nagle
Nov 22 '07 #1
Share this Question
Share on Google+
1 Reply


P: n/a
I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding.
You didn't specify the processing you want to perform. For example,
this should work just fine

fd = gzip.open(fname, 'rb')
for line in fd.readline():
pass

For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).
The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.)
I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?
If that's the processing you want to do - sure

fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()

You can combine that to

fd = codecs.getreader("utf-8")(gzip.open(fname))

HTH,
Martin
Nov 22 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.