I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding.
You didn't specify the processing you want to perform. For example,
this should work just fine
fd = gzip.open(fname, 'rb')
for line in fd.readline():
pass
For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).
The obvious approach is
fd = gzip.open(fname, 'rb',encoding='utf8')
But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.)
I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?
If that's the processing you want to do - sure
fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()
You can combine that to
fd = codecs.getreader("utf-8")(gzip.open(fname))
HTH,
Martin