471,354 Members | 1,753 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,354 software developers and data experts.

How to read gzipped utf8 file in Python?

I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding. The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.) Is there some way to do this?
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?

John Nagle
Nov 22 '07 #1
1 10378
I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding.
You didn't specify the processing you want to perform. For example,
this should work just fine

fd = gzip.open(fname, 'rb')
for line in fd.readline():
pass

For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).
The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.)
I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?
If that's the processing you want to do - sure

fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()

You can combine that to

fd = codecs.getreader("utf-8")(gzip.open(fname))

HTH,
Martin
Nov 22 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Marek Möhling | last post: by
5 posts views Thread by Richard Lewis | last post: by
1 post views Thread by JohnRHarlow | last post: by
1 post views Thread by Paul Smith | last post: by
9 posts views Thread by flebber | last post: by
4 posts views Thread by weheh | last post: by
reply views Thread by damonwischik | last post: by
reply views Thread by XIAOLAOHU | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.