How to read gzipped utf8 file in Python?

John Nagle

I have a large (gigabytes) file which is encoded in UTF-8 and then
compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding. The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.) Is there some way to do this?
Is it possible to express "unzip, then decode utf8" via
"codecs.open"?

John Nagle

Nov 22 '07 #1

Subscribe Post Reply

11031

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I have a large (gigabytes) file which is encoded in UTF-8 and then

compressed with gzip. I'd like to read it with the "gzip" module
and "utf8" decoding.

You didn't specify the processing you want to perform. For example,
this should work just fine

fd = gzip.open(fname, 'rb')
for line in fd.readline():
pass

For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).

The obvious approach is

fd = gzip.open(fname, 'rb',encoding='utf8')

But "gzip.open" doesn't support an "encoding" parameter. (It
probably should, for consistency.)

I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.

Is it possible to express "unzip, then decode utf8" via
"codecs.open"?

If that's the processing you want to do - sure

fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()

You can combine that to

fd = codecs.getreader("utf-8")(gzip.open(fname))

HTH,
Martin

Nov 22 '07 #2

by: Marek Möhling | last post by:

My server (Apache/1.3.28 - PHP/4.3.3) is configured to receive gzipped data via: Header append Accept-Encoding "gzip, deflate" PHP is configured to send gzipped data via: php_value...

PHP

utf8 and ftplib

by: Richard Lewis | last post by:

Hi there, I'm having a problem with unicode files and ftplib (using Python 2.3.5). I've got this code: xml_source = codecs.open("foo.xml", 'w+b', "utf8") #xml_source = file("foo.xml",...

Python

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...

.NET Framework

script to read emails and extract attachments from cron

by: JohnRHarlow | last post by:

Hi: I am looking for advice on the best way to set up a process to read incoming emails (from a normal unix mailbox on the same host) containing a gzipped telemetry attachment. I'd like the...

Python

Can sqlite read gzipped databases?

by: Paul Smith | last post by:

Hi, I'd like to read a series of sqlite database files that have already been gzipped and was wondering if this can be done on the fly. In other words, can I avoid explicitly unzipping the file...

Python

read xml file from compressed file using gzip

by: flebber | last post by:

I was working at creating a simple program that would read the content of a playlist file( in this case *.k3b") and write it out . the compressed "*.k3b" file has two file and the one I was trying...

Python

How to get Python to default to UTF8

by: weheh | last post by:

I'm developing a cgi-bin application that must be unicode sensitive. I'm striving for a UTF8 implementation. I'm running python 2.3 on a development machine (windows xp) and a server (windows xp...

Python

Set sys.stdout.encoding to utf8 in emacs/python-mode?

by: damonwischik | last post by:

I use emacs 22 and python-mode. Emacs can display utf8 characters (e.g. when I open a utf8-encoded file with Chinese, those characters show up fine), and I'd like to see utf8-encoded output from my...

Python

Why is its substantialy slower to load 50GB of gzipped file (20GB gzipped file) then loading 50GB unzipped data? im using System.IO.Compression.GZipStream and its not maxing out the cpu while loading the gzip data! Im using the default buffer of the

by: DR | last post by:

Why is its substantialy slower to load 50GB of gzipped file (20GB gzipped file) then loading 50GB unzipped data? im using System.IO.Compression.GZipStream and its not maxing out the cpu while...

C# / C Sharp

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

How to read gzipped utf8 file in Python?

Similar topics