Python vs. Java gzip performance

Bill

I've written a small program that, in part, reads in a file and parses
it. Sometimes, the file is gzipped. The code that I use to get the
file object is like so:

if filename.endswi th(".gz"):
file = GzipFile(filena me)
else:
file = open(filename)

Then I parse the contents of the file in the usual way (for line in
file:...)

The equivalent Java code goes like this:

if (isZipped(aFile )) {
input = new BufferedReader( new InputStreamRead er(new
GZIPInputStream (new FileInputStream (aFile)));
} else {
input = new BufferedReader( new FileReader(aFil e));
}

Then I parse the contents similarly to the Python version (while
nextLine = input.readLine. ..)

The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

Thanks -- Bill.

Mar 17 '06 #1

Subscribe Reply

7137

Martin v. Löwis

Bill wrote:

The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

Don't use readline/readlines. Instead, read in larger chunks, and break
it into lines yourself. For example, if you think the entire file should
fit into memory, read it at once.

If that helps, try editing gzip.py to incorporate that approach.

Regards,
Martin

Mar 17 '06 #2

Caleb Hattingh

I tried this:

from timeit import *

#Try readlines
print Timer('import
gzip;lines=gzip .GzipFile("gzte st.txt.gz").rea dlines();[i+"1" for i in
lines]').timeit(200) # This is one line
# Try file object - uses buffering?
print Timer('import gzip;[i+"1" for i in
gzip.GzipFile(" gztest.txt.gz")]').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn't seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect. The file
"gztest.txt .gz" is a gzipped file with 1000 lines, each being "This is
a test file".

Mar 17 '06 #3

Peter Otten

Caleb Hattingh wrote:

I tried this:

from timeit import *

#Try readlines
print Timer('import
gzip;lines=gzip .GzipFile("gzte st.txt.gz").rea dlines();[i+"1" for i in
lines]').timeit(200) # This is one line
# Try file object - uses buffering?
print Timer('import gzip;[i+"1" for i in
gzip.GzipFile(" gztest.txt.gz")]').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn't seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect. The file
"gztest.txt .gz" is a gzipped file with 1000 lines, each being "This is
a test file".

$ python -c"file('tmp.txt ', 'w').writelines ('%d This is a test\n' % n for n
in range(1000))"
$ gzip tmp.txt

Now, if you follow Martin's advice:

$ python -m timeit -s"from gzip import GzipFile"
"GzipFile('tmp. txt.gz').readli nes()"
10 loops, best of 3: 20.4 msec per loop

$ python -m timeit -s"from gzip import GzipFile"
"GzipFile('tmp. txt.gz').read() .splitlines(Tru e)"
1000 loops, best of 3: 534 usec per loop

Factor 38. Not bad, I'd say :-)

Peter

Mar 17 '06 #4

Andrew MacIntyre

Bill wrote:

I've written a small program that, in part, reads in a file and parses
it. Sometimes, the file is gzipped. The code that I use to get the
file object is like so:

if filename.endswi th(".gz"):
file = GzipFile(filena me)
else:
file = open(filename)

Then I parse the contents of the file in the usual way (for line in
file:...)

The equivalent Java code goes like this:

if (isZipped(aFile )) {
input = new BufferedReader( new InputStreamRead er(new
GZIPInputStream (new FileInputStream (aFile)));
} else {
input = new BufferedReader( new FileReader(aFil e));
}

Then I parse the contents similarly to the Python version (while
nextLine = input.readLine. ..)

The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

The gzip module is implemented in Python on top of the zlib module. If
you peruse its source (particularly the readline() method of the GzipFile
class) you might get an idea of what's going on.

popen()ing a gzcat source achieves better performance by shifting the
decompression to an asynchronous execution stream (separate process)
while allowing the standard Python file object's optimised readline()
implementation (in C) to do the line splitting (which is done in Python
code in GzipFile).

I suspect that Java approach probably implements a similar approach
under the covers using threads.

Short of rewriting the gzip module in C, you may get some better
throughput by using a slightly lower level approach to parsing the file:

while 1:
line = z.readline(size =4096)
if not line:
break
... # process line here

This is probably only likely to be of use for files (such as log files)
with lines longer that the 100 character default in the readline()
method. More intricate approaches using z.readlines(siz ehint=<size>)
might also work.

If you can afford the memory, approaches that read large chunks from the
gzipped stream then line split in one low level operation (so that the
line splitting is mostly done in C code) are the only way to lift
performance.

To me, if the performance matters, using popen() (or better: the
subprocess module) isn't so bad; it is actually quite portable
except for the dependency on gzip (probably better to use "gzip -dc"
rather than "gzcat" to maximise portability though). gzip is available
for most systems, and the approach is easily modified to use bzip2 as
well (though Python's bz2 module is implemented totally in C, and so
probably doesn't have the performance issues that gzip has).

-------------------------------------------------------------------------
Andrew I MacIntyre "These thoughts are mine alone..."
E-mail: an*****@bullsey e.apana.org.au (pref) | Snail: PO Box 370
an*****@pcug.or g.au (alt) | Belconnen ACT 2616
Web: http://www.andymac.org/ | Australia

Mar 18 '06 #5

Serge Orlov

Bill wrote:

Is there something that can be improved in the Python version?

Seems like GzipFile.readli nes is not optimized, file.readline works
better:

C:\py>python -c "file('tmp.txt' , 'w').writelines ('%d This is a test\n'
% n for n in range(10000))"

C:\py>python -m timeit "open('tmp.txt' ).readlines()"
100 loops, best of 3: 2.72 msec per loop

C:\py>python -m timeit "open('tmp.txt' ).readlines(100 0000)"
100 loops, best of 3: 2.74 msec per loop

C:\py>python -m timeit "open('tmp.txt' ).read().splitl ines(True)"
100 loops, best of 3: 2.79 msec per loop

Workaround has been posted already.

-- Serge.

Mar 18 '06 #6

Caleb Hattingh

Hi Peter

Clearly I misunderstood what Martin was saying :) I was comparing
operations on lines via the file generator against first loading the
file's lines into memory, and then performing the concatenation.

What does ".readlines ()" do differently that makes it so much slower
than ".read().splitl ines(True)"? To me, the "one obvious way to do it"
is ".readlines ()".

Caleb

Mar 21 '06 #7

Martin v. Löwis

Caleb Hattingh wrote:

What does ".readlines ()" do differently that makes it so much slower
than ".read().splitl ines(True)"? To me, the "one obvious way to do it"
is ".readlines ()".

readlines reads 100 bytes (at most) at a time. I'm not sure why it
does that (probably in order to not read further ahead than necessary
to get a line (*)), but for gzip, that is terribly inefficient. I
believe the gzip algorithms use a window size much larger than that -
not sure how the gzip library deals with small reads.

One interpretation would be that gzip decompresses the current block
over an over again if the caller only requests 100 bytes each time.
This is a pure guess - you would need to read the zlib source code
to find out.

Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Regards,
Martin

(*) Guessing further, it might be that "read a lot" fails to work well
on a socket, as you would have to wait for the complete data before
even returning the first line.

P.S. Contributions to improve this are welcome.

Mar 21 '06 #8

Fulvio

Hello,

I'm very new of Python programming. I just wrote some hundred lines of a
programm.
Now, I'd like to go some step farther and make a disk cataloger. There are
plenty for win, but few for linux. So, I'd like to write one which is for win
and linux.
I'm, actually, a bit stuck on how to collect informations regarding disk names
(CDroms or USB HDs).
The matter is rather difficult if is suppose to make the programm running for
linux as much as it would do for MSW.

Suggestions are very welcome.

Fulvio

Mar 22 '06 #9

Felipe Almeida Lessa

Em Qua, 2006-03-22 Ã*s 00:47 +0100, "Martin v. LÃ¶wis" escreveu:

Caleb Hattingh wrote:
What does ".readlines ()" do differently that makes it so much slower
than ".read().splitl ines(True)"? To me, the "one obvious way to do it"
is ".readlines ()".
[snip] Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Then there should be a fast-path on readlines like this:

def readlines(self, sizehint=None):
if sizehint is None:
return self.read().spl itlines(True)
# ...

Is it okay? Or is there any embedded problem I couldn't see?

--
Felipe.

Mar 22 '06 #10

Similar topics

6739

Python's biggest compromises

by: Anthony_Barker | last post by:

I have been reading a book about the evolution of the Basic programming language. The author states that Basic - particularly Microsoft's version is full of compromises which crept in along the language's 30+ year evolution. What to you think python largest compromises are? The three that come to my mind are significant whitespace, dynamic typing, and that it is interpreted - not compiled. These three put python under fire and cause...

Python

1424

Dr. Dobb's Python-URL! - weekly Python news and links (Aug 7)

by: Irmen de Jong | last post by:

QOTW: "To make the instructions even friendlier it would also help if 'but Whatever You Do DON'T UNZIP THE FREAKIN' THING - This Means YOU John Latter!' were in large, bold, and underlined type. And preferably a different colour." -- John Latter, on Python's install instructions " What's more, these are three of Python's greatest *strengths*. We resist all attempts to change these, and we (at least I) avoid other languages because...

Python

3301

Porting Java web application to Python to make it faster?

by: Wolfgang Keller | last post by:

Hello, as a non-developer I am currently participating in an industrial "research" project to develop a so-called "web application". This application serves at the same time as middleware to connect several other "conventional" enterprise-applications such as ERP, SCADA etc. and to provide a GUI frontend to the users. The developers are into Struts, Enterprise Java Beans and the like, so it will be entirely implemented in Java with all...

Python

114

9867

why python is slower than java?

by: Maurice LING | last post by:

This may be a dumb thing to ask, but besides the penalty for dynamic typing, is there any other real reasons that Python is slower than Java? maurice

Python

1305

Python compressed URL post

by: Norman Barker | last post by:

Hi, I have spent most of the day on this so any help would be appreciated. I have set up mod_deflate in Apache so that any input marked content-type gzip from the client is automatically decompressed before being forwarded on to the server-side java servlet. The idea is that I compress a file on the client write it to the stream and then it reaches the servlet decompressed through Apache. At the

Python

7749

Python doc problem example: gzip module (reprise)

by: Xah Lee | last post by:

Python Doc Problem Example: gzip Xah Lee, 20050831 Today i need to use Python to compress/decompress gzip files. Since i've read the official Python tutorial 8 months ago, have spent 30 minutes with Python 3 times a week since, have 14 years of computing experience, 8 years in mathematical computing and 4 years in unix admin and perl, i have quickly found the official doc: http://python.org/doc/2.4.1/lib/module-gzip.html

Python

3433

python speed

by: Krystian | last post by:

Hi are there any future perspectives for Python to be as fast as java? i would like to use Python as a language for writing games. best regards krystian

Python

337

Weekly Python Patch/Bug Summary

by: Kurt B. Kaiser | last post by:

Patch / Bug Summary ___________________ Patches : 375 open ( -3) / 3264 closed (+26) / 3639 total (+23) Bugs : 910 open ( +3) / 5851 closed (+20) / 6761 total (+23) RFE : 217 open ( -1) / 220 closed ( +3) / 437 total ( +2) New / Reopened Patches ______________________

Python

3958

Python, subprocess, dump, gzip and Cron

by: Aidan | last post by:

Hi, I'm having a bit of trouble with a python script I wrote, though I'm not sure if it's related directly to python, or one of the other software packages... The situation is that I'm trying to create a system backup script that creates an image of the system, filters the output though gzip, and then uploads the data (via ftp) to a remote site.

Python

8913

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9426

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9280

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9142

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8144

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6722

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4525

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3238

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2162

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General