Unexpected timing results with file I/O

Steven D'Aprano

After reading an earlier thread about opening and closing lots of files,
I thought I'd do a little experiment.

Suppose you have a whole lot of files, and you need to open each one,
append a string, then close them. There's two obvious ways to do it:
group your code by file, or group your code by procedure.

# Method one: grouped by file.
for each file:
open the file, append the string, then close it
# Method two: grouped by procedure.
for each file:
open the file
for each open file:
append the string
for each open file:
close the file
If you have N files, both methods make the same number of I/O calls: N
opens, N writes, N closes. Which is faster?

Intuitively, the first method has *got* to be faster, right? It's got one
loop instead of three and it doesn't build an intermediate list of open
file objects. It's so *obviously* going to be faster that it is hardly
worth bothering to check it with timeit, right?

Well, I wouldn't be writing unless that intuitive result was wrong. So
here's my test results:
Method 1:

>>import timeit
names = ['afile' + str(n) for n in range(1000)]
T = timeit.Timer('''for name in names:

.... fp = open(name, 'a'); fp.write('xyz\\n'); fp.close()
.... ''', 'from __main__ import names')

>>min(T.repeat(6, 500))

17.391216039657593
Method 2:

>>for name in names: # reset the files to an empty state.

.... fp = open(name, 'w'); fp.close()
....

>>T = timeit.Timer('''files = [open(name, 'a') for name in names]

.... for fp in files:
.... fp.write('xyz\\n')
.... for fp in files:
.... fp.close()
.... ''', '''from __main__ import names''')

>>min(T.repeat(6, 500))

16.823362112045288
Surprisingly, Method 2 is a smidgen faster, by about half a second over
500,000 open-write-close cycles. It's not much faster, but it's
consistent, over many tests, changing many of the parameters (e.g. the
number of files, the number of runs per timeit test, etc.).

I'm using Linux and Python 2.5.

So, what's going on? Can anyone explain why the code which does more work
takes less time?

--
Steven

Feb 4 '08 #1

Subscribe Post Reply

1401

Christian Heimes

Steven D'Aprano wrote:

So, what's going on? Can anyone explain why the code which does more work
takes less time?

Short answer: CPU and RAM are much faster than hard disks.

The three loops and the creation of a list costs only a few CPU cycles
compared to flushing the new data to disk.

Christian

Feb 4 '08 #2

Marc 'BlackJack' Rintsch

On Mon, 04 Feb 2008 15:17:18 +0000, Steven D'Aprano wrote:

# Method one: grouped by file.
for each file:
open the file, append the string, then close it
# Method two: grouped by procedure.
for each file:
open the file
for each open file:
append the string
for each open file:
close the file

Method 1:

17.391216039657593

Method 2:

16.823362112045288
Surprisingly, Method 2 is a smidgen faster, by about half a second over
500,000 open-write-close cycles. It's not much faster, but it's
consistent, over many tests, changing many of the parameters (e.g. the
number of files, the number of runs per timeit test, etc.).

I'm using Linux and Python 2.5.

So, what's going on? Can anyone explain why the code which does more work
takes less time?

Can't confirm this (Linux, Python 2.5):

Method 1: 15.380897998809814
Method 2: 18.085366010665894

I guess it's really all about the disk IO as my system monitor applet
shows that almost all of the time is spend in the kernel and very little
in user space.

Ciao,
Marc 'BlackJack' Rintsch

Feb 4 '08 #3

rdahlstrom

On Feb 4, 10:17 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:

After reading an earlier thread about opening and closing lots of files,
I thought I'd do a little experiment.

Suppose you have a whole lot of files, and you need to open each one,
append a string, then close them. There's two obvious ways to do it:
group your code by file, or group your code by procedure.

# Method one: grouped by file.
for each file:
open the file, append the string, then close it

# Method two: grouped by procedure.
for each file:
open the file
for each open file:
append the string
for each open file:
close the file

If you have N files, both methods make the same number of I/O calls: N
opens, N writes, N closes. Which is faster?

Intuitively, the first method has *got* to be faster, right? It's got one
loop instead of three and it doesn't build an intermediate list of open
file objects. It's so *obviously* going to be faster that it is hardly
worth bothering to check it with timeit, right?

Well, I wouldn't be writing unless that intuitive result was wrong. So
here's my test results:

Method 1:

>import timeit
names = ['afile' + str(n) for n in range(1000)]
T = timeit.Timer('''for name in names:

... fp = open(name, 'a'); fp.write('xyz\\n'); fp.close()
... ''', 'from __main__ import names')>>min(T.repeat(6, 500))

17.391216039657593

Method 2:

>for name in names: # reset the files to an empty state.

... fp = open(name, 'w'); fp.close()
...>>T = timeit.Timer('''files = [open(name, 'a') for name in names]

... for fp in files:
... fp.write('xyz\\n')
... for fp in files:
... fp.close()
... ''', '''from __main__ import names''')>>min(T.repeat(6, 500))

16.823362112045288

Surprisingly, Method 2 is a smidgen faster, by about half a second over
500,000 open-write-close cycles. It's not much faster, but it's
consistent, over many tests, changing many of the parameters (e.g. the
number of files, the number of runs per timeit test, etc.).

I'm using Linux and Python 2.5.

So, what's going on? Can anyone explain why the code which does more work
takes less time?

--
Steven

The code that does more work takes more time. The second one does
quite a bit less work. Think of it like this:

You have 500,000 people to fit through a door. Here are your options:

1. For each person, open the door, walk through the door, then close
the door.
2. Open the door, allow everyone to walk through, then close the
door.

Which one would you say would be a more efficient way to fit 500,000
people through the door?

Feb 4 '08 #4

Gabriel Genellina

En Mon, 04 Feb 2008 15:53:11 -0200, rdahlstrom <ro*************@gmail.com>
escribiï¿½:

On Feb 4, 10:17 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:
>>
Suppose you have a whole lot of files, and you need to open each one,
append a string, then close them. There's two obvious ways to do it:
group your code by file, or group your code by procedure.

# Method one: grouped by file.
for each file:
open the file, append the string, then close it

# Method two: grouped by procedure.
for each file:
open the file
for each open file:
append the string
for each open file:
close the file

If you have N files, both methods make the same number of I/O calls: N
opens, N writes, N closes. Which is faster?

The code that does more work takes more time. The second one does
quite a bit less work. Think of it like this:

You have 500,000 people to fit through a door. Here are your options:

1. For each person, open the door, walk through the door, then close
the door.
2. Open the door, allow everyone to walk through, then close the
door.

Which one would you say would be a more efficient way to fit 500,000
people through the door?

Mmmm, no, the second one should be:

2. Create 500,000 doors and open them.
Make each person enter the room -one at a time- using its own door.
Close each of the 500,000 doors.

--
Gabriel Genellina

Feb 4 '08 #5

Steven D'Aprano

On Mon, 04 Feb 2008 17:08:02 +0000, Marc 'BlackJack' Rintsch wrote:

>Surprisingly, Method 2 is a smidgen faster, by about half a second over
500,000 open-write-close cycles. It's not much faster, but it's
consistent, over many tests, changing many of the parameters (e.g. the
number of files, the number of runs per timeit test, etc.).

I'm using Linux and Python 2.5.

So, what's going on? Can anyone explain why the code which does more
work takes less time?

Can't confirm this (Linux, Python 2.5):

Method 1: 15.380897998809814
Method 2: 18.085366010665894

Hmmm... does your system use software RAID? Mine does. I wonder if that's
a relevant factor?

I guess it's really all about the disk IO as my system monitor applet
shows that almost all of the time is spend in the kernel and very little
in user space.

I wouldn't be surprised if it was something to do with the OS caching
writes to disk. And saying that is really just me doing a lot of hand-
waving and saying "it's magic of what we know naught".

--
Steven

Feb 4 '08 #6

by: Dave Hansen | last post by:

OK, first, I don't often have the time to read this group, so apologies if this is a FAQ, though I couldn't find anything at python.org. Second, this isn't my code. I wouldn't do this. But a...

Python

IE5/6, but not IE4, doing unexpected

by: Dr John Stockton | last post by:

In <URL:http://www.merlyn.demon.co.uk/estrdate.htm#T3> the button "Table III Reversed" executes function BCPTableIII() { var GN, DM, Ro, CY document.write("\t\t\tCalculated Table III",...

Javascript

tsearch2 and unexpected exists

by: Nigel J. Andrews | last post by:

This will be a little vague, it was last night and I can't now do the test in that db (see below) so can't give the exact wording. I seem to remember a report a little while ago about tsearch v2...

PostgreSQL Database

docmd.transferspreadsheet unexpected error

by: user_5701 | last post by:

Hello, I'm getting an error with a Docmd.Transferspreadsheet line of code: DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel2000, "tblTest", pathAndFilename, True The above line...

Microsoft Access / VBA

sizeof struct returning unexpected results

by: Sean | last post by:

I have a struct that I wrote to test a protocol. The idea I had was to just declare the elements of the struct in the order in which they are sent and received as defined by the protocol. ...

C / C++

[Math Sqrt] Timing Problem ???

by: Reiner Apke | last post by:

Hello, I have got a very strange problem with the calcualtion of the the square root (Math.Sqrt()). I calculate in a loop a lot of of diameters maxDiameter = Math.Sqrt(maxCrossSection *...

C# / C Sharp

Unexpected timing results

by: Steven D'Aprano | last post by:

I have two code snippets to time a function object being executed. I expected that they should give roughly the same result, but one is more than an order of magnitude slower than the other. ...

Python

Timing a function object versus timeit

by: Steven D'Aprano | last post by:

The timeit module is ideal for measuring small code snippets; I want to measure large function objects. Because the timeit module takes the code snippet argument as a string, it is quite handy...

Python

Unexpected results while working with floats

by: bintom | last post by:

I ran the following simple code in C++ and got unexpected results: float f = 139.4; cout << f; Output: 139.399994;

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Unexpected timing results with file I/O

Similar topics