Hi!
I have a really long binary file that I want to read.
The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))
# do something
Each loop takes about 0.2 ms in my computer, which means the whole for loop
takes 2000 seconds.
I would like it to run faster.
Do you have any suggestions?
Thank you very much.
OhKyu 4 2596
On Mon, 30 Apr 2007 00:45:22 -0700, OhKyu Yoon wrote:
Hi!
I have a really long binary file that I want to read.
The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))
I assume that is supposed to be infile.read()
# do something
Each loop takes about 0.2 ms in my computer, which means the whole for loop
takes 2000 seconds.
You're reading 400 million bytes, or 400MB, in about half an hour. Whether
that's fast or slow depends on what the "do something" lines are doing.
I would like it to run faster.
Do you have any suggestions?
Disk I/O is slow, so don't read from files in tiny little chunks. Read a
bunch of records into memory, then process them.
# UNTESTED!
rsize = 8 + 32 # record size
for i in xrange(N//1000):
buffer = infile.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:offset+8])
# do something
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:offset+rsize])
# do something
(Now I'm just waiting for somebody to tell me that file.read() already
buffers reads...)
--
Steven D'Aprano
On Apr 30, 9:41 am, Steven D'Aprano <s...@REMOVEME.cybersource.com.au>
wrote:
On Mon, 30 Apr 2007 00:45:22 -0700, OhKyu Yoon wrote:
Hi!
I have a really long binary file that I want to read.
The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))
I assume that is supposed to be infile.read()
# do something
Each loop takes about 0.2 ms in my computer, which means the whole for loop
takes 2000 seconds.
You're reading 400 million bytes, or 400MB, in about half an hour. Whether
that's fast or slow depends on what the "do something" lines are doing.
I would like it to run faster.
Do you have any suggestions?
Disk I/O is slow, so don't read from files in tiny little chunks. Read a
bunch of records into memory, then process them.
# UNTESTED!
rsize = 8 + 32 # record size
for i in xrange(N//1000):
buffer = infile.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:offset+8])
# do something
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:offset+rsize])
# do something
(Now I'm just waiting for somebody to tell me that file.read() already
buffers reads...)
--
Steven D'Aprano
I think the file.read() already buffers reads... :)
En Tue, 01 May 2007 05:22:49 -0300, eC <er*********@gmail.comescribió:
On Apr 30, 9:41 am, Steven D'Aprano <s...@REMOVEME.cybersource.com.au>
wrote:
>On Mon, 30 Apr 2007 00:45:22 -0700, OhKyu Yoon wrote:
I have a really long binary file that I want to read.
The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))
Disk I/O is slow, so don't read from files in tiny little chunks. Read a bunch of records into memory, then process them.
# UNTESTED! rsize = 8 + 32 # record size for i in xrange(N//1000): buffer = infile.read(rsize*1000) # read 1000 records at once for j in xrange(1000): # process each record offset = j*rsize time = struct.unpack('=HHHH', buffer[offset:offset+8]) # do something tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:offset+rsize]) # do something
(Now I'm just waiting for somebody to tell me that file.read() already buffers reads...)
I think the file.read() already buffers reads... :)
Now we need someone to actually measure it, to confirm the expected
behavior... Done.
--- begin code ---
import struct,timeit,os
fn = r"c:\temp\delete.me"
fsize = 1000000
if not os.path.isfile(fn):
f = open(fn, "wb")
f.write("\0" * fsize)
f.close()
os.system("sync")
def smallreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', f.read(8))
tdc = struct.unpack('=LiLiLiLi', f.read(32))
f.close()
def bigreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N//1000):
buffer = f.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:offset+8])
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:offset+rsize])
f.close()
print "smallreads", timeit.Timer("smallreads(fn)","from __main__ import
fn,smallreads,fsize").repeat(3,1)
print "bigreads", timeit.Timer("bigreads(fn)", "from __main__ import
fn,bigreads,fsize").repeat(3,1)
--- end code ---
Output:
smallreads [4.2534193777646663, 4.126013885559789, 4.2389176672125458]
bigreads [1.2897319939456011, 1.3076018578892405, 1.2703250635695138]
So in this sample case, reading in big chunks is about 3 times faster than
reading many tiny pieces.
--
Gabriel Genellina
Wow, thank you all!
"Gabriel Genellina" <ga*******@yahoo.com.arwrote in message
news:op***************@furufufa-ec0e13.cpe.telecentro.com.ar...
En Tue, 01 May 2007 05:22:49 -0300, eC <er*********@gmail.comescribió:
>On Apr 30, 9:41 am, Steven D'Aprano <s...@REMOVEME.cybersource.com.au> wrote:
>>On Mon, 30 Apr 2007 00:45:22 -0700, OhKyu Yoon wrote:
>I have a really long binary file that I want to read. The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000 time = struct.unpack('=HHHH', infile.read(8)) # do something tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))
Disk I/O is slow, so don't read from files in tiny little chunks. Read a bunch of records into memory, then process them.
# UNTESTED! rsize = 8 + 32 # record size for i in xrange(N//1000): buffer = infile.read(rsize*1000) # read 1000 records at once for j in xrange(1000): # process each record offset = j*rsize time = struct.unpack('=HHHH', buffer[offset:offset+8]) # do something tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:offset+rsize]) # do something
(Now I'm just waiting for somebody to tell me that file.read() already buffers reads...)
I think the file.read() already buffers reads... :)
Now we need someone to actually measure it, to confirm the expected
behavior... Done.
--- begin code ---
import struct,timeit,os
fn = r"c:\temp\delete.me"
fsize = 1000000
if not os.path.isfile(fn):
f = open(fn, "wb")
f.write("\0" * fsize)
f.close()
os.system("sync")
def smallreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', f.read(8))
tdc = struct.unpack('=LiLiLiLi', f.read(32))
f.close()
def bigreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N//1000):
buffer = f.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:offset+8])
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:offset+rsize])
f.close()
print "smallreads", timeit.Timer("smallreads(fn)","from __main__ import
fn,smallreads,fsize").repeat(3,1)
print "bigreads", timeit.Timer("bigreads(fn)", "from __main__ import
fn,bigreads,fsize").repeat(3,1)
--- end code ---
Output:
smallreads [4.2534193777646663, 4.126013885559789, 4.2389176672125458]
bigreads [1.2897319939456011, 1.3076018578892405, 1.2703250635695138]
So in this sample case, reading in big chunks is about 3 times faster than
reading many tiny pieces.
--
Gabriel Genellina
This discussion thread is closed Replies have been disabled for this discussion. Similar topics
reply
views
Thread by Josiah Carlson |
last post: by
|
5 posts
views
Thread by Geoffrey |
last post: by
|
5 posts
views
Thread by grant |
last post: by
|
16 posts
views
Thread by Alfonso Morra |
last post: by
|
10 posts
views
Thread by Giovanni Bajo |
last post: by
|
3 posts
views
Thread by David Bear |
last post: by
|
2 posts
views
Thread by brnstrmrs |
last post: by
|
reply
views
Thread by Ping Zhao |
last post: by
|
2 posts
views
Thread by Heikki Toivonen |
last post: by
| | | | | | | | | | |