random writing access to a file in Python

Claudio Grondi

I have a 250 Gbyte file (occupies the whole hard drive space) and want
to change only eight bytes in this file at a given offset of appr. 200
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?

Claudio Grondi

Aug 25 '06 #1

Subscribe Post Reply

7135

Tim Peters

[Claudio Grondi]

I have a 250 Gbyte file (occupies the whole hard drive space)

Then where is Python stored ;-)?

and want to change only eight bytes in this file at a given offset of appr. 200
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?

Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets. For example, Windows
using NTFS does. Try it. Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate. Use as
recent a released version of Python as you can.

Aug 25 '06 #2

Claudio Grondi

Tim Peters wrote:

[Claudio Grondi]

>I have a 250 Gbyte file (occupies the whole hard drive space)

Then where is Python stored ;-)?

>and want to change only eight bytes in this file at a given offset of
appr. 200
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?

Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets. For example, Windows
using NTFS does. Try it. Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate. Use as
recent a released version of Python as you can.

Thank you much for the quick response.

The core of my problem was ... trying to use 'wb' or 'w+b' ... (stupid
me ...)

Claudio Grondi

Aug 25 '06 #3

Claudio Grondi

Dennis Lee Bieber wrote:

On Fri, 25 Aug 2006 16:39:14 +0200, Claudio Grondi
<cl************@freenet.dedeclaimed the following in comp.lang.python:

>>The core of my problem was ... trying to use 'wb' or 'w+b' ... (stupid
me ...)

Ouch... How many times did you have to restore that massive file
from backup?

I was smart enough to try it first on a very small file wondering what
was happening. Python documentation and even Google search after 'random
file access in Python' were not helpful as there was no example and no
hint available.

The only hint about random file access in Python I found with Google was
Table of Contents of "Python Cookbook" from O'Railly:
http://www.oreilly.com/catalog/pythoncook2/toc.html
and hints about random reading access.

I was stupid enough to forget about 'r+' (used it many times before in
C/C++ a decade ago, but not yet in Python) thinking just too much the
Pythonic way:

================================================== =============
if I want to write, I don't open for reading (plus or not plus)
================================================== =============

Actually my file was 'only' 42 GByte, but I wanted to construct the
question making it impossible to suggest use of an intermediate file.

In between I have chosen a total new approach as random writing to hard
disk seems to actually move the disk head each time when seeking, so
apparently no cache is used sorting a bit the pieces to write to the
disk, so if there are many of them there is no other chance as to try to
put them together in memory first before writing them to the file. This
makes the straightforward intuitive programming a bit complicated
because to work on large files it is necessary to work in chunks and
waste some processing results when they don't fill the gaps. I suppose I
am still not on the right path, so by the way:

Is there a ready to use (free, best Open Source) tool able to sort lines
(each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in
place) taking full advantage of available memory to speed up the process
as much as possible?

Claudio Grondi

Aug 25 '06 #4

tobiah

================================================== =============
if I want to write, I don't open for reading (plus or not plus)
================================================== =============

I was thinking the same thing. I know python is only following
C, which is good, but the flags are confusing. I'll bet that
when fopen() was first written, they had only 'r', and 'w'.
Then, when that was working they thought, "but what if we
want to read and write on the same handle". So they tacked
on the '+' to mean, "Let me do the other thing too".

--
Posted via a free Usenet account from http://www.teranews.com

Aug 25 '06 #5

Tim Chase

Is there a ready to use (free, best Open Source) tool able to sort lines

(each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in
place) taking full advantage of available memory to speed up the process
as much as possible?

Sounds like an occasion to use a merge-sort. The pseudo-code
would be:

break up the file into bite-sized chunks (maybe a couple megs
each).
Sort each of them linewise.
Write them out to intermediate files

Once you have these pieces, open each file

read the first line of each one.

[here] Find the "earliest" of each of those lines according to
your sort-order.
write it to your output file
read the next line from that particular file
return to [here]

There are some optimizations that can be had on this as
well...you can find the "earliest" *and* the "next earliest" of
those lines/files, and just read from the "earliest" file until
the current line of it passes "next earliest"...lather, rinse,
repeat shifting "next earliest" to be the "earliest" and then
find the new "next earliest".

I don't know if the GNU "sort" utility does this, but I've thrown
some rather large files at it and haven't choked it yet.

-tkc

Aug 25 '06 #6

Fredrik Lundh

Claudio Grondi wrote:

I was smart enough to try it first on a very small file wondering what
was happening. Python documentation and even Google search after 'random
file access in Python' were not helpful as there was no example and no
hint available.

one would think that the repeated use of the word "truncate" in the
documentation for open/file would have provided the necessary hints:

"'w' for writing (truncating an existing file)"

"Modes 'r+', 'w+' and 'a+' open the file for updating
(note that 'w+' truncates the file)"

</F>

Aug 25 '06 #7

Paul Rubin

Claudio Grondi <cl************@freenet.dewrites:

Is there a ready to use (free, best Open Source) tool able to sort
lines (each line appr. 20 bytes long) of a XXX GByte large text file
(i.e. in place) taking full advantage of available memory to speed up
the process as much as possible?

Try the standard Unix/Linux sort utility. Use the --buffer-size=SIZE
to tell it how much memory to use.

Aug 26 '06 #8

Claudio Grondi

Paul Rubin wrote:

Claudio Grondi <cl************@freenet.dewrites:

>>Is there a ready to use (free, best Open Source) tool able to sort
lines (each line appr. 20 bytes long) of a XXX GByte large text file
(i.e. in place) taking full advantage of available memory to speed up
the process as much as possible?

Try the standard Unix/Linux sort utility. Use the --buffer-size=SIZE
to tell it how much memory to use.

I am on Windows and it seems, that Windows XP SP2 'sort' can work with
the file, but not without a temporary file and space for the resulting
file, so triple of space of the file to sort must be provided.
Windows XP 'sort' uses constantly appr. 300 MByte of memory and can't
use 100% of CPU all the time, probably due to I/O operations via USB (25
MByte/s experienced top data transfer speed).
I can't tell yet if it succeeded as the sorting of the appr. 80 GByte
file with fixed length records of 20 bytes is still in progress (for
eleven CPU time hours / 18 daytime hours).
I am not sure if own programming would help in my case to be much faster
than the systems own sort (I haven't tried yet to set the size of memory
to use in the options to e.g 1.5 GByte as the sort help tells it is
better not to specify it). My machine is a Pentium 4, 2.8 GHz with 2.0
GByte RAM.
I would be glad to hear if the time required for sorting I currently
experience is as expected for such kind of task or is there still much
space for improvement?

Claudio Grondi

Aug 26 '06 #9

Paul Rubin

Claudio Grondi <cl************@freenet.dewrites:

Try the standard Unix/Linux sort utility. Use the --buffer-size=SIZE
to tell it how much memory to use.
I am on Windows and it seems, that Windows XP SP2 'sort' can work with
the file, but not without a temporary file and space for the resulting
file, so triple of space of the file to sort must be provided.

Oh, sorry, I didn't see the earlier parts of the thread. Anyway,
depending on the application, it's probably not worth the hassle of
coding something yourself, instead of just throwing more disk space at
the Unix utility. But especially if the fields are fixed size, you
could just mmap the file and then do quicksort on disk. Simplest
would be to just let the OS paging system take care of caching stuff
if you wanted to get fancy, you could sort in memory once the sorting
regions got below a certain size.

A huge amount of stuff has been written (e.g. about half of Knuth vol
3) about how to sort. Remember too, that traditionally large-scale
sorting was done on serial media like tape drives, so random access
isn't that vital.

Aug 26 '06 #10

Claudio Grondi

Paul Rubin wrote:

Claudio Grondi <cl************@freenet.dewrites:

>>>Try the standard Unix/Linux sort utility. Use the --buffer-size=SIZE
to tell it how much memory to use.

I am on Windows and it seems, that Windows XP SP2 'sort' can work with
the file, but not without a temporary file and space for the resulting
file, so triple of space of the file to sort must be provided.

Oh, sorry, I didn't see the earlier parts of the thread. Anyway,
depending on the application, it's probably not worth the hassle of
coding something yourself, instead of just throwing more disk space at
the Unix utility. But especially if the fields are fixed size, you
could just mmap the file and then do quicksort on disk. Simplest
would be to just let the OS paging system take care of caching stuff
if you wanted to get fancy, you could sort in memory once the sorting
regions got below a certain size.

A huge amount of stuff has been written (e.g. about half of Knuth vol
3) about how to sort. Remember too, that traditionally large-scale
sorting was done on serial media like tape drives, so random access
isn't that vital.

Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact on
the actual sorting speed than
the speed of the data transfer from/to storage device(s)
?
So, that the most effective measure towards shortening the time required
for sorting very large files were to use faster hard drives (e.g. 10.000
rpm instead of 7.600 rpm) and faster interfaces for the data transfer
(e.g. E-IDE or S-ATA instead of USB), right?

Claudio Grondi

Aug 26 '06 #11

Paul Rubin

Claudio Grondi <cl************@freenet.dewrites:

Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact
on the actual sorting speed than
the speed of the data transfer from/to storage device(s)

Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek. Random access helps simulate
this, but only somewhat. Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory. Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1). Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run. That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step. You really want
each merge to read from M physical input devices that are separate
from the output device(s).

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?

Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once? If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.

Aug 27 '06 #12

Claudio Grondi

Paul Rubin wrote:

Claudio Grondi <cl************@freenet.dewrites:

>>Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact
on the actual sorting speed than
the speed of the data transfer from/to storage device(s)

Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek. Random access helps simulate
this, but only somewhat. Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory. Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1). Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run. That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step. You really want
each merge to read from M physical input devices that are separate
from the output device(s).

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?

The Windows XP SP 2 '/sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours) has, from what I could observe on the
task manager, done the job in only two runs of 'copying' :
1. first to a temporary file,
2. then to the final file.
In the second run the size of processed chunks between reading/writing
was in the order of up to tenths of Megabytes, where in the first run in
order of up to hundreds Megabytes. I suppose that the procedure behind
it was as follows:
1. decision about the size of chunks to split the file into and choosing
the size of required memory
2. processing the chunks with in-memory sorting them and writing to the
temporary file
3. decision about the size of buffers for merge sorting the chunks into
the final file, so that they all fit into the 300 MByte of used memory
4. opening as many 'pipes' as there were chunks filling all of the pipe
buffers up when one of them runs out of data
5. continuously switching between reading and writing to the hard disk,
writing the results of the merge sorting to the final file always when
one of the buffers run out of data and then filling up all of the
buffers for the next cycle (concluded from observed scattered reading,
smooth writing)

>
Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?

The latter.
One of the intermediate reasons behind doing it, is an attempt to get
more and better intuitive understanding what are the hardware and
software limits of brute force based approaches to solving problems in
the area of AI, language processing and data compression.

If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.

Thank you much for your detailed response.

Claudio Grondi

Aug 27 '06 #13

Paul Rubin

Claudio Grondi <cl************@freenet.dewrites:

The Windows XP SP 2 '/sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours) has, from what I could observe on the
task manager, done the job in only two runs of 'copying' :

That is terrible; on a reasonably fast machine these days it should
take just a few minutes.

Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?
The latter.
One of the intermediate reasons behind doing it, is an attempt to get
more and better intuitive understanding what are the hardware and
software limits of brute force based approaches to solving problems in
the area of AI, language processing and data compression.

This makes no sense; the methods you want are from the ancient days of
"data processing", nothing to do with AI. Remember that the mainframe
computers of the 1960's that this stuff was developed on, had less
computing power and memory than a modern mobile phone. But phone
companies, tax agencies, and so forth, still managed to run these big
sorting tasks on those mainframes in order to send people their phone
bills, identify the tax cheaters, etc. There is rich and wonderful
history to it.

Aug 27 '06 #14

Claudio Grondi

Paul Rubin wrote:

Claudio Grondi <cl************@freenet.dewrites:

>>The Windows XP SP 2 '/sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours) has, from what I could observe on the
task manager, done the job in only two runs of 'copying' :

That is terrible; on a reasonably fast machine these days it should
take just a few minutes.

Ok, I see - the misunderstanding is, that there were 4.294.967.296
records each 20 bytes long, what makes the actual file 85.899.345.920
bytes large (I just used 'Gigs' for telling the number of records, not
the size of the file).
Still not acceptable sorting time?

Claudio Grondi

Aug 27 '06 #15

Paul Rubin

Claudio Grondi <cl************@freenet.dewrites:

>The Windows XP SP 2 '/sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours)....

Ok, I see - the misunderstanding is, that there were 4.294.967.296
records each 20 bytes long, what makes the actual file 85.899.345.920
bytes large (I just used 'Gigs' for telling the number of records, not
the size of the file).
Still not acceptable sorting time?

I think that's not so bad, though probably still not optimal. 85 GB
divided by 18 hours is 1.3 MB/sec, which means if the program is
reading the file 8 times, it's getting 10 MB/sec through the Windows
file system, which is fairly reasonable throughput.

If you know something about the distribution of the data (e.g. the
records are random 20-byte hashes) you might be able to sort it in
essentially linear time (radix sorting). But even with a general
purpose algorithm, if you have a few hundred MB of ram and some
scratch disk space to work with, you should be able to sort that much
data in much less than 18 hours.

But if you only had to do it once and it's finished now, why do you
still care how long it took?

Aug 27 '06 #16

Claudio Grondi

Dennis Lee Bieber wrote:

On 27 Aug 2006 15:06:07 -0700, Paul Rubin <http://ph****@NOSPAM.invalid>
declaimed the following in comp.lang.python:

>>I think that's not so bad, though probably still not optimal. 85 GB
divided by 18 hours is 1.3 MB/sec, which means if the program is
reading the file 8 times, it's getting 10 MB/sec through the Windows
file system, which is fairly reasonable throughput.

Especially if, as implied, this was on a USB drive <G>

Don't underestimate external USB drives (I have heard there are great
differences between them depending on the used controller).

If the file requested is not scattered over the drive due to
defragmentation and appropriate reading procedure is used I have seen
(e.g. just yesterday with the 80 Gig file) constant throughput of 28
MBytes/second what compared to the maximum I have seen on E-IDE of 40
MBytes/second is not that bad as your posting might suggest.

Thanks to Paul Rubin for the hint on radix sorting (even if coming a bit
too late). I had used already yesterday this kind of approach in another
context on the file I was sorting and can therefore estimate the total
time on my system for such sorting using this method quite well: it will
take not more than 3 hours (what is a very big improvement compared to
18 hours). I suppose, that the problem with Windows XP 'sort' is that it
can't take advantage of the constant record size as there is no option
available which could be used to pass this hint to it.

"But if you only had to do it once and it's finished now, why do you
still care how long it took?"
Because one of my usual goals going along with doing things like that,
is to get some feeling for them gaining experience making me in similar
future cases capable of improvement by _intuitive_ selection of the best
known to me path to the solution (learning by doing).
It is a big difference between _knowing_ that there are various
different sorting algorithms and it is necessary to choose the right one
to speed up sorting and actually _experiencing_ that you have to wait
for your results 18 hours and the machine is so busy that it is hard to
use it for other tasks at the same time. If the sorting took less than
one hour I would probably never make the effort to give it some serious
thoughts.

Claudio Grondi

Aug 28 '06 #17

random writing access to a file in Python

Similar topics