creating/modifying sparse files on linux

draghuram

Hi,

Is there any special support for sparse file handling in python? My
initial search didn't bring up much (not a thorough search). I wrote
the following pice of code:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

Thanks,
Raghu.

Aug 17 '05 #1

Subscribe Post Reply

5636

Trent Mick

[dr*******@gmail.com wrote]

Hi,

Is there any special support for sparse file handling in python? My
initial search didn't bring up much (not a thorough search). I wrote
the following pice of code:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

test_largefile.py in the Python test suite does this kind of thing and
doesn't take very long for me to run on Linux (SuSE 9.0 box).

Trent

--
Trent Mick
Tr****@ActiveState.com

Aug 17 '05 #2

Marc 'BlackJack' Rintsch

In <11**********************@f14g2000cwb.googlegroups .com>,
dr*******@gmail.com wrote:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

`range(len)` creates a list of size `len` *in memory* so you are trying to
build a list with 314,572,800 numbers. That seems to eat up all your RAM
and causes the swapping.

You can use `xrange(len)` instead which uses a constant amount of memory.
But be prepared to wait some time because now you are writing 314,572,800
characters *one by one* into the file. It would be faster to write larger
strings in each step.

Ciao,
Marc 'BlackJack' Rintsch

Aug 17 '05 #3

Terry Reedy

<dr*******@gmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...

Is there any special support for sparse file handling in python?
Since I have not heard of such in several years, I suspect not. CPython,
normally compiled, uses the standard C stdio lib. If your system+C has a
sparseIO lib, you would probably have to compile specially to use it.
options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
options.ranges = [(4096,1024),(30000,314572800)] # makes below nicer
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
off,len = map(int, drange.split(",")) # or
off,len = [int(s) for s in drange.split(",")] # or for tuples as suggested
above
off,len = drange
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
If I read the above right, the 2nd len is 300,000,000+ making the space
needed for the range list a few gigabytes. I suspect this is where you
started thrashing ;-). Instead:

for x in xrange(len): # this is what xrange is for ;-)
fd.write("a")
Without indent, this is syntax error, so if your code ran at all, this
cannot be an exact copy. Even with xrange fix, 300,000,000 writes will be
slow. I would expect that an real application should create or accumulate
chunks larger than single chars.
fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here?
See above
Is there a better way to create/modify sparse files?

Unless you can access builting facilities, create your own mapping index.

Terry J. Reedy

Aug 17 '05 #4

draghuram

Thanks for the info on xrange. Writing single char is just to get going
quickly. I knew that I would have to improve on that. I would like to
write chunks of 1MB which would require that I have 1MB string to
write. Is there any simple way of generating this 1MB string (other
than keep appending to a string until it reaches 1MB len)? I don't care
about the actual value of the string itself.

Thanks,
Raghu.

Aug 17 '05 #5

Terry Reedy

<dr*******@gmail.com> wrote in message
news:11**********************@z14g2000cwz.googlegr oups.com...

Thanks for the info on xrange. Writing single char is just to get going
quickly. I knew that I would have to improve on that. I would like to
write chunks of 1MB which would require that I have 1MB string to
write. Is there any simple way of generating this 1MB string
megastring = 1000000*'a' # t < 1 sec on my machine
(other than keep appending to a string until it reaches 1MB len)?

You mean like (unexecuted)
s = ''
for i in xrange(1000000): s += 'a' #?

This will allocate, copy, and deallocate 1000000 successively longer
temporary strings and is a noticeable O(n**2) operation. Since strings are
immutable, you cannot 'append' to them the way you can to lists.

Terry J. Reedy

Aug 17 '05 #6

François Pinard

[dr*******@gmail.com]

Is there any simple way of generating this 1MB string (other than keep
appending to a string until it reaches 1MB len)?

You might of course use 'x' * 1000000 for fairly quickly generating a
single string holding one million `x'.

Yet, your idea of generating a sparse file is interesting. I never
tried it with Python, but would not see why Python would not allow
it. Did someone ever played with sparse files in Python? (One problem
with sparse files is that it is next to impossible for a normal user to
create an exact copy. There is no fast way to read read them either.)

--
François Pinard http://pinard.progiciels-bpi.ca

Aug 17 '05 #7

Bengt Richter

On 17 Aug 2005 11:53:39 -0700, "dr*******@gmail.com" <dr*******@gmail.com> wrote:

Hi,

Is there any special support for sparse file handling in python? My
initial search didn't bring up much (not a thorough search). I wrote
the following pice of code:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

Thanks

I'm unclear as to what your goal is. Do you just need an object that provides
an interface like a file object, but internally is more efficient than an
a normal file object when you access it as above[1], or do you need to create
a real file and record all the bytes in full (with what default for gaps?)
on disk, so that it can be opened by another program and read as an ordinary file?

Some operating system file systems may have some support for virtual zero-block runs
and lazy allocation/representation of non-zero blocks in files. It's easy to imagine
the rudiments, but I don't know of such a file system, not having looked ;-)

You could write your own "sparse-file"-representation object, and maybe use pickle
for persistence. Or maybe you could use zipfiles. The kind of data you are creating above
would probably compress really well ;-)

[1] writing 314+ million identical bytes one by one is silly, of course ;-)
BTW, len is a built-in function, and using built-in names for variables
is frowned upon as a bug-prone practice.

Regards,
Bengt Richter

Aug 18 '05 #8

Benji York

Terry Reedy wrote:

megastring = 1000000*'a' # t < 1 sec on my machine

(other than keep appending to a string until it reaches 1MB len)?

You mean like (unexecuted)
s = ''
for i in xrange(1000000): s += 'a' #?

This will allocate, copy, and deallocate 1000000 successively longer
temporary strings and is a noticeable O(n**2) operation.

Not exactly. CPython 2.4 added an optimization of "+=" for strings.
The for loop above takes about 1 second do execute on my machine. You
are correct in that it will take *much* longer on 2.3.
--
Benji York

Aug 18 '05 #9

draghuram

My goal is very simple. Have a mechanism to create sparse files and
modify them by writing arbitratry ranges of bytes at arbitrary offsets.
I did get the information I want (xrange instead of range, and a simple
way to generate 1Mb string in memory). Thanks for pointing out about
using "len" as variable. It is indeed silly.

My only assumption from underlying OS/file system is that if I seek
past end of file and write some data, it doesn't generate blocks for
data in between. This is indeed true on Linux (I tested on ext3).

Thanks,
Raghu.

Aug 18 '05 #10

Mike Meyer

"dr*******@gmail.com" <dr*******@gmail.com> writes:

My goal is very simple. Have a mechanism to create sparse files and
modify them by writing arbitratry ranges of bytes at arbitrary offsets.
I did get the information I want (xrange instead of range, and a simple
way to generate 1Mb string in memory). Thanks for pointing out about
using "len" as variable. It is indeed silly.

My only assumption from underlying OS/file system is that if I seek
past end of file and write some data, it doesn't generate blocks for
data in between. This is indeed true on Linux (I tested on ext3).

This better be true for anything claiming to be Unix. The results on
systems that break this aren't pretty.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Aug 19 '05 #11

Similar topics

creating XML with embedded PHP

by: Troy | last post by:

Hi- I am attempting to set up an RSS feed using PHP. It would be convenient for me to embed PHP into an xml file like I would do to an HTML file in order to create the XML, however the apache...

PHP

Error SQL0035N when creating stored procedure

by: Hu Nan | last post by:

DB2 UDB 8.1 on Red Hat Linux 8: Just installed two Linux machines with UDB 8.1. When creating stored procedure, like "create procedure p1() begin end", got error like: SQL0035N The file...

DB2 Database

Creating Template and GUI?

by: Barkster | last post by:

I've been using Dreamweaver to create my php pages and love the functionality but when I start modifying code I start wondering if I should be something else because it always throws off the...

PHP

Creating and Accessing Very Big Arrays in C

by: djhulme | last post by:

Hi, I'm using GCC. Please could you tell me, what is the maximum number of array elements that I can create in C, i.e. char* anArray = (char*) calloc( ??MAX?? , sizeof(char) ) ; I've...

C / C++

Finding Nonzero Elements in a Sparse Matrix

by: deLenn | last post by:

Hi, Does scipy have an equivalent to Matlab's 'find' function, to list the indices of all nonzero elements in a sparse matrix? Cheers.

Python

large and sparse matrices

by: mediratta | last post by:

Hi, I want to allocate memory for a large matrix, whose size will be around 2.5 million x 17000. Three fourth of its rows will have all zeroes, but it is not known which will be those rows. If I...

C / C++

Sparse Matrix implemented as a doubly-linked list

by: adam.kleinbaum | last post by:

Hi there, I'm a novice C programmer working with a series of large (30,000 x 30,000) sparse matrices on a Linux system using the GCC compiler. To represent and store these matrices, I'd like to...

C / C++

Creating Installer or Executable in Python

by: DanielJohnson | last post by:

I have a small project which has around 10 .py files and I run this project using command line arguments. I have to distribute this project to somebody. I was wondering how can I make an...

Python

implemeting sparse and non-sparse matrixes

by: ishakteyran | last post by:

hello to all.. i have a realy tough assignment which requires me to add, substract, multiply, and get inverse of non-sparse and sparse matrixes.. in a more clear way it wants me to to the...

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp