using mmap on large (> 2 Gig) files

myeates

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg

2 Gig since its not seen as an int.

Mathew

Oct 23 '06 #1

Subscribe Post Reply

9267

Martin v. Löwis

my*****@jpl.nasa.gov schrieb:

Anyone ever done this? It looks like Python2.4 won't take a length arg
>2 Gig since its not seen as an int.

What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

On a 64-bit architecture, this is a known limitation of Python 2.4:
you can't have containers with more than 2Gi items. This limitation
was removed in Python 2.5, so I recommend to upgrade. Notice that
the code has seen little testing, due to lack of proper hardware,
so I shall suggest that you review the mmap code first before using
it (or just test it out and report bugs as you find them).

Regards,
Martin

Oct 23 '06 #2

Travis E. Oliphant

Martin v. Löwis wrote:

my*****@jpl.nasa.gov schrieb:
>Anyone ever done this? It looks like Python2.4 won't take a length arg
>>2 Gig since its not seen as an int.

What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

On a 64-bit architecture, this is a known limitation of Python 2.4:
you can't have containers with more than 2Gi items. This limitation
was removed in Python 2.5, so I recommend to upgrade. Notice that
the code has seen little testing, due to lack of proper hardware,

NumPy uses the mmap object and I saw a paper at SciPy 2006 that used
Python 2.5 + mmap + numpy to do some pretty nice and relatively fast
manipulations of very large data sets.

So, the very useful changes by Martin have seen more testing than he is
probably aware of.

-Travis

Oct 23 '06 #3

sturlamolden

my*****@jpl.nasa.gov wrote:

Anyone ever done this? It looks like Python2.4 won't take a length arg

http://docs.python.org/lib/module-mmap.html

It seems that Python does take a length argument, but not an offset
argument (unlike the Windows' CreateFileMapping/MapViewOfFile and UNIX'
mmap), so you always map from the beginning of the file. Of course if
you have ever worked with memory mapping files in C, you will probably
have experienced that mapping a large file from beginning to end is a
major slowdown. And if the file is big enough, it does not even fit
inside the 32 bit memory space of your process. Thus you have to limit
the portion of the file that is mapped, using the offset and the length
arguments.

But the question remains whether Python's "mmap" qualifies as a "memory
mapping" at all. Memory mapping a file means that the file is "mapped"
into the process address space. So if you access a certain address
(using a pointer type in C), you will actually read from or write to
the file. On Windows, this mechanism is even used to access "files"
that does not live on the file system. E.g. if CreateFileMapping is
called with the file handle set to INVALID_HANDLE_VALUE, creates a file
mapping backed by the OS paging file. That is, you actually obtain a
shared memory segment e.g. usable for for inter-process communication.
How would you use Python's mmap for something like this?

I haven't looked at the source, but I'd be surprised if Python actually
maps the file into the process image when mmap is called. I believe
Python is not memory mapping at all; rather, it just opens a file in
the file system and uses fseek to move around. That is, you can use
slicing operators on Python's "memory mapped file object" as if it were
a list or a string, but it's not really memory mapping, it's just a
syntactical convinience. Because of this, you even need to manually
"flush" the memory mapping object. If you were talking to a real memory
mapped file, flushing would obviously not be required.

This probably means that your problem is irrelevant. Even if the file
is too large to fit inside a 32 bit process image, Python's memory
mapping would not be affected by this, as it is not memory mapping the
file when "mmap" is called.

Oct 23 '06 #4

sturlamolden

Martin v. Löwis wrote:

What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

Indeed. But why does Python's memory mapping need to be flushed? And
why doesn't Python's mmap take an offset argument to handle large
files? Is Python actually memory mapping with mmap or just faking it
with fseek? If Python isn't memory mapping, there would be no limit
imposed by the 32 bit address space.

Oct 24 '06 #5

sturlamolden

my*****@jpl.nasa.gov wrote:

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg
2 Gig since its not seen as an int.

Lookin at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows MapViewOfFile and UNIX to 0.
This means that it is always mapping from the beginning of the file.
Thus, Python's mmap module is useless for large files. This is really
bad coding. The one that wrote mmapmodule.c didn't consider the
posibility that a 64 bit file system like NTFS can harbour files to
large to fit in a 32 address space. Thus,
mmapmodule.c needs to be fixed before it can be used for large files.

Oct 24 '06 #6

sturlamolden

my*****@jpl.nasa.gov wrote:

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg
2 Gig since its not seen as an int.

Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows MapViewOfFile and UNIX to 0.
This means that it is always mapping from the beginning of the file.
Thus, Python's mmap module is useless for large files. This is really
bad coding. The one that wrote mmapmodule.c didn't consider the
posibility that a 64 bit file system like NTFS can harbour files to
large to fit in a 32 address space. Thus,
mmapmodule.c needs to be fixed before it can be used for large files.

Oct 24 '06 #7

sturlamolden

my*****@jpl.nasa.gov wrote:

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg
2 Gig since its not seen as an int.

Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows' MapViewOfFile and UNIX'
mmap to 0. This means that it is always mapping from the beginning of
the file. Thus, Python's mmap module is useless for large files. This
is really bad coding. The one that wrote mmapmodule.c didn't consider
the possibility that a 64 bit file system like NTFS can harbour files
to large to fit in a 32 address space. Thus, mmapmodule.c needs to be
fixed before it can be used for large files.

Oct 24 '06 #8

myeates

Well, compiling Python 2.5 on Solaris 10 on an x86 is no walk in the
park. pyconfig.h seems to think SIZEOF_LONG is 4 and I SEGV during my
build, even after modifying the Makefile and pyconfig.h.

Mathew

Martin v. Löwis wrote:

my*****@jpl.nasa.gov schrieb:
Anyone ever done this? It looks like Python2.4 won't take a length arg
2 Gig since its not seen as an int.

What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

On a 64-bit architecture, this is a known limitation of Python 2.4:
you can't have containers with more than 2Gi items. This limitation
was removed in Python 2.5, so I recommend to upgrade. Notice that
the code has seen little testing, due to lack of proper hardware,
so I shall suggest that you review the mmap code first before using
it (or just test it out and report bugs as you find them).

Regards,
Martin

Oct 24 '06 #9

Fredrik Lundh

sturlamolden wrote:

Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows' MapViewOfFile and UNIX'
mmap to 0. This means that it is always mapping from the beginning of
the file. Thus, Python's mmap module is useless for large files. This
is really bad coding. The one that wrote mmapmodule.c didn't consider
the possibility that a 64 bit file system like NTFS can harbour files
to large to fit in a 32 address space. Thus, mmapmodule.c needs to be
fixed before it can be used for large files.

if you've gotten that far, maybe you could come up with a patch, instead
of stating that someone else "needs to fix it" ?

</F>

Oct 24 '06 #10

Donn Cave

In article <11**********************@k70g2000cwa.googlegroups .com>,
"sturlamolden" <st**********@yahoo.nowrote:
....

It seems that Python does take a length argument, but not an offset
argument (unlike the Windows' CreateFileMapping/MapViewOfFile and UNIX'
mmap), so you always map from the beginning of the file. Of course if
you have ever worked with memory mapping files in C, you will probably
have experienced that mapping a large file from beginning to end is a
major slowdown.

I certainly have not experienced that. mmap itself takes nearly
no time, there should be no I/O. Access to mapped pages may
require I/O, but there is no way around that in any case.

I haven't looked at the source, but I'd be surprised if Python actually
maps the file into the process image when mmap is called. I believe
Python is not memory mapping at all; rather, it just opens a file in
the file system and uses fseek to move around.

Wow, you're sure a wizard! Most people would need to look before
making statements like that.

Donn Cave, do**@u.washington.edu

Oct 24 '06 #11

Martin v. Löwis

sturlamolden schrieb:

Martin v. Löwis wrote:

>What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into
the available address space).

Indeed. But why does Python's memory mapping need to be flushed?

It doesn't need to, why do you think it does?

And why doesn't Python's mmap take an offset argument to handle large
files?

I don't know exactly; the most likely reason is that nobody has
contributed code to make it support that. That's, in turn, probably
because nobody had the problem yet, or nobody of those who did
cared enough to implement and contribute a patch.

Is Python actually memory mapping with mmap or just faking it
with fseek?

Read the source, Luke. It uses mmap or MapViewOfFile, depending
on the platform.

Regards,
Martin

Oct 24 '06 #12

Martin v. Löwis

sturlamolden schrieb:

Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows MapViewOfFile and UNIX to 0.
This means that it is always mapping from the beginning of the file.
Thus, Python's mmap module is useless for large files. This is really
bad coding. The one that wrote mmapmodule.c didn't consider the
posibility that a 64 bit file system like NTFS can harbour files to
large to fit in a 32 address space. Thus,
mmapmodule.c needs to be fixed before it can be used for large files.

You know this isn't true in general. It is true for a 32-bit address
space only.

Regards,
Martin

Oct 24 '06 #13

sturlamolden

Fredrik Lundh wrote:

to large to fit in a 32 address space. Thus, mmapmodule.c needs to be
fixed before it can be used for large files.

if you've gotten that far, maybe you could come up with a patch, instead
of stating that someone else "needs to fix it" ?

I did not say "someone else" needs to fix it. I can patch it, but I am
busy until next weekend. This is a typical job for a cold, rainy
Saturday afternoon. Also I am not in a hurry to patch mmapmodule.c for
my own projects, as I am not using it (but I am going to).

A patch would involve an new object, say, "mmap.mmap2" that thakes the
additional offeset parameter. I don't want it to break any code
dependent on the existing "mmap.mmap" object. Also, I think mmap.mmap2
should allow the file object to be None, and in that case return a
shared memory segment backed by the OS' paging file. Calling
CreateFileMapping with the filehandle set to INVALID_HANDLE_VALUE is
how shared memory for IPC is created on Windows.

Oct 24 '06 #14

Martin v. Löwis

sturlamolden schrieb:

A patch would involve an new object, say, "mmap.mmap2" that thakes the
additional offeset parameter. I don't want it to break any code
dependent on the existing "mmap.mmap" object. Also, I think mmap.mmap2
should allow the file object to be None, and in that case return a
shared memory segment backed by the OS' paging file. Calling
CreateFileMapping with the filehandle set to INVALID_HANDLE_VALUE is
how shared memory for IPC is created on Windows.

Python has default parameters for that. Just add a new parameter,
and make it have a default value of 0. No need to add new functions
(let alone types).

In any case, take as much time as you need. Python 2.6 won't be
released until 2008.

Regards,
Martin

Oct 24 '06 #15

sturlamolden

Martin v. Löwis wrote:

You know this isn't true in general. It is true for a 32-bit address
space only.

Yes, but there are two other aspects:

1. Many of us use 32-bit architectures. The one who wrote the module
should have considered why UNIX' mmap and Windows' MapViewOfFile takes
an offset parameter. As it is now, "mmap.mmap" can be considered
inadequate on 32 bit architectures.

2. The OS may be stupid. Mapping a large file may be a major slowdown
simply because the memory mapping is implemented suboptimally inside
the OS. For example it may try to load and synchronise huge portions of
the file that you don't need. This will deplete the amout of free RAM,
and perhaps result in excessive swapping. "mmap.mmap" is therefore a
potential "tarpit" on any architecture. Thus, memory mapping more than
you need is not intelligent, even if you do have a 64 bit processor.
The missing offset argument is essential for getting adequate
performance from a memory-mapped file object.

Oct 24 '06 #16

sturlamolden

Donn Cave wrote:

Wow, you're sure a wizard! Most people would need to look before
making statements like that.

I know, but your news-server doesn't honour cancel messages. :)

Python's mmap does indeed memory map the file into the process image.
It does not fake memory mapping by means of file seek operations.

However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile(). In Python, we don't always need
the file memory mapped, we normally just want to use slicing-operators,
for-loops and other goodies on the file object -- i.e. we just want to
treat the file as a Python container object. There are many ways of
achieving that.

We can implement a container object backed by a binary file just as
efficient (and possibly even more efficient) without using the OS'
memory mapping facilities. The major advantage is that we can
"pseudo-memory map" a lot more than a 32 bit address space can harbour.
However - as I wrote in another posting - memory-mapping may also be
used to create shared memory on Windows, and that doesn't fit easily
into the fseek scheme. But apart from that, I don't see why true memory
mapping has any real advantage on Python. As long as slicing operators
work, users will probably not be able to tell the difference.

There are in any case room for improving Python's mmap object.

Oct 24 '06 #17

sturlamolden

Martin v. Löwis wrote:

Your news server doesn't honour cancel as well...

It doesn't need to, why do you think it does?

This was an extremely stupid question on my side. It needs to be
flushed after a write because that's how the memory pages mapping the
file is synchronized with the file. Write ops to the memory mapping
addresses isn't immediately synchronized with the file on disk. Both
Windows and UNIX require this. I should think before I write, but I
realized this after posting and my cancel didn't reach you.

Read the source, Luke. It uses mmap or MapViewOfFile, depending
on the platform.

Yes, indeed.

Oct 24 '06 #18

Steve Holden

sturlamolden wrote:
[...]

This was an extremely stupid question on my side.

I take my hat off to anyone who's prepared to admit this. We all do it,
but most of us try to ignore the fact.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Oct 24 '06 #19

Martin v. Löwis

sturlamolden schrieb:

2. The OS may be stupid. Mapping a large file may be a major slowdown
simply because the memory mapping is implemented suboptimally inside
the OS. For example it may try to load and synchronise huge portions of
the file that you don't need.

Can you give an example of an operating system that behaves that way?
To my knowledge, all current systems integrating memory mapping somehow
with the page/buffer caches, using various strategies to write-back
(or just discard in case of no writes) pages that haven't been used
for a while.

The missing offset argument is essential for getting adequate
performance from a memory-mapped file object.

I very much question that statement. Do you have any numbers to
prove it?

Regards,
Martin

Oct 25 '06 #20

Tim Roberts

"sturlamolden" <st**********@yahoo.nowrote:

>
However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile().

My goodness, do I disagree with that! At least on Windows, I/O on a file
mapped with MapViewOfFile uses the virtual memory pager -- the same
mechanism used by the swap file. Because it is so heavily used, that is
some of the most well-optimized code in the system.

>We can implement a container object backed by a binary file just as
efficient (and possibly even more efficient) without using the OS'
memory mapping facilities. The major advantage is that we can
"pseudo-memory map" a lot more than a 32 bit address space can harbour.

Both the Unix mmap and the Win32 MapViewOfFile allow a starting byte
offset. It wouldn't be rocket science to extend Python's mmap to allow
that.

>There are in any case room for improving Python's mmap object.

Here we agree.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Oct 26 '06 #21

Paul Rubin

"sturlamolden" <st**********@yahoo.nowrites:

However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile().

Why on would you think that?! It is counterintuitive. fseek beyond
whatever is buffered in stdio (usually no more than 1kbyte or so)
requires a system call, while mmap is just a memory access.

In Python, we don't always need the file memory mapped, we normally
just want to use slicing-operators, for-loops and other goodies on
the file object -- i.e. we just want to treat the file as a Python
container object. There are many ways of achieving that.

Some of the time we want to share the region with other processes.
Sometimes we just want random access to a big file on disk without
having to do a lot of context switches seeking around in the file.

There are in any case room for improving Python's mmap object.

IMO it should have some kind of IPC locking mechanism added, in
addition to the offset stuff suggested.

Oct 26 '06 #22

Chetan

Paul Rubin <http://ph****@NOSPAM.invalidwrites:

"sturlamolden" <st**********@yahoo.nowrites:
>However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile().

Why on would you think that?! It is counterintuitive. fseek beyond
whatever is buffered in stdio (usually no more than 1kbyte or so)
requires a system call, while mmap is just a memory access.

And the buffer copy required with every I/O from/to the application.

>In Python, we don't always need the file memory mapped, we normally
just want to use slicing-operators, for-loops and other goodies on
the file object -- i.e. we just want to treat the file as a Python
container object. There are many ways of achieving that.

Some of the time we want to share the region with other processes.
Sometimes we just want random access to a big file on disk without
having to do a lot of context switches seeking around in the file.

>There are in any case room for improving Python's mmap object.

IMO it should have some kind of IPC locking mechanism added, in
addition to the offset stuff suggested.

The type of IPC required differs depending on who is using the shared region -
either another python process or another external program. Apart from the
spinlock primitives, other types of synchronization mechanisms are provided by
the OS. However, I do see value in providing a shared memory based spinlock
mechanism. These services can be built on top of the shared memory
infrastructure. I am not sure what kind or real world python applications use
it.

-Chetan

Oct 26 '06 #23

Paul Rubin

Chetan <pa*************@xspam.sbcglobal.netwrites:

Why on would you think that?! It is counterintuitive. fseek beyond
whatever is buffered in stdio (usually no more than 1kbyte or so)
requires a system call, while mmap is just a memory access.
And the buffer copy required with every I/O from/to the application.

Even that can probably be avoided since the mmap region has to start
on a page boundary, but anyway regular I/O definitely has to copy the
data. For mmap, I'm thinking mostly of the case where the entire file
is paged in through most of the program's execution though. That
obviously wouldn't apply to every application.

IMO it should have some kind of IPC locking mechanism added, in
addition to the offset stuff suggested.
The type of IPC required differs depending on who is using the
shared region - either another python process or another external
program. Apart from the spinlock primitives, other types of
synchronization mechanisms are provided by the OS. However, I do see
value in providing a shared memory based spinlock mechanism.

I mean just have an interface to OS locks (Linux futex and whatever
the Windows counterpart is) and maybe also a utility function to do a
compare-and-swap in user space.

Oct 26 '06 #24

Chetan

Paul Rubin <http://ph****@NOSPAM.invalidwrites:

I mean just have an interface to OS locks (Linux futex and whatever
the Windows counterpart is) and maybe also a utility function to do a
compare-and-swap in user space.

There is code for spinlocks, but it allocates the lockword in the process
memory. This can be used for thread synchronization, but not for IPC with
external python or non-python processes.
I found a PyIPC IPC package that seems to provide interface to Sys V shared
memory and semaphore - but I just found it, so cannot comment on it at this
time.

Oct 26 '06 #25

nnorwitz

Martin v. Löwis wrote:

sturlamolden schrieb:

And why doesn't Python's mmap take an offset argument to handle large
files?

I don't know exactly; the most likely reason is that nobody has
contributed code to make it support that. That's, in turn, probably
because nobody had the problem yet, or nobody of those who did
cared enough to implement and contribute a patch.

Or because no one cared enough to test a patch that was produced 2.5
years ago (not directed at Martin, just pointing out why the patch
stalled).

http://python.org/sf/708374

With just a little community support, this can go in. I suppose now
that we have the buildbots, we can check in untested code and test it
that way. The patch should be reviewed.

n

Oct 28 '06 #26

Chetan

"nn******@gmail.com" <nn******@gmail.comwrites:

Martin v. LÃ¶wis wrote:
sturlamolden schrieb:

And why doesn't Python's mmap take an offset argument to handle large
files?
I don't know exactly; the most likely reason is that nobody has
contributed code to make it support that. That's, in turn, probably
because nobody had the problem yet, or nobody of those who did
cared enough to implement and contribute a patch.

Or because no one cared enough to test a patch that was produced 2.5
years ago (not directed at Martin, just pointing out why the patch
stalled).

http://python.org/sf/708374

With just a little community support, this can go in. I suppose now
that we have the buildbots, we can check in untested code and test it
that way. The patch should be reviewed.

n

I made the changes before I saw this. However, the patch seems to be quite
dated and some of the changes are very interesting, especially if they were
tested for the special conditions they are supposed to handle and
if they were made after some discussion.
I can submit my patch as it is, but I am working on making some of the other
changes I had in mind for the mmap to be useful.
Some of the other changes would make more sense for py3k, if it supports a byte
array object, but I haven't looked at py3k at all.

Chetan

Oct 28 '06 #27

using mmap on large (> 2 Gig) files

Similar topics