2.6, 3.0, and truly independent intepreters - Page 7

Andy

Dear Python dev community,

I'm CTO at a small software company that makes music visualization
software (you can check us out at www.soundspectrum.com). About two
years ago we went with decision to use embedded python in a couple of
our new products, given all the great things about python. We were
close to using lua but for various reasons we decided to go with
python. However, over the last two years, there's been one area of
grief that sometimes makes me think twice about our decision to go
with python...

Some background first... Our software is used for entertainment and
centers around real time, high-performance graphics, so python's
performance, embedded flexibility, and stability are the most
important issues for us. Our software targets a large cross section
of hardware and we currently ship products for Win32, OS X, and the
iPhone and since our customers are end users, our products have to be
robust, have a tidy install footprint, and be foolproof. Basically,
we use embedded python and use it to wrap our high performance C++
class set which wraps OpenGL, DirectX and our own software renderer.
In addition to wrapping our C++ frameworks, we use python to perform
various "worker" tasks on worker thread (e.g. image loading and
processing). However, we require *true* thread/interpreter
independence so python 2 has been frustrating at time, to say the
least. Please don't start with "but really, python supports multiple
interpreters" because I've been there many many times with people.
And, yes, I'm aware of the multiprocessing module added in 2.6, but
that stuff isn't lightweight and isn't suitable at all for many
environments (including ours). The bottom line is that if you want to
perform independent processing (in python) on different threads, using
the machine's multiple cores to the fullest, then you're out of luck
under python 2.

Sadly, the only way we could get truly independent interpreters was to
put python in a dynamic library, have our installer make a *duplicate*
copy of it during the installation process (e.g. python.dll/.bundle ->
python2.dll/.bundle) and load each one explicitly in our app, so we
can get truly independent interpreters. In other words, we load a
fresh dynamic lib for each thread-independent interpreter (you can't
reuse the same dynamic library because the OS will just reference the
already-loaded one).

From what I gather from the python community, the basis for not
offering "real" muti-threaded support is that it'd add to much
internal overhead--and I couldn't agree more. As a high performance C
and C++ guy, I fully agree that thread safety should be at the high
level, not at the low level. BUT, the lack of truly independent
interpreters is what ultimately prevents using python in cool,
powerful ways. This shortcoming alone has caused game developers--
both large and small--to choose other embedded interpreters over
python (e.g. Blizzard chose lua over python). For example, Apple's
QuickTime API is powerful in that high-level instance objects can
leverage performance gains associated with multi-threaded processing.
Meanwhile, the QuickTime API simply lists the responsibilitie s of the
caller regarding thread safety and that's all its needs to do. In
other words, CPython doesn't need to step in an provide a threadsafe
environment; it just needs to establish the rules and make sure that
its own implementation supports those rules.

More than once, I had actually considered expending company resources
to develop a high performance, truly independent interpreter
implementation of the python core language and modules but in the end
estimated that the size of that project would just be too much, given
our company's current resources. Should such an implementation ever
be developed, it would be very attractive for companies to support,
fund, and/or license. The truth is, we just love python as a
language, but it's lack of true interpreter independence (in a
interpreter as well as in a thread sense) remains a *huge* liability.

So, my question becomes: is python 3 ready for true multithreaded
support?? Can we finally abandon our Frankenstein approach of loading
multiple identical dynamic libs to achieve truly independent
interpreters?? I've reviewed all the new python 3 C API module stuff,
and all I have to say is: whew--better late then never!! So, although
that solves modules offering truly independent interpreter support,
the following questions remain:

- In python 3, the C module API now supports true interpreter
independence, but have all the modules in the python codebase been
converted over? Are they all now truly compliant? It will only take
a single static/global state variable in a module to potentially cause
no end of pain in a multiple interpreter environment! Yikes!

- How close is python 3 really to true multithreaded use? The
assumption here is that caller ensures safety (e.g. ensuring that
neither interpreter is in use when serializing data from one to
another).

I believe that true python independent thread/interpreter support is
paramount and should become the top priority because this is the key
consideration used by developers when they're deciding which
interpreter to embed in their app. Until there's a hello world that
demonstrates running independent python interpreters on multiple app
threads, lua will remain the clear choice over python. Python 3 needs
true interpreter independence and multi-threaded support!
Thanks,
Andy O'Meara

Oct 22 '08

Subscribe Reply

114

3926

« First
<
5
6
7
8
9
>
Last »

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

>>As far as I can tell, it seems

>>CPython's current state can't CPU bound parallelization in the same
address space.
That's not true.

Um... So let's say you have a opaque object ref from the OS that
represents hundreds of megs of data (e.g. memory-resident video). How
do you get that back to the parent process without serialization and
IPC?

What parent process? I thought you were talking about multi-threading?

What should really happen is just use the same address space so
just a pointer changes hands. THAT's why I'm saying that a separate
address space is generally a deal breaker when you have large or
intricate data sets (ie. when performance matters).

Right. So use a single address space, multiple threads, and perform the
heavy computations in C code. I don't see how Python is in the way at
all. Many people do that, and it works just fine. That's what
Jesse (probably) meant with his remark

>A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

Please reconsider this; it might be a solution to your problem.

Regards,
Martin

Oct 26 '08 #61

Andy O'Meara

Grrr... I posted a ton of lengthy replies to you and other recent
posts here using Google and none of them made it, argh. Poof. There's
nothing that fires more up more than lost work, so I'll have to
revert short and simple answers for the time being. Argh, damn.
On Oct 25, 1:26 am, greg <g...@cosc.cant erbury.ac.nzwro te:

Andy O'Meara wrote:
I would definitely agree if there was a context (i.e. environment)
object passed around then perhaps we'd have the best of all worlds.

Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance.

I'm with you on all counts, so no disagreement there. On the "passing
a ptr everywhere" issue, perhaps one idea is that all objects could
have an additional field that would point back to their parent context
(ie. their interpreter). So the only prototypes that would have to be
modified to contain the context ptr would be the ones that don't
inherently operate on objects (e.g. importing a module).
On Oct 25, 1:54*am, greg <g...@cosc.cant erbury.ac.nzwro te:

Andy O'Meara wrote:
- each worker thread makes its own interpreter, pops scripts off a
work queue, and manages exporting (and then importing) result data to
other parts of the app.

I hope you realize that starting up one of these interpreters
is going to be fairly expensive. It will have to create its
own versions of all the builtin constants and type objects,
and import its own copy of all the modules it uses.

Yeah, for sure. And I'd say that's a pretty well established
convention already out there for any industry package. The pattern
I'd expect to see is where the app starts worker threads, starts
interpreters in one or more of each, and throws jobs to different ones
(and the interpreter would persist to move on to subsequent jobs).

One wonders if it wouldn't be cheaper just to fork the
process. Shared memory can be used to transfer large lumps
of data if needed.

As I mentioned, wen you're talking about intricate data structures, OS
opaque objects (ie. that have their own internal allocators), or huge
data sets, even a shared memory region unfortunately can't fit the
bill.
Andy

Oct 27 '08 #62

Andy O'Meara

On Oct 24, 9:52*pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

...Unless part of the C module execution involves the need do CPU-
bound work on another thread through a different python interpreter,
right?

Wrong.

Let's take a step back and remind ourselves of the big picture. The
goal is to have independent interpreters running in pthreads that the
app starts and controls. Each interpreter never at any point is doing
any thread-related stuff in any way. For example, each script job
just does meat an potatoes CPU work, using callbacks that, say,
programatically use OS APIs to edit and transform frame data.

So I think the disconnect here is that maybe you're envisioning
threads being created *in* python. To be clear, we're talking out
making threads at the app level and making it a given for the app to
take its safety in its own hands.

>
As far as I can tell, it seems
CPython's current state can't CPU bound parallelization in the same
address space.

That's not true.

Well, when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job. Otherwise, please describe in detail how
I'd get an opaque OS object (e.g. an OS ref that refers to memory-
resident video) from the child process back to the parent process.

Again, the big picture that I'm trying to plant here is that there
really is a serious need for truly independent interpreters/contexts
in a shared address space. Consider stuff like libpng, zlib, ipgjpg,
or whatever, the use pattern is always the same: make a context
object, do your work in the context, and take it down. For most
industry-caliber packages, the expectation and convention (unless
documented otherwise) is that the app can make as many contexts as its
wants in whatever threads it wants because the convention is that the
app is must (a) never use one context's objects in another context,
and (b) never use a context at the same time from more than one
thread. That's all I'm really trying to look at here.
Andy

Oct 27 '08 #63

Andy O'Meara

And in the case of hundreds of megs of data

... and I would be surprised at someone that would embed hundreds of
megs of data into an object such that it had to be serialized... seems
like the proper design is to point at the data, or a subset of it, in a
big buffer. *Then data transfers would just transfer the offset/length
and the reference to the buffer.

and/or thousands of data structure instances,

... and this is another surprise! *You have thousands of objects (data
structure instances) to move from one thread to another?

I think we miscommunicated there--I'm actually agreeing with you. I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.

Of course, I know that data get large, but typical multimedia streams
are large, binary blobs. *I was under the impression that processing
them usually proceeds along the lines of keeping offsets into the blobs,
and interpreting, etc. *Editing is usually done by making a copy of a
blob, transforming it or a subset in some manner during the copy
process, resulting in a new, possibly different-sized blob.

Your instincts are right. I'd only add on that when you're talking
about data structures associated with an intricate video format, the
complexity and depth of the data structures is insane -- the LAST
thing you want to burn cycles on is serializing and unserializing that
stuff (so IPC is out)--again, we're already on the same page here.

I think at one point you made the comment that shared memory is a
solution to handle large data sets between a child process and the
parent. Although this is certainty true in principle, it doesn't hold
up in practice since complex data structures often contain 3rd party
and OS API objects that have their own allocators. For example, in
video encoding, there's TONS of objects that comprise memory-resident
video from all kinds of APIs, so the idea of having them allocated
from shared/mapped memory block isn't even possible. Again, I only
raise this to offer evidence that doing real-world work in a child
process is a deal breaker--a shared address space is just way too much
to give up.
Andy

Oct 27 '08 #64

James Mills

On Mon, Oct 27, 2008 at 12:03 PM, Andy O'Meara <an****@gmail.c omwrote:

I think we miscommunicated there--I'm actually agreeing with you. I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.

Andy,

Why don't you just use a temporary file
system (ram disk) to store the data that
your app is manipulating. All you need to
pass around then is a file descriptor.

--JamesMills

--
--
-- "Problems are solved by method"

Oct 27 '08 #65

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Andy O'Meara wrote:

On Oct 24, 9:52 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

>>>A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.
...Unless part of the C module execution involves the need do CPU-
bound work on another thread through a different python interpreter,
right?
Wrong.

[...]

>
So I think the disconnect here is that maybe you're envisioning
threads being created *in* python. To be clear, we're talking out
making threads at the app level and making it a given for the app to
take its safety in its own hands.

No. Whether or not threads are created by Python or the application
does not matter for my "Wrong" evaluation: in either case, C module
execution can easily side-step/release the GIL.

>>As far as I can tell, it seems
CPython's current state can't CPU bound parallelization in the same
address space.
That's not true.

Well, when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job. Otherwise, please describe in detail how
I'd get an opaque OS object (e.g. an OS ref that refers to memory-
resident video) from the child process back to the parent process.

WHAT PARENT PROCESS? "In the same address space", to me, means
"a single process only, not multiple processes, and no parent process
anywhere". If you have just multiple threads, the notion of passing
data from a "child process" back to the "parent process" is
meaningless.

Again, the big picture that I'm trying to plant here is that there
really is a serious need for truly independent interpreters/contexts
in a shared address space.

I understand that this is your mission in this thread. However, why
is that your problem? Why can't you just use the existing (limited)
multiple-interpreters machinery, and solve your problems with that?

For most
industry-caliber packages, the expectation and convention (unless
documented otherwise) is that the app can make as many contexts as its
wants in whatever threads it wants because the convention is that the
app is must (a) never use one context's objects in another context,
and (b) never use a context at the same time from more than one
thread. That's all I'm really trying to look at here.

And that's indeed the case for Python, too. The app can make as many
subinterpreters as it wants to, and it must not pass objects from one
subinterpreter to another one, nor should it use a single interpreter
from more than one thread (although that is actually supported by
Python - but it surely won't hurt if you restrict yourself to a single
thread per interpreter).

Regards,
Martin

Oct 27 '08 #66

Rhamphoryncus

On Oct 26, 6:57*pm, "Andy O'Meara" <and...@gmail.c omwrote:

Grrr... I posted a ton of lengthy replies to you and other recent
posts here using Google and none of them made it, argh. Poof. There's
nothing that fires more up more than lost work, *so I'll have to
revert short and simple answers for the time being. *Argh, damn.

On Oct 25, 1:26 am, greg <g...@cosc.cant erbury.ac.nzwro te:

Andy O'Meara wrote:
I would definitely agree if there was a context (i.e. environment)
object passed around then perhaps we'd have the best of all worlds.

Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance.

I'm with you on all counts, so no disagreement there. *On the "passing
a ptr everywhere" issue, perhaps one idea is that all objects could
have an additionalfield that would point back to their parent context
(ie. their interpreter). *So the only prototypes that would have to be
modified to contain the context ptr would be the ones that don't
inherently operate on objects (e.g. importing a module).

Trying to directly share objects like this is going to create
contention. The refcounting becomes the sequential portion of
Amdahl's Law. This is why safethread doesn't scale very well: I share
a massive amount of objects.

An alternative, actually simpler, is to create proxies to your real
object. The proxy object has a pointer to the real object and the
context containing it. When you call a method it serializes the
arguments, acquires the target context's GIL (while releasing yours),
and deserializes in the target context. Once the method returns it
reverses the process.

There's two reasons why this may perform well for you: First,
operations done purely in C may cheat (if so designed). A copy from
one memory buffer to another memory buffer may be given two proxies as
arguments, but then operate directly on the target objects (ie without
serialization).

Second, if a target context is idle you can enter it (acquiring its
GIL) without any context switch.

Of course that scenario is full of "maybes", which is why I have
little interest in it..

An even better scenario is if your memory buffer's methods are in pure
C and it's a simple object (no pointers). You can stick the memory
buffer in shared memory and have multiple processes manipulate it from
C. More "maybes".

An evil trick if you need pointers, but control the allocation, is to
take advantage of the fork model. Have a master process create a
bunch of blank files (temp files if linux doesn't allow /dev/zero),
mmap them all using MAP_SHARED, then fork and utilize. The addresses
will be inherited from the master process, so any pointers within them
will be usable across all processes. If you ever want to return
memory to the system you can close that file, then have all processes
use MAP_SHARED|MAP_ FIXED to overwrite it. Evil, but should be
disturbingly effective, and still doesn't require modifying CPython.

Oct 28 '08 #67

Michael Sparks

Glenn Linderman wrote:

so a 3rd party library might be called to decompress the stream into a
set of independently allocated chunks, each containing one frame (each
possibly consisting of several allocations of memory for associated
metadata) that is independent of other frames

We use a combination of a dictionary + RGB data for this purpose. Using a
dictionary works out pretty nicely for the metadata, and obviously one
attribute holds the frame data as a binary blob.

http://www.kamaelia.org/Components/p...Codec.YUV4MPEG gives some
idea structure and usage. The example given there is this:

Pipeline( RateControlledF ileReader("vide o.dirac",readmo de="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG (),
SimpleFileWrite r("output.yuv4m peg")
).run()

Now all of those components are generator components.

That's useful since:
a) we can structure the code to show what it does more clearly, and it
still run efficiently inside a single process
b) We can change this over to using multiple processes trivially:

ProcessPipeline (
RateControlledF ileReader("vide o.dirac",readmo de="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG (),
SimpleFileWrite r("output.yuv4m peg")
).run()

This version uses multiple processes (under the hood using Paul Boddies
pprocess library, since this support predates the multiprocessing module
support in python).

The big issue with *this* version however is that due to pprocess (and
friends) pickling data to be sent across OS pipes, the data throughput on
this would be lowsy. Specifically in this example, if we could change it
such that the high level API was this:

ProcessPipeline (
RateControlledF ileReader("vide o.dirac",readmo de="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG (),
SimpleFileWrite r("output.yuv4m peg")
use_shared_memo ry_IPC = True,
).run()

That would be pretty useful, for some hopefully obvious reasons. I suppose
ideally we'd just use shared_memory_I PC for everything and just go back to
this:

ProcessPipeline (
RateControlledF ileReader("vide o.dirac",readmo de="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG (),
SimpleFileWrite r("output.yuv4m peg")
).run()

But essentially for us, this is an optimisation problem, not a "how do I
even begin to use this" problem. Since it is an optimisation problem, it
also strikes me as reasonable to consider it OK to special purpose and
specialise such links until you get an approach that's reasonable for
general purpose data.

In theory, poshmodule.sour ceforge.net, with a bit of TLC would be a good
candidate or good candidate starting point for that optimisation work
(since it does work in Linux, contrary to a reply in the thread - I've not
tested it under windows :).

If someone's interested in building that, then someone redoing our MiniAxon
tutorial using processes & shared memory IPC rather than generators would
be a relatively gentle/structured approach to dealing with this:

* http://www.kamaelia.org/MiniAxon/

The reason I suggest that is because any time we think about fiddling and
creating a new optimisation approach or concurrency approach, we tend to
build a MiniAxon prototype to flesh out the various issues involved.
Michael
--
http://www.kamaelia.org/Home

Oct 28 '08 #68

Michael Sparks

Philip Semanchuk wrote:

On Oct 25, 2008, at 7:53 AM, Michael Sparks wrote:
>Glenn Linderman wrote:
>>In the module multiprocessing environment could you not use shared
memory, then, for the large shared data items?

If the poshmodule had a bit of TLC, it would be extremely useful for
this,... http://poshmodule.sourceforge.net/

Last time I checked that was Windows-only. Has that changed?

I've only tested it under Linux where it worked, but does clearly need a bit
of work :)

The only IPC modules for Unix that I'm aware of are one which I
adopted (for System V semaphores & shared memory) and one which I
wrote (for POSIX semaphores & shared memory).

http://NikitaTheSpider.com/python/shm/
http://semanchuk.com/philip/posix_ipc/

I'll take a look at those - poshmodule does need a bit of TLC and doesn't
appear to be maintained.

If anyone wants to wrap POSH cleverness around them, go for it! If
not, maybe I'll make the time someday.

I personally don't have the time do do this, but I'd be very interested in
hearing someone building an up-to-date version. (Indeed, something like
this would be extremely useful for everyone to have in the standard library
now that the multiprocessing library is in the standard library)
Michael.
--
http://www.kamaelia.org/Home

Oct 28 '08 #69

Andy O'Meara

On Oct 26, 10:11*pm, "James Mills" <prolo...@short circuit.net.au>
wrote:

On Mon, Oct 27, 2008 at 12:03 PM, Andy O'Meara <and...@gmail.c omwrote:
I think we miscommunicated there--I'm actually agreeing with you. *I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. *This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.

Andy,

Why don't you just use a temporary file
system (ram disk) to store the data that
your app is manipulating. All you need to
pass around then is a file descriptor.

--JamesMills

Unfortunately, it's the penalty of serialization and unserialization .
When you're talking about stuff like memory-resident images and video
(complete with their intricate and complex codecs), then the only
option is to be passing around a couple pointers rather then take the
hit of serialization (which is huge for video, for example). I've
gone into more detail in some other posts but I could have missed
something.
Andy

Oct 28 '08 #70