2.6, 3.0, and truly independent intepreters - Page 6

Andy

Dear Python dev community,

I'm CTO at a small software company that makes music visualization
software (you can check us out at www.soundspectrum.com). About two
years ago we went with decision to use embedded python in a couple of
our new products, given all the great things about python. We were
close to using lua but for various reasons we decided to go with
python. However, over the last two years, there's been one area of
grief that sometimes makes me think twice about our decision to go
with python...

Some background first... Our software is used for entertainment and
centers around real time, high-performance graphics, so python's
performance, embedded flexibility, and stability are the most
important issues for us. Our software targets a large cross section
of hardware and we currently ship products for Win32, OS X, and the
iPhone and since our customers are end users, our products have to be
robust, have a tidy install footprint, and be foolproof. Basically,
we use embedded python and use it to wrap our high performance C++
class set which wraps OpenGL, DirectX and our own software renderer.
In addition to wrapping our C++ frameworks, we use python to perform
various "worker" tasks on worker thread (e.g. image loading and
processing). However, we require *true* thread/interpreter
independence so python 2 has been frustrating at time, to say the
least. Please don't start with "but really, python supports multiple
interpreters" because I've been there many many times with people.
And, yes, I'm aware of the multiprocessing module added in 2.6, but
that stuff isn't lightweight and isn't suitable at all for many
environments (including ours). The bottom line is that if you want to
perform independent processing (in python) on different threads, using
the machine's multiple cores to the fullest, then you're out of luck
under python 2.

Sadly, the only way we could get truly independent interpreters was to
put python in a dynamic library, have our installer make a *duplicate*
copy of it during the installation process (e.g. python.dll/.bundle ->
python2.dll/.bundle) and load each one explicitly in our app, so we
can get truly independent interpreters. In other words, we load a
fresh dynamic lib for each thread-independent interpreter (you can't
reuse the same dynamic library because the OS will just reference the
already-loaded one).

From what I gather from the python community, the basis for not
offering "real" muti-threaded support is that it'd add to much
internal overhead--and I couldn't agree more. As a high performance C
and C++ guy, I fully agree that thread safety should be at the high
level, not at the low level. BUT, the lack of truly independent
interpreters is what ultimately prevents using python in cool,
powerful ways. This shortcoming alone has caused game developers--
both large and small--to choose other embedded interpreters over
python (e.g. Blizzard chose lua over python). For example, Apple's
QuickTime API is powerful in that high-level instance objects can
leverage performance gains associated with multi-threaded processing.
Meanwhile, the QuickTime API simply lists the responsibilitie s of the
caller regarding thread safety and that's all its needs to do. In
other words, CPython doesn't need to step in an provide a threadsafe
environment; it just needs to establish the rules and make sure that
its own implementation supports those rules.

More than once, I had actually considered expending company resources
to develop a high performance, truly independent interpreter
implementation of the python core language and modules but in the end
estimated that the size of that project would just be too much, given
our company's current resources. Should such an implementation ever
be developed, it would be very attractive for companies to support,
fund, and/or license. The truth is, we just love python as a
language, but it's lack of true interpreter independence (in a
interpreter as well as in a thread sense) remains a *huge* liability.

So, my question becomes: is python 3 ready for true multithreaded
support?? Can we finally abandon our Frankenstein approach of loading
multiple identical dynamic libs to achieve truly independent
interpreters?? I've reviewed all the new python 3 C API module stuff,
and all I have to say is: whew--better late then never!! So, although
that solves modules offering truly independent interpreter support,
the following questions remain:

- In python 3, the C module API now supports true interpreter
independence, but have all the modules in the python codebase been
converted over? Are they all now truly compliant? It will only take
a single static/global state variable in a module to potentially cause
no end of pain in a multiple interpreter environment! Yikes!

- How close is python 3 really to true multithreaded use? The
assumption here is that caller ensures safety (e.g. ensuring that
neither interpreter is in use when serializing data from one to
another).

I believe that true python independent thread/interpreter support is
paramount and should become the top priority because this is the key
consideration used by developers when they're deciding which
interpreter to embed in their app. Until there's a hello world that
demonstrates running independent python interpreters on multiple app
threads, lua will remain the clear choice over python. Python 3 needs
true interpreter independence and multi-threaded support!
Thanks,
Andy O'Meara

Oct 22 '08

Subscribe Reply

114

3914

« First
<
4
5
6
7
8
>
Last »

M.-A. Lemburg

These discussion pop up every year or so and I think that most of them
are not really all that necessary, since the GIL isn't all that bad.

Some pointers into the past:

* http://effbot.org/pyfaq/can-t-we-get...reter-lock.htm
Fredrik on the GIL

* http://mail.python.org/pipermail/pyt...il/003605.html
Greg Stein's proposal to move forward on free threading

* http://www.sauria.com/~twl/conferenc...20Google.notes
(scroll down to the Q&A section)
Greg Stein on whether the GIL really does matter that much

Furthermore, there are lots of ways to tune the CPython VM to make
it more or less responsive to thread switches via the various sys.set*()
functions in the sys module.

Most computing or I/O intense C extensions, built-in modules and object
implementations already release the GIL for you, so it usually doesn't
get in the way all that often.

So you have the option of using a single process with multiple
threads, allowing efficient sharing of data. Or you use multiple
processes and OS mechanisms to share data (shared memory, memory
mapped files, message passing, pipes, shared file descriptors, etc.).

Both have their pros and cons.

There's no general answer to the
problem of how to make best use of multi-core processors, multiple
linked processors or any of the more advanced parallel processing
mechanisms (http://en.wikipedia.org/wiki/Parallel_computing).
The answers will always have to be application specific.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2008)

>>Python/Zope Consulting and Support ... http://www.egenix.com/
mxODBC.Zope.D atabase.Adapter ... http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

_______________ _______________ _______________ _______________ ____________

:::: Try mxODBC.Zope.DA for Windows,Linux,S olaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611

Oct 25 '08 #51

Terry Reedy

Glenn Linderman wrote:

On approximately 10/24/2008 8:39 PM, came the following characters from
the keyboard of Terry Reedy:
>Glenn Linderman wrote:

>>For example, Python presently has a rather stupid algorithm for
string concatenation.

Yes, CPython2.x, x<=5 did.

>Python the language has syntax and semantics. Python implementations
have algorithms that fulfill the defined semantics.

I can buy that, but when Python is not qualified, CPython should be
assumed, as it predominates.

People do that, and it sometimes leads to unnecessary confusion. As to
the present discussion, is it about
* changing Python, the language
* changing all Python implementations
* changing CPython, the leading implementation
* branching CPython with a compiler switch, much as there was one for
including Unicode or not.
* forking CPython
* modifying an existing module
* adding a new module
* making better use of the existing facilities
* some combination of the above

Of course, the latest official release
should probably also be assumed, but that is so recent,

People do that, and it sometimes leads to unnecessary confusion. People
routine posted version specific problems and questions without
specifying the version (or platform when relevant). In a month or so,
there will be *2* latest official releases. There will be more
confusion without qualification.

few have likely
upgraded as yet... I should have qualified the statement.

* Is the target of this discussion 2.7 or 3.1 (some changes would be 3.1
only).

[diversion to the side topic]

>If there is more than one reference to a guaranteed immutable object,
such as a string, the 'stupid' algorithm seem necessary to me.
In-place modification of a shared immutable would violate semantics.

Absolutely. But after the first iteration, there is only one reference
to string.

Which is to say, 'string' is the only reference to its object it refers
too. You are right, so I presume that the optimization described would
then kick in. But I have not read the code, and CPython optimizations
are not part of the *language* reference.

[back to the main topic]

There is some discussion/debate/confusion about how much of the stdlib
is 'standard Python library' versus 'standard CPython library'. [And
there is some feeling that standard Python modules should have a default
Python implementation that any implementation can use until it
optionally replaces it with a faster compiled version.] Hence my
question about the target of this discussion and the first three options
listed above.

Terry Jan Reedy

Oct 25 '08 #52

Philip Semanchuk

On Oct 25, 2008, at 7:53 AM, Michael Sparks wrote:

Glenn Linderman wrote:

>In the module multiprocessing environment could you not use shared
memory, then, for the large shared data items?

If the poshmodule had a bit of TLC, it would be extremely useful for
this,
since it does (surprisingly) still work with python 2.5, but does
need a
bit of TLC to make it usable.

http://poshmodule.sourceforge.net/

Last time I checked that was Windows-only. Has that changed?

The only IPC modules for Unix that I'm aware of are one which I
adopted (for System V semaphores & shared memory) and one which I
wrote (for POSIX semaphores & shared memory).

http://NikitaTheSpider.com/python/shm/
http://semanchuk.com/philip/posix_ipc/
If anyone wants to wrap POSH cleverness around them, go for it! If
not, maybe I'll make the time someday.

Cheers
Philip

Oct 25 '08 #53

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

>There are a number of problems with that approach. The biggest one is

>that it is theoretical.

Not theoretical. Used successfully in Perl.

Perhaps it is indeed what Perl does, I know nothing about that.
However, it *is* theoretical for Python. Please trust me that
there are many many many many pitfalls in it, each needing a
separate solution, most likely with no equivalent in Perl.

If you had a working patch, *then* it would be practical.

Granted Perl is quite a
different language than Python, but then there are some basic
similarities in the concepts.

Yes - just as much as both are implemented in C :-(

Perhaps you should list the problems, instead of vaguely claiming that
there are a number of them. Hard to respond to such a vague claim.

As I said: go implement it, and you will find out. Unless you are
really going at an implementation, I don't want to spend my time
explaining it to you.

But the approach is sound; nearly any monolithic
program can be turned into a multithreaded program containing one
monolith per thread using such a technique.

I'm not debating that. I just claim that it is far from simple.

Regards,
Martin

Oct 25 '08 #54

Andy O'Meara

On Oct 24, 9:52*pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

...Unless part of the C module execution involves the need do CPU-
bound work on another thread through a different python interpreter,
right?

Wrong.

(even if the interpreter is 100% independent, yikes).

Again, wrong.

For
example, have a python C module designed to programmaticall y generate
images (and video frames) in RAM for immediate and subsequent use in
animation. *Meanwhile, we'd like to have a pthread with its own
interpreter with an instance of this module and have it dequeue jobs
as they come in (in fact, there'd be one of these threads for each
excess core present on the machine).

I don't understand how this example involves multiple threads. You
mention a single thread (running the module), and you mention designing
a *module. Where is the second thread?

Glenn seems to be following me here... The point is to have any many
threads as the app wants, each in it's own world, running without
restriction (performance wise). Maybe the app wants to run a thread
for each extra core on the machine.

Perhaps the disconnect here is that when I've been saying "start a
thread", I mean the app starts an OS thread (e.g. pthread) with the
given that any contact with other threads is managed at the app level
(as opposed to starting threads through python). So, as far as python
knows, there's zero mention or use of threading in any way,
*anywhere*.

As far as I can tell, it seems
CPython's current state can't CPU bound parallelization in the same
address space.

That's not true.

Um... So let's say you have a opaque object ref from the OS that
represents hundreds of megs of data (e.g. memory-resident video). How
do you get that back to the parent process without serialization and
IPC? What should really happen is just use the same address space so
just a pointer changes hands. THAT's why I'm saying that a separate
address space is generally a deal breaker when you have large or
intricate data sets (ie. when performance matters).

Andy

Oct 25 '08 #55

Andy O'Meara

On Oct 24, 9:40*pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

It seems to me that the very simplest move would be to remove global
static data so the app could provide all thread-related data, which
Andy suggests through references to the QuickTime API. This would
suggest compiling python without thread support so as to leave it up
to the application.

I'm not sure whether you realize that this is not simple at all.
Consider this fragment

* * if (string == Py_None || index >= state->lastmark ||
!state->mark[index] || !state->mark[index+1]) {
* * * * if (empty)
* * * * * * /* want empty string */
* * * * * * i = j = 0;
* * * * else {
* * * * * * Py_INCREF(Py_No ne);
* * * * * * return Py_None;

The way to think about is that, ideally in PyC, there are never any
global variables. Instead, all "globals" are now part of a context
(ie. a interpreter) and it would presumably be illegal to ever use
them in a different context. I'd say this is already the expectation
and convention for any modern, industry-grade software package
marketed as extension for apps. Industry app developers just want to
drop in a 3rd party package, make as many contexts as they want (in as
many threads as they want), and expect to use each context without
restriction (since they're ensuring contexts never interact with each
other). For example, if I use zlib, libpng, or libjpg, I can make as
many contexts as I want and put them in whatever threads I want. In
the app, the only thing I'm on the hook for is to: (a) never use
objects from one context in another context, and (b) ensure that I'm
never make any calls into a module from more than one thread at the
same time. Both of these requirements are trivial to follow in the
"embarrassi ngly easy" parallelization scenarios, and that's why I
started this thread in the first place. :^)

Andy

Oct 25 '08 #56

Andy O'Meara

On Oct 24, 10:24*pm, Glenn Linderman <v+pyt...@g.nev cal.comwrote:

>
And in the case of hundreds of megs of data

... and I would be surprised at someone that would embed hundreds of
megs of data into an object such that it had to be serialized... seems
like the proper design is to point at the data, or a subset of it, in a
big buffer. *Then data transfers would just transfer the offset/length
and the reference to the buffer.

and/or thousands of data structure instances,

... and this is another surprise! *You have thousands of objects (data
structure instances) to move from one thread to another?

Heh, no, we're actually in agreement here. I'm saying that in the
case where the data sets are large and/or intricate, a single top-
level pointer changing hands is *always* the way to go rather than
serialization. For example, suppose you had some nifty python code
and C procs that were doing lots of image analysis, outputting tons of
intricate and rich data structures. Once the thread is done with that
job, all that output is trivially transferred back to the appropriate
thread by a pointer changing hands.

>
Of course, I know that data get large, but typical multimedia streams
are large, binary blobs. *I was under the impression that processing
them usually proceeds along the lines of keeping offsets into the blobs,
and interpreting, etc. *Editing is usually done by making a copy of a
blob, transforming it or a subset in some manner during the copy
process, resulting in a new, possibly different-sized blob.

No, you're definitely right-on, with the the additional point that the
representation of multimedia usually employs intricate and diverse
data structures (imagine the data structure representation of a movie
encoded in modern codec, such as H.264, complete with paths, regions,
pixel flow, geometry, transformations , and textures). As we both
agree, that's something that you *definitely* want to move around via
a single pointer (and not in a serialized form). Hence, my position
that apps that use python can't be forced to go through IPC or else:
(a) there's a performance/resource waste to serialize and unserialize
large or intricate data sets, and (b) they're required to write and
maintain serialization code that otherwise doesn't serve any other
purpose.

Andy

Oct 25 '08 #57

Andy O'Meara

Andy O'Meara wrote:
I would definitely agree if there was a context (i.e. environment)
object passed around then perhaps we'd have the best of all worlds.

Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance.

Good points--I would agree with you on all counts there. On the
"passing a context everywhere" performance hit, perhaps one idea is
that all objects could have an additional field that would point back
to their parent context (ie. their interpreter). So the only
prototypes that would have to be modified to contain the context ptr
would be the ones that inherently don't take any objects. This would
conveniently and generally correspond to procs associated with
interpreter control (e.g. importing modules, shutting down modules,
etc).

Andy O'Meara wrote:
- each worker thread makes its own interpreter, pops scripts off a
work queue, and manages exporting (and then importing) result data to
other parts of the app.

I hope you realize that starting up one of these interpreters
is going to be fairly expensive.

Absolutely. I had just left that issue out in an effort to keep the
discussion pointed, but it's a great point to raise. My response is
that, like any 3rd party industry package, I'd say this is the
expectation (that context startup and shutdown is non-trivial and to
should be minimized for performance reasons). For simplicity, my
examples didn't talk about this issue but in practice, it'd be typical
for apps to have their "worker" interpreters persist as they chew
through jobs.
Andy

Oct 25 '08 #58

Rhamphoryncus

On Oct 25, 12:29*am, greg <g...@cosc.cant erbury.ac.nzwro te:

Rhamphoryncus wrote:
A list
is not shareable, so it can only be used within the monitor it's
created within, but the list type object is shareable.

Type objects contain dicts, which allow arbitrary values
to be stored in them. What happens if one thread puts
a private object in there? It becomes visible to other
threads using the same type object. If it's not safe
for sharing, bad things happen.

Python's data model is not conducive to making a clear
distinction between "private" and "shared" objects,
except at the level of an entire interpreter.

shareable type objects (enabled by a __future__ import) use a
shareddict, which requires all keys and values to themselves be
shareable objects.

Although it's a significant semantic change, in many cases it's easy
to deal with: replace mutable (unshareable) global constants with
immutable ones (ie list -tuple, set -frozenset). If you've got
some global state you move it into a monitor (which doesn't scale, but
that's your design). The only time this really fails is when you're
deliberately storing arbitrary mutable objects from any thread, and
later inspecting them from any other thread (such as our new ABC
system's cache). If you want to store an object, but only to give it
back to the original thread, I've got a way to do that.

Oct 25 '08 #59

greg

Glenn Linderman wrote:

On approximately 10/25/2008 12:01 AM, came the following characters from
the keyboard of Martin v. LÃ¶wis:

>If None remains global, then type(None) also remains global, and
type(None),__b ases__[0]. Then type(None).__ba ses__[0].__subclasses__ ()
will yield "interestin g" results. This is essentially the status quo.

I certainly don't grok the implications of what you say above,
as I barely grok the semantics of it.

Not only is there a link from a class to its base classes, there
is a link to all its subclasses as well.

Since every class is ultimately a subclass of 'object', this means
that starting from *any* object, you can work your way up the
__bases__ chain until you get to 'object', then walk the sublass
hierarchy and find every class in the system.

This means that if any object at all is shared, then all class
objects, and any object reachable from them, are shared as well.

--
Greg

Oct 26 '08 #60