Python does not play well with others

John Nagle

The major complaint I have about Python is that the packages
which connect it to other software components all seem to have
serious problems. As long as you don't need to talk to anything
outside the Python world, you're fine. But once you do, things
go downhill. MySQLdb has version and platform compatibility
problems. So does M2Crypto. The built-in SSL support is weak.
Even basic sockets don't quite work right; the socket module
encapsulates the timeout mechanism but doesn't get it right.

In the Perl, Java, PHP, and C/C++ worlds, the equivalent
functions just work. That's because, in those worlds, either the
development team for the language or the development team
for the subsystem takes responsibility for making them work.
Only Python doesn't do that.

Python has been around long enough that this should have
been fixed by now.

John Nagle

Jan 24 '07

Subscribe Post Reply

113

5171

Graham Dumpleton

On Feb 6, 5:39 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

John Nagle <n...@animats.comwrites:

The GIL doesn't affect seperate processes, and any large server that
cares about stability is going to be running a pre-forking MPM no
matter what language they're supporting.

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request. For
many AJAX apps, the loading cost tends to dominate the transaction.

I think the idea is that each pre-forked subprocess has its own
mod_python that services multiple requests serially.

And where 'worker' MPM is used, each child process can be handling
multiple concurrent requests at the same time. Similarly on Windows
although there is only one process.

New to me is the idea that you can have multiple separate Python
interpreters in a SINGLE process (mentioned in another post). I'd
thought that being limited to one interpreter per process was a
significant and hard-to-fix limitation of the current CPython
implementation that's unlikely to be fixed earlier than 3.0.

No such limitation exists with mod_python as it does all the
interpreter creation and management at the Python C API level. The one
interpreter per process limitation is only when using the standard
'python' runtime executable and you are doing everything in Python
code.

Graham

Feb 5 '07 #101

sjdevnull

On Feb 5, 12:52 pm, John Nagle <n...@animats.comwrote:

sjdevn...@yahoo.com wrote:
John Nagle wrote:

>Graham Dumpleton wrote:

>>On Feb 4, 1:05 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

>>>"Paul Boddie" <p...@boddie.org.ukwrites:
Realistically, mod_python is a dead end for large servers,
because Python isn't really multi-threaded. The Global Python
Lock means that a multi-core CPU won't help performance.

The GIL doesn't affect seperate processes, and any large server that
cares about stability is going to be running a pre-forking MPM no
matter what language they're supporting.

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request.

No, you don't. Each server is persistent and serves many requests--
it's not at all like CGI, and it reuses the loaded Python image.

So if you have, say, an expensive to load Python module, that will
only be executed once for each server you start...e.g. if you have
Apache configured to accept up to 50 connections, the module will be
run at most 50 times; once each of the 50 processes has started up,
they stick around until you restart Apache, unless you've configured
apache to only serve X requests in one process before restarting it.
(The one major feature that mod_python _is_ missing is the ability to
do some setup in the Python module prior to forking. That would make
restarting Apache somewhat nicer).

The major advantage of pre-forking is that you have memory protection
between servers, so a bug in one won't take down the whole apache
server (just the connection(s) that are affected by that bug). Most
shared hosting providers use pre-forking just for these stability
reasons.

A nice side effect of the memory protection is that you have
completely seperate Python interpreters in each process--while each
one is reused between connections, they run in independent processes
and the GIL doesn't come into play at all.

Feb 5 '07 #102

Paul Rubin

"Graham Dumpleton" <gr*****@dscpl.com.auwrites:

No such limitation exists with mod_python as it does all the
interpreter creation and management at the Python C API level. The one
interpreter per process limitation is only when using the standard
'python' runtime executable and you are doing everything in Python code.

Oh cool, I thought CPython used global and/or static variables or had
other obstacles to supporting multiple interpreters. Is there a
separate memory pool for each interpreter when you have multiple ones?
So each one has its own copies of 0,1,2,None,..., etc? How big is the
memory footprint per interpreter? I guess there's no way to timeshare
(i.e. with microthreads) between interpreters but that's o.k.

Feb 5 '07 #103

Paul Rubin

"Graham Dumpleton" <gr*****@dscpl.com.auwrites:

Yes, these per VirtualHost interpreter instances will only be created
on demand in the child process when a request arrives which
necessitates it be created and so there is some first time setup for
that specific interpreter instance at that point, but the main Python
initialisation has already occurred so this is minor.

Well ok, but what if each of those interpreters wants to load, say,
the cookie module? Do you have separate copies of the cookie module
in each interpreter? Does each one take the overhead of loading the
cookie module? It would be neat if there was a way of including
frequently used modules in the shared text segment of the
interpreters, as created during the initial build process. GNU Emacs
used to do something like that with a contraption called "unexec" (it
could dump out parts of its data segment into a pure (shared)
executable that you could then run without the overhead of loading all
those modules) but the capability went away as computers got faster
and it became less common to have a lot of Emacs instances weighing
down timesharing systems. Maybe it's time for a revival of those
techniques.

Feb 5 '07 #104

Graham Dumpleton

On Feb 6, 8:57 am, "sjdevn...@yahoo.com" <sjdevn...@yahoo.comwrote:

On Feb 5, 12:52 pm, John Nagle <n...@animats.comwrote:

sjdevn...@yahoo.com wrote:
John Nagle wrote:

>>Graham Dumpleton wrote:

>>>On Feb 4, 1:05 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

>>>>"Paul Boddie" <p...@boddie.org.ukwrites:
> Realistically,mod_pythonis a dead end for large servers,
>>because Python isn't really multi-threaded. The Global Python
>>Lock means that a multi-core CPU won't help performance.

The GIL doesn't affect seperate processes, and any large server that
cares about stability is going to be running a pre-forking MPM no
matter what language they're supporting.

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request.

No, you don't. Each server is persistent and serves many requests--
it's not at all like CGI, and it reuses the loaded Python image.

So if you have, say, an expensive to load Python module, that will
only be executed once for each server you start...e.g. if you have
Apache configured to accept up to 50 connections, the module will be
run at most 50 times; once each of the 50 processes has started up,
they stick around until you restart Apache, unless you've configured
apache to only serve X requests in one process before restarting it.
(The one major feature thatmod_python_is_ missing is the ability to
do some setup in the Python module prior to forking. That would make
restarting Apache somewhat nicer).

There would be a few issues with preloading modules before the main
Apache child process performed the fork.

The first is whether it would be possible for code to be run with
elevated privileges given that the main Apache process usually is
started as root. I'm not sure at what point it switches to the special
user Apache generally runs as and whether in the main process the way
this switch is done is enough to prevent code getting back root
privileges in some way, so would need to be looked into.

The second issue is that there can be multiple Python interpreters
ultimately created depending on how URLs are mapped, thus it isn't
just an issue with loading a module once, you would need to create all
the interpreters you think might need it and preload it into each. All
this will blow out the memory size of the main Apache process.

There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.
One particular area which could be a problem is where Apache wants to
do a restart, as it will attempt to unload the mod_python module and
reload it. Right now this may not be an issue as mod_python does the
wrong thing and doesn't shutdown Python allowing it to be
reinitialised when mod_python is reloaded, but in mod_wsgi (when
mod_python isn't also being loaded), it will shutdown Python. If there
is user code executing in a thread within the parent process this may
actually stop mod_wsgi from cleanly shutting down Python thus causing
Apache to hang.

All up, the risks of loading extra modules in the parent process
aren't worth it and could just result in things being less stable.

Graham

Feb 5 '07 #105

Graham Dumpleton

On Feb 6, 9:15 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

"Graham Dumpleton" <grah...@dscpl.com.auwrites:
Yes, these per VirtualHost interpreter instances will only be created
on demand in the child process when a request arrives which
necessitates it be created and so there is some first time setup for
that specific interpreter instance at that point, but the main Python
initialisation has already occurred so this is minor.

Well ok, but what if each of those interpreters wants to load, say,
the cookie module? Do you have separate copies of the cookie module
in each interpreter? Does each one take the overhead of loading the
cookie module?

Each interpreter instance will have its own copy of any Python based
code modules. You can't avoid this as Python code is so modifiable
that they have to be separate else you would be modifying the same
instance as used by a different interpreter which could screw up the
other applications view of the world. The whole point of having
separate interpreters is to avoid applications trampling on each
other. If you really are concerned about multiple loading, use the
PythonInterpreter directive to specifically say that applications
running under different VirtualHost containers should use the same
interpreter.

Note though, that although you can run multiple applications in one
interpreter in many cases, it may not be able to be done in others.
For example, it is not possible to run two instances of Django within
the one interpreter instance. The first reason as to why this can't be
done is that Django expects certain information about its
configuration to come from os.environ. Since there is only one
os.environ it can't have two different values for each application at
the same time. Some may argue that in 'prefork' you could just change
os.environ to be correct for the application for the current request
and this effectively is what the mod_python adapter for Django does,
but this will fail when 'worker' MPM or Windows is used. I suspect
this is the where the idea that Django can't be run on 'worker' MPM
came from. Although the documentation for Django suggests it is a
mod_python problem, it is actually a Django problem. This use of
os.environ by Django also means that Django isn't a well behaved WSGI
application component. :-(

It would be neat if there was a way of including
frequently used modules in the shared text segment of the
interpreters, as created during the initial build process. GNU Emacs
used to do something like that with a contraption called "unexec" (it
could dump out parts of its data segment into a pure (shared)
executable that you could then run without the overhead of loading all
those modules) but the capability went away as computers got faster
and it became less common to have a lot of Emacs instances weighing
down timesharing systems. Maybe it's time for a revival of those
techniques.

I don't see it as being applicable. Do note that provided there are
precompiled byte code files for .py files then load time is at least
reduced because Python doesn't have to recompile the code. This
actually can be quite significant.

Graham

Feb 5 '07 #106

Paul Rubin

"Graham Dumpleton" <gr*****@dscpl.com.auwrites:

The first is whether it would be possible for code to be run with
elevated privileges given that the main Apache process usually is
started as root. I'm not sure at what point it switches to the special
user Apache generally runs as and whether in the main process the way
this switch is done is enough to prevent code getting back root
privileges in some way, so would need to be looked into.

It switches very early, I think. It starts as root so it can listen
on port 80.

There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.

Certainly launching any new threads should be postponed til after the
fork.

Feb 5 '07 #107

Graham Dumpleton

On Feb 6, 10:15 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

"Graham Dumpleton" <grah...@dscpl.com.auwrites:
There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.

Certainly launching any new threads should be postponed til after the
fork.

Except that you can't outright prevent it from being done as a Python
module could create the threads as a side effect of the module import
itself. I guess though if you load a module which does that and it
screws things up, then you have brought it on yourself as it would
have been your choice to make mod_python load it in the first place if
the feature was there. :-)

Feb 5 '07 #108

Paul Rubin

"Graham Dumpleton" <gr*****@dscpl.com.auwrites:

Certainly launching any new threads should be postponed til after the
fork.

Except that you can't outright prevent it from being done as a Python
module could create the threads as a side effect of the module import
itself.

Yeah, the preload would have to be part of the server configuration,
requiring appropriate care in choosing the preloaded modules (they'd
normally be stdlib modules which rarely do uncivilized things like
launch new threads on import). It wouldn't do to let random user
scripts into the preload.

One could imagine languages in which this could be enforced by a
static type system. Hmm.

Feb 6 '07 #109

sjdevnull

On Feb 5, 5:45 pm, "Graham Dumpleton" <grah...@dscpl.com.auwrote:

On Feb 6, 8:57 am, "sjdevn...@yahoo.com" <sjdevn...@yahoo.comwrote:

On Feb 5, 12:52 pm, John Nagle <n...@animats.comwrote:

sjdevn...@yahoo.com wrote:
John Nagle wrote:

>Graham Dumpleton wrote:

>>On Feb 4, 1:05 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

>>>"Paul Boddie" <p...@boddie.org.ukwrites:
Realistically,mod_pythonis a dead end for large servers,
>because Python isn't really multi-threaded. The Global Python
>Lock means that a multi-core CPU won't help performance.

The GIL doesn't affect seperate processes, and any large server that
cares about stability is going to be running a pre-forking MPM no
matter what language they're supporting.

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request.

No, you don't. Each server is persistent and serves many requests--
it's not at all like CGI, and it reuses the loaded Python image.

So if you have, say, an expensive to load Python module, that will
only be executed once for each server you start...e.g. if you have
Apache configured to accept up to 50 connections, the module will be
run at most 50 times; once each of the 50 processes has started up,
they stick around until you restart Apache, unless you've configured
apache to only serve X requests in one process before restarting it.
(The one major feature thatmod_python_is_ missing is the ability to
do some setup in the Python module prior to forking. That would make
restarting Apache somewhat nicer).

There would be a few issues with preloading modules before the main
Apache child process performed the fork.

The first is whether it would be possible for code to be run with
elevated privileges given that the main Apache process usually is
started as root. I'm not sure at what point it switches to the special
user Apache generally runs as and whether in the main process the way
this switch is done is enough to prevent code getting back root
privileges in some way, so would need to be looked into.

In our case, the issue is this: we load a ton of info at server
restart, from the database. Some of it gets processed a bit based on
configuration files and so forth. If this were done in my own C
server, I'd do all of that and set up the (read-only) runtime data
structures prior to forking. That would mean that:
a) The processing time would be lower since you're just doing the pre-
processing once; and
b) The memory footprint could be lower if large data structures were
created prior to fork; they'd be in shared copy-on-write pages.

b) isn't really possible in Python as far as I can tell (you're going
to wind up touching the reference counts when you get pointers to
objects in the page, so everything's going to get copied into your
process eventually), but a) would be very nice to have.

The second issue is that there can be multiple Python interpreters
ultimately created depending on how URLs are mapped, thus it isn't
just an issue with loading a module once, you would need to create all
the interpreters you think might need it and preload it into each. All
this will blow out the memory size of the main Apache process.

It'll blow out the children, too, though. Most real-world
implementations I've seen just use one interpreter, so even a solution
that didn't account for this would be very useful in practice.

There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.

Yeah, you don't want to run threads in the parent (I'm not sure many
big mission-critical sites use multiple threads anyway, certainly none
of the 3 places I've worked at did). You don't want to allow
untrusted code. You have to be careful, and you should treat anything
run there as part of the server configuration.

But it would still be mighty nice. We're considering migrating to
another platform (still Python-based) because of this issue, but
that's only because we've gotten big enough (in terms of "many big fat
servers sucking up CPU on one machine", not "tons of traffic") that
it's finally an issue. mod_python is still very nice and frankly if
our startup coding was a little less piggish it might not be an issue
even now--on the other hand, we've gotten a lot of flexibility out of
our approach, and the code base is up to 325,000 lines of python or
so. We might be able to refactor things to cut down on startup costs,
but in general a way to call startup code only once seems like the
Right Thing(TM).

Feb 6 '07 #110

Paul Rubin

"sj*******@yahoo.com" <sj*******@yahoo.comwrites:

In our case, the issue is this: we load a ton of info at server
restart, from the database. Some of it gets processed a bit based on
configuration files and so forth. If this were done in my own C
server, I'd do all of that and set up the (read-only) runtime data
structures prior to forking. That would mean that:
a) The processing time would be lower since you're just doing the pre-
processing once; and
b) The memory footprint could be lower if large data structures were
created prior to fork; they'd be in shared copy-on-write pages.

If you completely control the server, write an apache module that
dumps this data into a file on startup, then mmap it into your Python app.

Feb 6 '07 #111

sjdevnull

On Feb 6, 4:27 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

"sjdevn...@yahoo.com" <sjdevn...@yahoo.comwrites:
In our case, the issue is this: we load a ton of info at server
restart, from the database. Some of it gets processed a bit based on
configuration files and so forth. If this were done in my own C
server, I'd do all of that and set up the (read-only) runtime data
structures prior to forking. That would mean that:
a) The processing time would be lower since you're just doing the pre-
processing once; and
b) The memory footprint could be lower if large data structures were
created prior to fork; they'd be in shared copy-on-write pages.

If you completely control the server, write an apache module that
dumps this data into a file on startup, then mmap it into your Python app.

The final data after loading is in the form of a bunch of python
objects in a number of complex data structures, so that's not really a
good solution as far as I can tell. We read in a bunch of data from
the database and build a data layer describing all the various classes
(and some kinds of global configuration data, etc) used by the various
applications in the system.

It's possible that we could build it all in a startup module and then
pickle everything we've built into a file that each child would
unpickle, but I'm a bit leery about that approach.

Feb 6 '07 #112

Paul Rubin

"sj*******@yahoo.com" <sj*******@yahoo.comwrites:

It's possible that we could build it all in a startup module and then
pickle everything we've built into a file that each child would
unpickle, but I'm a bit leery about that approach.

Yeah, that's not so great. You could look at POSH.

Feb 6 '07 #113

azrael

you have to undestand that python is not like other languages. I am
working wih it for 3 months. in this time i learned more than throgh
c, c++, java or php. you take. what the hell is php. a language
developed primary for webaplications. take zope and you have the same.
besides that zope will do fantastic things crossing with other
modules.
c and c++ are incredible languages and if i am not wrong, c is the way
to make aplications working on minimum time (not including assembler).
it has been developed for quite a long time and the researches have
been sponsored even from the goverments. the development of php is
continued while Rasmus Lerdorf is working for yahoo. the python
comunity is not so big as the php (at least in my country), but such
people like me, who fell in love with python, we work on our projects,
we know what we want, and if it dosn't work we make it work. I ma
working on a project and had no filters i needed for signal processing
so I wrote it. My next step when they are finished and work well is to
send it to the admins from pil. this is open source, and we help each
others.

thats open source.

you try python and you like it or not, you keep using it or not. if
something dosn't work, be so kind and make it work, but please, don't
expect someone else to do your "homework". If there are some problems,
contact the admins and offer them your help.

On Jan 25, 6:17 pm, John Nagle <n...@animats.comwrote:

Paul Boddie wrote:
On 25 Jan, 12:01, "Ben Sizer" <kylo...@gmail.comwrote:

I think that is why many of the SIGs are stagnant, why the standard library
has so much fluff yet still lacks in key areas such as multimedia and web
development, etc.
... I think this is also a good insight into why things are as they are
within the core development section of the community, although one can wonder
what some people actively developing the language are actually doing with it
if they are satisfied with the state of some of the standard library
solutions. However, there are lots of factors which motivate people and the
proliferation (or otherwise) of solutions to common problems: whether one
develops one's own solutions as separate projects and/or tries to push for a
consensus, whether one cares about other people using such solutions, whether
one aspires to contributing to the standard library.

Over the years, people have tended towards building their own communities
around their projects rather than attempting to engage the wider Python
community, and I think a motivation behind that has been the intractability
of improving parts of the standard library.

Yes. Working on "frameworks" is perceived as cooler than working
on libraries. Things like Ruby on Rails, Struts, Zope, and Twisted
get attention. There are papers and conferences on these things.
It's hard to get people excited about overhauling
the CGI library, or making mod_python work securely in shared-hosting
environments.

The key distinction between a framework and a library is that users
are expected to make their code fit the framework. In particular,
frameworks aren't expected to play well with each other. If you need
something from Zope and something from Twisted, you're probably not
going to be able to make it work. Libraries, on the other hand,
are expected to play well together. Which means that they have to
handle the hard cases correctly, not just the easy ones.

True. It also doesn't address the issue of development priorities and their
role in defining the platform's own standards
...
I do wonder whether the interests of language/runtime project developers
eventually become completely aligned with the needs of such projects, making
things like "multimedia and web development" seem irrelevant, uninteresting
or tangential. This has worrying implications for the perceived relevance of
Python with regard to certain kinds of solutions, despite the wealth of
independently produced libraries available for the language.

Something like that happened to the C++ standards committee.
The committee was captured by the template fanatics, and most new
standards work involves doing computation at compile time via template
expansion. That's seldom done in production code, yet most of the
standards effort is devoted to making cool template hacks work better.
Meanwhile, real problems, like doing something about memory leaks and buffer
overflows, are ignored by the C++ committee. As a result, C++ is
being displaced by Java and C#, which don't have elaborate templates but do have
memory safety.

I'm not sure how the Python development community will deal with this
problem. But what's happened in the C++ standards world has clearly
been very bad for users of the language. Learn from the mistakes there.

My main concern is with glue code to major packages. The connections
to OpenSSL, MySQL, and Apache (i.e. mod_python) all exist, but have major
weaknesses. If you're doing web applications, those are standard pieces
which need to work right. There's a tendency to treat those as abandonware
and re-implement them as event-driven systems in Twisted. Yet the
main packages aren't seriously broken. It's just that the learning curve
to make a small fix to any of them is substantial, so nobody new takes
on the problem.

John Nagle
Animats

Feb 7 '07 #114

Similar topics