469,946 Members | 1,784 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,946 developers. It's quick & easy.

The reliability of python threads

Hey everyone, I have a question about python threads. Before anyone
goes further, this is not a debate about threads vs. processes, just a
question.

With that, are python threads reliable? Or rather, are they safe? I've
had some strange errors in the past, I use threading.lock for my
critical sections, but I wonder if that is really good enough.

Does anyone have any conclusive evidence that python threads/locks are
safe or unsafe?

Thanks,

Carl

--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 24 '07 #1
41 1904

In article <ma***************************************@python. org>,
"Carl J. Van Arsdall" <cv*********@mvista.comwrites:
|Hey everyone, I have a question about python threads. Before anyone
|goes further, this is not a debate about threads vs. processes, just a
|question.
|>
|With that, are python threads reliable? Or rather, are they safe? I've
|had some strange errors in the past, I use threading.lock for my
|critical sections, but I wonder if that is really good enough.
|>
|Does anyone have any conclusive evidence that python threads/locks are
|safe or unsafe?

Unsafe. They are built on top of unsafe primitives (POSIX, Microsoft
etc.) Python will shield you from some problems, but not all.

There is precious little that you can do, because the root cause is
that the standards and specifications are hopelessly flawed.
Regards,
Nick Maclaren.
Jan 24 '07 #2
On 24 Jan 2007 17:12:19 GMT, Nick Maclaren <nm**@cus.cam.ac.ukwrote:
>
In article <ma***************************************@python. org>,
"Carl J. Van Arsdall" <cv*********@mvista.comwrites:
|Hey everyone, I have a question about python threads. Before anyone
|goes further, this is not a debate about threads vs. processes, just a
|question.
|>
|With that, are python threads reliable? Or rather, are they safe? I've
|had some strange errors in the past, I use threading.lock for my
|critical sections, but I wonder if that is really good enough.
|>
|Does anyone have any conclusive evidence that python threads/locks are
|safe or unsafe?

Unsafe. They are built on top of unsafe primitives (POSIX, Microsoft
etc.) Python will shield you from some problems, but not all.

There is precious little that you can do, because the root cause is
that the standards and specifications are hopelessly flawed.
This is sufficiently inaccurate that I would call it FUD. Using
threads from Python, as from any other language, requires knowledge of
the tradeoffs and limitations of threading, but claiming that their
usage is *inherently* unsafe isn't true. It is almost certain that
your code and locking are flawed, not that the threads underneath you
are buggy.
Jan 24 '07 #3

In article <ma***************************************@python. org>,
"Chris Mellon" <ar*****@gmail.comwrites:
| |>
| |Does anyone have any conclusive evidence that python threads/locks are
| |safe or unsafe?
|
| Unsafe. They are built on top of unsafe primitives (POSIX, Microsoft
| etc.) Python will shield you from some problems, but not all.
|
| There is precious little that you can do, because the root cause is
| that the standards and specifications are hopelessly flawed.
|>
|This is sufficiently inaccurate that I would call it FUD. Using
|threads from Python, as from any other language, requires knowledge of
|the tradeoffs and limitations of threading, but claiming that their
|usage is *inherently* unsafe isn't true. It is almost certain that
|your code and locking are flawed, not that the threads underneath you
|are buggy.

I suggest that you find out rather more about the ill-definition of
POSIX threading memory model, to name one of the better documented
aspects. A Web search should provide you with more information on
the ghastly mess than any sane person wants to know.

And that is only one of many aspects :-(
Regards,
Nick Maclaren.
Jan 24 '07 #4
On 24 Jan 2007 18:21:38 GMT, Nick Maclaren <nm**@cus.cam.ac.ukwrote:
>
In article <ma***************************************@python. org>,
"Chris Mellon" <ar*****@gmail.comwrites:
| |>
| |Does anyone have any conclusive evidence that python threads/locks are
| |safe or unsafe?
|
| Unsafe. They are built on top of unsafe primitives (POSIX, Microsoft
| etc.) Python will shield you from some problems, but not all.
|
| There is precious little that you can do, because the root cause is
| that the standards and specifications are hopelessly flawed.
|>
|This is sufficiently inaccurate that I would call it FUD. Using
|threads from Python, as from any other language, requires knowledge of
|the tradeoffs and limitations of threading, but claiming that their
|usage is *inherently* unsafe isn't true. It is almost certain that
|your code and locking are flawed, not that the threads underneath you
|are buggy.

I suggest that you find out rather more about the ill-definition of
POSIX threading memory model, to name one of the better documented
aspects. A Web search should provide you with more information on
the ghastly mess than any sane person wants to know.

And that is only one of many aspects :-(
I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.

Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.

Emphasizing this means that people will tend to ignore bugs as being
"the fault of POSIX" rather than either auditing their code more
carefully, or avoiding threads entirely (the second being what I
suspect your goal is).

As a last case, I should point out that while the POSIX memory model
can't be proven safe, concrete implementations do not necessarily
suffer from this problem.
Jan 24 '07 #5
Chris Mellon wrote:
On 24 Jan 2007 18:21:38 GMT, Nick Maclaren <nm**@cus.cam.ac.ukwrote:
>[snip]


I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.

Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.
Yea, typically I would think that. The problem I am seeing is
incredibly intermittent. Like a simple pyro server that gives me a
problem maybe every three or four months. Just something funky will
happen to the state of the whole thing, some bad data, i'm having an
issue tracking it down and some more experienced programmers mentioned
that its most likely a race condition. THe thing is, I'm really not
doing anything too crazy, so i'm having difficult tracking it down. I
had heard in the past that there may be issues with threads, so I
thought to investigate this side of things.

It still proves difficult, but reassurance of the threading model helps
me focus my efforts.
Emphasizing this means that people will tend to ignore bugs as being
"the fault of POSIX" rather than either auditing their code more
carefully, or avoiding threads entirely (the second being what I
suspect your goal is).

As a last case, I should point out that while the POSIX memory model
can't be proven safe, concrete implementations do not necessarily
suffer from this problem.
Would you consider the Linux implementation of threads to be concrete?

-carl

--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 24 '07 #6

In article <ma***************************************@python. org>,
"Carl J. Van Arsdall" <cv*********@mvista.comwrites:
|Chris Mellon wrote:
|
| Logic and programming errors in user code are far more likely to be
| the cause of random errors in a threaded program than theoretical
| (I've never come across a case in practice) issues with the POSIX
| standard.
|
|Yea, typically I would think that. The problem I am seeing is
|incredibly intermittent. Like a simple pyro server that gives me a
|problem maybe every three or four months. Just something funky will
|happen to the state of the whole thing, some bad data, i'm having an
|issue tracking it down and some more experienced programmers mentioned
|that its most likely a race condition. THe thing is, I'm really not
|doing anything too crazy, so i'm having difficult tracking it down. I
|had heard in the past that there may be issues with threads, so I
|thought to investigate this side of things.

I have seen that many dozens of times on half a dozen Unices, but have
only tracked down the cause in a handful of cases. Of those,
implementation defects that are sanctioned by the standards have
accounted for about half.

Note that the term "race condition" is accurate but misleading! One
of the worst problems with POSIX is that it does not define how
non-memory global state is synchronised. For example, it is possible
for a memory update and an associated signal to occur on different
sides of a synchronisation boundary. Similarly, it is possible for
I/O to sidestep POSIX's synchronisation boundaries. I have seen both.

Perhaps the nastiest is that POSIX leaves it unclear whether the
action of synchronisation is transitive. So, if A synchronises with
B, and then B with C, A may not have synchronised with C. Again, I
have seen that. It can happen on Intel systems, according to the
experts I know.

|Would you consider the Linux implementation of threads to be concrete?

In this sort of area, Linux tends to be saner than most systems, but
remember that it has had MUCH less stress testing on threaded codes
than many other Unices. In fact, it was only a few years ago that
Linux threads became stable enough to be worth using.

Note that failures due to implementation defects and flaws in the
standards are likely to show up in very obscure ways; ones due to
programmer error tend to be much simpler.

If you want to contact me by Email, and can describe technically
what you are doing and (most importantly) what you are assuming, I
may be able to give some hints. But no promises.
Regards,
Nick Maclaren.
Jan 24 '07 #7
In article <ma***************************************@python. org>,
Carl J. Van Arsdall <cv*********@mvista.comwrote:
>
Hey everyone, I have a question about python threads. Before anyone
goes further, this is not a debate about threads vs. processes, just a
question.

With that, are python threads reliable? Or rather, are they safe? I've
had some strange errors in the past, I use threading.lock for my
critical sections, but I wonder if that is really good enough.

Does anyone have any conclusive evidence that python threads/locks are
safe or unsafe?
My response is that you're asking the wrong questions here. Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app). I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.

I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.
--
Aahz (aa**@pythoncraft.com) <* http://www.pythoncraft.com/

Help a hearing-impaired person: http://rule6.info/hearing.html
Jan 24 '07 #8

In article <ep**********@panix3.panix.com>,
aa**@pythoncraft.com (Aahz) writes:
|>
|My response is that you're asking the wrong questions here. Our database
|server locked up hard Sunday morning, and we still have no idea why (the
|machine itself, not just the database app). I think it's more important
|to focus on whether you have done all that is reasonable to make your
|application reliable -- and then put your efforts into making your app
|recoverable.

Absolutely! Shit happens. In a well-designed world, that would not be
the case, but we don't live in one. Until you have identified the cause,
you can't tell if threading has anything to do with the failure - given
what we know, it seems likely, but what Aahz says is how to tackle the
problem WHATEVER the cause.
Regards,
Nick Maclaren.
Jan 24 '07 #9


On Jan 24, 6:43 pm, "Carl J. Van Arsdall" <cvanarsd...@mvista.com>
wrote:
Chris Mellon wrote:
On 24 Jan 2007 18:21:38 GMT, Nick Maclaren <n...@cus.cam.ac.ukwrote:
[snip]
I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.
Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.
Yea, typically I would think that. The problem I am seeing is
incredibly intermittent. Like a simple pyro server that gives me a
problem maybe every three or four months. Just something funky will
happen to the state of the whole thing, some bad data, i'm having an
issue tracking it down and some more experienced programmers mentioned
that its most likely a race condition. THe thing is, I'm really not
doing anything too crazy, so i'm having difficult tracking it down. I
had heard in the past that there may be issues with threads, so I
thought to investigate this side of things.

It still proves difficult, but reassurance of the threading model helps
me focus my efforts.
<SNIP>
-carl
Three to four months before `strange errors`? I'd spend some time
correlating logs; not just for your program, but for everything running

on the server. Then I'd expect to cut my losses and arrange to safely
re-start the program every TWO months.
(I'd arrange the re-start after collecting logs but before their
analysis.
Life is too short).

- Paddy.

Jan 24 '07 #10
Carl J. Van Arsdall wrote:
Chris Mellon wrote:
>On 24 Jan 2007 18:21:38 GMT, Nick Maclaren <nm**@cus.cam.ac.ukwrote:

>>[snip]



I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.

Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.

Yea, typically I would think that. The problem I am seeing is
incredibly intermittent. Like a simple pyro server that gives me a
problem maybe every three or four months. Just something funky will
happen to the state of the whole thing, some bad data, i'm having an
issue tracking it down and some more experienced programmers mentioned
that its most likely a race condition.
Right. You're at MonteVista, which does real-time Linux systems
and support. There will be people there who thoroughly understand
thread issues. (I've used QNX for real time, but MonteVista has
made progress since in recent years.)

The Python thread documentation is kind of vague about how
well the Python primitives are protected against concurrency problems.
For example, do you have to protect basic types like lists
and hashes against concurrent access? Is "pop" atomic?
(It is in "dequeue", but what about regular lists?)
Can you crash Python from within Python via concurrency errors?
Does the garbage collector run concurrently or does it freeze
all threads? What's different depending upon whether you're using
real OS threads or simulated Python threads?

John Nagle
Jan 24 '07 #11


On Jan 24, 10:43 am, "Carl J. Van Arsdall" <cvanarsd...@mvista.com>
wrote:
Chris Mellon wrote:
On 24 Jan 2007 18:21:38 GMT, Nick Maclaren <n...@cus.cam.ac.ukwrote:
[snip]
I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.
Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.Yea, typically I would think that. The problem I am seeing is
incredibly intermittent. Like a simple pyro server that gives me a
problem maybe every three or four months. Just something funky will
happen to the state of the whole thing, some bad data, i'm having an
issue tracking it down and some more experienced programmers mentioned
that its most likely a race condition. THe thing is, I'm really not
doing anything too crazy, so i'm having difficult tracking it down. I
had heard in the past that there may be issues with threads, so I
thought to investigate this side of things.

It still proves difficult, but reassurance of the threading model helps
me focus my efforts.
Emphasizing this means that people will tend to ignore bugs as being
"the fault of POSIX" rather than either auditing their code more
carefully, or avoiding threads entirely (the second being what I
suspect your goal is).
As a last case, I should point out that while the POSIX memory model
can't be proven safe, concrete implementations do not necessarily
suffer from this problem.Would you consider the Linux implementation of threads to be concrete?

-carl

--

Carl J. Van Arsdall
cvanarsd...@mvista.com
Build and Release
MontaVista Software
Jan 25 '07 #12


On Jan 24, 10:43 am, "Carl J. Van Arsdall" <cvanarsd...@mvista.com>
wrote:
Chris Mellon wrote:
On 24 Jan 2007 18:21:38 GMT, Nick Maclaren <n...@cus.cam.ac.ukwrote:
[snip]
I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.
Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.Yea, typically I would think that. The problem I am seeing is
incredibly intermittent. Like a simple pyro server that gives me a
problem maybe every three or four months. Just something funky will
happen to the state of the whole thing, some bad data, i'm having an
issue tracking it down and some more experienced programmers mentioned
that its most likely a race condition. THe thing is, I'm really not
doing anything too crazy, so i'm having difficult tracking it down. I
had heard in the past that there may be issues with threads, so I
thought to investigate this side of things.

It still proves difficult, but reassurance of the threading model helps
me focus my efforts.
Emphasizing this means that people will tend to ignore bugs as being
"the fault of POSIX" rather than either auditing their code more
carefully, or avoiding threads entirely (the second being what I
suspect your goal is).
As a last case, I should point out that while the POSIX memory model
can't be proven safe, concrete implementations do not necessarily
suffer from this problem.Would you consider the Linux implementation of threads to be concrete?

-carl

--

Carl J. Van Arsdall
cvanarsd...@mvista.com
Build and Release
MontaVista Software
Jan 25 '07 #13
On Jan 24, 10:43 am, "Carl J. Van Arsdall" <cvanarsd...@mvista.com>
wrote:
Yea, typically I would think that. The problem I am seeing is
incredibly intermittent. Like a simple pyro server that gives me a
problem maybe every three or four months. Just something funky will
happen to the state of the whole thing, some bad data, i'm having an
issue tracking it down and some more experienced programmers mentioned
that its most likely a race condition. THe thing is, I'm really not
doing anything too crazy, so i'm having difficult tracking it down. I
had heard in the past that there may be issues with threads, so I
thought to investigate this side of things.
POSIX issues aside, Python's threading model should be less susceptible
to memory-barrier problems that are possible in other languages (this
is due to the GIL). Double-checked locking, frinstance, is safe in
python even though it isn't in java.

Are you ever relying solely on the GIL to access shared data?

-Mike

Jan 25 '07 #14
"Klaas" <mi********@gmail.comwrites:
POSIX issues aside, Python's threading model should be less susceptible
to memory-barrier problems that are possible in other languages (this
is due to the GIL).
But the GIL is not part of Python's threading model; it's just a
particular implementation artifact. Programs that rely on it are
asking for trouble.
Double-checked locking, frinstance, is safe in python even though it
isn't in java.
What's that?
Are you ever relying solely on the GIL to access shared data?
I think a lot of programs do that, which is probably unwise in the
long run.
Jan 25 '07 #15
On Jan 24, 4:11 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
"Klaas" <mike.kl...@gmail.comwrites:
POSIX issues aside, Python's threading model should be less susceptible
to memory-barrier problems that are possible in other languages (this
is due to the GIL).
But the GIL is not part of Python's threading model; it's just a
particular implementation artifact. Programs that rely on it are
asking for trouble.
CPython is more that "a particular implementation" of python, and the
GIL is more than an "artifact". It is a central tenet of threaded
python programming.

I don't advocate relying on the GIL to manage shared data when
threading, but 1) it is useful for the reasons I mention 2) the OP's
question was almost certainly about an application written for and run
on CPython.
Double-checked locking, frinstance, is safe in python even though it
isn't in java.
What's that?
google.com

-Mike

Jan 25 '07 #16
"Klaas" <mi********@gmail.comwrites:
CPython is more that "a particular implementation" of python,
It's precisely a particular implementation of Python. Other
implementations include Jython, PyPy, and IronPython.
and the GIL is more than an "artifact". It is a central tenet of
threaded python programming.
If it's a central tenet of threaded python programming, why is it not
mentioned at all in the language or library manual? The threading
module documentation describes the right way to handle thread
synchronization in Python, and that module implements traditional
locking approaches without reference to the GIL.
I don't advocate relying on the GIL to manage shared data when
threading, but 1) it is useful for the reasons I mention 2) the OP's
question was almost certainly about an application written for and run
on CPython.
Possibly true.
Jan 25 '07 #17
> and the GIL is more than an "artifact". It is a central tenet of
>threaded python programming.

If it's a central tenet of threaded python programming, why is it not
mentioned at all in the language or library manual? The threading
module documentation describes the right way to handle thread
synchronization in Python, and that module implements traditional
locking approaches without reference to the GIL.
And we all hope the GIL will one day die it's natural death ...
maybe... probably.. hopefully ;)
--
damjan
Jan 25 '07 #18
On Jan 24, 5:18 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
"Klaas" <mike.kl...@gmail.comwrites:
CPython is more that "a particular implementation" of python,
It's precisely a particular implementation of Python. Other
implementations include Jython, PyPy, and IronPython.
I did not deny that it is an implementation of Python. I deny that it
is but an implementation of Python.

Jython: several versions behind, used primariy for interfacing with
java
PyPy: years away from being a practical platform for replacing CPython
IronPython: best example you've given, but still probably three or four
orders of magnitude less significant that CPython
and the GIL is more than an "artifact". It is a central tenet of
threaded python programming.
If it's a central tenet of threaded python programming, why is it not
mentioned at all in the language or library manual?
The same reason why IE CSS quirks are not delineated in the HTML 4.01
spec. This doesn't mean that they aren't central to css web
programming (they are).

How could the GIL, which limits the number of threads in which python
code can be run in a single process to one, NOT be a central part of
threaded python programming?
The threading
module documentation describes the right way to handle thread
synchronization in Python, and that module implements traditional
locking approaches without reference to the GIL.
No-one has argued that the GIL should be used instead of
threading-based locking. How could they? The two concepts are not
interchangeable and while they affect each other, are two different
things entirely. In the post you responded to and quoted I said:
I don't advocate relying on the GIL to manage shared data when
threading,
-Mike

Jan 25 '07 #19

In article <11********************@a34g2000cwb.googlegroups.c om>,
"Paddy" <pa*******@netscape.netwrites:
|>
|Three to four months before `strange errors`? I'd spend some time
|correlating logs; not just for your program, but for everything running
|on the server. Then I'd expect to cut my losses and arrange to safely
|re-start the program every TWO months.
|(I'd arrange the re-start after collecting logs but before their
|analysis. Life is too short).

Forget it. That strategy is fine in general, but is a waste of time
where threading issues are involved (or signal handling, or some types
of communication problem, for that matter). There are three unrelated
killer facts that interact:

Such failures are usually probabilistic ("Poisson process"), and
so have no "history".

The expected number is usually proportional to the square of the
activity, sometimes a higher power.

Virtually nothing involved does any routine logging, or even has
options to log relevant events.

The first means that the strategy of restarting doesn't help. All
three mean that current logs are almost never any use.
Regards,
Nick Maclaren.
Jan 25 '07 #20


On Jan 25, 9:26 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
In article <1169675599.502726.5...@a34g2000cwb.googlegroups.c om>,"Paddy" <paddy3...@netscape.netwrites:|>
|Three to four months before `strange errors`? I'd spend some time
|correlating logs; not just for your program, but for everything running
|on the server. Then I'd expect to cut my losses and arrange to safely
|re-start the program every TWO months.
|(I'd arrange the re-start after collecting logs but before their
|analysis. Life is too short).

Forget it. That strategy is fine in general, but is a waste of time
where threading issues are involved (or signal handling, or some types
of communication problem, for that matter).
Nah, Its a great strategy. it keeps you up and running when all you
know for sure is that you will most likely be able to keep things
together for three months normally.
The OP only thinks its a threading problem - it doesn't matter what the
true fix will be, as long as arranging to re-start the server well
before its likely to go down doesn't take too long, compared to your
exploration of the problem, and, of course, you have to be able to
afford the glitch in availability.
There are three unrelated
killer facts that interact:

Such failures are usually probabilistic ("Poisson process"), and
so have no "history".

The expected number is usually proportional to the square of the
activity, sometimes a higher power.

Virtually nothing involved does any routine logging, or even has
options to log relevant events.

The first means that the strategy of restarting doesn't help. All
three mean that current logs are almost never any use.

Regards,
Nick Maclaren.
Jan 25 '07 #21
Aahz wrote:
[snip]

My response is that you're asking the wrong questions here. Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app). I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.
Well, I assume that I have done all I can to make it reliable. This
list is usually my last resort, or a place where I come hoping to find
ideas that aren't coming to me naturally. The only other thing I
thought to come up with was that there might be network errors. But
i've gone back and forth on that, because TCP should handle that for me
and I shouldn't have to deal with it directly in pyro, although I've
added (and continue to add) checks in places that appear appropriate
(and in some cases, checks because I prefer to be paranoid about errors).

I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.
Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)

--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 25 '07 #22

In article <11*********************@j27g2000cwj.googlegroups. com>,
"Paddy" <pa*******@netscape.netwrites:
|>
| |Three to four months before `strange errors`? I'd spend some time
| |correlating logs; not just for your program, but for everything running
| |on the server. Then I'd expect to cut my losses and arrange to safely
| |re-start the program every TWO months.
| |(I'd arrange the re-start after collecting logs but before their
| |analysis. Life is too short).
|
| Forget it. That strategy is fine in general, but is a waste of time
| where threading issues are involved (or signal handling, or some types
| of communication problem, for that matter).
|>
|Nah, Its a great strategy. it keeps you up and running when all you
|know for sure is that you will most likely be able to keep things
|together for three months normally.
|>
|The OP only thinks its a threading problem - it doesn't matter what the
|true fix will be, as long as arranging to re-start the server well
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|before its likely to go down doesn't take too long, compared to your
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|exploration of the problem, and, of course, you have to be able to
|afford the glitch in availability.

Consider the marked phrase in the context of a Poisson process failure
model, and laugh. If you don't understand why I say that, I suggest
finding out the properties of the Poisson process!
Regards,
Nick Maclaren.
Jan 25 '07 #23


On Jan 25, 7:36 pm, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
In article <1169751828.986583.47...@j27g2000cwj.googlegroups. com>,"Paddy" <paddy3...@netscape.netwrites:|>
| |Three to four months before `strange errors`? I'd spend some time
| |correlating logs; not just for your program, but for everything running
| |on the server. Then I'd expect to cut my losses and arrange to safely
| |re-start the program every TWO months.
| |(I'd arrange the re-start after collecting logs but before their
| |analysis. Life is too short).
|
| Forget it. That strategy is fine in general, but is a waste of time
| where threading issues are involved (or signal handling, or some types
| of communication problem, for that matter).
|>
|Nah, Its a great strategy. it keeps you up and running when all you
|know for sure is that you will most likely be able to keep things
|together for three months normally.
|>
|The OP only thinks its a threading problem - it doesn't matter what the
|true fix will be, as long as arranging to re-start the server well
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|before its likely to go down doesn't take too long, compared to your
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|exploration of the problem, and, of course, you have to be able to
|afford the glitch in availability.

Consider the marked phrase in the context of a Poisson process failure
model, and laugh. If you don't understand why I say that, I suggest
finding out the properties of the Poisson process!

Regards,
Nick Maclaren.
No, you should think of the service that needs to be up. You seem to be
talking about how it can't be fixed rather than looking for ways to
keep things going. A little learning is fine but "it can't
theoretically be fixed" is no solution.
With a program that stays up for that long, the situation will usualy
work out for the better when either software versions are upgraded, or
OS and drivers are upgraded. (Sometimes as a result of the analysis,
sometimes not).

Keep your eye on the goal and your more likely to score!

- Paddy.

Jan 25 '07 #24
"Paddy" <pa*******@netscape.netwrites:
No, you should think of the service that needs to be up. You seem to be
talking about how it can't be fixed rather than looking for ways to
keep things going.
But you're proposing cargo cult programming. There is no reason
whatsoever to expect that restarting the server now and then will help
the problem in the slightest. Nick used the fancy term Poisson
process but it just means that the probability of failure at any
moment is independent of what's happened in the past, like the
spontaneous radioactive decay of an atom. It's not like a mechanical
system where some part gradually gets worn out and eventually breaks,
so you can prevent the failure by replacing the part every so often.
A little learning is fine but "it can't theoretically be fixed" is
no solution.
The best you can do is identify the unfixable situations precisely and
work around them. Precision is important.

The next best thing is have several servers running simultaneously,
with failure detection and automatic failover.

If a server is failing at random every few months, trying to prevent
that by restarting it every so often is just shooting in the dark.
Think of your server stopping now and then because there's a power
failure, where you get power failures every few months on the average.
Shutting down your server once a month, unplugging it, and plugging it
back in will do nothing to prevent those outages. You need to either
identify and fix whatever is causing the power outages, or install a
backup generator.
Jan 25 '07 #25


On Jan 25, 8:00 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
"Paddy" <paddy3...@netscape.netwrites:
No, you should think of the service that needs to be up. You seem to be
talking about how it can't be fixed rather than looking for ways to
keep things going.
But you're proposing cargo cult programming.
i don't know that term. What I'm proposing is that if, for example, a
process stops running three times in a year at roughly three to four
months intervals , and it should have stayed up; then restart the
server sooner, at aa time of your choosing, whilst taking other
measures to investicate the error.
There is no reason
whatsoever to expect that restarting the server now and then will help
the problem in the slightest.
Thats where we most likely differ. The problem is only indirecctly the
program failing. the customer wants reliable service. Which you can get
from unreliable components. It happens all the time in firmware
controlled systems that periodically reboot themselves as a matter of
course.
Nick used the fancy term Poisson
process but it just means that the probability of failure at any
moment is independent of what's happened in the past, like the
spontaneous radioactive decay of an atom. It's not like a mechanical
system where some part gradually gets worn out and eventually breaks,
so you can prevent the failure by replacing the part every so often.
Whilst you sit agreeing on how many fairys can dance on the end of a
pin or not Your company could be loosing customers. You and Nick seem
to be saying it *must* be Poisson, therefore we can't do...
>
A little learning is fine but "it can't theoretically be fixed" is
no solution.The best you can do is identify the unfixable situations precisely and
work around them. Precision is important.
I'm sorry, but your argument reminds me of when Western statistical
quality control first met with the Japanese Zero defects methodologies.
We had argued ourselves into accepting a certain amount of defective
cars getting out to customers as the result of our theories. The
Japanese practices emphasized *no* defects were acceptable at the
customer, and they seemed to deliver better made cars.
>
The next best thing is have several servers running simultaneously,
with failure detection and automatic failover.
Yah, finally. I can work with that
>
If a server is failing at random every few months, trying to prevent
that by restarting it every so often is just shooting in the dark.
"at random" - "every few months"
Me thinking it happens "every few months" allows me to search for a
fix.
If thinking it happens "at random" leads you to a brick wall, then
switch!
Think of your server stopping now and then because there's a power
failure, where you get power failures every few months on the average.
Shutting down your server once a month, unplugging it, and plugging it
back in will do nothing to prevent those outages. You need to either
identify and fix whatever is causing the power outages, or install a
backup generator.
Yep. I also know that a mad bloke entering the server room with a
hammer every three to four months is also not likely to be fixed by
restarting the server every two months ;-)

- Paddy.

Jan 25 '07 #26
"Paddy" <pa*******@netscape.netwrites:
But you're proposing cargo cult programming.
i don't know that term.
http://en.wikipedia.org/wiki/Cargo_cult_programming
What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals ,
and it should have stayed up; then restart the server sooner, at aa
time of your choosing,
What makes you think that restarting the server will make it less
likely to fail? It sounds to me like there's zero evidence of that,
since you say "roughly three or four month intervals" and talk about
threading and race conditions. If it's failing every 3 months, 15
days and 2.43 hours like clockwork, that's different, sure, restart it
every three months. But the description I see so far sounds like a
random failure caused by some events occurring with low enough
probability that they only happen on average every few months of
operation. That kind of thing is very common and is often best
diagnosed by instrumenting the hell out of the code.
There is no reason whatsoever to expect that restarting the server
now and then will help the problem in the slightest.
Thats where we most likely differ.
Do you think there is a reason to expect that restarting the server
will help the problem in the slightest? I realize you seem to expect
that, but you have not given a REASON. That's what I mean by cargo
cult programming.
Whilst you sit agreeing on how many fairys can dance on the end of a
pin or not Your company could be loosing customers. You and Nick seem
to be saying it *must* be Poisson, therefore we can't do...
I dunno about Nick, I'm saying it's best to assume that it's Poisson
and do whatever is necessary to diagnose and fix the bug, and that the
voodoo measure you're proposing is not all that likely to help and it
will take years to find out whether it helps or not (i.e. restarting
after 3 months and going another 3 months without a failure proves
nothing).
I'm sorry, but your argument reminds me of when Western statistical
quality control first met with the Japanese Zero defects methodologies.
We had argued ourselves into accepting a certain amount of defective
cars getting out to customers as the result of our theories. The
Japanese practices emphasized *no* defects were acceptable at the
customer, and they seemed to deliver better made cars.
I don't see your point. You're the one who wants to keep operating
defective software instead of fixing it.
"at random" - "every few months"
Me thinking it happens "every few months" allows me to search for a
fix. If thinking it happens "at random" leads you to a brick wall,
then switch!
But you need evidence before you can say it happens every few months.
Do you have, say, a graph of the exact dates and times of failure, the
number of requests processed so far, etc.? If it happened at some
exact or almost exact uniform time interval or precisely once every
1.273 million requests or whatever, that tells you something. But the
earlier description didn't sound like that. Restarting the server is
not much better than carrying a lucky rabbit's foot.
Jan 25 '07 #27

In article <11**********************@q2g2000cwa.googlegroups. com>,
"Paddy" <pa*******@netscape.netwrites:
|>
|No, you should think of the service that needs to be up. You seem to be
|talking about how it can't be fixed rather than looking for ways to
|keep things going. A little learning is fine but "it can't
|theoretically be fixed" is no solution.

I suggest that you do invest in a little learning and look up Poisson
processes.

|Keep your eye on the goal and your more likely to score!

And, if you have your eye on the wrong goal, you would generally be
better off not scoring :-)
Regards,
Nick Maclaren.
Jan 25 '07 #28

PaulI dunno about Nick, I'm saying it's best to assume that it's
PaulPoisson and do whatever is necessary to diagnose and fix the bug,
Pauland that the voodoo measure you're proposing is not all that
Paullikely to help and it will take years to find out whether it helps
Paulor not (i.e. restarting after 3 months and going another 3 months
Paulwithout a failure proves nothing).

What makes you think Paddy indicated he wouldn't try to solve the problem?
Here's what he wrote:

What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals , and it
should have stayed up; then restart the server sooner, at aa time of
your choosing, whilst taking other measures to investicate the error.

I see nothing wrong with trying to minimize the chances of a problem rearing
its ugly head while at the same time trying to investigate its cause (and
presumably solve it).

Skip

Jan 26 '07 #29
sk**@pobox.com writes:
What makes you think Paddy indicated he wouldn't try to solve the problem?
Here's what he wrote:

What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals , and it
should have stayed up; then restart the server sooner, at aa time of
your choosing, whilst taking other measures to investicate the error.
Well, ok, that's better than just rebooting every so often and leaving
it at that, like the firmware systems he cited.
I see nothing wrong with trying to minimize the chances of a problem
I think a measure to minimize the chance of some problem is only valid
if there's some plausible theory that it WILL decrease the chance of
the problem (e.g. if there's reason to think that the problem is
caused by a very slow resource leak, but that hasn't been suggested).
That's the part that I'm missing from this story.

One thing I'd certainly want to do is set up a test server under a
much heavier load than the real server sees, and check whether the
problem occurs faster.
Jan 26 '07 #30
"Carl J. Van Arsdall" <cv*********@mvista.comwrote:
Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
Are you 100% rock bottom gold plated guaranteed sure that there is
not something else that is also critical that you just haven't realised is?

This stuff is never obvious before the fact - and always seems stupid
afterward, when you have found it. Your best (some would say only)
weapon is your imagination, fueled by scepticism...
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)
Nothing wrong with doing that to find a bug - not as a general
practice, of course - that would be too pessimistic.

In hard to find bugs - doing anything to narrow the time and place
of the error down is fair game - the object is to get you to read
some code that you *know works* with new eyes...

I build in a global boolean variable that I call trace, and when its on
I do all sort of weird stuff, giving a running commentary (either by
print or in some log like file) of what the programme is doing,
like read this, wrote that, received this, done that here, etc.
A bare useful minimum is a "we get here" indicator like the routine
name, but the data helps a lot too.

Compared to an assert, it does not stop the execution, and you
could get lucky by cross correlating such "traces" from different
threads. - or better, if you use a queue or a pipe for the "log",
you might see the timing relationships directly.

But this in itself is fraught with danger, as you can hit file size
limits, or slow the whole thing down to unusability.

On the other hand it does not generate the volume that a genuine
trace does, it is easier to read, and you can limit it to the bits that
you are currently suspicious of.

Programming is such fun...

hth - Hendrik

Jan 26 '07 #31

In article <ma***************************************@python. org>,
sk**@pobox.com writes:
|>
|What makes you think Paddy indicated he wouldn't try to solve the problem?
|Here's what he wrote:
|>
| What I'm proposing is that if, for example, a process stops running
| three times in a year at roughly three to four months intervals , and it
| should have stayed up; then restart the server sooner, at aa time of
| your choosing, whilst taking other measures to investicate the error.
|>
|I see nothing wrong with trying to minimize the chances of a problem rearing
|its ugly head while at the same time trying to investigate its cause (and
|presumably solve it).

No, nor do I, but look more closely. His quote makes it quite clear that
he has got it firmly in his mind that this is a degradation problem, and
so regular restarting will improve the reliability. Well, it could also
be one where failure becomes LESS likely the longer the server stays up
(i.e. the "settling down" problem).

No problem is as hard to find as one where you are firmly convinced that
it is somewhere other than where it is.
Regards,
Nick Maclaren.
Jan 26 '07 #32


On 26 Jan, 09:05, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
In article <mailman.3176.1169771514.32031.python-l...@python.org>,s...@pobox.com writes:|>
|What makes you think Paddy indicated he wouldn't try to solve the problem?
|Here's what he wrote:
|>
| What I'm proposing is that if, for example, a process stops running
| three times in a year at roughly three to four months intervals , and it
| should have stayed up; then restart the server sooner, at aa time of
| your choosing, whilst taking other measures to investicate the error.
|>
|I see nothing wrong with trying to minimize the chances of a problem rearing
|its ugly head while at the same time trying to investigate its cause (and
|presumably solve it).

No, nor do I, but look more closely. His quote makes it quite clear that
he has got it firmly in his mind that this is a degradation problem, and
so regular restarting will improve the reliability. Well, it could also
be one where failure becomes LESS likely the longer the server stays up
(i.e. the "settling down" problem).
If in the past year the settling down problem did not rear its head
when the server crashed after three to four months and was restarted,
then why not implement a regular , notified, downtime - whilst also
looking into the problem in more depth?

* You are already having to restart.
* restarts last for 3-4 months.
Why burden yourself with "Oh but it could fail once in three hours,
you've not prooved that it can't, we'll have to stop everything whilst
we do a thorough investigation. Is it Poisson? Is it 'settling down'?
Just wait whilst I prepare my next doctoral thesis... "

- Okay, the last was extreme. but cathartic :-)
>
No problem is as hard to find as one where you are firmly convinced that
it is somewhere other than where it is.
Amen!
>
Regards,
Nick Maclaren.
- Paddy.

Jan 26 '07 #33
Hendrik van Rooyen wrote:
"Carl J. Van Arsdall" <cv*********@mvista.comwrote:

>[snip]

Are you 100% rock bottom gold plated guaranteed sure that there is
not something else that is also critical that you just haven't realised is?
100%? No, definitely not. I know myself, as I explore this option and
other options, I will of course be going into and out of the code,
looking for that small piece I might have missed. But I'm like a modern
operating system, I do lots of things at once. So after being unable to
solve it the first few times, I thought to pose a question, but as I
pose the question that never means that I'm done looking at my code and
hoping I missed something. I'd much rather have this be my fault...
that means I have a much higher probability of fixing it. But i sought
to explore some tips given to me. Ah, but the day I could be 100%
sure, that would be a good day (hell, i'd go ask for a raise for being
the best coder ever!)
This stuff is never obvious before the fact - and always seems stupid
afterward, when you have found it. Your best (some would say only)
weapon is your imagination, fueled by scepticism...

Yea, seriously!

>try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)

Nothing wrong with doing that to find a bug - not as a general
practice, of course - that would be too pessimistic.

In hard to find bugs - doing anything to narrow the time and place
of the error down is fair game - the object is to get you to read
some code that you *know works* with new eyes...

I really like that piece of wisdom, I'll add that to my list of coding
mantras. Thanks!
I build in a global boolean variable that I call trace, and when its on
I do all sort of weird stuff, giving a running commentary (either by
print or in some log like file) of what the programme is doing,
like read this, wrote that, received this, done that here, etc.
A bare useful minimum is a "we get here" indicator like the routine
name, but the data helps a lot too.

Yea, I do some of that too. I use that with conditional print
statements to stderr when i'm doing my validation against my test
cases. But I could definitely do more of them. The thing will be
simulating the failure. In the production server, thousands of printed
messages would be bad.

I've done short but heavy simulations, but to no avail. For example,
I'll have a couple systems infinitely loop and beat on the system. This
is a much heavier load than the system will ever normally face, as its
hit a lot at once and then idles for a while. The test environment
constantly hits it, and I let that run for several days. Maybe a longer
run is needed, but how long is reasonable before determining that its
something beyond my control?
Compared to an assert, it does not stop the execution, and you
could get lucky by cross correlating such "traces" from different
threads. - or better, if you use a queue or a pipe for the "log",
you might see the timing relationships directly.
Ah, store the logs in a rotating queue of fixed size? That would work
pretty well to maintain control on a large run, thanks!
But this in itself is fraught with danger, as you can hit file size
limits, or slow the whole thing down to unusability.

On the other hand it does not generate the volume that a genuine
trace does, it is easier to read, and you can limit it to the bits that
you are currently suspicious of.

Programming is such fun...
Yea, I'm one of those guys who really gets a sense of satisfaction out
of coding. Thanks for the tips.

-carl

--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 26 '07 #34
"Carl J. Van Arsdall" <cv*********@mvista.comwrote:
Hendrik van Rooyen wrote:
"Carl J. Van Arsdall" <cv*********@mvista.comwrote:
8< ---------------------------------------------------
Yea, I do some of that too. I use that with conditional print
statements to stderr when i'm doing my validation against my test
cases. But I could definitely do more of them. The thing will be
When I read this - I thought - probably your stuff is working
perfectly - on your test cases - you could try to send it some
random data and to see what happens - seeing as you have a test
server, throw the kitchen sink at it.

Possibly "random" here means something that "looks like" data
but that is malformed in some way. Kind of try to "trick" the
system to get it to break reliably.

I'm sorry I can't be more specific - it sounds so weak, and you
probably already have test cases that "must fail" but I don't
know how to put it any better...

- Hendrik
Jan 27 '07 #35
Hendrik van Rooyen wrote:
[snip]
>could definitely do more of them. The thing will be

When I read this - I thought - probably your stuff is working
perfectly - on your test cases - you could try to send it some
random data and to see what happens - seeing as you have a test
server, throw the kitchen sink at it.

Possibly "random" here means something that "looks like" data
but that is malformed in some way. Kind of try to "trick" the
system to get it to break reliably.

I'm sorry I can't be more specific - it sounds so weak, and you
probably already have test cases that "must fail" but I don't
know how to put it any better...
Well, sometimes a weak analogy is the best thing because it allows me to
fill in the blanks "How can I throw a kitchen sink at it in a way I
never have before"

And away my mind goes, so thank you.

-carl

--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 29 '07 #36
In article <ma***************************************@python. org>,
Carl J. Van Arsdall <cv*********@mvista.comwrote:
>Aahz wrote:
>>
My response is that you're asking the wrong questions here. Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app). I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.
Well, I assume that I have done all I can to make it reliable. This
list is usually my last resort, or a place where I come hoping to find
ideas that aren't coming to me naturally. The only other thing I
thought to come up with was that there might be network errors. But
i've gone back and forth on that, because TCP should handle that for me
and I shouldn't have to deal with it directly in pyro, although I've
added (and continue to add) checks in places that appear appropriate
(and in some cases, checks because I prefer to be paranoid about errors).
My point is that an app that dies only once every few months under load
is actually pretty damn stable! That is not the kind of problem that
you are likely to stimulate.
>I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.

Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)
What we do at my company is maintain log files. When we think we have
identified a potential choke point for problems, we add a log call.
Tracking this down will involve logging the changes to your data until
you can figure out where it goes wrong -- once you know where it goes
wrong, you have an excellent chance of figuring out why.
--
Aahz (aa**@pythoncraft.com) <* http://www.pythoncraft.com/

"I disrespectfully agree." --SJM
Jan 30 '07 #37
Carl J. Van Arsdall wrote:
Aahz wrote:
>[snip]

My response is that you're asking the wrong questions here. Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app). I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.
Well, I assume that I have done all I can to make it reliable. This
list is usually my last resort, or a place where I come hoping to find
ideas that aren't coming to me naturally. The only other thing I
thought to come up with was that there might be network errors. But
i've gone back and forth on that, because TCP should handle that for me
and I shouldn't have to deal with it directly in pyro, although I've
added (and continue to add) checks in places that appear appropriate
(and in some cases, checks because I prefer to be paranoid about errors).

>I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.
Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)
Are you using memory with built-in error detection and correction?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Jan 30 '07 #38
Aahz wrote:
In article <ma***************************************@python. org>,
Carl J. Van Arsdall <cv*********@mvista.comwrote:
My point is that an app that dies only once every few months under load
is actually pretty damn stable! That is not the kind of problem that
you are likely to stimulate.
This has all been so vague. How does it die?

It would be useful if Python detected obvious deadlock. If all threads
are blocked on mutexes, you're stuck, and at that point, it's time
to abort and do tracebacks on all threads. You shouldn't have to
run under a debugger to detect that.

Then a timer, so that if the Global Python Lock
stays locked for more than N seconds, you get an abort and a traceback.
That way, if you get stuck in some C library, it gets noticed.

Those would be some good basic facilities to have in thread support.

In real-time work, you usually have a high-priority thread which
wakes up periodically and checks that a few flags have been set
indicating progress of the real time work, then clears the flags.
Throughout the real time code, flags are set indicating progress
for the checking thread to notice. All serious real time systems
have some form of stall timer like that; there's often a stall
timer in hardware.

John Nagle
Jan 30 '07 #39
Steve Holden wrote:
[snip]

Are you using memory with built-in error detection and correction?

You mean in the hardware? I'm not really sure, I'd assume so but is
there any way I can check on this? If the hardware isn't doing that, is
there anything I can do with my software to offer more stability?

--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 30 '07 #40
John Nagle wrote:
Aahz wrote:
>In article <ma***************************************@python. org>,
Carl J. Van Arsdall <cv*********@mvista.comwrote:
My point is that an app that dies only once every few months under load
is actually pretty damn stable! That is not the kind of problem that
you are likely to stimulate.

This has all been so vague. How does it die?
Well, before operating on most of the data I perform type checks, if the
type check fails, my system flags an exception. Now i'm in the process
of finding out how the data went bad. I gotta wait at this point
though, so I was investigating possibilities so I could find a new way
of throwing the kitchen sink at it.

It would be useful if Python detected obvious deadlock. If all threads
are blocked on mutexes, you're stuck, and at that point, it's time
to abort and do tracebacks on all threads. You shouldn't have to
run under a debugger to detect that.

Then a timer, so that if the Global Python Lock
stays locked for more than N seconds, you get an abort and a traceback.
That way, if you get stuck in some C library, it gets noticed.

Those would be some good basic facilities to have in thread support.
I agree. That would be incredibly useful. Although doesn't this spark
up the debate on threads killing threads? From what I understand, this
is frowned upon (and was removed from java because it was dangerous).
Although I think that if there was a master or control thread that
watched the state of the system and could intervene, that would be
powerful. One way to do this could be to use processes, and each
process could catch a kill signal if it appears to be stalled, although
I am absolutely sure there is more to it than that. I don't think this
could be done at all with python threads though, but as a fan of python
threads and their ease of use, it would be a nice and powerful feature
to have.
-carl
--

Carl J. Van Arsdall
cv*********@mvista.com
Build and Release
MontaVista Software

Jan 30 '07 #41
Carl J. Van Arsdall wrote:
Steve Holden wrote:
>[snip]

Are you using memory with built-in error detection and correction?

You mean in the hardware? I'm not really sure, I'd assume so but is
there any way I can check on this? If the hardware isn't doing that, is
there anything I can do with my software to offer more stability?
You might be able to check using the OS features (have you said what OS
you are using?) - alternatively Google for information from the system
supplier.

If you don't have that feature in hardware you are up sh*t creek without
a paddle, as it can't be emulated.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Feb 1 '07 #42

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.