471,339 Members | 1,461 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,339 software developers and data experts.

Socket BeginSend and disconnections

A while back I posted regarding a problem we were having with one of our
applications which was randomly crashing. Monitoring memory usage revealed
a spike in nonpaged pool memory just prior to the crash each time.

We finally think we have narrowed down the cause of this to a user (located
semi-remotely) who would connect into our system and disconnect
"ungracefully" (literally, by pulling his network cable). Connections here
are all TCP/IP sockets.

So, we're trying to look at now how to properly protect our application
against this type of error in the future.

I have two questions (so far, I think):
1) What determines the length of time between this remote user yanking his
cable out and when we actually start seeing SocketExceptions that the
connection was closed? Just depends on the network pieces involved? (aka:
not consistent?)

2) Is it safe (and good practice maybe??) to call multiple BeginSend's on a
socket? Right now, we are, and that is what we think is causing our
problem. Any time our application needs to send something to a particular
client, it basically just calls BeginSend and moves on. What happened in
this case, is that the BeginSend's never COMPLETED (our callbacks never got
invoked) after the user's cable was pulled. With the amount of traffic in
our system, the number of "pending sends" sometimes reached into the
thousands. At this point 1 of 2 things happened: either the network finally
figured out the remote user wasn't there and all of a sudden all the pending
callbacks completed at once (with socketexceptions, which is good) - this
was rare. More frequently, these pending calls just built up more and more
until the application finally croaked (we think this was the nonpaged pool
memory spike we are seeing).
So, is there a different concept we should be using on sending data out to
clients that would prevent this type of failure in the future?

Advice/comments appreciated. Thanks

-Adam
Jun 27 '08 #1
4 4099
On May 23, 7:20*pm, "Adam Clauss" <acla...@swri.orgwrote:
A while back I posted regarding a problem we were having with one of our
applications which was randomly crashing. *Monitoring memory usage revealed
a spike in nonpaged pool memory just prior to the crash each time.

We finally think we have narrowed down the cause of this to a user (located
semi-remotely) who would connect into our system and disconnect
"ungracefully" (literally, by pulling his network cable). *Connections here
are all TCP/IP sockets.

So, we're trying to look at now how to properly protect our application
against this type of error in the future.

I have two questions (so far, I think):
1) What determines the length of time between this remote user yanking his
cable out and when we actually start seeing SocketExceptions that the
connection was closed? *Just depends on the network pieces involved? (aka:
not consistent?)

2) Is it safe (and good practice maybe??) to call multiple BeginSend's on a
socket? *Right now, we are, and that is what we think is causing our
problem. *Any time our application needs to send something to a particular
client, it basically just calls BeginSend and moves on. *What happened in
this case, is that the BeginSend's never COMPLETED (our callbacks never got
invoked) after the user's cable was pulled. *With the amount of traffic in
our system, the number of "pending sends" sometimes reached into the
thousands. *At this point 1 of 2 things happened: either the network finally
figured out the remote user wasn't there and all of a sudden all the pending
callbacks completed at once (with socketexceptions, which is good) - this
was rare. * More frequently, these pending calls just built up more and more
until the application finally croaked (we think this was the nonpaged pool
memory spike we are seeing).
So, is there a different concept we should be using on sending data out to
clients that would prevent this type of failure in the future?

Advice/comments appreciated. Thanks

-Adam
you should use socket KeepAlive option (set keep alive time to 4/5
sec), to notify disconnection.

Ali
Jun 27 '08 #2
On Fri, 23 May 2008 07:20:02 -0700, Adam Clauss <ac*****@swri.orgwrote:
[...]
I have two questions (so far, I think):
1) What determines the length of time between this remote user yanking
his
cable out and when we actually start seeing SocketExceptions that the
connection was closed? Just depends on the network pieces involved?
(aka:
not consistent?)
_Usually_, if there's a broken connection, you'll see exceptions shortly
after you start trying to send data, if not as soon as you try to send
data. It does depend somewhat on how the connection is broken and where.
Depending on the network configuration, it is theoretically possible to
_never_ get an exception.
2) Is it safe (and good practice maybe??) to call multiple BeginSend's
on a
socket? Right now, we are, and that is what we think is causing our
problem.
That very well could be. It points to two issues:

* You may want to put an upper bound on how many send operations you
perform for any connection without getting a response, or at least a
completion of the call to BeginSend() (which indicates the data's been
buffered locally, not that the remote endpoint has received). In theory,
you should be able to queue as many sends as you need, especially if
you've set the socket buffer itself to 0 (so that all buffering is managed
by your own allocated buffers you pass to BeginSend()). But it does sound
like you are somehow sending data _so_ quickly that you fill up the
available memory before the network layer can detect the broken connection.

* You may have a bug where you don't handle an out-of-memory condition
gracefully. Whether this is really a bug depends on your intended
design. But it seems to me that for a server-class application, it makes
sense to catch all exceptions and try to continue gracefully if possible.
You may find that the program becomes useless until some clean-up is done,
but at the very minimum it would allow you to report the specific problem,
and possibly you could include logic that allows you to start pruning your
client list until things start working again.

Given that the problem occurs when you try to send data, I'm not convinced
that enabling keep-alive is going to be useful. In scenarios where
keep-alive would detect a problem, so too should trying to send data. I'm
curious, over how long does this failure take to occur. From the time
that the connection is broken, until the time that your application starts
seeing errors (either exceptions on the socket or simply failing)? I
admit, I'm surprised to hear that you're able to make enough attempts to
send that you run out of resources before the socket itself reports an
error. It seems like on a modern computer, you shouldn't be able to
allocate memory fast enough to cause that to happen.

Pete
Jun 27 '08 #3
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Fri, 23 May 2008 07:20:02 -0700, Adam Clauss <ac*****@swri.orgwrote:

_Usually_, if there's a broken connection, you'll see exceptions shortly
after you start trying to send data, if not as soon as you try to send
data. It does depend somewhat on how the connection is broken and where.
Depending on the network configuration, it is theoretically possible to
_never_ get an exception.
For whatever reason (based on my testing in our development environment), I
see about a minute's worth of time go by before a socketexception gets
thrown and the disconnect recognized.
>
* You may want to put an upper bound on how many send operations you
perform for any connection without getting a response, or at least a
completion of the call to BeginSend() (which indicates the data's been
buffered locally, not that the remote endpoint has received). In theory,
you should be able to queue as many sends as you need, especially if
you've set the socket buffer itself to 0 (so that all buffering is managed
by your own allocated buffers you pass to BeginSend()). But it does sound
like you are somehow sending data _so_ quickly that you fill up the
available memory before the network layer can detect the broken
connection.
The associated error at the time of crash tends to be
EVENT_SRV_NO_NONPAGED_POOL in Event Viewer (aka: we ran out of nonpaged pool
memory).
Is setting the socket buffer to 0 a non-default setting (right now we do not
explicitly make a setting to that value). If set that way, would it maybe
use our application memory (which obviously has a much larger pool to
allocate from) rather than nonpaged and possibly give the socket enough time
to recognize the disconnect?
The socket traffic is XML messages (typically one-way to the client). They
can range from maybe a couple hundred bytes to the largest being several
hundred KB. I don't remember what the cap is on non-paged memory, but we
put some counts in to look at number of calls to begin send vs number of
callbacks received, and the difference quickly grew into the thousands
during this minute or so time period.
* You may have a bug where you don't handle an out-of-memory condition
gracefully. Whether this is really a bug depends on your intended
design. But it seems to me that for a server-class application, it makes
sense to catch all exceptions and try to continue gracefully if possible.
You may find that the program becomes useless until some clean-up is done,
but at the very minimum it would allow you to report the specific problem,
and possibly you could include logic that allows you to start pruning your
client list until things start working again.

Given that the problem occurs when you try to send data, I'm not convinced
that enabling keep-alive is going to be useful. In scenarios where
keep-alive would detect a problem, so too should trying to send data. I'm
curious, over how long does this failure take to occur. From the time
that the connection is broken, until the time that your application starts
seeing errors (either exceptions on the socket or simply failing)? I
admit, I'm surprised to hear that you're able to make enough attempts to
send that you run out of resources before the socket itself reports an
error. It seems like on a modern computer, you shouldn't be able to
allocate memory fast enough to cause that to happen.
Our test setup it takes about a minute. However, our test setup also
doesn't crash. Watching nonpaged pool memory with perfmon, I do see a spike
begin after I yank the cord, but it does not crash. A minute goes by, a
"logout" (socket disconnection) gets logged by our application, and memory
falls back to normal. It seems one of two things is happening in the
production setup:
1) They have FAR more data flowing than we do in our test setup, causing the
spike to be of greater magnitude and big enough to crash the application
before the minute goes by (this is almost certainly true - they DO have more
data); and/or:
2) Their network configuration is not registering the disconnection for a
timeperiod longer than a minute - I am still working to verify exactly how
long it took the application to crash after uncompleted operations started
stacking up.

- Adam
Jun 27 '08 #4
On Fri, 23 May 2008 10:34:45 -0700, Adam Clauss <ac*****@swri.orgwrote:
For whatever reason (based on my testing in our development
environment), I
see about a minute's worth of time go by before a socketexception gets
thrown and the disconnect recognized.
Yuck. For what it's worth, I've never seen disconnects take that long to
detect when actually sending data (obviously, they can take indefinitely
longer if you don't try to send anything :) ). I typically see the
disconnect within a second or two.

It might be worth trying to explore what makes it take so long. I have
little enough experience with the lowest levels of networking that I can't
suggest specifics in that regard.

The only higher-level thing that comes to mind is the possibility that
there's some thread hogging all the CPU time, which is limiting how
quickly your i/o thread(s) get to process things. In this latter
scenario, the network driver itself would be detecting the disconnect
almost immediately, but wouldn't get a chance to report it until much
later.

But it'd be hard for me to say for sure even with a code sample. Without
one, it's just pure speculation. That said, if you have any code that's
raising thread priorities, you might consider disabling it to see if that
helps (hopefully you don't...it's almost never the right thing to do :)
). And if you have a thread that is compute-intensive, you might consider
_lowering_ that thread's priority so that in times of high i/o load, it
doesn't get in the way.
[...]
The associated error at the time of crash tends to be
EVENT_SRV_NO_NONPAGED_POOL in Event Viewer (aka: we ran out of nonpaged
pool
memory).
Is setting the socket buffer to 0 a non-default setting (right now we do
not
explicitly make a setting to that value). If set that way, would it
maybe
use our application memory (which obviously has a much larger pool to
allocate from) rather than nonpaged and possibly give the socket enough
time
to recognize the disconnect?
Maybe. However, I'm not really sure why the non-paged pool is involved.
Typically, the network driver is going to have a fixed sized buffer. I
wouldn't expect it to try to expand that buffer or add new ones. Instead,
it will either reject an attempt to queue new data (non-blocking i/o) or
it will force the attempt to wait until there is space (blocking i/o).

AFAIK a 0-sized buffer for your socket is not the default, and it has the
effect of telling the driver to not buffer at all, but rather to use the
buffer you provide. This is common for IOCP implementations of sockets,
and since the async Socket API uses IOCP, it's something to try. The main
advantage is actually one of performance -- it avoids one copy of the data
-- but I suppose if there's something about the network layer where it's
trying to allocate non-paged memory as you queue data, telling it not to
buffer might improve things.

Again, I'm not actually clear myself why non-paged memory would be getting
allocated at this point. But then, that's as likely just a gap in my
knowledge as it is an indication that that's abnormal and/or unrelated to
your problem. :)

I apologize for the vagueness in my comments. The bulk of my socket
programming experience is with the unmanaged Winsock API. Inasmuch as the
..NET Socket class is built on that, my previous knowledge is applicable,
but there may be details specific to .NET that I'm unaware of.

Pete
Jun 27 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Robert A. van Ginkel | last post: by
4 posts views Thread by Qingdong Z. | last post: by
5 posts views Thread by kuba bogaczewicz | last post: by
reply views Thread by Macca | last post: by
14 posts views Thread by =?Utf-8?B?TWlrZVo=?= | last post: by
2 posts views Thread by manasap | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.