Thread Pool versus Dedicated Threads

=?GB2312?B?0rvK18qr?=

Hi all,

Recently I had a new coworker. There is some dispute between us.

The last company he worked for has a special networking programming
model. They split the business logic into different modules, and have
a dedicated thread for the each module. Modules exchanged info
through a in-memory message queue.

In my opinion, such a model means very complicated asynchronous
interaction between module. A simple function call between modules
would require a timer to avoid waiting for answer forever.
And if a module was blocked by IO (such as db query), other modules
depends on would have to wait for it.

For example, if module A want to query db, it would

1. save states in a list
2 .sending a message to db-adapter module (a thread dedicated for db
operation)
3. start a timer
4. if response message arrived on time, retrieve states from the
list, and go on
5. if timer fires, log an error message and cancel the operation ——
send an error notify to user……

My new coworker had written 300,000 lines of code in this model and
claimed this is the most simple way to write a network application.
He said we could implement a message queue in half-a day and message
would make interface much more clear.

I think if module interact with each other through function calls and
a thread/process pool model would be more easier, in which each
thread/
process has no dedicated job but handle whatever the master thread
give it.

But as I don't have much experience in this area, I am not quite
sure.

What do u think about it? Is there any successful projects that could
prove which model is **right**?

Aug 14 '08 #1

Subscribe Post Reply

4241

Ian Collins

ä¸€é¦–è¯— wrote:

Hi all,

<snip>

While interesting, there isn't really a C++ question in there. You
would get more insight on comp.programming.threads.

--
Ian Collins.

Aug 14 '08 #2

James Kanze

On Aug 14, 8:20 am, ??? <newpt...@gmail.comwrote:

Recently I had a new coworker. There is some dispute between us.

The last company he worked for has a special networking
programming model. They split the business logic into
different modules, and have a dedicated thread for the each
module. Modules exchanged info through a in-memory message
queue.

In my opinion, such a model means very complicated
asynchronous interaction between module.

If there's common data, there's always a more or less
complicated asynchronous interaction between modules. The
dedicated thread model normally reduces the "common data" to
just the message queue, which makes things significantly
simpler.

A simple function call between modules would require a timer
to avoid waiting for answer forever. And if a module was
blocked by IO (such as db query), other modules depends on
would have to wait for it.

Yup. That's the downside. The single, dedicated thread is (or
can be) a bottleneck. Of course, such bottlenecks can occur
anyway; if your manipulating a shared resource, for example,
which needs locking.

For example, if module A want to query db, it would

1. save states in a list
2 .sending a message to db-adapter module (a thread dedicated for db
operation)
3. start a timer
4. if response message arrived on time, retrieve states from the
list, and go on
5. if timer fires, log an error message and cancel the operation ??
send an error notify to user??

I'd put the time-out in the DB adapter module. Other than this:
what's the difference between putting the request in a single
block and posting it to the message queue, and passing the
information as arguments to a function?

My new coworker had written 300,000 lines of code in this
model and claimed this is the most simple way to write a
network application. He said we could implement a message
queue in half-a day and message would make interface much more
clear.

It's typically easier to get the code right using the message
queue, but it's not a silver bullet. You can still end up with
deadlocks. But you're much less likely to have problems due to
two threads accessing the same data without sufficient
synchronization.

I think if module interact with each other through function
calls and a thread/process pool model would be more easier, in
which each thread/ process has no dedicated job but handle
whatever the master thread give it.

A "thread/process pool model" doesn't mean anything. I'm not
sure what real alternative you're suggesting. Most places I've
worked at use a thread per client connection; on receiving a
request, the thread either grabs whatever locks it needs and
does the work, or forwards it to the dedicated thread (which
then doesn't need any locks, because it is the only thread which
accesses the information). Both models work. Which one is
better depends on the application.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 14 '08 #3

Chris Becke

"James Kanze" <ja*********@gmail.comwrote:

A "thread/process pool model" doesn't mean anything. I'm not
sure what real alternative you're suggesting.

The real alternative is to create a similar message queue design, but completely break the relationship of client connections to threads. Client connections exist on as many or as few threads as needed by the scalibility of the comms library. Requests coming in are packaged and posted to a message queue.
A pool of worker threads, proportional to the number of virtual CPUs in the server (rather than the number of client connections) pull requests from the queue, process the request, and then go back to see if theres anything in the queue to process.

This sort of design can ultimately be far better tuned to keep the CPU cores as busy as possible, while minimising needless context switches from having an "active" thread count far in excess of CPU availability. Given a database server that can, likewise, process multiple requests at once using asynchronous file io, this design will keep the database busy, rather than continually bottlenecking in the single DB "object" thread.

Aug 14 '08 #4

=?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=

On 2008-08-14 08:20, ä¸€é¦–è¯— wrote:

Hi all,

Recently I had a new coworker. There is some dispute between us.

The last company he worked for has a special networking programming
model. They split the business logic into different modules, and have
a dedicated thread for the each module. Modules exchanged info
through a in-memory message queue.

In my opinion, such a model means very complicated asynchronous
interaction between module. A simple function call between modules
would require a timer to avoid waiting for answer forever.
And if a module was blocked by IO (such as db query), other modules
depends on would have to wait for it.

What do u think about it? Is there any successful projects that could
prove which model is **right**?

In general I think you might be right, but when dealing with networking
there is usually a very layered architecture with one-way communication
between the layers (i.e. a lower layer passing the processed data up to
a higher layer). In that case the message-passing model makes very much
sense since it models the actual workings very well and makes each layer
simple to implement (if there are any packages in the in-queue you
process it and put the result in the out-queue, if there are no packages
in the in-queue you wait 'till there are).

For other kinds of tasks it might be easier to let one thread handle the
work-package in all the steps (and modules). Of course there are other
models and combinations, and which one is the best for a given purpose
is not always clear until you have tried a few.

--
Erik WikstrÃ¶m

Aug 14 '08 #5

James Kanze

On Aug 14, 4:24 pm, "Chris Becke" <chris.be...@gmail.comwrote:

"James Kanze" <james.ka...@gmail.comwrote:
A "thread/process pool model" doesn't mean anything. I'm not
sure what real alternative you're suggesting.

The real alternative is to create a similar message queue
design, but completely break the relationship of client
connections to threads.

That's a valid solution if there is no client specific data.
That's not always the case, however.

Client connections exist on as many or as few threads as
needed by the scalibility of the comms library. Requests
coming in are packaged and posted to a message queue. A pool
of worker threads, proportional to the number of virtual CPUs
in the server (rather than the number of client connections)
pull requests from the queue, process the request, and then go
back to see if theres anything in the queue to process.

Again, it depends. If requests constantly use shared data,
there's no point in having more than one thread to handle them.
If requests never use shared data, there's no point in not
handling them immediately in the receiving thread.

This sort of design can ultimately be far better tuned to keep
the CPU cores as busy as possible, while minimising needless
context switches from having an "active" thread count far in
excess of CPU availability.

Do you have actual measurements from a real application to
support this claim. I doubt that it's true for most
applications.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 14 '08 #6

=?GB2312?B?0rvK18qr?=

Hi all,

Thanks for all your help! After I read all your posts and reconsider
my coworker's arguments, I think some explanation may be needed.

1. Chris explained exactly my real alternative solutions in this post.

2. Also as James pointed out, the most valuable point of a 'dedicated
model' is that no lock is needed as only one thread would touch the
data.

3. As Erik wrote "there is usually a very layered architecture",
whether there should have a layered architecture, is a key
consideration whether to use a 'dedicated model'.

4. About shared data. Yes, of course there are shared data between
each client. Actually we are building an SIP server for VOIP and
Instant Message. But in the case of a web server, isn't there are
also shared data between each client? ...

(Sorry I have to attend a meeting, I will further explain my
consideration later.)

On Aug 14, 10:24*pm, "Chris Becke" <chris.be...@gmail.comwrote:

"James Kanze" <james.ka...@gmail.comwrote:
A "thread/process pool model" doesn't mean anything. *I'm not
sure what real alternative you're suggesting.

The real alternative is to create a similar message queue design, but completely break the relationship of client connections to threads. Client connections exist on as many or as few threads as needed by the scalibility ofthe comms library. Requests coming in are packaged and posted to a messagequeue.
A pool of worker threads, proportional to the number of virtual CPUs in the server (rather than the number of client connections) pull requests fromthe queue, process the request, and then go back to see if theres anythingin the queue to process.

This sort of design can ultimately be far better tuned to keep the CPU cores as busy as possible, while minimising needless context switches from having an "active" thread count far in excess of CPU availability. Given a database server that can, likewise, process multiple requests at once using asynchronous file io, this design will keep the database busy, rather than continually bottlenecking in the single DB "object" thread.

Aug 15 '08 #7

Chris Becke

>This sort of design can ultimately be far better tuned to keep

>the CPU cores as busy as possible, while minimising needless
context switches from having an "active" thread count far in
excess of CPU availability.

>Do you have actual measurements from a real application to
support this claim. I doubt that it's true for most
applications.

Microsoft Windows needs to allocate stack space for each thread created. On the 32bit version of the OS then, this means an immediately scalibility problem :- with only 2Gb of address space per process, this implies a hard limit of 2048 connections (threads) per server. Even on a 64bit OS the working set added to the process for each thread means that phsyical hardware limits will be reached that much faster than a system that uses asynchronous IO to keep lots of connections on one thread.

Aug 15 '08 #8

James Kanze

On Aug 15, 10:03 am, "Chris Becke" <chris.be...@gmail.comwrote:

This sort of design can ultimately be far better tuned to keep
the CPU cores as busy as possible, while minimising needless
context switches from having an "active" thread count far in
excess of CPU availability.
Do you have actual measurements from a real application to
support this claim. I doubt that it's true for most
applications.

Microsoft Windows needs to allocate stack space for each
thread created. On the 32bit version of the OS then, this
means an immediately scalibility problem :- with only 2Gb of
address space per process, this implies a hard limit of 2048
connections (threads) per server. Even on a 64bit OS the
working set added to the process for each thread means that
phsyical hardware limits will be reached that much faster than
a system that uses asynchronous IO to keep lots of connections
on one thread.

That's a different problem, but yes, it does have to be taken
into account. The cost of creating a thread can also be an
issue, if connections are short lived (e.g. as in an HTML
server).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 15 '08 #9

James Kanze

On Aug 15, 4:07 am, ä¸€é¦–è¯— <newpt...@gmail.comwrote:

Thanks for all your help! After I read all your posts and
reconsider my coworker's arguments, I think some explanation
may be needed.

1. Chris explained exactly my real alternative solutions in
this post.

2. Also as James pointed out, the most valuable point of a
'dedicated model' is that no lock is needed as only one thread
would touch the data.

3. As Erik wrote "there is usually a very layered
architecture", whether there should have a layered
architecture, is a key consideration whether to use a
'dedicated model'.

4. About shared data. Yes, of course there are shared data
between each client. Actually we are building an SIP server
for VOIP and Instant Message. But in the case of a web
server, isn't there are also shared data between each client?
...

Sometimes. Sometimes not. Of course, only mutable shared data
is a problem. And it depends on who's using it, when. As I
said, there's no silver bullet. It all depends on the
application.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientÃ©e objet/
Beratung in objektorientierter Datenverarbeitung
9 place SÃ©mard, 78210 St.-Cyr-l'Ã‰cole, France, +33 (0)1 30 23 00 34

Aug 15 '08 #10

Chris M. Thomasson

>
"Chris Becke" <ch*********@gmail.comwrote in message
news:12***************@vasbyt.isdsl.net...

>>This sort of design can ultimately be far better tuned to keep
the CPU cores as busy as possible, while minimising needless
context switches from having an "active" thread count far in
excess of CPU availability.

>>Do you have actual measurements from a real application to
support this claim. I doubt that it's true for most
applications.

>Microsoft Windows needs to allocate stack space for each thread created. On
the 32bit version of >the OS then, this means an immediately scalibility
problem :- with only 2Gb of address space per >process, this implies a hard
limit of 2048 connections (threads) per server. Even on a 64bit OS >the
working set added to the process for each thread means that phsyical
hardware limits will be >reached that much faster than a system that uses
asynchronous IO to keep lots of connections on >one thread.

I have personally created IOCP servers on Windows which can handle __well__
over 40,000 connections; want some tips?

Aug 15 '08 #11

PeterAPIIT

Can anyone explain what is thread pool and dedicated pool ?

I have read billion of materials still not very understand.

Aug 16 '08 #12

Ian Collins

Chris M. Thomasson wrote:

>>
"Chris Becke" <ch*********@gmail.comwrote:

>Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space per
>process, this implies a hard limit of 2048 connections (threads) per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you didn't create
40,000 threads!

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

--
Ian Collins.

Aug 16 '08 #13

Chris M. Thomasson

"Ian Collins" <ia******@hotmail.comwrote in message
news:6g*************@mid.individual.net...

Chris M. Thomasson wrote:

>>>
"Chris Becke" <ch*********@gmail.comwrote:

>>Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space per
process, this implies a hard limit of 2048 connections (threads) per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you didn't create
40,000 threads!

I only created around 2 * N threads for the IOCP treading pool, where N is
the number of processors in the system. I did create a couple of more
threads whose only job was to perform some resource maintenance tasks...

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

Right. Well, I guess you could use one user-thread (e.g. fiber)
per-connection and implement your own scheduler. The question is why in the
world would you do that on Windows when there is the wonderful and scalable
IOCP mechanism to work with...

Aug 16 '08 #14

James Kanze

On Aug 16, 6:08 am, Ian Collins <ian-n...@hotmail.comwrote:

Chris M. Thomasson wrote:

"Chris Becke" <chris.be...@gmail.comwrote:

Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space per
process, this implies a hard limit of 2048 connections (threads) per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you
didn't create 40,000 threads!

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

It depends. We're not at 40,000 connections yet, but we've
certainly more than a handful per core. And there's no problem
with the one thread per connection model for our application; in
fact, it would work better with two threads per connection
(one for push, and the other for pull). I've done a few tests
on Solaris, and there's no problem with thousands of threads.

It depends on what each connection is doing, and how long they
stay connected. (In our case, connections tend to last anywhere
between four and twelve hours. And of course, most of that
time, they are quiescent.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 16 '08 #15

Ian Collins

James Kanze wrote:

On Aug 16, 6:08 am, Ian Collins <ian-n...@hotmail.comwrote:

>The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

It depends. We're not at 40,000 connections yet, but we've
certainly more than a handful per core. And there's no problem
with the one thread per connection model for our application; in
fact, it would work better with two threads per connection
(one for push, and the other for pull). I've done a few tests
on Solaris, and there's no problem with thousands of threads.

That depends what they are doing, I've been hit by a thundering heard
with just 100 or so.

It depends on what each connection is doing, and how long they
stay connected. (In our case, connections tend to last anywhere
between four and twelve hours. And of course, most of that
time, they are quiescent.)

Ah, that explains it. I guess very few are blocking on the same
resource and you will have a very low rate of context switches. The
problems begin when the thread lifetime is short, the classic example
being a web server.

--
Ian Collins.

Aug 16 '08 #16

gpderetta

On Aug 16, 7:47 am, "Chris M. Thomasson" <n...@spam.invalidwrote:

"Ian Collins" <ian-n...@hotmail.comwrote in message

news:6g*************@mid.individual.net...

Chris M. Thomasson wrote:

"Chris Becke" <chris.be...@gmail.comwrote:

>Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space per
process, this implies a hard limit of 2048 connections (threads) per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you didn't create
40,000 threads!

I only created around 2 * N threads for the IOCP treading pool, where N is
the number of processors in the system. I did create a couple of more
threads whose only job was to perform some resource maintenance tasks...

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

Right. Well, I guess you could use one user-thread (e.g. fiber)
per-connection and implement your own scheduler. The question is why in the
world would you do that on Windows when there is the wonderful and scalable
IOCP mechanism to work with...

You can of course use user-threads on top of IOCP and get the best of
both worlds.

BTW, a good reference on the topic of (web) server scalability:

http://www.kegel.com/c10k.html

(I guess many here know this page).

HTH,

--
gpd

Aug 16 '08 #17

James Kanze

On Aug 16, 12:47 pm, Ian Collins <ian-n...@hotmail.comwrote:

James Kanze wrote:
On Aug 16, 6:08 am, Ian Collins <ian-n...@hotmail.comwrote:

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

It depends. We're not at 40,000 connections yet, but we've
certainly more than a handful per core. And there's no problem
with the one thread per connection model for our application; in
fact, it would work better with two threads per connection
(one for push, and the other for pull). I've done a few tests
on Solaris, and there's no problem with thousands of threads.

That depends what they are doing, I've been hit by a
thundering heard with just 100 or so.

Exactly. If you're using threads to parallelize operations,
then too many will be counter-productive. If you're using them
to separate various concerns, it depends.

It depends on what each connection is doing, and how long they
stay connected. (In our case, connections tend to last anywhere
between four and twelve hours. And of course, most of that
time, they are quiescent.)

Ah, that explains it. I guess very few are blocking on the
same resource and you will have a very low rate of context
switches.

Probably. In our case, clients have to remain connected,
because we use both push and pull. And there is client
(connection) specific state, related to things like privileges.
There are many ways of handling this: I'm pretty sure that the
entire application could have been written in a single thread
without too many problems; alternatively, we could have used two
threads per connection (one for the push, and one for the pull).
Or any number of mixtures of this (a thread per connection for
the pull, but a single thread for the push).

The problems begin when the thread lifetime is short, the
classic example being a web server.

HTTP tends to be an example where a new thread per connection is
NOT a good idea. But there are a lot of other protocols out
there, and a lot of other client/server architectures. And I
suspect that writing C++ code to handle HTTP connections is
fairly rare: I would expect that most people would use existing
server (Apache, WebSphere, etc.) software, with JSP or something
similar for the dynamically generated contents. The C++ parts
would be the back-end engines, where the connections wouldn't
necessarily (but could) reflect the incoming connections.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 16 '08 #18

Chris M. Thomasson

"gpderetta" <gp*******@gmail.comwrote in message
news:5b**********************************@59g2000h sb.googlegroups.com...

On Aug 16, 7:47 am, "Chris M. Thomasson" <n...@spam.invalidwrote:
>"Ian Collins" <ian-n...@hotmail.comwrote in message

news:6g*************@mid.individual.net...

Chris M. Thomasson wrote:

>"Chris Becke" <chris.be...@gmail.comwrote:

>>Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space per
process, this implies a hard limit of 2048 connections (threads) per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

>I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you didn't
create
40,000 threads!

I only created around 2 * N threads for the IOCP treading pool, where N
is
the number of processors in the system. I did create a couple of more
threads whose only job was to perform some resource maintenance tasks...

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

Right. Well, I guess you could use one user-thread (e.g. fiber)
per-connection and implement your own scheduler. The question is why in
the
world would you do that on Windows when there is the wonderful and
scalable
IOCP mechanism to work with...

You can of course use user-threads on top of IOCP and get the best of
both worlds.

Sure. I guess you would use an IOCP thread as the actual scheduler for the
fibers within it. When an IO completeion is encountered, you extract the
fiber context from the completeion key and simply switch to that fiber. When
the fiber does its thing, it switches back to the IOCP thread. Something
like:

// pseudo-code
struct per_io {
OVERLAPPED ol;
char buf[1024];
DWORD bytes;
int action;
BOOL status;
};

struct per_socket {
SOCKET sck;
void* fiber_socket_context;
void* fiber_iocp_context;
struct per_io* active_io;
};
DWORD WINAPI iocp_entry(LPVOID state) {
for (;;) {
struct per_io* pio = NULL;
struct per_socket* psck = NULL;
DWORD bytes = 0;
BOOL status = GQCS(...,
&bytes,
...,
(LPOVERLAPPED)&pio,
(PULONG_PTR)&psck,
INFINITE);
pio->status = status;
psck->active_io = pio;
SwitchToFiber(psck->fiber_socket_context);
}
return 0;
}
VOID WINAPI per_socket_entry(LPVOID state) {
struct per_socket* const _this = state;
for (;;) {
struct per_io* const pio = _this->active_io;
switch (pio->action) {
case ACTION_RECV:
[...];
break;
case ACTION_SEND:
[...];

[whatever...];
}
}
}

BTW, a good reference on the topic of (web) server scalability:

http://www.kegel.com/c10k.html

(I guess many here know this page).

Indeed.

Aug 17 '08 #19

Chris M. Thomasson

"Chris M. Thomasson" <no@spam.invalidwrote in message
news:lA*****************@newsfe01.iad...

"gpderetta" <gp*******@gmail.comwrote in message
news:5b**********************************@59g2000h sb.googlegroups.com...
>On Aug 16, 7:47 am, "Chris M. Thomasson" <n...@spam.invalidwrote:
>>"Ian Collins" <ian-n...@hotmail.comwrote in message

news:6g*************@mid.individual.net...

Chris M. Thomasson wrote:

"Chris Becke" <chris.be...@gmail.comwrote:

Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space
per
process, this implies a hard limit of 2048 connections (threads)
per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached
that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you didn't
create
40,000 threads!

I only created around 2 * N threads for the IOCP treading pool, where N
is
the number of processors in the system. I did create a couple of more
threads whose only job was to perform some resource maintenance tasks...

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

Right. Well, I guess you could use one user-thread (e.g. fiber)
per-connection and implement your own scheduler. The question is why in
the
world would you do that on Windows when there is the wonderful and
scalable
IOCP mechanism to work with...

You can of course use user-threads on top of IOCP and get the best of
both worlds.

Sure. I guess you would use an IOCP thread as the actual scheduler for the
fibers within it. When an IO completeion is encountered, you extract the
fiber context from the completeion key and simply switch to that fiber.
When the fiber does its thing, it switches back to the IOCP thread.
Something like:

WHOOPS! I accidentally sent this to early! Retarded keypress... Anyway, I
needed to allow the per_socket fiber to switch back to the iocp fiber!!!

>
// pseudo-code
struct per_io {
OVERLAPPED ol;
char buf[1024];
DWORD bytes;
int action;
BOOL status;
};

struct per_socket {
SOCKET sck;
void* fiber_socket_context;
void* fiber_iocp_context;
struct per_io* active_io;
};
DWORD WINAPI iocp_entry(LPVOID state) {
for (;;) {
struct per_io* pio = NULL;
struct per_socket* psck = NULL;
DWORD bytes = 0;
BOOL status = GQCS(...,
&bytes,
...,
(LPOVERLAPPED)&pio,
(PULONG_PTR)&psck,
INFINITE);
pio->status = status;
psck->active_io = pio;

psck->fiber_iocp_context = state;

SwitchToFiber(psck->fiber_socket_context);
}
return 0;
}

VOID WINAPI per_socket_entry(LPVOID state) {
struct per_socket* const _this = state;
for (;;) {
struct per_io* const pio = _this->active_io;
switch (pio->action) {
case ACTION_RECV:
[...];
break;
case ACTION_SEND:
[...];

[whatever...];
}

SwitchToFiber(_this->fiber_iocp_context);

}
}

>BTW, a good reference on the topic of (web) server scalability:

http://www.kegel.com/c10k.html

(I guess many here know this page).

Indeed.

Aug 17 '08 #20

Chris M. Thomasson

"gpderetta" <gp*******@gmail.comwrote in message
news:5b**********************************@59g2000h sb.googlegroups.com...

On Aug 16, 7:47 am, "Chris M. Thomasson" <n...@spam.invalidwrote:
>"Ian Collins" <ian-n...@hotmail.comwrote in message

news:6g*************@mid.individual.net...

Chris M. Thomasson wrote:

>"Chris Becke" <chris.be...@gmail.comwrote:

>>Microsoft Windows needs to allocate stack space for each thread
created. On the 32bit version of >the OS then, this means an
immediately scalibility problem :- with only 2Gb of address space per
process, this implies a hard limit of 2048 connections (threads) per
server. Even on a 64bit OS >the working set added to the process for
each thread means that phsyical hardware limits will be >reached that
much faster than a system that uses asynchronous IO to keep lots of
connections on >one thread.

>I have personally created IOCP servers on Windows which can handle
__well__ over 40,000 connections; want some tips?

But I'd bet several gallons for my favourite beer that you didn't
create
40,000 threads!

I only created around 2 * N threads for the IOCP treading pool, where N
is
the number of processors in the system. I did create a couple of more
threads whose only job was to perform some resource maintenance tasks...

The one thread per connection model simply isn't scalable beyond a
handful of threads per core.

Right. Well, I guess you could use one user-thread (e.g. fiber)
per-connection and implement your own scheduler. The question is why in
the
world would you do that on Windows when there is the wonderful and
scalable
IOCP mechanism to work with...

You can of course use user-threads on top of IOCP and get the best of
both worlds.

I jumped the gun here before actually working out a solution... Now that I
think about it some more, well, this scheme may not work after all. The
problem is that fibers are bound to specific threads for their lifetime.
However, IOCP completions can allow a socket to receive completions on
different threads. Think about it. If a socket issues two overlapped io
operations, well, those completions may come in on two different threads.
How would you use fibers in this scenario? The only way I can see it working
is if you created a IOCP handle for each io processing thread, which defeats
the purpose of IOCP in the first place. Therefore, I conclude that fibers
and IOCP will _not_ work well together as-is...

What am I missing?

[...]

Aug 17 '08 #21

gpderetta

On Aug 17, 5:18 am, "Chris M. Thomasson" <n...@spam.invalidwrote:

"gpderetta" <gpdere...@gmail.comwrote in message

You can of course use user-threads on top of IOCP and get the best of
both worlds.

I jumped the gun here before actually working out a solution... Now that I
think about it some more, well, this scheme may not work after all. The
problem is that fibers are bound to specific threads for their lifetime.

Hum as far as I understand from the win32 documentation, fibers are
allowed to migrate from one thread to another:

"You can call SwitchToFiber with the address of a fiber created by a
different thread. To do this, you must have the address returned to
the other thread when it called CreateFiber and you must use proper
synchronization. "

(in fact I think you also need some appropriate compiler flags to
disable some TLS related optimizations)

Of course the problem is proper sinchronization:

However, IOCP completions can allow a socket to receive completions on
different threads. Think about it. If a socket issues two overlapped io
operations, well, those completions may come in on two different threads.
How would you use fibers in this scenario?

Hum don't do two overlapped operations linked to the same fiber
then :).

More seriously, the problem is making sure never to wake up a fiber if
it is already running and never go to sleep if there is any ready
operation not already acknowledged for.

For example, for every asynchronous operation posted, you could
increment a counter ; when an operation complete, you decrement the
counter and check if the fiber is not awake, if not, mark the fiber as
awake and run it (or put it in a ready queue); when you stop a fiber,
you first suspend it the atomically check for ready operations
operations pending and mark it as sleeping. If there are ready
operations, you abort the suspend and restart the fiber. Getting
things right without missing wakeups is not trivial.

You need to synchronize access to the counter and fiber state, of
course, but you would need to synchronize access to any state attached
to the socket anyway , so IMHO it doesn't make much of a difference.
In fact I'm sure you can come up with some scheme that actually
doesn't require locks.

The only way I can see it working
is if you created a IOCP handle for each io processing thread, which defeats
the purpose of IOCP in the first place. Therefore, I conclude that fibers
and IOCP will _not_ work well together as-is...

Even in the one IOCP per thread, you are no worse than unix select and
friends: IOCP is still a pretty good reactor and will still scale way
better than using the one thread per connection model; you lose the
benefit of having the OS control the optimal number of running
threads, but but you gain by not needing synchronization and better
cache locality.

--
gpd

Aug 22 '08 #22

Chris M. Thomasson

"gpderetta" <gp*******@gmail.comwrote in message
news:83**********************************@25g2000h sx.googlegroups.com...

On Aug 17, 5:18 am, "Chris M. Thomasson" <n...@spam.invalidwrote:
>"gpderetta" <gpdere...@gmail.comwrote in message

You can of course use user-threads on top of IOCP and get the best of
both worlds.

I jumped the gun here before actually working out a solution... Now that
I
think about it some more, well, this scheme may not work after all. The
problem is that fibers are bound to specific threads for their lifetime.

Hum as far as I understand from the win32 documentation, fibers are
allowed to migrate from one thread to another:

"You can call SwitchToFiber with the address of a fiber created by a
different thread. To do this, you must have the address returned to
the other thread when it called CreateFiber and you must use proper
synchronization. "

[...]

Okay. I just thought that any thread which created fibers would end up
destroying said fibers when it gets terminated (e.g., returning from initial
thread function). This is where I got the term "fibers are bound to specific
threads". Please note that I never used fibers in any of my code in any way
shape or form! I love to learn new things!

You can read the following:

http://developer.amd.com/documentati...031200677.aspx

And tell me where I am going wrong.

:^)

Aug 24 '08 #23

gpderetta

On Aug 24, 12:17 pm, "Chris M. Thomasson" <n...@spam.invalidwrote:

"gpderetta" <gpdere...@gmail.comwrote in message

news:83**********************************@25g2000h sx.googlegroups.com...

On Aug 17, 5:18 am, "Chris M. Thomasson" <n...@spam.invalidwrote:
"gpderetta" <gpdere...@gmail.comwrote in message

You can of course use user-threads on top of IOCP and get the best of
both worlds.

I jumped the gun here before actually working out a solution... Now that
I
think about it some more, well, this scheme may not work after all. The
problem is that fibers are bound to specific threads for their lifetime.

Hum as far as I understand from the win32 documentation, fibers are
allowed to migrate from one thread to another:

"You can call SwitchToFiber with the address of a fiber created by a
different thread. To do this, you must have the address returned to
the other thread when it called CreateFiber and you must use proper
synchronization. "

[...]

Okay. I just thought that any thread which created fibers would end up
destroying said fibers when it gets terminated (e.g., returning from initial
thread function). This is where I got the term "fibers are bound to specific
threads". Please note that I never used fibers in any of my code in any way
shape or form! I love to learn new things!

You can read the following:

http://developer.amd.com/documentati...031200677.aspx

As far as I can tell, the article only says that the fiber currently
running on a specific thread, is destroyed when the tread terminates
(which is the behavior I would expect).
Thinking of possible implementations of fibers, I see no reason to
kill all fibers generated by a specific thread when that thread exit
(in fact it would be quite expensive and require a lot of book-keeping
to do).

In the event you are right with fibers getting destroyed at
inconvenient time, it is always possible to make your own 'fiber-like'
abstraction without such a limitation (and you know ASM well enough to
do it :) )

To keep this more C++ related: the (tentative) boost.coroutine library
uses fibers internally on win32, but for portability reasons, it
doesn't allow thread migration. You can look there for a possible
custom implementation of fibers (or threadlets or whatever you want to
call them).

--
gpd

Aug 24 '08 #24

Thread Pool versus Dedicated Threads

Similar topics