470,596 Members | 1,667 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,596 developers. It's quick & easy.

Re: atomically thread-safe Meyers singleton impl (fixed)...

You need compiler barriers (_ReadWriteBarrier() in MSVC) to ensure
things don't get rearranged across your atomic access
functions. There's no need to drop to assembler either: you're not
doing anything more complicated than a simple MOV.

Anyway, if I was writing this (and I wouldn't be, because I really
dislike singletons), I'd just use boost::call_once. It doesn't use a
lock unless it has to and is portable across pthreads and win32
threads.

Oh, and one other thing: you don't need inline assembler for atomic
ops with gcc from version 4.2 onwards, as the compiler has built-in
functions for atomic operations.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
Jul 30 '08 #1
8 3126
On 30 , 11:20, Anthony Williams <anthony....@gmail.comwrote:
You need compiler barriers (_ReadWriteBarrier() in MSVC) to ensure
things don't get rearranged across your atomic access
functions. There's no need to drop to assembler either: you're not
doing anything more complicated than a simple MOV.

In MSVC one can use just volatile variables, accesses to volatile
variables guaranteed to be 'load-acquire' and 'store-release'. I.e. on
Itanium/PPC MSVC will emit hardware memory fences (along with compiler
fences).

http://msdn.microsoft.com/en-us/library/12a04hfd.aspx

Dmitriy V'jukov
Jul 30 '08 #2
"Dmitriy V'jukov" <dv*****@gmail.comwrites:
On 30 июл, 11:20, Anthony Williams <anthony....@gmail.comwrote:
>You need compiler barriers (_ReadWriteBarrier() in MSVC) to ensure
things don't get rearranged across your atomic access
functions. There's no need to drop to assembler either: you're not
doing anything more complicated than a simple MOV.


In MSVC one can use just volatile variables, accesses to volatile
variables guaranteed to be 'load-acquire' and 'store-release'. I.e. on
Itanium/PPC MSVC will emit hardware memory fences (along with compiler
fences).

http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
I am aware of this. However, the description only says it orders
accesses to "global and static data", and that it refers to objects
*declared* as volatile. I haven't tested it enough to be confident
that it is entirely equivalent to _ReadWriteBarrier(), and that it
works on variables *cast* to volatile.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
Jul 30 '08 #3

"Anthony Williams" <an*********@gmail.comwrote in message
news:u6***********@gmail.com...
"Chris M. Thomasson" <no@spam.invalidwrites:
[...]
The algorithm used by boost::call_once on pthreads platforms is
described here:

http://www.open-std.org/jtc1/sc22/wg...007/n2444.html
>>It doesn't use a
lock unless it has to and is portable across threads and win32
threads.

The code I posted does not use a lock unless it absolutely has to
because it attempts to efficiently take advantage of the double
checked locking pattern.

Oh yes, I realise that: the code for call_once is similar. However, it
attempts to avoid contention on the mutex by using thread-local
storage. If you have atomic ops, you can go even further in
eliminating the mutex, e.g. using compare_exchange and fetch_add.
[...]

Before I reply to your entire post I should point out that:

http://groups.google.com/group/comp....9c7aff738f9102

the Boost mechanism is not 100% portable, but is elegant in practice. It
uses a similar technique that a certain distributed reference counting
algorithm I created claims:

http://groups.google.com/group/comp....16861ac568b2be

http://groups.google.com/group/comp....hreads&q=vzoom

Not 100% portable, but _highly_ portable indeed!

Jul 30 '08 #4
"Chris M. Thomasson" <no@spam.invalidwrites:
"Anthony Williams" <an*********@gmail.comwrote in message
news:u6***********@gmail.com...
>"Chris M. Thomasson" <no@spam.invalidwrites:
[...]
>The algorithm used by boost::call_once on pthreads platforms is
described here:

http://www.open-std.org/jtc1/sc22/wg...007/n2444.html
>>>It doesn't use a
lock unless it has to and is portable across threads and win32
threads.

The code I posted does not use a lock unless it absolutely has to
because it attempts to efficiently take advantage of the double
checked locking pattern.

Oh yes, I realise that: the code for call_once is similar. However, it
attempts to avoid contention on the mutex by using thread-local
storage. If you have atomic ops, you can go even further in
eliminating the mutex, e.g. using compare_exchange and fetch_add.
[...]

Before I reply to your entire post I should point out that:

http://groups.google.com/group/comp....9c7aff738f9102

the Boost mechanism is not 100% portable, but is elegant in
practice.
Yes. If you look at the whole thread, you'll see a comment by me there
where I admit as much.
It uses a similar technique that a certain distributed
reference counting algorithm I created claims:
I wasn't aware that you were using something similar in vZOOM.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
Jul 30 '08 #5
"Anthony Williams" <an*********@gmail.comwrote in message
news:uh***********@gmail.com...
"Chris M. Thomasson" <no@spam.invalidwrites:
>"Anthony Williams" <an*********@gmail.comwrote in message
news:u6***********@gmail.com...
>>"Chris M. Thomasson" <no@spam.invalidwrites:
[...]
>>The algorithm used by boost::call_once on pthreads platforms is
described here:

http://www.open-std.org/jtc1/sc22/wg...007/n2444.html

It doesn't use a
lock unless it has to and is portable across threads and win32
threads.

The code I posted does not use a lock unless it absolutely has to
because it attempts to efficiently take advantage of the double
checked locking pattern.

Oh yes, I realise that: the code for call_once is similar. However, it
attempts to avoid contention on the mutex by using thread-local
storage. If you have atomic ops, you can go even further in
eliminating the mutex, e.g. using compare_exchange and fetch_add.
[...]

Before I reply to your entire post I should point out that:

http://groups.google.com/group/comp....9c7aff738f9102

the Boost mechanism is not 100% portable, but is elegant in
practice.

Yes. If you look at the whole thread, you'll see a comment by me there
where I admit as much.
Does the following line:

__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

explicitly set `_fast_pthread_once_per_thread_epoch' to zero? If so, is it
guaranteed?

>It uses a similar technique that a certain distributed
reference counting algorithm I created claims:

I wasn't aware that you were using something similar in vZOOM.
Humm, now that I think about it, it seems like I am totally mistaken. The
"most portable" version of vZOOM relies on an assumption that pointer
load/stores are atomic and the unlocking of a mutex executes at least a
release-barrier, and the loading of a shared variable executes at least a
data-dependant load-barrier; very similar to RCU without the explicit
#LoadStore | #StoreStore before storing into a shared pointer location...
Something like:


__________________________________________________ __________________
struct foo {
int a;
};
static foo* shared_f = NULL;
// single producer thread {
foo* local_f = new foo;
pthread_mutex_t* lock = get_per_thread_mutex();
pthread_mutex_lock(lock);
local_f->a = 666;
pthread_mutex_unlock(lock);
shared_f = local_f;
}
// single consumer thread {
foo* local_f;
while (! (local_f = shared_f)) {
sched_yield();
}
assert(local_f->a == 666);
delete local_f;
}
__________________________________________________ __________________

If the `pthread_mutex_unlock()' function does not execute at least a
release-barrier in the producer, and if the load of the shared variable does
not execute at least a data-dependant load-barrier in the consumer, the
"most portable" version of vZOOM will NOT work on that platform in any way
shape or form, it will need a platform-dependant version. However, the only
platform I can think of where the intra-node memory visibility requirements
do not hold is the Alpha... For multi-node super-computers, inter-node
communication is adapted to using MPI.

Jul 30 '08 #6
"Chris M. Thomasson" <no@spam.invalidwrites:
"Anthony Williams" <an*********@gmail.comwrote in message
news:uh***********@gmail.com...
>"Chris M. Thomasson" <no@spam.invalidwrites:
>>"Anthony Williams" <an*********@gmail.comwrote in message
news:u6***********@gmail.com...
"Chris M. Thomasson" <no@spam.invalidwrites:
[...]
The algorithm used by boost::call_once on pthreads platforms is
described here:

http://www.open-std.org/jtc1/sc22/wg...007/n2444.html

>It doesn't use a
>lock unless it has to and is portable across threads and win32
>threads.
>
The code I posted does not use a lock unless it absolutely has to
because it attempts to efficiently take advantage of the double
checked locking pattern.

Oh yes, I realise that: the code for call_once is similar. However, it
attempts to avoid contention on the mutex by using thread-local
storage. If you have atomic ops, you can go even further in
eliminating the mutex, e.g. using compare_exchange and fetch_add.
[...]

Before I reply to your entire post I should point out that:

http://groups.google.com/group/comp....9c7aff738f9102

the Boost mechanism is not 100% portable, but is elegant in
practice.

Yes. If you look at the whole thread, you'll see a comment by me there
where I admit as much.

Does the following line:

__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

explicitly set `_fast_pthread_once_per_thread_epoch' to zero? If so,
is it guaranteed?
The algorithm assumes it does, but it depends which compiler you
user. In the Boost implementation, the value is explicitly
initialized (to ~0 --- I found it worked better with exception
handling to count backwards).
>
>>It uses a similar technique that a certain distributed
reference counting algorithm I created claims:

I wasn't aware that you were using something similar in vZOOM.

Humm, now that I think about it, it seems like I am totally
mistaken. The "most portable" version of vZOOM relies on an assumption
that pointer load/stores are atomic and the unlocking of a mutex
executes at least a release-barrier, and the loading of a shared
variable executes at least a data-dependant load-barrier; very similar
to RCU without the explicit #LoadStore | #StoreStore before storing
into a shared pointer location... Something like:

// single producer thread {
foo* local_f = new foo;
pthread_mutex_t* lock = get_per_thread_mutex();
pthread_mutex_lock(lock);
local_f->a = 666;
pthread_mutex_unlock(lock);
shared_f = local_f;
So you're using the lock just for the barrier properties. Interesting
idea.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
Jul 30 '08 #7

"Anthony Williams" <an*********@gmail.comwrote in message
news:ud***********@gmail.com...
"Chris M. Thomasson" <no@spam.invalidwrites:
[...]
>>>the Boost mechanism is not 100% portable, but is elegant in
practice.

Yes. If you look at the whole thread, you'll see a comment by me there
where I admit as much.

Does the following line:

__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

explicitly set `_fast_pthread_once_per_thread_epoch' to zero? If so,
is it guaranteed?

The algorithm assumes it does, but it depends which compiler you
user. In the Boost implementation, the value is explicitly
initialized (to ~0 --- I found it worked better with exception
handling to count backwards).
>>
>>>It uses a similar technique that a certain distributed
reference counting algorithm I created claims:

I wasn't aware that you were using something similar in vZOOM.

Humm, now that I think about it, it seems like I am totally
mistaken. The "most portable" version of vZOOM relies on an assumption
that pointer load/stores are atomic and the unlocking of a mutex
executes at least a release-barrier, and the loading of a shared
variable executes at least a data-dependant load-barrier; very similar
to RCU without the explicit #LoadStore | #StoreStore before storing
into a shared pointer location... Something like:

// single producer thread {
foo* local_f = new foo;
pthread_mutex_t* lock = get_per_thread_mutex();
pthread_mutex_lock(lock);
local_f->a = 666;
pthread_mutex_unlock(lock);
shared_f = local_f;

So you're using the lock just for the barrier properties. Interesting
idea.
Yes. Actually, I did not show the whole algorithm. The code above is busted
because I forgot to show it all; STUPID ME!!! Its busted because the store
to shared_f can legally be hoisted up above the unlock. Here is the whole
picture... Each thread has a special dedicated mutex which is locked from
its birth... Here is exactly how production of an object can occur:
static foo* volatile shared_f = NULL;

// single producer thread {
00: foo* local_f;
01: pthread_mutex_t* const mem_mutex = get_per_thread_mem_mutex();
02: local_f = new foo;
03: local_f->a = 666;
04: pthread_mutex_unlock(mem_mutex);
05: pthread_mutex_lock(mem_mutex);
06: shared_f = local_f;
}
Here are the production rules wrt POSIX:

1. Steps 02-03 CANNOT sink below step 04
2. Step 06 CANNOT rise above step 05
3. vZOOM assumes that step 04 has a release barrier

Those __two__guarantees__and__single__assumption__ ensure the ordering and
visibility of the operations is correct. After that, the consumer can do:
// single consumer thread {
00: foo* local_f;
01: while (! (local_f = shared_f)) {
02: sched_yield();
}
03: assert(local_f->a == 666);
04: delete local_f;
}
Consumption rules:

01: vZOOM assumes that the load from `shared_f' will have implied
data-dependant load-barrier.

BTW, here is a brief outline of how the "most portable" version of vZOOM
distributed reference counting works with the above idea:

http://groups.google.ru/group/comp.p...e9b6e427b4a144

http://groups.google.com/group/comp....24fe99f742ce6e
(an __execlelent__ question from Dmitriy...)


What do you think Anthony?

Jul 30 '08 #8
[...]
>
BTW, here is a brief outline of how the "most portable" version of vZOOM
distributed reference counting works with the above idea:

http://groups.google.ru/group/comp.p...e9b6e427b4a144
Take note of the per-thread memory lock. Its vital to vZOOM.
>
http://groups.google.com/group/comp....24fe99f742ce6e
(an __execlelent__ question from Dmitriy...)


What do you think Anthony?
Jul 30 '08 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

38 posts views Thread by Anthony Baxter | last post: by
4 posts views Thread by Gilles Leblanc | last post: by
31 posts views Thread by AlexeiOst | last post: by
14 posts views Thread by Michi Henning | last post: by
7 posts views Thread by Ivan | last post: by
9 posts views Thread by mareal | last post: by
3 posts views Thread by John Nagle | last post: by
23 posts views Thread by =?GB2312?B?0rvK18qr?= | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.