Subtle difference between C++0x MM and other MMs

Dmitriy V'jukov

Consider following Peterson's algorithm implementation from Wikipedia:
http://en.wikipedia.org/wiki/Peterson%27s_algorithm

flag[0] = 0
flag[1] = 0
turn = 0

P0: flag[0] = 1
turn = 1
memory_barrier()
while( flag[1] && turn == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0] = 0

P1: flag[1] = 1
turn = 0
memory_barrier()
while( flag[0] && turn == 0 );
// do nothing
// critical section
...
// end of critical section
flag[1] = 0

We can implement this in Java using volatile variables, and needed
memory_barrier() will be emitted automatically by compiler.
We can implement this in C# using volatile variables, and
Thread.MemoryBarrier() as memory_barrier().
We can implement this in x86 MM using plain loads and stores, and
mfence instruction as memory_barrier().
We can implement this in C++0x using std::atomic<and issuing loads
with memory_order_acquire, stores with memory_order_release, and
atomic_thread_fence(memory_order_seq_cst) as memory_barrier(). This is
the most straightforward translation of Java/C#/x86 implementations.

The only problem is that C++0x implementation will not work.
Personally for me, it's quite counter-intuitive. And following
question arise. What is the most simple way to translate some existing
Java/C#/x86 algorithm implementation to C++0x? It seems that it's not
so easy...

Dmitriy V'jukov

Aug 24 '08 #1

Subscribe Post Reply

2029

Peter Dimov

On Aug 24, 7:46*pm, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

Consider following Peterson's algorithm implementation from Wikipedia:http://en.wikipedia.org/wiki/Peterson%27s_algorithm

*flag[0] * = 0
*flag[1] * = 0
*turn * * *= 0

*P0: flag[0] = 1
* * turn = 1
* * memory_barrier()
* * while( flag[1] && turn == 1 );
* * * * * * // do nothing
* * // critical section
* * ...
* * // end of critical section
* * flag[0] = 0

P1: flag[1] = 1
* * turn = 0
* * memory_barrier()
* * while( flag[0] && turn == 0 );
* * * * * * // do nothing
* * // critical section
* * ...
* * // end of critical section
* * flag[1] = 0

We can implement this in Java using volatile variables, and needed
memory_barrier() will be emitted automatically by compiler.

And the C++ equivalent is to use seq_cst load and stores, which are
equivalent to Java volatiles.

We can implement this in C# using volatile variables, and
Thread.MemoryBarrier() as memory_barrier().
We can implement this in x86 MM using plain loads and stores, and
mfence instruction as memory_barrier().
We can implement this in C++0x using std::atomic<and issuing loads
with memory_order_acquire, stores with memory_order_release, and
atomic_thread_fence(memory_order_seq_cst) as memory_barrier(). This is
the most straightforward translation of Java/C#/x86 implementations.

The only problem is that C++0x implementation will not work.

Why will it not work?

Aug 24 '08 #2

Dmitriy V'jukov

On 24 Á×Ç, 21:52, Peter Dimov <pdi...@gmail.comwrote:

On Aug 24, 7:46 pm, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

Consider following Peterson's algorithm implementation from Wikipedia:http://en.wikipedia.org/wiki/Peterson%27s_algorithm

flag[0] = 0
flag[1] = 0
turn = 0

P0: flag[0] = 1
turn = 1
memory_barrier()
while( flag[1] && turn == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0] = 0

P1: flag[1] = 1
turn = 0
memory_barrier()
while( flag[0] && turn == 0 );
// do nothing
// critical section
...
// end of critical section
flag[1] = 0

We can implement this in Java using volatile variables, and needed
memory_barrier() will be emitted automatically by compiler.

And the C++ equivalent is to use seq_cst load and stores, which are
equivalent to Java volatiles.

Yes, it's possible to implement any algorithm that relying on
sequentially consistent memory model in C++0x using seq_cst atomic
operations. But! Seq_cst atomic operations, especially stores, can be
quite expensive. So, one has general desire to use weaker operations,
like store-release and load-acquire. And in Java/C#/x86 it's possible
to implement Peterson's algorithm using weak operations + 1 strong
fence. In C++0x - NOT.

We can implement this in C# using volatile variables, and
Thread.MemoryBarrier() as memory_barrier().
We can implement this in x86 MM using plain loads and stores, and
mfence instruction as memory_barrier().
We can implement this in C++0x using std::atomic<and issuing loads
with memory_order_acquire, stores with memory_order_release, and
atomic_thread_fence(memory_order_seq_cst) as memory_barrier(). This is
the most straightforward translation of Java/C#/x86 implementations.

The only problem is that C++0x implementation will not work.

Why will it not work?

I mean not every C++0x implementation of Peterson's algorithm, but
particular implementation which uses store-release, load-acquire + 1
seq_cst fence.
Dmitriy V'jukov

Aug 24 '08 #3

Peter Dimov

On Aug 24, 9:44*pm, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

And in Java/C#/x86 it's possible
to implement Peterson's algorithm using weak operations + 1 strong
fence. In C++0x - NOT.

How would you implement Peterson's algorithm in Java using weak
operations and a fence? Java doesn't have weak operations or fences.
Its volatile loads and stores are equivalent to C++MM's seq_cst loads
and stores. Both promise sequential consistency (no more and no less).

I mean not every C++0x implementation of Peterson's algorithm, but
particular implementation which uses store-release, load-acquire + 1
seq_cst fence.

Why do you think that this implementation doesn't work?

Aug 24 '08 #4

Peter Dimov

On Aug 24, 11:53*pm, Peter Dimov <pdi...@gmail.comwrote:

On Aug 24, 9:44*pm, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

I mean not every C++0x implementation of Peterson's algorithm, but
particular implementation which uses store-release, load-acquire + 1
seq_cst fence.

Why do you think that this implementation doesn't work?

I think I see your point. Getting back to

P0: flag[0] = 1
turn = 1
memory_barrier()
while( flag[1] && turn == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0] = 0

P1: flag[1] = 1
turn = 0
memory_barrier()
while( flag[0] && turn == 0 );
// do nothing
// critical section
...
// end of critical section
flag[1] = 0

It's easy to show that P0 and P1 can't block each other forever;
eventually they will agree on the value of 'turn' and one of them will
proceed.

The case where P0 sees flag[1] == 0 and P1 sees flag[0] == 0 is a
classic SC violation example and every reasonable definition of
memory_barrier rules it out.

The interesting case you must have had in mind is the sequence

P1:flag[1] = 1
P1:turn = 0
P0:flag[0] = 1
P0:turn = 1
P0:memory_barrier

Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
the critical section.)

I wonder whether the formal CLR memory model (or even the current
formal x86 memory model) disallows this. (XCHG for turn instead of a
fence should work.)

I think that the C++MM does, if the condition is while( turn == 1 &&
flag[1] ). P0 seeing its own turn=1 doesn't break the release sequence
started by P1:turn=0 because turn=1 is executed by the same thread
(first bullet in 1.10/6). So P1:turn=0 synchronizes-with the read from
'turn' in P0 and ensures that P1:flag[1]=1 is seen.

Aug 24 '08 #5

Dmitriy V'jukov

On 25 Á×Ç, 02:47, Peter Dimov <pdi...@gmail.comwrote:

On Aug 24, 11:53 pm, Peter Dimov <pdi...@gmail.comwrote:

On Aug 24, 9:44 pm, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:
I mean not every C++0x implementation of Peterson's algorithm, but
particular implementation which uses store-release, load-acquire + 1
seq_cst fence.

Why do you think that this implementation doesn't work?

I think I see your point. Getting back to

P0: flag[0] = 1
turn = 1
memory_barrier()
while( flag[1] && turn == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0] = 0

P1: flag[1] = 1
turn = 0
memory_barrier()
while( flag[0] && turn == 0 );
// do nothing
// critical section
...
// end of critical section
flag[1] = 0

It's easy to show that P0 and P1 can't block each other forever;
eventually they will agree on the value of 'turn' and one of them will
proceed.

The case where P0 sees flag[1] == 0 and P1 sees flag[0] == 0 is a
classic SC violation example and every reasonable definition of
memory_barrier rules it out.

The interesting case you must have had in mind is the sequence

P1:flag[1] = 1
P1:turn = 0
P0:flag[0] = 1
P0:turn = 1
P0:memory_barrier

Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
the critical section.)

Exactly! And this behavior is very counter-intuitive for me!

I wonder whether the formal CLR memory model (or even the current
formal x86 memory model) disallows this. (XCHG for turn instead of a
fence should work.)

As for x86 MM, I think that Yes, such behavior is disallowed. x86 MM
defines possible reorderings in one thread. And then intended reading
is that you must try to construct interleaving of threads (taking into
account reorderings in threads). If it's possible to construct
interleaving, then behavior is allowed. If it's impossible - then
disallowed.

For Peterson's algorithm it's impossible to construct interleaving,
which will break algorithm.

For for CLR, it's very informal. But I think intended reading is the
same as for x86... just because Microsoft targets mainly to x86 :)

But for C++0x the fact that it's impossible to construct interleaving
doesn't matter...

I think that the C++MM does, if the condition is while( turn == 1 &&
flag[1] ). P0 seeing its own turn=1 doesn't break the release sequence
started by P1:turn=0 because turn=1 is executed by the same thread
(first bullet in 1.10/6). So P1:turn=0 synchronizes-with the read from
'turn' in P0 and ensures that P1:flag[1]=1 is seen.

"by the same thread" which execute release, not "by the same thread"
which execute acquire.
So this won't work too.
Dmitriy V'jukov

Aug 25 '08 #6

Dmitriy V'jukov

On 25 Á×Ç, 00:53, Peter Dimov <pdi...@gmail.comwrote:

How would you implement Peterson's algorithm in Java using weak
operations and a fence? Java doesn't have weak operations or fences.
Its volatile loads and stores are equivalent to C++MM's seq_cst loads
and stores. Both promise sequential consistency (no more and no less).

Java volatiles promise more. volatile store is release, and volatile
load is acquire. for x86 this means that plain stores and loads will
be used. Well, yes, sometimes heavy membar will be emitted.
C++0x's seq_cst atomic store on x86 will be locked instruction.
So, in my opinion, translating Java volatiles to C++0x's seq_cst
atomics is not fair.
IMVHO more fair way is to translate volatile store to store-release,
volatile load to load-acquire, and manually emit seq_cst fence. At
least this is what initially comes to mind.

Dmitriy V'jukov

Aug 25 '08 #7

Peter Dimov

On Aug 25, 9:27*am, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

"by the same thread" which execute release, not "by the same thread"
which execute acquire.
So this won't work too.

Yes, you are right.

Aug 25 '08 #8

Peter Dimov

On Aug 25, 9:37*am, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

On 25 Á×Ç, 00:53, Peter Dimov <pdi...@gmail.comwrote:

How would you implement Peterson's algorithm in Java using weak
operations and a fence? Java doesn't have weak operations or fences.
Its volatile loads and stores are equivalent to C++MM's seq_cst loads
and stores. Both promise sequential consistency (no more and no less).

Java volatiles promise more. volatile store is release, and volatile
load is acquire.

Exactly the same as seq_cst.

for x86 this means that plain stores and loads will
be used. Well, yes, sometimes heavy membar will be emitted.
C++0x's seq_cst atomic store on x86 will be locked instruction.

Java VMs will also (be changed to) emit XCHG for volatile stores,
because plain stores do not guarantee SC, no matter how many MFENCEs
one uses. x86 is now officially not TSO.

T0: x = 1
T1: y = 1
T2: r1 = x; r2 = y; // 1 0 allowed
T3: r3 = y; r4 = x; // 1 0 allowed

Aug 25 '08 #9

Peter Dimov

On Aug 25, 1:47*am, Peter Dimov <pdi...@gmail.comwrote:

The interesting case you must have had in mind is the sequence

P1:flag[1] = 1
P1:turn = 0
P0:flag[0] = 1
P0:turn = 1
P0:memory_barrier

Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
the critical section.)

You may be interested to know that this is (likely to be) disallowed
on PowerPC as well. It'd definitely be nice to somehow disallow it in
the C++MM as well, if we figure out how to do that.

Otherwise, we'll have to live with something like

P0: flag[0].store( 1, relaxed );
turn.exchange( 1, acq_rel );
while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0].store( 0, release );

(I hope I got this right). Of course using a RMW somewhat defeats the
point of the algorithm, but we're using it as an illustration.

Aug 25 '08 #10

Dmitriy V'jukov

On Aug 25, 3:37*pm, Peter Dimov <pdi...@gmail.comwrote:

I agree with you that the translation from x86 to the C++MM doesn't
work, and that this is a potential problem. I don't see at the moment
how the definition of a seq_cst fence can be modified to match the x86
behavior though.

This is how I currently made this in Relacy Race Detector in Java/CLR
mode:
http://groups.google.com/group/comp....c810b38be80bbb
Dmitriy V'jukov

Aug 26 '08 #11

Dmitriy V'jukov

On Aug 26, 12:25*am, Peter Dimov <pdi...@gmail.comwrote:

On Aug 25, 1:47*am, Peter Dimov <pdi...@gmail.comwrote:

The interesting case you must have had in mind is the sequence

P1:flag[1] = 1
P1:turn = 0
P0:flag[0] = 1
P0:turn = 1
P0:memory_barrier

Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
the critical section.)

You may be interested to know that this is (likely to be) disallowed
on PowerPC as well. It'd definitely be nice to somehow disallow it in
the C++MM as well, if we figure out how to do that.

I think that this is disallowed on Sparc too.
Dmitriy V'jukov

Aug 26 '08 #12

Peter Dimov

On Aug 26, 2:20*pm, "Dmitriy V'jukov" <dvyu...@gmail.comwrote:

On Aug 26, 12:25*am, Peter Dimov <pdi...@gmail.comwrote:

You may be interested to know that this is (likely to be) disallowed
on PowerPC as well. It'd definitely be nice to somehow disallow it in
the C++MM as well, if we figure out how to do that.

I think that this is disallowed on Sparc too.

Yes, I believe it is. FWIW, the following:

P0: flag[0].store( 1, relaxed );
fence( seq_cst ); // #1
turn.store( 1, relaxed );
fence( seq_cst );
while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0].store( 0, release );

seems to work under the C++MM. The model in its current form doesn't
let us get away with just a fence(release) at #1.

Aug 27 '08 #13

Dmitriy V'jukov

On Aug 27, 4:37*am, Peter Dimov <pdi...@gmail.comwrote:

You may be interested to know that this is (likely to be) disallowed
on PowerPC as well. It'd definitely be nice to somehow disallow it in
the C++MM as well, if we figure out how to do that.

I think that this is disallowed on Sparc too.

Yes, I believe it is. FWIW, the following:

P0: flag[0].store( 1, relaxed );
* * fence( seq_cst ); // #1
* * turn.store( 1, relaxed );
* * fence( seq_cst );
* * while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
* * * * * * // do nothing
* * // critical section
* * ...
* * // end of critical section
* * flag[0].store( 0, release );

seems to work under the C++MM. The model in its current form doesn't
let us get away with just a fence(release) at #1.

It looks like it works.
But it's easy to construct heavyweight correct algorithm.
The question that bothers me - how much lightweight (and at the same
time formally correct) algorithm I can construct with C++0x?
Dmitriy V'jukov

Aug 27 '08 #14

by: Skip Montanaro | last post by:

I got to wondering if there is a difference between __contains__() and has_key() for dictionaries (I don't think there is), so I peeked at UserDict.UserDict and saw they were implemented as...

Python

C++ danger to break due to its weight, fragmentation danger - C++0x

by: Ioannis Vranos | last post by:

I would like to see your views on these. C++98 is already a large language since it supports 4 paradigms and each one is supported well, with optimal space and time efficiency. And this is...

C / C++

Difference between 1998 and 2003 C++ standard

by: foo | last post by:

I've been looking on the web to see if I could find a link that would describe the difference between the 14882:1998 and the 14882:2003 C++ standard. I found the following link:...

C / C++

subtle side effect of generator/generator expression

by: bonono | last post by:

Hi, I initially thought that generator/generator expression is cool(sort of like the lazy evaluation in Haskell) until I notice this side effect. >>>a=(x for x in range(2)) >>>list(a) ...

Python

difference between scanf("%i") and scanf("%d") ??? perhaps bug inVS2005?

by: Martin Jørgensen | last post by:

Hi, Today I got a really strange problem... I've made myself a data-file and I read in data from that file.... When I read something like this line: 03 04 05, 00 04 01, 05 03 07, 08 03...

C / C++

C++0x and GCC4.3

by: LewGun | last post by:

at the end of last year Herb Sutter told us that "C++ 0x has been fixed", now GCC4.3 had been released, the compiler were known as "the C ++ new features' experimental unit",but it support to the...

C / C++

Re: Synchronization Algorithm Verificator for C++0x

by: Dmitriy V'jukov | last post by:

On 2 Á×Ç, 20:47, "Dmitriy V'jukov" <dvyu...@gmail.comwrote: Q: Can I use Relacy Race Detector to check my algo againts other that C ++0x memory models (x86, PPC, Java, CLI)? A Yes, you...

C / C++

How to best explain a "subtle" difference between Python and Perl ?

by: Palindrom | last post by:

Hi everyone ! I'd like to apologize in advance for my bad english, it's not my mother tongue... My girlfriend (who is a newbie in Python, but knows Perl quite well) asked me this morning why...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Subtle difference between C++0x MM and other MMs

Similar topics