Re: possible race condition in post-wait-start-complete

Rolf Rabenseifner (Rabenseifner@RUS.Uni-Stuttgart.DE)
Mon, 2 Sep 1996 16:41:13 +0200 (DST)

Marc,

I stated
>> I also want to remember that this way of synchronization is not good
>> on virtual shared memory machines like CRAY T3E, because it cannot
>> be implemented efficiently there.

and you answered:
> It would be useful to explain this in more detail.
> I don't see, off hand, the relevance.

The following mail shows, that your post/start has really less
efficiency than proposal G.
My mail "1sided and better post/start" from Friday showed that
your proposal does not has more functionality than proposal G.

You stated in a mail to me:
> I don't have strong feelings about my proposal vs yours -- it's a
> tradeoff of covering more ground at the expense of a more general
> (complex?) set of calls.

Because your proposal has less efficiency and same functionality
I think it is not a tradeoff, it is only worse (only in the topic
"post/start")

Efficiency:

The question is, how one can implement your post/start(weak).

- distributed memory:
post is a local operation,
start(weak) is piggybacked on the next RMA message.

- shared memory:

There are two global variables for each window:
is_posted = true/false
post_count = number of posts issued
The origin process has for each target window the local
variable
last_post_count = post_count when start completed (last time)

post: post_count++; is_posted = true; /* it must be guaranteed
that 'true' is stored to memory after post_count was
updated, otherwise some locking is necessary. */

start: repeat /*do nothing*/
until is_posted and (post_count > last_post_count);
/* it must be guaranteed that is_posted is read before
post_count and that they are treated as volatile */
last_post_count = post_count;

wait: waiting for complete...;
is_posted = false;

Memory need: is_posted : n * sizeof(boolean)
post_count: n * sizeof(count)
last_post_count: n * n * sizeof(count)

sizeof(boolean) = 1 bit is enough.
sizeof(count) = 8 byte, because 4 is not enough,
because after 2*31 til 2*32-1 iterations
a newly used origin (with last_post_count=0)
will evaluate (post_count > last_post_count)
as false.
Atomicity is not need, because is_posted and
post_count are modified only by the window owner
and post_count is only modified while
is_posted==false and it is read and evaluated only
while is_posted==true.

==> n * (8 * n + 8.125) bytes

- virtual shared memory:

If you implement in the same manner as for shared memory,
then the start code loops over 2 remote reads: is_posted
and post_count. And this reads are synchronous RPCs.



The alternative -- the proposal G -- does not has this
performance bug, because post knows the nodes to which it should
post and therefore it can send the counter update to the origin:

- distributed memory:
piggyback

- shared memory and virtual shared memory:
MPI_RMA_POST(comm,count,rank,info)
MPI_RMA_START(comm,rank,info)

Implementation on shared / virtual shared memory:

The origin process has for each target window two variables:
post_count = number of posts issued by target to the origin
last_post_count = post_count when start completed (last time)
And post_count can be written by the target.

post: post_count[origin node]++; /* must be done atomically */

start: repeat /*do nothing*/
until (post_count > last_post_count);
/* post_count must be treated as volatile and read
atomically */
last_post_count = post_count;

Memory need: 2 * n * n * sizeof(count)

The counters can be small. They must typically held only the
number of posts that the origin waits for until it really
starts. 2 bytes would be enough. But it must be possible to
increment it by one from one remote node. Efficient
hardware support typically is given only for 4 or 8 bytes.

Comparance for virtual shared memory:

Marc's definition:

post: local store of flag
local increment of counter
start: loop over [remote get of flag and counter] until
criterion is fulfilled
Memory: ~ 8 * n * n

Proposal G:

post: remote increment of counter
start: loop over [local load of counter] until
criterion is fulfilled
Memory: 4 * n * n or 8 * n * n

Therefore proposal G is more efficient.

Best regards
Rolf


Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de