I stated
>> I also want to remember that this way of synchronization is not good
>> on virtual shared memory machines like CRAY T3E, because it cannot
>> be implemented efficiently there.
and you answered:
> It would be useful to explain this in more detail.
> I don't see, off hand, the relevance.
The following mail shows, that your post/start has really less
efficiency than proposal G.
My mail "1sided and better post/start" from Friday showed that
your proposal does not has more functionality than proposal G.
You stated in a mail to me:
> I don't have strong feelings about my proposal vs yours -- it's a
> tradeoff of covering more ground at the expense of a more general
> (complex?) set of calls.
Because your proposal has less efficiency and same functionality
I think it is not a tradeoff, it is only worse (only in the topic
"post/start")
Efficiency:
The question is, how one can implement your post/start(weak).
- distributed memory:
post is a local operation,
start(weak) is piggybacked on the next RMA message.
- shared memory:
There are two global variables for each window:
is_posted = true/false
post_count = number of posts issued
The origin process has for each target window the local
variable
last_post_count = post_count when start completed (last time)
post: post_count++; is_posted = true; /* it must be guaranteed
that 'true' is stored to memory after post_count was
updated, otherwise some locking is necessary. */
start: repeat /*do nothing*/
until is_posted and (post_count > last_post_count);
/* it must be guaranteed that is_posted is read before
post_count and that they are treated as volatile */
last_post_count = post_count;
wait: waiting for complete...;
is_posted = false;
Memory need: is_posted : n * sizeof(boolean)
post_count: n * sizeof(count)
last_post_count: n * n * sizeof(count)
sizeof(boolean) = 1 bit is enough.
sizeof(count) = 8 byte, because 4 is not enough,
because after 2*31 til 2*32-1 iterations
a newly used origin (with last_post_count=0)
will evaluate (post_count > last_post_count)
as false.
Atomicity is not need, because is_posted and
post_count are modified only by the window owner
and post_count is only modified while
is_posted==false and it is read and evaluated only
while is_posted==true.
==> n * (8 * n + 8.125) bytes
- virtual shared memory:
If you implement in the same manner as for shared memory,
then the start code loops over 2 remote reads: is_posted
and post_count. And this reads are synchronous RPCs.
The alternative -- the proposal G -- does not has this
performance bug, because post knows the nodes to which it should
post and therefore it can send the counter update to the origin:
- distributed memory:
piggyback
- shared memory and virtual shared memory:
MPI_RMA_POST(comm,count,rank,info)
MPI_RMA_START(comm,rank,info)
Implementation on shared / virtual shared memory:
The origin process has for each target window two variables:
post_count = number of posts issued by target to the origin
last_post_count = post_count when start completed (last time)
And post_count can be written by the target.
post: post_count[origin node]++; /* must be done atomically */
start: repeat /*do nothing*/
until (post_count > last_post_count);
/* post_count must be treated as volatile and read
atomically */
last_post_count = post_count;
Memory need: 2 * n * n * sizeof(count)
The counters can be small. They must typically held only the
number of posts that the origin waits for until it really
starts. 2 bytes would be enough. But it must be possible to
increment it by one from one remote node. Efficient
hardware support typically is given only for 4 or 8 bytes.
Comparance for virtual shared memory:
Marc's definition:
post: local store of flag
local increment of counter
start: loop over [remote get of flag and counter] until
criterion is fulfilled
Memory: ~ 8 * n * n
Proposal G:
post: remote increment of counter
start: loop over [local load of counter] until
criterion is fulfilled
Memory: 4 * n * n or 8 * n * n
Therefore proposal G is more efficient.
Best regards
Rolf
Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de