---------- Forwarding Original Note --------
cc: mpi-1sided @ mcs.anl.gov
From: Rabenseifner @ RUS.Uni-Stuttgart.DE
Date: 08/06/96 05:34:21 PM Z-2
Subject: Re: new version of chapter 4
in your latest version of the "1-sided" chapter from 4th Aug.
I believe that there are probably severe problems:
1) On page 15, lines 39-40 you allow "An implementation may have
the call to MPI_RMA_START block until the window is clear; or
it may allow the call to MPI_RMA_START to proceed, and..."
And on page 18, line 24-25 you use the following rule for your
"B completed a call to MPI_RMA_START after A cleared the window"
This problem has probably the same reason as the following one
-- a missing matching rule, e.g. that MPI_CLEAR needs to
have a rank and should match with the START.
2) There is no matching rule for CLEAR/START only the timing rule
above. Therefore the following code will break:
Proc. 0 Proc. 1 Proc. 2
RMA_Init RMA_Init RMA_Init
- - - - -> start
<- - - - - complete
start <- - - - -
complete - - ->
because there is no way to express that the start in 0 should
be after the second clear.
There is no way to express what you try to express, using only the
code you wrote -- some additional synchronization is needed. This is
by choice: I want a clear call to be implementable by merely clearing
a flag, so that there is no necessary hand-shake between start and
clear. The clear call is really a call to window_out, with the added
functionality that a remote process can postpone its RMA transfers
until such call occurred. So, a possible implementation model is that
a window is associated with a in_use flag, and a counter. Each RMA
access increments the counter. The wait call sets the in_use flag.
The clear call resets the in_use flag and sets the counter to zero.
The start call may test the in_use flag.
Note that, if one picks the additional check/nocheck option that I
suggested in my comments, then nobody prevents a user to do its own
synchronization using message passing or what have you: a call to
wait, with count=0, is equivalent to window_in; a call to clear is
equivalent to window_out.
3) On page 18, line 29 I do not understand
"that waited on the complete call of B".
You should add the "matching rule":
"and the call to MPI_RMA_COMPLETE is one of the "count" calls
that matches the call to MPI_RMA_WAIT."
(Perhaps you mean the same, but the "that" points to "its own access"
and the RMA_COMPLETE was done in A)
You are right
4) A did not find the progress rules.
Page 15, lines 36-42 say only that MPI_START/PUT/GET/COMPLETE
blocks until the target window is "clear".
But it says nothing whether there are other cases where
they may block also, nore it says that say must not block in
Will add such. Generically, the rules should say that synchronization
calls may block, but that communication should progress to completion
once the needed synchronization occurred.
Your implementation model for distributed memory on page 19
implies that MPI_RMA_START, MPI_PUT, MPI_GET and MPI_RMA_COMPLETE
can be based on MPI_SEND which can block and therefore these
routines may also block at the origin until MPI_RMA_WAIT
is called at the target.
Then there is no chance to build symetric programs:
put(rank=1) __ __ put(rank=0)
complete(rank=1) \/ complete(rank=0)
wait(count=1) /\ wait(count=1)
load <-/ \->load
There is also no chance with a nonblocking version of MPI_RMA_START
because page 18, line 25 says that MPI_RMA_START must be completed
before the MPI_PUT.
I need to modify the description of the implementation on top of
message passing. It should use nonblocking sends, with the wait that
completes the send executed when the RMA_WAIT call occurs. The
symmetric example you wrote should work as expected.
But this is the application model that should be improved,
compared to the use of MPI_BARRIER.
The following complete double buffering example shows that
combining complete&wait makes no sense in real applications
that want to parallelize communication and computation:
Each process splits its window and its local buffers
in two halves and take turn computing on the data in one half
and pushing it into the other half.
The local buffers are named la and lb, the target windows wa and wb.
The example is given for two processes.
Process i's code is for i=0,1:
la=compute(wb); clear; start(1-i,nocheck);
put(la,1-i,wa); complete(1-i); local_computation; wait(1);
lb=compute(wa); clear; start(1-i,nocheck);
put(lb,1-i,wb); complete(1-i); local_computation; wait(1);
Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626