I will not be in Chicage, therefore a summary of all my problems
with the current proposal by mail.
It includes a modification (called G) to Marc's proposal (called B)
that hopefully has the fewest disadvantages.
I hope, you find the time to discuss it carefully.
This comparison is a summary of the recent discussion.
But it compares only the 2-party synchronization of the proposals.
Except Dave's proposal and the original proposal, all other have the
the same 3-party sync. (LOCK/UNLOCK) and allow BARRIER.
In the moment we have the following proposals
A) The main proposal voted in June.
B) Marc's post(any)/start(rank) & complete(rank)/wait(count)
from 16th Aug. with the additional mail from 17th Aug.
C) same as B), but post(any) /start(rank) is substituted by
post(rank)/start(count)
D) same as C), but both pairs post(rank)/start(count) and
complete(rank)/wait(count) are substituted by one pair
post(rank)/wait(count), that is used for both functionalities.
This is the proposal from Raja, but with Marc's progress rules.
E) Dave's put/get/offer proposal; it can be derived from B by:
start of IOFFER with non-zero count --> B's POST
first start of PUT or GET after last OFFER --> B's START(WEAK)
completion of IOFFER after prev. local PUT/GET --> B's COMPLETE
completion of IOFFER with non-zero count --> B's WAIT
F) Same as B) but combining post(any)/start(rank) with
post(rank)/start(count)
And I want to add
G) Same as B) but post must know to which origin nodes it should post,
i.e. post(count,rank) / start(rank)
(The exact definition of G) one can find at the end of this mail)
A major difference is the way of synchronizing before RMA:
- B) & E) post(any) / start(rank)
Disadvantages: 1) Cannot synchronize before RMA, after changing
the set of origin nodes. (see (X) in the
example below.)
The draft 08/26/96 also does not solve (Y)
because the matching rule matches to the
first post and therefore load and put can be
executed at the same time.
And the draft 08/16/96 does not solve (Z)
because the start can be issued directly after
the complete, i.e. while the previous 'post'
is still valid.
2) Cannot be implemented efficiently on
virtual shared memory machines because
start(rank) is a remote procedure call to
the target.
The reasons I have discussed in my mail
Subject: 1-sided proposals comparison
Date: Tue, 20 Aug 1996 18:10:50 +0200 (DST)
3) Looks flexible due to post(_any_), but the
matching wait needs that the application knows
how many origin nodes are there.
In other words -- applications where the target
does not know the origins cannot use this way
of synchronization, they must use LOCK/UNLOCK.
4) Due to the lacks in 1) it is impossible
to allow that one origin node moves its work
to another, i.e. to have correct post/start
one must send the synchronization _messages_
from the target.
Therefore the application will always know
at the target which origin processes are using
the target window. ((U) does not work, because
a start directly after the recv matches in any
case with to previous post.)
Remembering that post must match with wait
and that wait has a count argument, one can
state that the application can give count
and ranks at MPI_RMA_POST, but this proposal
does not want to see this information.
5) The semantics rule 3. (page 103, lines 27-29)
is incorrect with weak start.
- C) & D) post(rank) / start(count)
Advantages: 1) It can be used for synchronizing before RMA,
after changing the set of origin nodes.
2) Can be implemented efficiently on all
platforms
Disadvantages: 3) Not only the number but also the ranks of
the origin nodes must be knwon at the target.
From applications view point this is not
really worse than the topic 3) above.
4) start(count) must wait until all targets
are posted. It connot be substituted by
count times start(1), because after start(1)
it is unknown which target is ready.
- F) post(count,rank) / start(count,flag,rank)
This proposal tries to combine the both above.
Disadvantages: 1) It is not defined clearly.
2) If it allows not only the methods of B) and C),
but also post(rank)/start(rank) then the
implementation has only small chances to
optimize.
3) If it only allows the methods of B) and C),
then it would be better to have both
methods with different function names due
to the different argument lists.
- G) post(count,rank) / start(rank)
Advantages: 1) It can be used for synchronizing before RMA,
after changing the set of origin nodes.
2) Can be implemented efficiently on all
platforms
3) It can be used to handle all RMA for one
target.
4) The interface can be extended to allow
polling on the next target that posts
(this can be implemnted on all platforms by
a local code without additional communication)
Disadvantages: 5) Not only the number but also the ranks of
the origin nodes must be knwon at the target.
From applications view point this is not
really worse than the topic B&C 3) above.
Here an example that shows some problems:
Origin 1 Target Origin 2
complete
wait
send to 1
recv
post
start
put
complete
wait ) (X) Why must I use an
send to 2 --> ) additional sync, when
receive --> ) the folowing post/start
load ) can do it?
post --> ) (Y) does the 'start'
start --> ) wait after this 'post'?
put
complete
wait
load
post --> ) (Z) does the 'start'
start --> ) wait after this 'post'?
put
complete
recv <------------------------ send --> ) (U) one origin sends
start --> ) its task to another and
: wait ) in the next iteration
: load ) post & start should
: post --> ) work correctly
return --> )
Other problems:
- Do we have a "no_store" hint?
If we do not have a "no_store" hint then MPI_RMA_POST connot
be implemented efficiently on shared memory systems with a
cache that needs 'flush_cache' after local 'stores' to guarantee
that the data is written to the memory, because each MPI_RMA_POST
must issue a 'flush_cache' also there was no local 'store'.
Therefore it is fair -- even there is no vendor in the Forum
who sells such machines in the moment -- to define such a hint.
To possibilities:
-- as an argument of MPI_RMA_INIT
Disadvant.: Software maintenance -- this hint is far away
from the place where local stores and
MPI_RMA_POST are done.
-- as an argument of MPI_RMA_POST
Advantage: more flexible, because the window can be used
for different RMA styles -- sometimes
store/get, and sometimes put/load.
Disadvant.: less efficient on systems without the need of
'flush_cache', due to the additional argument
in each call
Where -- MPI_RMA_INIT or MPI_RMA_POST -- this is minor.
But to have this flag at all is important for those systems.
- Do we have a "no_locks" hint?
Yes, we should have it to speed up the startup where
handler are needed.
- Because we need to hints: "no_store" and "no_locks"
it seems good to add this by an info argument in RMA_INIT
because the advantages of "no_store" in MPI_RMA_POST
are not so important.
- post/start with STRONG/WEAK/NO_CHECK synchronization:
-- STRONG is not necessary for RMA as long as we do not see
an example.
-- WEAK and NO_CHECK are both necessary !!!
WEAK solves all cases where the DMA must be blocked until
the last local operation is done on the window.
NO_CHECK is necessary for double buffering applications
where the synchronization is already done by
complete/wait on the other window.
The proposal G) post(count,rank) / start(rank)
==============================================
MPI_RMA_POST(comm,count,rank,info)
IN comm communicator associated with window (hanndle)
IN count number of processes that can start with RMA (integer)
IN rank ranks of processes that can start with RMA (array of integer)
IN info information flag (handle)
Tells the processes in rank, that say can start RMA to the window
associated with comm.
The next call to MPI_RMA_START in the processes indicated by
(comm,rank), that has not already a matching call to MPI_RMA_POST,
matches to that call if its (comm,rank) indicates the process that
issues MPI_RMA_POST.
Each call to MPI_RMA_POST has to be matched by a unique following
call to MPI_RMA_WAIT with same comm.
The info argument must have the same value in MPI_RMA_POST and the
matching calls to MPI_RMA_START.
The following predefined values are defined for the info argument:
MPI_WEAK The RMA issued after the MPI_RMA_START matching to the
MPI_RMA_POST is executed on the window after the call to
MPI_RMA_POST.
MPI_NOCHECK The application guarantees that the matching MPI_RMA_START
is called after the call to MPI_RMA_POST.
MPI_RMA_START(comm,rank,info)
IN comm communicator associated with window (hanndle)
IN rank rank of the process (integer)
IN info information flag (handle)
Starts a sequence of RMA operations targeted to the window associated
with (comm,rank). The next call to MPI_RMA_POST in the processes
indicated by (comm,rank), that has not already a matching call to
MPI_RMA_START in this origin process, matches to that call if its
(comm,rank) includes this origin process.
The info argument must have the same value in MPI_RMA_START and the
matching call to MPI_RMA_POST.
The following predefined values are defined for the info argument:
MPI_WEAK The RMA issued after the MPI_RMA_START matching to the
MPI_RMA_POST is executed at the target after the call to
MPI_RMA_POST.
MPI_NOCHECK The application guarantees that the matching MPI_RMA_START
is called after the call to MPI_RMA_POST.
Each call to MPI_RMA_START must be matched by a unique following call
to MPI_RMA_COMPLETE with same (comm,rank).
MPI_RMA_COMPLETE(comm,rank)
IN comm communicator associated with window (hanndle)
IN rank rank of the process (integer)
This call to MPI_RMA_COMPLETE blocks until all previous RMA to
the window associated with (comm,rank) has completed at the origin.
Each call to MPI_RMA_COMPLETE must be matched by a unique previous call
to MPI_RMA_START with the same (comm,rank).
MPI_RMA_WAIT(comm,count)
IN comm communicator associated with window (hanndle)
IN count number of processes to wait for (integer)
Blocks until count distinct processes called MPI_RMA_COMPLETE(comm,rank),
where rank is the rank of the caller, and the RMA issued previous to
the calls to MPI_RMA_COMPLETE are completed at the target, i.e.
at the window of the caller.
Each call to MPI_RMA_WAIT must be matched by a unique previous call
to MPI_RMA_POST with the same comm. Both calls must have the same
count value.
MPI_RMA_INIT(base,sizw,disp_unit,info,comm,newcomm)
....
IN info a set of key-value pairs giving optimization hints
(info handle)
reserved key values:
synchronization The value "no_locks" means, that the application will
not use MPI_RMA_LOCK or MPI_RMA_UNLOCK for this window.
operation The value "no_stores" means, that the application will
not make local stores to this window.
All other text is the same as in Marc's proposal, except:
Semantics: (page 103, lines 27-29)
3. The access by A is a load or store, the access by B is an RMA
operation, A executed a call to MPI_RMA_POST after its access,
and B completed a matching MPI_RMA_START before its own access.
--------------------------------------------------
Because I believe that this proposal G is the best, I will discuss
the trial, to combine functions to only one interface,
only on the basis of G:
Method as in D)
Because POST(comm,count,rank,info) / START(comm,rank,info)
and COMPLETE(comm,rank) / WAIT(comm,count)
have now different interfaces, it seems to be bad to
substitute both pairs by one pair.
Method as in E)
there are also more problems with the argument list than in B
but the main argument is the lack of a start call, because
there one must distinguish between WEAK and NOCHECK.
Therefore it seems that G) is the smallest possible interface
Best regards
Rolf
Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de