-Dave
===========================================================================
Specifically, my vote was yes on the amended section 1.3 (GET, PUT, and
ACCUMULATE), to demonstrate that I was in favor of having 1-sided in MPI and
these functions seemed quite appropriate, and no on everything else.
References herein are to the version of the proposal handed out on the
last day of the September meeting. Marc has evidently tried to publish
another version today (Friday, Sept 13), but I have not been able to find
a readable copy.
In order to be as constructive as possible, I have made specific suggestions
to address most of my objections. In addition, I have already presented an
alternate proposal, as appendix A in the document disseminated at the last
meeting, and at
http://www.nas.nasa.gov/NAS/Tools/MPI2/
which addresses all of the points I have made here. (That proposal was voted
down in subcommittee within an hour or so of the beginning of the September
meeting, and was not discussed further.)
Specific objections to the proposal voted upon:
(a) Severely restricted:
The point-to-point proposal allows one-sided communications only between a
fixed set of processes that know of each other's existence. The target
must name the origins which may access it, and the origins must name the
target(s) which they will access. In other words, even the point-to-point
operations are essentially collective, but instead of being collective over
all of the processes in a communicator (like the rest of MPI), they are
collective over the groups defined by the transitive closure of the ranks
mentioned in the MPI_RMA_POST (and MPI_RMA_START or PUT, GET and ACCUMULATE)
operations.
In general, it is much more difficult to use these "semi-collective"
point-to-point operations than it is to just define a smaller communicator
and using MPI_RMA_BARRIER. The semi-collective calls require that the each
process call START and COMPLETE or POST and WAIT, depending upon whether the
process will be an origin or target, or all four if it will be used as
both host and target, and these must be called in the proper order with
the proper arguments, or the result is undefined. Also, the MPI_RMA_BARRIER
approach can execute with order log(n) synchronization messages, where the
POST/WAIT/START/COMPLETE approach can take n*n, in some cases.
There is one reason, however, that the RMA_BARRIER approach is worse. It
does not allow the target to hold-off PUTs and GETs until the target is
ready. Other synchronization must be used. This severely restricts options
for efficient implementation (e.g. to buffer PUTs and GETs), which is present
in some cases with the POST/WAIT/START/COMPLETE.
So, when the user *does* want to perform collective communication, they are
left to choose between the lesser of two evils.
The only way to perform a truly non-collective one-sided communication
in the proposal is to use the LOCK/UNLOCK model, which is expected to have
very high overhead on some systems. The high overhead is not related to
the fact that it is non-collective, it is related to the fact that the
same constructs also support third-party communication, which often requires
an intermediary demon or signal handler to convey messages properly.
(b) Unnecessarily complex:
Only three operations in the proposal perform communication: PUT, GET,
and ACCUMULATE. The remaining 9 operations are infrastructure to support
them: i.e. allocate and deallocate memory (MEM_ALLOC and MEM_FREE)
required for some operations, synchronization (RMA_POST, RMA_WAIT,
RMA_START, RMA_COMPLETE, RMA_LOCK, RMA_UNLOCK, and RMA_BARRIER), and
creation and freeing (RMA_INIT and RMA_FREE) of a brand new object type
(MPI_Wins). Even more complexity is introduced by adding a "flag" argument
with three possible values (MPI_STRONG, MPI_WEAK, and MPI_NOCHECK) to
many of these operations. As a result of this design, the ratio of legal
combinations of these operations and flags to the total number of
combinations is relatively small, providing ample opportunity for users
to create erroneous programs with undefined consequences.
*I have seen no rationale for an MPI_STRONG flag on RMA_POST or
RMA_START operations. This flag does nothing but delay the operation
containing the flag until some other operation is performed in another
process. In other words, it does nothing but slow down the program, and
possibly keep other statements within the program from executing for some
period of time, but there is never any rationale given for these delays.
If the MPI Forum wants a general synchronization operation that is faster
than sending a 0-byte message, then it should introduce one, and that
operation should be used here by users that need it.
Not only is STRONG useless, it makes programs non-portable.
On message-passing systems, it is usually desirable to allow
PUTs, for example, to forward their data to the target as early
as possible, even if the target has not yet executed a POST, as
long as there is buffer space on the target. On shared-memory
systems, it is probably desirable to have the PUT or GET wait
until the POST has occurred. STRONG takes this decision out of
the hands of the implementation. (Of course, a similar argument could
be made against SSEND.)
*If the MPI_STRONG flag is not used on RMA_START, the RMA_START operation
itself is not needed. PUTs and GETs can be restricted from executing on
the target until the target executes an MPI_POST, as is done when the
MPI_WEAK flag is specified. The only remaining use I can imagine for
RMA_START is that it could specify the MPA_NOCHECK flag, but this
more logically belongs on the 1-sided operations themselves (e.g. RPUT,
RGET, RACCUMULATE) just as for RSEND. The list of ranks specified on
RMA_POST serves no purpose, especially with no MPI_STRONG flag. (To its
credit, the proposal voted upon includes the possibility of omitting this
under "Rationale" at the top of page 18, though it's proposed alternatives
are too extreme.)
*I can also see no justification for having a separate RMA_WAIT
operation, since it plays exactly the role of the end of the RMA_POST
operation. That is, by changing RMA_POST to RMA_IPOST (IRMA_POST?),
RMA_WAIT can be omitted and replaced by an MPI_WAIT on the request
returned from RMA_IPOST, making it consistent with the rest of MPI. This
automatically suggests that a blocking RMA_POST should be introduced as
well, which would combine the RMA_IPOST and MPI_WAIT. I believe that this
would be a useful addition, and not at all confusing.
*Assuming that the above modification is made, I can see very little
justification for introducing an entirely new object -- i.e. MPI_Wins --
since it is a small jump from the above to create a persistent request for
RMA_POST (with, say, MPI_RMA_POST_INIT). If the arguments from RMA_INIT are
added to RMA_POST and RMA_POST_INIT, the persistent request would provide
the same opportunities as RMA_INIT for distributing the "window" information
to the different processes, which was the initial rationale for RMA_INIT.
The only drawback is that multiple persistent requests created for the same
communicator could cause confusion. This can be easily addressed by
adding tags to the calls. These suggestions are exactly consistent with
the rationale for adding persistent requests and tags to MPI in the
first place.
(c) Poorly defined:
*The use of MPI_WEAK with the RMA_BARRIER operation is not well defined.
Specifically, the proposal states that
"All operations on (the window) that were started before the barrier call
will be completed at their origin before the barrier call returns at the
origin. They will be completed at their target before the barrier call
returns at the target."
and
"...it need not act as a true barrier with respect to other operations".
Without loss of generality, suppose that processes A and B are executing,
and A comes to an RMA_BARRIER(MPI_WEAK) before B does. Suppose that A
completes the barrier before B enters it. Then B can issue a PUT to A
before it enters the barrier, meaning that A has completed the barrier
before the PUT is complete on the target (A). Contradiction. Therefore,
no process can exit the barrier before all processes have entered it, which
means that it must act as a true barrier.
Perhaps I misunderstood, and each RMA_BARRIER(MPI_WEAK) call is to be
treated as a completely independent entity. If so, the name and association
with RMA_BARRIER(MPI_STRONG) is very confusing, and the statement that
"operations will be completed at their target before the (?) barrier call
returns at the target" is poorly defined. (Which barrier call?)
*Progress (liveness) rules are also another problem point. The question
is whether or not a PUT or GET must eventually complete, whether or not
the target executes an RMA_WAIT (or, identically, completes an RMA_IPOST).
While this may seem like a small technicality, it is in fact a very
important point.
If it is mandated that PUT and GET must complete in this case, then the
implementation becomes nearly as difficult and inefficient in some cases
as for the LOCK/UNLOCK case. In fact, the only thing keeping users from
using RMA_POST to allow third-party communication to the calling process's
address space is a rule that PUTs and GETs cannot overlap in the target
window. Since this rule is basically unenforceable, it will be overly
tempting to use a RMA_POST with no completion to provide slightly more
efficient third-party support than LOCK/UNLOCK. In fact, since many of
the proponents of third-party communication that I can remember also
proposed the absence of locks, it is possible to remove LOCK and UNLOCK
completely, by removing the unenforceable non-overlap rule and making each
PUT and GET logically atomic on the target. This also solves the very ugly
characteristic of locks that they lock the entire target window, thus
serializing *all* accesses to the target window, even if no accesses
overlap.
If it is not mandated that PUT and GET must complete until or unless
RMA_WAIT executes, then this should be clearly stated. MPI-1 made the
mistake of phrasing the progress guarantees so loosely that users could
interpret them one way and implementors could interpret them another way.
This mistake should not be repeated in MPI-2. However, if progress is
not mandated in this case, then the symmetric communication shown in
Figure 1.3 of the proposal is not guaranteed to work unless (a) PUTs are
required to complete on the origin in a bounded time, even if not on their
targets (requiring needless buffering in some cases), or (b) other
operations (like RMA_COMPLETE) are also required to satisfy incoming PUTs.
If b is accepted, then there is very little (if any) reason not to delete
RMA_COMPLETE from the proposal completely, and simply require RMA_WAIT
(i.e. the completion of RMA_IPOST) to automatically perform an RMA_COMPLETE.
One possible solution to the progress guarantee problem would be to provide
both forms of RMA_IPOST -- one which guarantees completion of PUTs and GETs,
whether or not the IPOST completes, for those who want third-party-like
communication, and the another which does not, for those who want
efficiency. However, it would be necessary to also distinguish which sort
of IPOST each PUT or GET was targetting, so that it could also be optimized.
(d) Lacks justification and forethought:
RMA_MEM_ALLOC has been proposed (and widely accepted) to be used on an
optional basis to possibly speed up programs in those cases where
dynamically-allocated memory is natural and communication can be implemented
more efficiently through shared memory. In one case, however -- i.e.
LOCK/UNLOCK -- an amendment was passed that RMA_MEM_ALLOC'd memory be
required, even if it is unnatural for the caller to use dynamically-
allocated memory. Why? Certainly it is more efficient for some vendors
to make this restriction, but if that rationale is sufficient, then
the restriction should be applied everywhere -- even to message-passing.
The fact is that some vendors can perform third-party communication
efficiently without RMA_MEM_ALLOC'd memory, and forcing users to use this
special memory will make some programs run slower on those vendors' machines,
because the user will be required to allocate memory and explicitly copy
between that memory and their static (or automatic) data structures.
I felt that these problems were substantial enough that I could not vote for
the proposal in clear conscience.
Implementing all of the suggestions here reduces the number of support
calls (excluding PUT, GET, and ACCUMULATE) from 9 to about 4, while also
making them more consistent with the rest of MPI and more understandable by
users. The P/G/O alternate proposal, mentioned at the beginning of this
message, provides a completely consistent and simple solution to all of
these problems, by enacting all of these suggestions and then cleaning up
the loose ends.
-Dave
========================================================================
David C. DiNucci | MRJ, Inc., Rsrch Scntst | NASA Ames Rsrch Ctr
dinucci@nas.nasa.gov| NAS (Num. Aerospace Sim.)| M/S T27A-2
(415)604-4430 | Parallel Tools Team | Moffett Field, CA 94035