Re: Reasons for negative votes on 1-sided at Sept. MPI mtg

David C. DiNucci (dinucci@nas.nasa.gov)
Tue, 17 Sep 1996 11:12:28 -0700 (PDT)

Marc Snir writes:
> David DiNucci writes:
> > There is one reason, however, that the RMA_BARRIER approach is worse.
> > It does not allow the target to hold-off PUTs and GETs until the target
> > is ready. Other synchronization must be used. This severely restricts
> > options for efficient implementation (e.g. to buffer PUTs and GETs),
> > which is present in some cases with the POST/WAIT/START/COMPLETE.

> This, I don't understand. A valid implementation of RMA_BARRIER (weak
> style) is to allow the call to barrier to proceed, and allow the ensuing
> calls to put/get to proceed, but postpone actual RMA data transfers until
> the ensuing call to RMA_BARRIER occur.

I believe that, later in this message(See ++), you concede that RMA_BARRIER
(weak style) must act as a barrier, so this is not a valid implementation.

> I doubt, however, that
> implementors will want to buffer put/get communication. More buffering and
> data copying decouples sender code from receiver code, at the expense of
> additional communication overhead. My take, for RMA, is that
> implementations will move in the direction of less buffering, in
> similarity with a shared memory programming paradigm.

If an implementation wants to delay moving the data to the target on
distributed memory machines until the target is ready, the implementation is
(and should be) free to do so. But if you are advocating that it is OK for
MPI to require the user to hold off the execution of one-sided primitives at
the origin until the target is ready for the communication, that is a
problem. (Perhaps you felt that the weak RMA_BARRIER was getting around
this, but it is not.)

> One main purpose of
> RMA communication is to allow a simplified communication protocol, with
> less hand-shaking, by providing full information on source and target
> buffers at the origin. It is up to the user to use the weakest possible
> synchronization (post windows as early as possible, wait as late as
> possible), so as to decouple as much as possible progress on one process
> from progress on another process. But, the user cannot assume that data
> will be buffered, so that a "put" will not complete unless the target
> window is available.

All of this can be said of MPI_SEND as well. I am perfectly willing to
allow PUTs and GETs to block. I am against forcing them to block.

> I see no harm in having complexity, as long as users are exposed to this
> complexity only when they need it. I expect many (most?) users will use
> only init, put, get, barrier -- four calls. The post/wait/start/complete
> and malloc calls are for users that want to optimize their code. They
> want more control on communication and will use functions that provide
> more control.

My claim is that users never need to be exposed to this complexity, and I
explained why in the remainder of my message, and I provided an example
of how it can be avoided in the P/G/O alternate proposal. The complexity
serves no purpose. I cannot state it more clearly. The user has no more
control in the voted-upon proposal than in the alternate proposal. The
complexity is added by separating single operations into multiple operations
for no reason, by adding the ability to control actions which can do nothing
but slow down the program, and by providing new mechanisms even when
sufficient mechanisms already exist in MPI.

The purpose of one-sided, just as message-passing, is to facilitate data
movement. In fact, one-sided is just like message-passing (and is just as
simple) except that one side specifies both the source and destination of
the message. Every difference between one-sided and message-passing
should be justified based upon the differences in their semantics, or by
admitting that some mistakes were made with message-passing.

I hope that anyone who believes that this complexity is necessary takes a
good look at the P/G/O proposal. It is an existence proof that this
complexity is not necessary.

> I have no strong argument for MPI_STRONG.

Glad to hear it.

> To the same extent it is a good idea to start with blocking send/receive
> calls, and introduce later nonblocking calls, it will be a good idea to
> start with strong synchronization, and introduce later weak
> synchronization.

The analogy is wrong. Closer would be recommending that users start with
SSEND and introduce later SEND. I know that there are some people who would
recommend this, but the fact is that the errors are not more or less complex
in one case than the other, they are just different.

> >*If the MPI_STRONG flag is not used on RMA_START, the RMA_START
> > operation itself is not needed.

> I suppose your argument is that the list of flags has no purpose, because
> the information is repeated in the put/get calls themselves. This is
> correct -- although providing twice the same information is not always
> such a bad idea -- in particular, it provides more choices to the
> implementer. Our decision was to separate synchronization calls from the
> communication calls. The alternative, which you proposed, was to merge
> them. To get the same functionality that we have now with
> strong/weak/nocheck we would need an additional argument in each put/get
> call, or 3 variants of put/get calls (or more, to handle also barrier and
> lock/unlock). In any case, we should not mix the two issues: one is the
> type of synchronization patterns that we want to support and the other is
> whether synchronization calls are separate from communication calls.

No argument made in my message was based upon whether I wanted to combine
or separate communication and synchronization. I addressed only problems
that I perceive in the proposal voted upon, and suggested alternatives.

My argument is just as I stated it. If there is no strong argument for
MPI_STRONG (as you stated above), then the only flags left are MPI_WEAK and
MPI_NOCHECK. MPI_NOCHECK makes no guarantee that the semantics of the
operation are actually being implemented, and as I discussed in the P/G/O
proposal implementation advice, use of the MPI_WEAK flag is almost as
efficient as MPI_NOCHECK anyway (i.e. MPI_WEAK *might* require one extra round
trip the first time a process issues a PUT or GET to the target, but it need
not add measurable overhead after that). Based on this, it makes sense to me
that MPI_WEAK would be the default behavior. If MPI wants to provide for
people who want to purposely subvert the internal checking performed on their
behalf for the *possible* savings of one round trip the very first time they
use an operation, I have no strong feelings on how it should or shouldn't be
done, but the method used previously in MPI was to provide "Ready" operations.
It is conceivable that it could also be performed through a special operation
(similar to what you call RMA_START, but probably with another name) which
would say "From here on, don't do any checking for one-sided operations".

> >*I can also see no justification for having a separate RMA_WAIT
> > operation, since it plays exactly the role of the end of the RMA_POST
> > operation. That is, by changing RMA_POST to RMA_IPOST (IRMA_POST?),
> > RMA_WAIT can be omitted and replaced by an MPI_WAIT on the request
> > returned from RMA_IPOST, making it consistent with the rest of MPI.
> > This automatically suggests that a blocking RMA_POST should be
> > introduced as well, which would combine the RMA_IPOST and MPI_WAIT. I
> > believe that this would be a useful addition, and not at all confusing.

> post and wait do not play exactly the same role. The first exposes a
> window for external communication and the second hide it. Having separate
> functions is clearer, and allows more efficient implementations.

A window is hidden if it is not exposed. Therefore, a single operation which
exposes a window (see "OFFER" in the P/G/O proposal) is sufficient, and
having separate functions does not allow more efficient implementations.
(A debate over which is clearer can only become religious.)
A SEND or RECV exposes a buffer for external communication until that
communication takes place, at which point the buffer is hidden. There is
no difference in the functionality for one-sided, except for which part of
the buffer is used for communication, and who supplies the info. Please
note that combining synchronization and communication is not an issue.
OFFER combines them no more and no less than START and COMPLETE.

> > *Assuming that the above modification is made, I can see very little
> > justification for introducing an entirely new object -- i.e. MPI_Wins
> > -- since it is a small jump from the above to create a persistent
> > request for RMA_POST (with, say, MPI_RMA_POST_INIT).

> I am totally lost here. RMA_INIT is likely to be a heavy operation, which
> requires each process to register information about windows at all other
> processes, information on displacement units, information, perhaps, on the
> architecture of the processes; it will be used to select which mechanism
> is used for RMA (are the windows in memory that can be directly accessed
> by all processes?). It will be used, perhaps, to start daemons or post
> handlers on distributed memory systems. I certainly would not want init
> to disappear.

Persistent requests were proposed for MPI for just these kinds of reasons.
Creation of a persistent request can be a heavyweight operation, which is
why one creates it ahead of time if it is likely to be used a lot. If
it is only to be used once, the complexity of a separate creation operation
can be avoided. Just like everywhere else in MPI.

(I certainly would want init to disappear.)

> Bt the way, the new object MPI_Win was not introduced because it is
> necessary to support RMA: the previous draft was using communicators for
> that purpose. It was introduced, because most people believed it's a
> cleaner design.

I am one person who believes that MPI_Win is a cleaner design than using a
communicator for this purpose. It is not as clean, or as consistent with
the rest of MPI, as creating a persistent OFFER request with a communicator,
a tag, and a buffer, as I suggested in the P/G/O proposal. Communicators
were underkill, MPI_Wins is overkill, and persistent requests are just right.

> You have a valid point here. It is true that no process can leave the
> barrier before all processes reached it, otherwise one does not know for
> sure that all processes have completed their put/get calls.

++ That is the statement that I reference above as ++

> On the other hand, I would not be strongly
> opposed to the deletion of the strng/weak/nocheck flag from
> MPI_RMA_BARRIER

Glad to hear it.

> >*Progress (liveness) rules are also another problem point.

> The requirement in MPI2, as in MPI1, is that progress is guaranteed:
> nonblocking calls complete within finite time; blocking calls complete
> within finite time, once the event they blocked on has occurred. The only
> debate is what "finite time" means. My interpretation is that finite time
> means that there is a fixed uppoer bound on completion time, that does not
> depend on code executed by other processes. Other people interpret it to
> mean that it is finite, but no fixed upper bound can be provided that is
> independent of the code executed by other processes. In this view, if
> process B is caught in an inifinite loop, so that the parallel program may
> never terminate, then a nonblocking MPI call at process A may never
> complete. My view is that, even in this case, we need to guarantee that
> the call at process A complete within finite time, as I want decent
> semantics for nonterminating programs, as well.

I understand the debate and the interpretations. However, rather than being
stuck with whichever interpretation sounds good to the implementor, I think
that both interpretations have their good points and bad points, and that
the users should have the power to choose the interpretation which best fits
their needs. That's why I think that we should just bite the bullet, separate
these two interpretations into two separate sets of semantics, and require
that they both be implemented, so that the user can choose the one they need
at the time. It is not too difficult to implement both, and we have already
seen the necessity to implement the strong semantics -- because users
demanded it when they demanded third party communication. This is the reason
that I proposed both OFFER (the weak semantics) and OFFERP (the strong
semantics, also used for 3rd party) in the P/G/O proposal.

My "buffer or deadlock" scenario was based on the premise that there would be
no guarantee that data could be moved into a target window until the target
executed an RMA_WAIT. You responded by stating that it will not deadlock,
even without buffering, if there is a guarantee that data will be moved into
a target window if the target blocks on any MPI operation. I don't think
we are disagreeing. I am only stating that
(1) it may be possible to satisfy users, and may even ease implementation,
if we stick with my simpler premise and combine the RMA_WAIT
and RMA_COMPLETE, as I did in the OFFER of the P/G/O proposal, and
(2) the guarantees to the user should be written out in black and white,
rather than left to the interpretation of the MPI implementor. (For
example, if you insert an infinite loop between the COMPLETE and WAIT
in diagram 5.5, does your guarantee that...

> Once this happened, the
> put, complete sequence is enabled, and must complete within finite time;

...become false?)

One additional note. By combining the WAIT, START, and COMPLETE into one
operation (OFFER), the P/G/O proposal also allows the user to perform an
MPI_WAIT or MPI_WAITALL on a request for these operations.

> To sum up, the current definition of progress is consistent with the
> definition of progress in MPI1. The fuziness about "finite time" is that
> same fuziness as in MPI1.

I understand this. I understand that this fuzziness has resulted in different
MPI implementors providing different semantics to MPI users and claiming that
they are all the same. They say, rightly, "I am the implementor, and finite
time will mean what I say it means". The question is whether to carry this
fuzziness into one-sided. (Implementors, make your jobs easier by noting
that for any t, where t is equal to the longest running time of any
terminating program on your system, there exist infinitely many epsilons > 0
such that t + epsilon < infinity. No matter how long you have waited to
complete that communication, you are always allowed to wait longer! Oops,
sorry, the system needed to be rebooted. Too bad I didn't have the MPI
Forum as parents -- I probably could have convinced them to give me a weekly
allowance as long as I cleaned my room in finite time.)

> On the other hand, a pure polling
> implementation of 3-rd party communication would be so obviously of poor
> quality, that it makes no sense to even consider it: there is no reason to
> believe that a process that is not involved in the communication will
> execute an MPI call any time soon. There, we need an asynchronous agent.

You seem to be basing your implementation on how you think users will be
using the constructs, rather than on the guarantees that the constructs
are making to the user -- i.e. the semantics.

Also, as I stated earlier, on some systems (e.g. coherent shared memory),
where no agent is required, users will certainly notice that they can
implement more efficient 3rd party communication by issuing a POST (with
no WAIT) on the third party, and then just issue PUTs and GETs from the other
two parties. The only thing keeping them from doing so according to the
written semantics is the unenforceable rule against overlapping PUTs and
GETs that I mentioned earlier. Should we tell users that we are providing
a fast way and a slow way to use their hardware, but please use the slow
way? (In this case, the slow way not only requires extra calls, it also
serializes all accesses to the target window.)

The P/G/O proposal handles this. There is only one construct which provides
guaranteed progress, and it is the same construct used for third party
communication -- i.e. OFFERP. It does not serialize accesses to the target
window, but does guarantee atomicity for overlapping accesses.

> > In one case, however -- i.e. LOCK/UNLOCK -- an amendment was passed that
> > RMA_MEM_ALLOC'd memory be required, even if it is unnatural for the
> > caller to use dynamically- allocated memory. Why?

> I voted against the amendment, so that I need not justify it. I agree
> with you.

Then I must find something related to discuss.

How is it that a subcommittee can look deeply into an issue like this for
several months, then when everybody gets together for a formal vote, an
amendment can be made and people who have never heard the debate must vote
off the top of their heads? I have heard others make recommendations for
changes in procedure, and I have a few of my own to prevent such things,
but a discussion should certainly not be limited to this subcommittee.

> The number of functions can be reduced by reducing the number of
> synchronization paradigms provided. With the same number of options, we
> have roughly the same number of calls (or same number of call options)
> with the P/G/O proposal or the current proposal.

The number of calls is not a good metric. For example, adding a non-blocking
"I" call which follows all the rules of non-blocking routines does not add
the same complexity as a new routine which uses all new arguments, etc.
(I am in the process of codifying a better metric than "number of calls", and
I'm not the first to do so.) The P/G/O proposal contains only 6 calls
which must actually be learned: GET, PUT, and ACCUMULATE, which are
almost identical to those in chapter 5, and OFFER, OFFERP, and OFFERC, which
are very similar to each other: OFFERP is exactly the same as OFFER except
that it guarantees progress and doesn't outlaw overlapping access, and OFFERC
is a collective OFFER. There is virtually nothing new to learn for the
remaining operations -- GETP, GETC, PUTP, PUTC, ACCUMP, IOFFER, IOFFERP,
IOFFERC, and maybe OFFER_INIT and OFFERP_INIT. About the only new way to
build an illegal (nonconforming) program is to overlap accesses where they
are not legal.

Also, the P/G/O proposal offers more synchronization paradigms than chapter 5.
(1) It allows the user to choose whether or not to use "efficient but no
guaranteed progress" semantics (OFFER) or "maybe not as efficient but
with guaranteed progress" semantics (OFFERP).
(2) It allows the user to perform 3rd party communication without serializing
accesses to the target window.
(3) It allows a process to serve as target to one or more anonymous origins.
(4) It allows both the origin and target to have a say in when the one-sided
operation will complete (using "combine").
(5) It allows the target in a collective call to hold off PUTs and GETs
to it's address space until it is ready for them (while not necessarily
blocking those calls on the origin).
(6) It allows a process to wait on a request until PUT/GET/ACCUMULATE
operations have finished.

> Again, it is important
> not to mix issues: (1) Do we separate synchronization calls from
> communication calls; (2) what synchronization and progress semantics we
> want to support for RMA; (3) whether we want an initialization call
> (init), or want the "initialization" to be dynamically done, as part of
> the communication; (4) whether we want to mandate the use of Malloced
> memory. The current design reflects decisions made on each of these
> issues. A simpler syntax will need to revist some, or all of these
> decisions.

If certain base assumptions or decisions lead to an undesirable outcome,
one can either re-examine the base decisions or he can convince himself
that the outcome is desirable after all. (she).

My goal in publishing my original note was to inform people of my voting
rationale. In this message, I am clearly, once again, advocating the P/G/O
proposal. I don't really know what outcome I am aiming for -- a third
first vote for one-sided? To somehow morph the current chapter to look much
like the P/G/O proposal so that it doesn't require another first vote?
Que sera sera.

-Dave
========================================================================
David C. DiNucci | MRJ, Inc., Rsrch Scntst | NASA Ames Rsrch Ctr
dinucci@nas.nasa.gov| NAS (Num. Aerospace Sim.)| M/S T27A-2
(415)604-4430 | Parallel Tools Team | Moffett Field, CA 94035