Re: 1-sided proposals comparison

David C. DiNucci (dinucci@nas.nasa.gov)
Thu, 22 Aug 1996 00:10:45 -0700

(My articles contain lots of tabs and white space, so if you get
any of them from the Web archive at parallel.nas.nasa.gov, it is
probably much easier to view them with "view source".)

Rolf Rabenseifner writes:
> Dave's answer has corrected my reviewing of his proposal in major
> topics.

Glad to hear it.

> Now for me, his proposal is clear and I think one can make the
> following abstaction, which simplifies the comparison:

> E (Dave's proposal) can be derived from B (Marc's proposal by the
> following mapping:

> E's start of IOFFER with non-zero count --> B's POST
> E's first start of PUT or GET after last OFFER --> B's START(WEAK)
> E's completion of IOFFER after prev. local PUT/GET --> B's COMPLETE
> E's completion of IOFFER with non-zero count --> B's WAIT

> And E combines B's POST with a non collective RMA_INIT
> and B's WAIT with the destruction of the window.

> And E has "tag"s (A,B,C and D does not have the "tag" argument).

That's probably pretty close, as far as it goes. But there are still
many differences. For example, what, in B, corresponds to the start of
IOFFERC in P/G/O? I claim that there is nothing. (BARRIER has very
different semantics -- e.g. it blocks.) Also, what about LOCK and UNLOCK?
Third party communication is almost entirely different, as near as I can
figure out. There are other differences like this.

There are also many qualitative differences which simplifies optimization
and piggybacking in P/G/O and (I claim) make it easier to learn.

> Therefore we can state, that E allows all necessary cache operations.
> It never induces unnecessary cache operations because they can
> eliminated by utilizing OFFER's "event" argument.

> Disadvantage compared to B:
> - it makes an additional synchronization in example 2, because
> E does not distinguishes between START(WEAK) and START(NO_CHECK).
> This should not be accepted, because example 2 should be
> programmable with most efficiency,

At the end of the introduction of the P/G/O proposal, the following
paragraph exists:

Discussion: This proposal can easily be extended in a natural way to
include RPUT and RGET functions, which assume that the OFFER has been
performed on the target in the same way that IRSEND assumes that a
receive has been posted on its target.

I thought that this might address your criticism, but now that I think
about it, more words are needed, because using RGET and RPUT would probably
not speed up code at all in the P/G/O proposal: PUT or GET cannot even map
addresses to the target's address space until the OFFER has been performed
to tell it where the base of the "window" is. I'll address this further
below.

So, yes, P/G/O always requires a synchronization, even if the user
wants to explicitly bypass checking that an OFFER is currently executing
on the target (i.e. the "window" is ready). Here are several reasons
that I don't think that this is a problem:

1. Synchronization is always necessary, even in the other proposals,
whenever a count on the target is incremented or atomicity of a PUT or
GET is ensured. These syncs are combined with the PUT or GET in P/G/O,
so the sync to see if the OFFER is ready takes no extra time. (e.g.
the OFFER flag could be fetched at exactly the same time that the old
version of the count is fetched. In fact, the count could serve as the
OFFER flag -- a negative value could mean "no OFFER active".)

2. Experience with RSEND suggests that users do not frequently over-ride
internal checking, especially for negligable gain.

3. Even if both of the above are absolutely false (which they are not),
according to the "Implementation Possibilities" list which you provided,
the only type of architecture where this *might* possibly make any
perceptable difference is in virtual shared memory. I would like to
see a low-level example to illustrate that START(NO_CHECK) saves
cycles on such an architecture before accepting your reasoning.
Otherwise, this could turn into another case where we complicate MPI
by trying to be "nice", but the implementors just ignore us and say
it doesn't help them.

P/G/O requires this synchronization because OFFER provides the same
flexibility as other operations in MPI. In OFFER, the user specifies a
communicator, which specifies a set of processes and qualifies a set of
tags, and separately specifies a tag and a buffer. This allows the use
of multiple buffers ("window"s) with the same communicator, even all at the
same time (by using different tags). Just like everywhere else in MPI.

Other proposals do not require a sync because a user is required to execute
a collective operation to create a new communicator which has a window
"attached" to it. That is the sync.

If you or others truly feel that the reasons above are not adequate, the
P/G/O proposal could be easily modified to accomodate you by taking most of
the arguments out of OFFER and re-adding the RMA_INIT call. Once this is
done, RPUT and RGET make sense again. I have added this as an option below
(under Summary), even though I personally am absolutely against it.

> Advantage compared to B:
> - like D it has only a 2-routines interface instead of B's and C's
> 4-routines interface.

More exactly:

B) Marc's proposal
RMA_INIT
ACCUMULATE
BARRIER
PUT RMA_START RMA_POST RMA_LOCK
GET RMA_COMPLETE RMA_WAIT RMA_UNLOCK

E) P/G/O
PUT PUTP PUTC
GET GETP GETC
OFFER OFFERP OFFERC
IOFFER IOFFERP IOFFERC
(ACCUMULATE) (ACCUMULATEP) (ACCUMULATEC)

P/G/O actually has more routines in the interface, but since the routines
in all three columns behave almost identically, I claim that there is less
for the user to learn. (I have added ACCUMULATE in parens, because it seems
useful, and I can't think of any good reason not to add it to P/G/O.)
BARRIER is logically in the interface of B.

> All proposals B-E have non blocking PUT and GET, there is no difference.

> The big difference between B & E and C & D is the way they handle
> the synchronization before the RMA, signaling that the window can
> be used for RMA:

> B & E: target process: POST to all and any
> origin process: START when 'rank'ed target has posted

> functional disadvantage: does not handle example III

> C & D: target process: POST to 'rank'ed origin processes
> origin process: START when got the post event from 'count'
> targets

> functional disadvantage: cannot handle applications in which
> the target does know which origin
> processes will PUT/GET to its window.

I agree that the differences are important, but not for the reasons you
site here as "functional disadvantage"s. Those could be addressed, for
example, simply by adding the "rank" array you mentioned to the OFFER and
IOFFER operations in P/G/O proposal. I have listed this for consideration
below (under Summary), but it will only help in those cases where only some
of the processes in the communicator will be PUTting or GETting, and the
OFFER process knows exactly which processes those are. I think this case
will probably be rare.

I think that the important "big differences" between the synchronization
in these proposals and the synchronization in E, as I mentioned before, is
that:

1. WAIT blocks in D (& C?), and LOCK blocks in B, and BARRIER blocks in
almost everything, but nothing blocks in the P/G/O proposal except an
OFFER (i.e. the end of an IOFFER). These other proposals block in places
where E just keeps going.

(I don't know exactly what proposal C is, because you explain it as "the
same as A), but post(any)/start(rank) is replaced with
post(rank)/start(count)". But I don't think A had a start.)

2. In E, synchronization and communication are combined to allow for more
opportunities for piggybacking without waiting for multiple user calls.

3. In spite of my plea a few weeks ago (subject "Comparisons and Additions"),
I *still* have not seen any justification for separating communication and
synchronization in one-sided. Separating these just complicates things,
and makes the user call more routines to get anything done.

> Due to the functional lacks and the efficiency problems it might
> be good to realize a combination of B&E and C&D - the following
> new proposal F:

I believe that F has the same problems mentioned in 1, 2, and 3, above.

S U M M A R Y
-------------
So, to summarize, in light of your comments about the problems you perceive
in E (P/G/O), I suggest that the following modifications be considered by
the group:

(a) The addition of two more arguments, "rcount" and "rank", to OFFER,
IOFFER, OFFERP, and IOFFERP. "rank" is an integer array containing
"rcount" ranks. If "rcount" is positive, only those processes specified
in "rank" are allowed to perform PUTs or GETs (PUTPs or GETPs). If
"rcount" is 0, any process in the communicator can perform them.

Reasons for: * Allows OFFER to inform only necessary processes that
PUTs or GETs are OK or not OK, to reduce overhead
* Allows users to more clearly specify their
intentions, and thereby allows dynamic checking for
logical errors in their program (similar to subscript
bounds checking).

Reasons against: * OFFER already has 7 arguments, this brings it to 9.
* Doesn't help in cases where all processes can PUT
or GET, or if target doesn't know which ones will
PUT or GET, and these cases are probably extremely
common.

(b) (1) The addition of RMA_INIT to the P/G/O proposal, (2) the removal of
the "base", "size", and "disp_unit" arguments from OFFER, IOFFER,
OFFERP, IOFFERP, OFFERC, and IOFFERC, and (3) the addition of a "ready"
flag to all the PUT and GET functions or the addition of RPUT, RGET,
RPUTP, RGETP, RPUTC, and RGETC functions.

Reasons for: * RPUT and RGET can avoid a synchronization in some
cases

Reasons against: * Never really avoids synchronization, since both the
update of the count and (often) atomicity guarantees
must be performed even for RPUT and RGET (and similar
functions in other proposals).
* Obscure way of using attributes to pass extra
arguments to MPI functions.
* Disallows the use of different "windows" by the same
communicator, so requires the creation of multiple
communicators if different regions are addressed,
unlike the rest of MPI.
* Will probably be as rarely used as RSEND in MPI-1.
* Even if all of above is incorrect, this is only
conjectured to possibly help on a few architectures.

These are actually related, because if (a) is approved, there is less reason
for (b) -- i.e. If the "rcount" and "ranks" arguments are used, a PUT or GET
should be able to check locally whether or not the OFFER has been performed,
so a hypothetical extra sync wouldn't take any extra time anyway.

> And I do not see how to achieve F's functionality and efficiency
> with unions of operations analog to D or E.

Have I understood and adequately addressed your concerns above?

> Marc, Raja and Dave can you agree? (Eric wrote to me, that he has no
> time to look at this question in the moment).

I will need to get Appendix A in order before Monday, so I may not have
much more time to spend, either.

> Thanks. I still hope that I must not put the review in LaTeX for the
> next Chicago meeting.

If we are going to continue discussing proposal "F", I hope that you can
at least list all the functions with a description so that we can keep
misunderstandings to a minimum. (I think F is currently a modification of
a modification of a published proposal, so I don't know just what it looks
like.)

I don't really know the procedures around MPI well enough yet. Even if
everybody thinks that one of these proposals is great, does it still need
a reading and two votes to be accepted? Does the first vote, already taken,
still count, even if the proposal we like doesn't resemble the first one
very much at all? Is there time to get this in by Supercomputing? (Am I
asking too many questions?)

> > I'll fill in the rest of your table in a following message.
> I did not found it in my folder. If you change your mind about
> A-F it is possibly not needed.

I haven't done it yet.

Later,
-Dave