Comparisons and additions

David C. DiNucci (dinucci@nas.nasa.gov)
Thu, 8 Aug 1996 14:13:42 -0700

This message has three parts. In the first, I compare the P/G/O one-sided
proposal with the other one-sided proposals (especially that currently in
chapter 4). In the second, I look at possible extensions to the P/G/O
proposal. In the third, I consider whether it makes sense to wait on both
remote operations and local operations using the same directive.

I welcome any comments, public or private.

What is the difference between the P/G/O proposals and others?
==============================================================

Both Marc's proposal and my own (http://www.nas.nasa.gov/NAS/Tools/MPI2) start
with a definition of one-sided communication, describing it as communication
where one side specifies both the source and destination of the data transfer.
Aside from this trait, however, one-sided is very similar to message passing.
The goal is to efficiently transfer data from the address space of one process
to the address space of another process.

Some of the proposals are evidently trying to make one-sided look like shared
memory. In shared memory, data doesn't move (i.e. get copied), so processes
must negotiate access to it in situ. In one-sided, data does move. The goal
in one-sided is to make data move when both processes want it to move, with
maximum opportunity for overlap with computation, with minimal latency, with
maximum bandwidth, and with a minimum of user effort. In other words, it is
just like message-passing, except that one side specifies both the source and
destination of the data transfer.

The primary difference between the P/G/O approach (i.e. my approach) and others
is that P/G/O unifies communication and synchronization, while others
separate them. I have already demonstrated in postings how unification
allows processes coded using the P/G/O proposal to continue executing program
code while similar examples coded using some other proposals block at BARRIER
or WAIT calls. This unification also helps the P/G/O proposal to accomplish
the other message-passing-like goals described above.

Example:
Approach Process A Process B
======== ============= =========
Post/Wait access buffer
Post Wait
PUT
compute
PUT
Wait Post
access buffer

Note that the "compute" in process B cannot execute until the Post in
process A. Now consider the P/G/O proposal

P/G/O access buffer
PUT
compute
PUT
OFFER OFFER(0)
access buffer

Note that the "compute" in process B can execute, by buffering the PUTs.
The OFFER(0) is required only to wait for the PUTs to complete, and can
be eliminated if the IPUT suggested below is adopted.

A secondary difference between the P/G/O approach and others is that P/G/O
uses tags and communicators exactly the same way that they are used in the
rest of MPI. Other proposals endow communicators with new attributes (e.g.
a "window", and now hints are proposed) which are not used anywhere else
in MPI, effectively passing additional arguments to the routines by attaching
those arguments to the communicator (in the MPI_RMA_INIT call). As a result,
all operations which use that communicator use the same arguments, and because
tags are not available for disambiguation, all matching operations on the
communicator are effectively combined (or serialized).

I have not seen any justification for separating communication and
synchronization in the other proposals, but have seen some mention of the
costs -- e.g. from page 1 of Marc's proposal

"the implementation can delay communication operations until the
synchronization occurs for efficiency."

(I found the related discussion in Chicago very disturbing. It was something
like "It isn't important that this proposal minimizes the number of
communications and/or latency, since it will probably be implemented on top
of stream sockets anyway". If this was the case, couldn't the MPI-1 forum
have simply renamed Unix sockets "MPI-1" and have been done with it? I think
the goal of MPI is to provide an interface which *can* be implemented
efficiently, and with low latency. Whether or not current implementations
choose to take advantage of the interface is their business.)

The separation of communication and synchronization in these proposals is
essentially the same as splitting message-passing into:
(a) put/get: message-passing with no sync (and ability to specify both
source and destination)
(b) post/wait: message-passing with no message

Why is this split justified for one-sided, but not for normal message-passing?
The only reason I can envisage is that, since one side specifies both the
source and destination, the target doesn't need to provide separate information
for each transfer, so multiple transfers can theoretically occur without
a separate matching operation for each on the target end. However, this
justification is not sufficient, as demonstrated by P/G/O. P/G/O allows a
target operation to match multiple GETs and/or PUTs, and allows each GET or
PUT to combine separate operations into composite operations which are only
counted once by the target. This allows control over the completion of the
target operation from both the target's end and the originator's end, without
requiring a separate sync for each operation.

In summary: What is the justification for splitting comm & sync in one-sided?

Does the P/G/O proposal go far enough?
======================================
The P/G/O proposal is closer to message-passing than the other proposals, but
there are still significant differences.

In the P/G/O proposal, there is only one PUT and one GET. These are both
non-blocking operations. In MPI, a message send has several forms:
(blocking,non-blocking)*(standard,buffered,ready,synchronous). Don't these
apply to P/G/O?

I think that the answer is: Yes. Every one of these options makes sense for a
PUT, and for exactly the same reasons that they make sense for message passing.
Many of these don't seem to make sense for GET. The reason is
that a PUT can be considered as a SEND followed by a SEND, and a GET can be
considered as a SEND followed by a RECV. The first SEND in each passes the
target address info and operation type. The second transfer (SEND or RECV)
passes the data. The target, then, executes the equivalent of a RECV to get
the operation type and target address info, then either a SEND or RECV to
transfer the data (depending on the operation type).

The two SENDs in a PUT can be combined into a single SEND operation, and all
of the MPI send modes therefore can apply to that operation. On a GET, the
modes would either apply only to the SEND part of the PUT, or they apply to
the SEND on the target end. Neither of these make much sense, primarily
because the coordination between the SEND on the target and the RECV on the
originator is taken care of automatically, and those options that are
possible should probably be specified by the target directly.

In summary: Should P/G/O contain PUT, IPUT, SPUT, ISPUT, RPUT, IRPUT, BPUT,
and IBPUT operations?

Combination of waiting for local ops and remote ops
===================================================
In the P/G/O proposal (and at first glance, it appears that this might be
true for Marc's proposal now, too), the OFFER operation will not complete
until both (a) all of the local GET or PUT operations have completed, and
(b) the proper number of remote matching operations have completed. I have
used the following case as justification for this rule:

process A process B

PUT to B put to A
PUT to B put to A
OFFER OFFER

That is, deadlock could occur if two different operations were required, one
to wait for the local operations, one to wait on the remote operations.

However, upon thinking further, this is exactly the same justification that
was used for MPI_SENDRECV, and I think I heard one night (late, over ice cream
in Chicago) that everybody just implements MPI_SENDRECV as
IRECV(...,req)
SEND
WAIT(req)

Likewise, even if OFFER just waited on remote operations, if PUT, IPUT, GET,
and IGET existed, the exchange above could be implemented as

process A process B
IPUT to B(req1) IPUT to A(req1)
IPUT to B(req2) IPUT to B(req2)
OFFER OFFER
WAITALL(req1,req2) WAITALL(req1,req2)

This is cleaner, and much more similar to the message-passing case.

In summary: If P/G/O does contain IPUT-WAIT and IGET-WAIT pairs, shouldn't
the OFFER operation only wait for the appropriate number of remote
operations, and require a separate wait for each local operation? (If
the answer is "Yes", this will also remove the requirement that a process
issue an "OFFER(0)" after GETs or PUTs to wait for their completion,
even if there are no incoming GETs or PUTs.)

Summary
=======
I believe that the P/G/O proposal is currently viable and offers the easiest
to use and most efficient one-sided interface, but there is always room for
improvement.

I am currently leaning toward making PUT and GET blocking, adding IPUT and
IGET with the usual request arguments (which require a WAIT or TEST for
completion), and removing the condition that OFFER will only complete when
the local operations are finished. I may also want to add something akin
to Marc's IACCUMULATE.

-Dave
===============================================================================
David C. DiNucci | MRJ, Inc., Rsrch Scntst |USMail: NASA Ames Rsrch Ctr
dinucci@nas.nasa.gov| NAS (Num. Aerospace Sim.)| M/S T27A-2
(415)604-4430 | Parallel Tools Group | Moffett Field, CA 94035