1. How should functions be named? "Shared" or "RMC"?
I am not too particular about names, but a window accessed via put/get is not
a shared memory region. It is still only in the address space of one process,
and one or more additional processes can access it thru put/get commands (not
load/store). I can leave with "one-side_init", or something to that effect.
2. Use communicators or a new object?
RMC_init creates an object that consists of
a. a group of processes.
b. a separate communication domain
c. a list of windows (location, size, type, associated array of requests)
(b) may not be necessary, since use of different communicators in put/get
offers no protection: I see no effect on the outcome of RMC operations
that access the same target memory due to their choice of a communicator (with
the exception of barrier and fence operations).
We thus have three choices:
communicators (overloaded with new attributes)
groups (overloaded with new attributes)
a new object.
Overloading has been our prefered way in the pass (Occam's razor), and
communicator is the object we have obverloaded (with topology and error
handling, among others). This would justify the current choice. I don't
think that the differences are very significant, in any case.
3. general datatype in the definition of windows.
The window type is specified by the target process. The
target_datatype argument is defined by the origin process, and
provided as argument in the put/get call.
We need to specify which buffer in the target memory is specified by
the target datatype argument, and what it means for that argument to
be compatible with the window type. There is no obvious way of doing so,
for datatypes that carry absolute displacements, since there is no
obvious way of "converting" general datatypes, in a heterogeneous
environment. We clearly do not want to preserve the absolute value
of the displacement, but we do not have enough information to translate
them. Therefore, we need to restrict ourselves
and use only "portable" datatypes as target datatype arguments in put/get
calls. A datatype is portable if it was defined using only
MPI_TYPE_CONTIGOUS, MPI_TYPE_VECTOR and MPI_TYPE_INDEXED. I.e., if all fields
are of the same basic datatype, and all displacements are multiples of that
same basic datatype extent.
If the window type is also a "portable" datatype, then we have gained
no generality as compared to defining the window to be of the base
datatype from which this gneral datatype is derived.
Suppose now that the window type is "non-portable", e.g., defined by
MPI_TYPE_STRUCT. Then, each put/get operation will still only be able to
access a "homogeneous" part of the target window -- an area where all
elements accessed have the same basic type and all displacements
between these elements are multiples of that basic type extent.
Furthermore, we shall now need a byte displacement to start of
the target buffer, rather than an element count. Thus, we have gained
little additional power, at the expense of more complexity in
definition and implementation.
4. Can an RMC initialization call make some part of memory globablly
addressable?
Certainly, MPI does not preclude an environment where memory is
shared. But it is somewhat unpleasant if, as a side effect of an MPI
call that enables put/get communication, we also lower the
shields between processes for load/store operations: this is an
unintended effect. We may accept this effect as a price for efficient
put/get implementation on shared memory machines, but, to the least,
users should be warned. The issue merrits some discussion.
5. Atomicity.
My prefered interpretation is, too, that atomicity is
garanteed only with respect to other operations of the same type, not
with respect to direct accesses by the local process.
If the user wants to make sure it updates a semaphore atomically, then
it will use an MPI_RMW call, even if the semaphore is in local
memory. MPI calls can in general, be "reflexive", and that applies to
RMC calls: they can be applied to local memory.
6. RMW calls should be able to use non-associative functions, if we
wish to have a compare&swap function: this type of function is not
associative.
7. Nonblocking RMW and nonblocking accumulate.
I see little use for nonblocking RMW. The question is whether it is
worthwhile to break the symmetry in order to save one or two
functions.
8. Counter request objects vs new objects.
If we provide a general mechanism to overload request objects, with handler
functions and caching, then counter request objects will not require additional
baggage.
9. RMW calls have implicit fence semantics.
The problem is fencing what? We have currently defined a fence to be
specific to a communicator, rahter than global over all pending RMC
operations. But, if windows are of homogeneous type, it is unlikely
that a window contains both data and semaphores. Therefore, it is
unlikely that we use the same communicator (or same shared something
object), both the the data window andf the synchronization semaphores.
On the other hnad, a global fence is not very consistent with a
multithreaded environment.
10. Ordering semantics and false sharing.
We had a similar discussion in the context of send/receive and decided
there that one should get the "correct" behavior, even if two
concurrent receives update disjoint parts of the same word, on a word
oriented machine. I believe the same should apply to puts.
As for ordering, the order rules for message passing specify which
sends are matched to which receives. We
specified that a program is erroneous if a process accesses an
active communication buffer (even if it is a read access to a send
buffer). We did not say so explictly, but I
interpret this to mean that a program is erroneous if two
communications concurrently access the same communication buffer.
Now, what should be the extrapolation to put/get? A blocking
(resp. nonblocking) put is equivalent to a blocking (resp. nonblocking)
send and a matching nonblocking receive, issued at the same time.
A blocking (resp. nonblocking) get is equivalent to a blocking
(resp. nonblocking) receive and a matching nonblocking send, issued at
the same time. Thus, a put/get completes at the origin when the
operation returns, if blocking, or the request completes, if
nonblocking. The put/get completes at the target when the target
request is updated or a fence completes. Thus
(*) Order is not an issue, since matching is obvious here.
(*) The origin communication buffer of a put/get should not be
accessed by the origin process or another put/get until the put/get
completes at the origin.
(*) The target communication buffer of a put/get should not be
accessed by the target process or another put/get until the put/get
completes at the target.
Suppose, for example, that a process executes a blocking put,
immediatly followed by a blocking get from the same target buffer.
The operations occur in order at the origin, but are concurrent at the
target (this is equivalent to a nonblocking receive at the target,
immediatly followed by a nonblocking send). Thus, the code is
erroneous, and it is not guaranteed that the get will return the
values stored by the put.
11. What should be the arguments passed to a request_handler?
To the least the extra state and the communication status. The
communicator argument is not strictly needed, since a handle to it can
be passed via the extra state argument. Still, if we expect that the
communicator will often be needed, it may be worthwhile having it as
an extra argument.
12. Is a request accessible after a handler has been posted?
The answer is clearly yes, for persistent requests.
Note that a call to MPI_TEST can be used to find whether the request
is still active (MPI_TEST returns flag = true on an inactive request,
even if invoked multiple times). Thus, to prevent races, we could
use the sequence
MPI_HANDLER_LOCK(comm);
MPI_TEST(request, flag);
if (flag==true) MPI_CANCEL(request);
MPI_HANDLER_UNLOCK(comm);
For nonpersistent ones we have several choices.
(a) request is not accessible (in particular, request cannot be
cancelled, once a handler is posted).
(b) request is accessible, and is freed when the handler is invoked.
Then, the calling process (thread) must have a
means of detecting that the request completed and was freed. This
could be done by having the handler set the request handle it was
passed as argument to NULL when it is invoked. The disadvantage of
such choice is that the request handle is updated asynchronously
(this is true, so far, only of communication buffers). Not that locks
can be used to prevent the update from occurring at critical times.
(c) request is accessible, and is not autmotatically
freed; an explicit MPI_REQUEST_FREE call is needed.
This can be made easier by passing the request as an argument
to the handler, so that the handler code can free the request, if desired.
(c) sounds better than the current choice.
13. Progress rule.
I would follow, again, the analogy between put/get and send/receive.
Once a put/get call is made, then it is as if a matching pair of send
and receive are posted, so that the communication must complete,
irrespective of any other event in the system (and even if no MPI call
is made at the target process.) As for send/receives, implementations
that check for enabled communications only when MPI calls are made
will not be conforming. The difference is that for send/receive one
has to construct fairly contrived examples to detect the problem;
e.g., something artificial like
if (myrank == 0) {
a = 0;
MPI_Irecv(&a, 1, MPI_INT, 1, 0, comm, &req);
MPI_Send(&b, 0, MPI_INT, 1, 1, comm);
while (a == 0);
MPI_Wait(&req);
}
else if (myrank == 1) {
a =1;
MPI_Recv (&b, 0, MPI_INT, 0, 1, comm, &status);
MPI_Send(&a, 1, MPI_INT, 0, 0, comm);
}
which will deadlock in an implementation that executes
communication code only when MPI calls are made, but will
complete on more robust implementations.
On the other hand, there are fairly natural examples where such
implementation will deadlock for put/get. E.g., a system where a
process is used as a memory server, and does not communicate at all.
-------------------
Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425