put/get implementations

Marc Snir (snir@watson.ibm.com)
Thu, 02 Nov 1995 10:34:13 -0500

:-) :-) :-) *** (-: (-: (-:

I would like to try to clear some confusion in the current discussion. I see
two generic implementations for put/get.

A. In a distributed memory environment (NOW, MPP,...), a put/get request has
to be serviced by an agent at the target node that has access to the target
process memory. This agent cannot be the originating process. It can be
callback code executed by the target process (e.g., when put/get is
implemented on top of hrecv); code executed by a server thread of the target
process, which is servicing put/get requests; code executed by a separate
communication (kernel) process that has cross mapped (some of the) address
space of the target process; microcode executed by a communication engine,
which is, essentially, a hardwired version of a communication kernel process;
etc.

The points of interest here are that
1. A dedicated put/get agent can be implemented to be faster than a generic
hrecv agent: this seems to be the case both for Meiko and for Paragon. This,
basically, because a put/get agent has restricted functionality -- will not
make system calls, for example.
2. This agent can access any part of the target process memory as easily as
any other. This is clearly true if the agent runs in the same address space
as the target process; this is true if the agent is the equivalent of a kernel
communication process that can cross map the target process memory. I see no
reason that access to "heap" or to a shareable memory segment be any easier
than access to "stack" or "text". (The later would be true only if the agent
was a totally separate user process -- but this is an unlikely implementation.)

Thus, in such environment, restricting put/get operations to a special,
dynamically allocated memory segment, is restricting the functionality of
put/get without improving performance or simplifying implementation.

B. In a shared memory environment, a put/get request could be entirely
serviced by code executed by the calling process, if that process has access
to the target process memory. Such implementation will result in higher
performance than an implementation that uses a remote agent, on systems that
have no h/w support for block transfers, and can only use processor
memory-to-memory copy for put/get. Since it is likely that memory sharing
between processes is restricted to dynamically allocated memory regions (e.g.,
Unix 5 shared segments), it is advantageous in such environment to restrict
put/get calls to "special memory".

Given this situation, I see two possible directions.

1. The proposal of Eric: put/get can only use memory in the target process
that has been dynamically allocated by a special call. This would be a
variant of the MPI_RMC_MALLOC call in the current proposal.

2. My preference, which is an evolution of the current proposal (hopefully
simplified): put/get can use any memory in the target process. A call is
provided to allocate dynamically "good" memory on shared memory machines. If
a put/get window is restricted to such memory, performance will be higher on
shared memory machines; but put/get windows can also include other parts of
the target process memory, with a possible performance penalty on shared
menmory systems. Thus, two initialization calls are provided: one
MPI_RMC_INIT, to open a put/get window on preallocated memory, and
MPI_RMC_MALLOC to allocate "good" memory for a put/get window.

The advantage of 2 is that it does not penalize distributed memory systems,
and that it does not impose a somewhat unnatural restriction on the use of
put/get. The difference between different parts of the memory is performance
(on SMP's), not function. This is not to service library or compiler writers
- -- this is to service application developers that may want to access via
put/get any array they declare in their code.

The disadvantage of 2, for shared memory implementers, is that they will
either have to use a portable, lower performance put/get implementation that
uses a remote agent, or they will have to have two execution paths for
put/get: put/get executed by caller, if the target window is accessible, and
put/get executed by a target agent, otherwise. I am not sure the pain is so
bad: a portable implementation will have to have the "remote agent" solution,
for NOWs and MPPs. It behooves the SMP implementers to also have the "direct
access" solution. Since each window will be either of one type or another,
the choice of the put/get method will depend on the communicator argement in
the put/get call.

One more thought: on SMP's one should not only be able to speed up put/gets,
when the target memory is accessible by the caller, but also to speed up
send/receives, by avoiding additional buffering. The receiver can directly
copy a message sent from the sender memory to its own memory. We may want
MPI_RMC_MALLOC to be a more general function that is used to allocate "good"
memory for communication, for put/get as well as for send/receive.

- -------------------

Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425

------- End of Forwarded Message

-------------------

Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425