alternate proposal

Eric Salo (salo@mrjones.engr.sgi.com)
Sun, 3 Mar 1996 01:00:56 -0800

Here is a somewhat expanded (and cleaned up) version of my proposal for Chapter
4. Apologies for doing it in ASCII, I'll convert it to TeX or PostScript
eventually.

This is what I intend to present in Chicago, so fire away. If I don't hear at
least five different reasons why it couldn't possibly work, I'll know that
nobody read it...

------------------------------------------------------------------------------
FUNCTIONS
------------------------------------------------------------------------------

MPI_PUT(buf, count, datatype, dest, offset, comm)

IN buf initial address of local put buffer (choice)
IN count number of elements in put buffer (integer)
IN datatype datatype of each put buffer element (handle)
IN dest rank of destination (integer)
IN offset initial byte offset within window of remote
put buffer (integer)
IN comm communicator (handle)

MPI_GET(buf, count, datatype, source, offset, comm)

IN buf initial address of local get buffer (choice)
IN count number of elements in get buffer (integer)
IN datatype datatype of each get buffer element (handle)
IN source rank of source (integer)
IN offset initial byte offset within window of remote
get buffer (integer)
IN comm communicator (handle)

These are the basic calls for moving data. Aside from the offset argument
(which replaces the tag), each argument to MPI_PUT is identical to the
corresponding argument to MPI_SEND. The same relationship holds for
MPI_GET and MPI_RECV, except there is no status argument to MPI_GET.

MPI_PUT takes local data from buf and copies it into the get/put window
that is associated with comm on the dest process. Like MPI_SEND, it returns
when it is safe for the application to modify the contents of buf.

MPI_GET takes remote data from the get/put window that is associated with
comm on the source process and copies it into buf. Like MPI_RECV, it returns
when buf contains the new data.

MPI_SHMALLOC(buf, len, atomicity, comm, newcomm)

OUT buf location of pointer which will hold the
local address of the window (choice)
IN len length of window in bytes (integer)
OUT atomicity smallest allowable unit/alignment of data
for gets and puts, in bytes (integer)
IN comm communicator (handle)
OUT newcomm new communicator (handle)

This collective call creates a new communicator that is identical in
membership to the original. In addition, the new communicator has an
associated get/put window that can be used in conjunction with the pointer
returned in buf to perform remote memory operations. A maximum of one such
window maybe associated with any communicator.

Before any gets or puts may be made, a get/put window (and associated buffer)
must be obtained by calling MPI_SHMALLOC. This function is essentially just
a malloc() with some extra semantics. Attempting a get/put operation on a
communicator that does not have an associated get/put window is an error.

The atomicity argument attempts to address limitations that are commonly
found in shared memory systems. For example, some MPPs or SMPs might only
be able to guarantee atomicity for operations performed on longwords, while
others might only be equipped to handle entire cachelines. This argument
provides the smallest size of data, in bytes, that can be atomically operated
upon by the communicator.

------------------------------------------------------------------------------
IMPLEMENTATION AND USE
------------------------------------------------------------------------------

a) SMPs

Consider a host which supports some form of shared memory, with multiple MPI
ranks running on that host. One possible implementation of this model might
be as follows:

Let's say that MPI_SHMALLOC is called by 4 processes, each of which asks for
a 16KB buffer. Internally, MPI would allocate 64KB of shared memory and map
it into the virtual address space of each process at some address A. (In
this example, A is the same on each process, but it need not be.) The process
with rank 0 would get the value of A returned in its buf argument, rank 1
would get (A + 16KB), rank 2 would get (A + 32KB), and rank 3 would get
(A + 48KB).

Each process then goes about using its own local 16KB of data however it sees
fit, until someone wants to perform a get/put. Let's say that rank 2 wants to
put the first 100 ints in its buffer into ints 200-299 in rank 3's buffer.
The application would then call:

MPI_PUT(buf, 100, MPI_INT, 3, 200*sizeof(int), comm)

The data is copied by MPI into the shared page that corresponds to rank 3.
While the other ranks also receive a copy of this data, none of them can
access it directly.

b) NOWs

Consider a NOW, with a single MPI rank running on each host. Each MPI process
might have an associated daemon process/thread that shares its address space.
When MPI_SHMALLOC is called, the MPI library need only call malloc(). To
send data, the MPI process would provide the location and extent of the put
buffer to the daemon, which would then stuff it thru a socket to the
corresponding daemon on the receiving host, which would then copy the data
directly into the receiver buffer of the remote MPI process. Gets would
work similarly, but in reverse.

A slightly safer implementation might be to give the daemon processes
read-only access to address space and to go thru some additional handshaking
so that all data is transfered only via reads.

Yet another option would be to have a single daemon per host that manages
multiple local processes. In that case, a complete overlap of address space
with the MPI processes is not possible and some sort of shared memory
communication might be used instead.

c) MPPs

It is probably impossible to come up with a single implementation that will
work across all MPPs because individual MPPs tend to be very different.
However, some combination of the above will probably do the trick in most(?)
cases.

------------------------------------------------------------------------------
DISCUSSION
------------------------------------------------------------------------------

The FORTRAN bindings for these functions will require some sort of pointer
support, which is unfortunate but (I claim) necessary. Preliminary queries have
failed to come up with anyone for whom this will actually present a problem.

There are obvious non-blocking, synchronous, and/or persistent extensions to
consider for MPI_GET and MPI_PUT. Let's ignore them all for now, they will be
trivial to add later.

At the moment, I'm assuming strict ordering between subsequent gets and puts.
That is, a put followed by a get (within the same communicator) will do the
right thing, as will a put followed by a put. This seems to be the most useful
programming model and I see no obvious harm to implementors in requiring it.
Does this get in anyone's way?

It seems best to leave the interactions between get/put and send/recv as
undefined. Are there any good reasons not to do this, and if so what should
the interactions be?

For atomicity, there are really two different knobs to turn: the size of the
smallest atomicity unit and the required alignment that such a unit must have
in memory. In practice, these values are usually the same and so I've combined
them. Are there any machines for which these values are different?

What should be the proper behavior if a datatype does not have the required
atomicity? Should it really be an error? Is there a graceful way that we can
say that this is allowed but not guarenteed to succeed? Perhaps by adding an
"atomicity override" bit to the get/put calls? (Yuck.)

It might be a good idea to require processes to pass the same len argument to
MPI_SHMALLOC. It might also be good to allow NULL for the buf argument,
indicating that a process does not wish to open up a local window but still
wishes access to the windows on other processes.

What should be the behavior of MPI_COMM_DUP with respect to get/put windows?

-- 
Eric Salo         Silicon Graphics Inc.             "Do you know what the
(415)933-2998     2011 N. Shoreline Blvd, 7L-802     last Xon said, just
salo@sgi.com      Mountain View, CA   94043-1389     before he died?"