Plea for simplicity

Eric Salo (salo@mrjones.engr.sgi.com)
Tue, 31 Oct 1995 23:06:51 -0800

(Stupid mailer...apologies to those that received an empty message earlier.)

At the wrap-up of the latest MPI-2 meeting, I voiced my concerns regarding
one-sided communication in general and the current state of Chapter 4 in
particular. I was asked to make a concrete proposal, and that is what I am
doing now. Actually, I have two proposals, the first of which is simple:

Delete all of Chapter 4 except section 4.11 (MPI_HRECV).

I'm just going to ignore MPI_HRECV for now. I would argue that it belongs more
properly in a point-to-point chapter since it's really just a new type of
receive, but that's a seperate debate. Anyway, it's still there, I'm just not
gonna talk about it here.

Caveat: This is all just IMHO, of course...

What's Wrong With Chapter 4?
----------------------------
Primarily, it is too general. It allows any process to perform gets/puts into
arbitrary remote addresses on other processes, but most of today's machines
don't work that way. In general, UNIX certainly doesn't allow a process to go
around mapping in the address spaces of other processes! Instead, the common
practice is to dynamically allocate shared buffers. System V shared memory
does this, for example.

Second, it is cumbersome. There's an awful lot of mechanism that one must
go thru to get things set up between communicating processes, and each process
must keep track of a significant amount of non-local information. This is
confusing for application writers, for implementors, and for we who are trying
to define the MPI-2 functionality.

Third, it is bloated. Way too many bells and whistles to worry about when we
haven't even convinced ourselves yet that the base functionality is there, is
portable, and can be implemented efficiently.

I think that the problem might simply be that we are trying to treat gets/puts
as a message passing abstraction when we should really be treating them as a
shared memory abstraction.

FORTRAN Pointers
----------------
As I see it, the largest obstacle to portability is FORTRAN. (Surprise!) How
do we go about telling MPI which memory regions are to be used for remote
get/puts? In C, we have pointers so this isn't much of a problem; one can
easily imagine some sort of MPI_SHMALLOC function to dynamically allocate and
initialize shared regions. But FORTRAN is a problem because the standard
doesn't support pointers. Or is it? The conventional wisdom says that FORTRAN
pointers are non-portable, but I've yet to find a single major vendor that does
not support FORTRAN 77 pointer extensions. The list of those that do includes
Convex, Cray, DEC, HP, IBM, Intel, SGI, and Sun. This is a substantial list!
I therefore argue that we can probably get away with requiring pointer support
in our FORTRAN 77 binding.

Okay, so what are the alternatives? As I see it, we have only a few options:

1) Require support for pointers in FORTRAN 77
2) Only provide C bindings for gets/puts
3) Drop the idea of standardizing gets/puts entirely
4) (Fill in your own idea here)

New Get/Put Proposal
--------------------
Here is my proposal for a new and greatly simplified get/put interface that
assumes pointer support for all languages. It is very rough, but hopefully it
hangs together. The basic model is that every process dynamically allocates a
buffer which can be acted upon by get/put calls from remote processes. Each
such buffer is associated with its own communicator. On uniprocessors and SMPs,
these windows can be implemented very efficiently as shared pages of memory.
On NOWs, the MPI library might use an asynchronous thread to listen on a
dedicated socket and copy data to/from the local window as remote requests are
received. NORMA MPPs would fall somewhere in between.

There are still some serious portability questions to be answered, but
hopefully the simplicity of this interface will give us a better chance of
finding good solutions.

-------------------------------------------------------------------------

MPI_SHMALLOC(buf, len, oldcomm, newcomm)

OUT buf location of pointer which will hold the
local address of the window (choice)
IN len length of window in bytes (integer)
IN oldcomm communicator (handle)
OUT newcomm get/put communicator created (handle)

This collective call allocates a "window" of memory to be used for remote
operations and associates it with a new get/put communicator. The new
communicator has identical membership to the old one. The window is
deallocated by a call to MPI_COMM_FREE().

MPI_GET(buf, count, type, offset, rank, comm)

IN buf address of local destination buffer (choice)
IN count number of elements sent (integer)
IN type datatype of elements sent (handle)
IN offset byte offset within remote window of data to
get (integer)
IN rank rank of remote process (integer)
IN comm get/put communicator (handle)

This call copies data from the get/put window of a remote process into a
local buffer. It returns when the data in buf has been safely sent on its way.

MPI_PUT(buf, count, type, offset, rank, comm)

IN buf address of local source buffer (choice)
IN count number of elements sent (integer)
IN type datatype of elements sent (handle)
IN offset byte offset within remote window of data to
put (integer)
IN rank rank of remote process (integer)
IN comm get/put communicator (handle)

This call copies data from a local buffer into the get/put window of a remote
process. It returns when buf contains the requested data.

Also
----
When called with a get/put communicator, MPI_BARRIER() will block until all
pending remote operations on that communicator have completed.

Discussion
----------
There are obvious non-blocking and/or synchronous extensions to consider for
MPI_GET and MPI_PUT. Let's ignore them all for the moment, they will be
trivial to add later.

The atomicity of remote gets/puts still needs to be addressed and could get
very nasty very quickly. For example, assume that some MPP hardware can only
send entire words remotely. What happens if two different processes try to
write to bytes located within the same remote word? This problem becomes
particularly obnoxious with non-contiguous data types. Perhaps implementations
will need to associate some sort of atomic chunk size to get/put communicators?

At the moment, I'm assuming strict ordering between subsequent gets and puts.
That is, a put followed by a get (within the same communicator) will do the
right thing.

I'm inclined to think that the interactions between get/put and send/recv
should be left undefined.

When dealing with the window buffers, all addresses are calculated as bytes.
This is as it should be! It is direct and easy to understand.

I see no reason at all to have to pass around remote data types. I claim
that this get/put model will support heterogeneous systems at least as well
as any other model.

Do we really want to create a new communicator when we allocate the shared
buffers? How many such buffers do we think each process is likely to allocate?
If the number is small, perhaps we could just "attach" the buffers to
communicators that already exist?

Cache invalidation on the T3D may be something of a problem. I remember
reading once that shmem_put() could be set to automatically invalidate the
remote cache. Can anyone confirm that? Is it even possible to write a good,
portable interface that will support goofy machines like this? :-)

It might be a good idea to require processes to pass the same len argument to
MPI_SHMALLOC(). It might also be good to allow NULL for the buf argument,
indicating that a process does not wish to open up a local window but still
wishes access to the windows on other processes.

Example
-------
Here, a master process puts an array of 10 floats into the window on the
slave process.

main()
{
MPI_Comm put_comm;
int my_rank, master_rank, slave_rank, i;
char *buf;

/* Initialize */

master_rank = 0;
slave_rank = 1;

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Create a window of 10000 bytes on each process */

MPI_Shmalloc(&buf, 10000, MPI_COMM_WORLD, &put_comm);

if (my_rank == master_rank) {

/* Declare the source buffer */

float f[10];

/* Initialize the source buffer with some data */

for (i=0; i<10; i++) {
f[i] = i * 1.1;
}

/* Put the source buffer into the remote window of the slave */

MPI_Put(f, 10, MPI_FLOAT, 0, slave_rank, put_comm);

/* Synchronize with the slave */

MPI_Barrier(put_comm);

} else { /* slave */
float *f;

/* Point f at the appropriate place in my window */

f = (void *)buf;

/* Synchronize with the master (wait for the data to arrive) */

MPI_Barrier(put_comm);

/* Print the data */

for (i=0; i<10; i++) {
printf("%f\n", f[i]);
}
}

MPI_Finalize();
}

Thanks to all of you who read this far. Please post your thoughts on this to
the mailing list so we can come to some sort of group consensus on this whole
issue ASAP. I'm particularly interested in hearing about specific machines
for which the lack of FORTRAN 77 pointer support will be an issue.

- Eric

-- 
Eric Salo         Silicon Graphics Inc.             "Do you know what the
(415)390-2998     2011 N. Shoreline Blvd, 7L-802     last Xon said, just
salo@sgi.com      Mountain View, CA   94043-1389     before he died?"