> Date: Tue, 04 Feb 1997 09:41:50 -0600
> From: William Gropp <gropp@mcs.anl.gov>
>
> | A possible criticism of the above proposal is that MPI has tried
> | to avoid mandating implementation details (such as requiring sockets).
> | We should keep in mind, though, that:
> | - no implementors have proposed using something other than sockets,
> | and indeed the current spec may be so restrictive as to require them.
>
> Actually, we had a long discussion earlier (last May/June) about
> non-socket based implementations (e.g., fork and shared memory).
> The collective nature of both the spawn and client/server calls was
> originally designed to allow efficient networks to update routes
> and tables, and to avoid the usual race conditions. Various
> proposals explaining how spawn and client/server might be done were
> discussed then. As one concrete approach, an application on a
> shared memory system could use a SYSV named segment as the named
> "port"; all of the "socket" functions (listen, connect, etc) can be
> built on top of this.
I'm very much in favor of this proposal. In fact (not to be immodest)
it appears a little like a proposal I made about 18 months ago, with
the difference that to be a bit more generic I proposed a pair of
callback function pointers to get the bootstrap info between the
client and server, which would default (when passed as 0) to the
simple TCP ones proposed above (i.e. just send and recv on a simple
fd).
With the callbacks, either the implementation (by supplying multiple
sets of available callbacks) *or* the user is free to provide whatever
bootstrap implementation they feel like (up to & including just
printing a printable representation of the bootstrap info in a window
and reading it from stdin on the other side, requiring the user to cut
& paste it).
My original proposal used MPI_Attach on the server and MPI_Parent on
the child, but of course MPI_Join on both may work (I wanted to retain
the flexibility for the parent and child to do different things).
Here's an edited version of my original proposal (dated Aug 2, 1995):
Process Rendezvous
----------------------------------------
...Basically I suggest that each of the client and server optionally
register two functions to send and receive opaque (to them) buffers
of data. They (the users) know how the sending is done but not the
content, and MPI knows the content but not how the sending is done.
This separation allows the user to conveniently use any existing
transport layer s/he has.
As soon as the MPI implementations on each side receive these buffers,
which presumably contain rendezvous info like a hostname/port number
for TCP (could just be the "well-known name" from the document), they
can start up MPI communications. Here is a possible scenario:
Parent Side Child Side
-------------------- --------------------
User calls MPI_Attach(&psend, &precv, ...)
> MPI_Attach fills in int buf[]
> MPI_Attach calls *psend(buf)
> > *psend sends buf to child
User calls MPI_Parent(&crecv, &csend)
MPI_Parent calls *crecv[buf]
*crecv fills buf from parent, returns
> > *psend returns
MPI_Parent parses buf, builds rbuf
to send back to parent
MPI_Parent calls *csend(rbuf)
*csend sends buf to parent
> MPI_Attach calls *precv(rbuf)
> > *precv receives buf from child
*csend returns
MPI_Parent returns
> *precv returns
> MPI_Attach parses buf from child
Now the parent and child (or server and client) have exchanged
information, and they can start communicating via MPI. Note that you
really do need two functions, MPI_Process_Attach can't be used for
both because one has to send first and the other receives first
(though this could be an argument, e.g. "Parent"). These callback
functions essentially replace the MPI_Process args, by identifying the
processes procedurally.
Any implementation would provide at least one set of these functions
predefined, which would work over the default communication medium
of the supported architectures. So by default you could pass NULL
for all of those functions and it would work simply. For instance,
a very PVM-like host table could be built and process-starting
versions of the send/recv functions could use it, creating a
PVM_Spawn like function.
Note one further important point: *psend could quite easily start
the child itself if it wants to. This is the simplest
implementation of MPI_Spawn. All the system-specific stuff
including executable names, arguments, host naming, resource
allocation, and so on could all be in the implementation of psend,
and thus *can be specified by the user*.