# New version of Dynamic Process Chapter, latex

Rusty Lusk (lusk@mcs.anl.gov)
Tue, 21 Nov 1995 12:55:24 -0600

Here is the new version of the Dynamic Process Chapter. It has the changes
agreed upon at the last meeting, and a wholly new writeup of the client/server
section. Comments welcome. Postscript to follow.

Rusty, Bill S., and Al

% Nov 22, 1995 - RL new client/server section
% Nov 13, 1995 - RL small changes and discussions
% Nov 8, 1995 - AG
% Oct 30, 1995 - wcs changes from 10/23 mpi meeting
% Oct 16, 1995 - RL changes from wcs (small)
% Oct 9, 1995 - wcs (changes from last MPI meeting)
% Aug 31, 1995 - wcs (2nd resource proposal merged in)
% Aug 29, 1995 - AG
% July 23, 1995 - wcs
% July 12, 1995 - RL
% July 3, 1995 - wcs
% Version as of May 30, 1995 - wcs
% Version as of May 29, 1995 - RL
% Version as of April 27, 1995

\setcounter{chapter}{2}
\chapter{Dynamic Processes}
\label{sec:dynamic-2}
\label{chap:dynamic-2}

\section{Introduction}
\label{sec:dynamic:introduction}

The MPI-1 message passing library allows processes in a parallel
program to communicate with one another. MPI-1 specifies neither how
the processes are created, nor how they establish communication.
Moreover, an MPI-1 application is static, that is, no
processes can be added to or deleted from an application while it is
running.

MPI users have asked that the MPI-1 model be relaxed to allow dynamic
process management.
The main impetus comes from the PVM~\cite{pvmbook} research effort,
which has provided a wealth of experience with dynamic process
management and resource control that illustrates their benefits
and potential pitfalls.

It was decided that resource control is outside the scope of MPI-2,
because a portable interface to all the existing and future resource
controllers and the potential controllers was deemed unfeasible.
Resource control can encompass a wide range of abilities
including adding and deleting hosts from an environment, reserving
and scheduling resources, managing compute partitions of MPP,
and returning information about available resources.

MPI-2 assumes that resource control is provided by the vendors
in the case of MPP or by some other software package when the
environment is a cluster of hosts.

MPI-2 does provide an integer variable, \mpifunc{MPI\_UNIVERSE\_SIZE}
that recommends how many processes
may be spawned in the environment alltogether.
The addition of this single environment-query capability
allows many existing programs to be ported to MPI straightforwardly.

The reasons for adding dynamic process management to MPI
are both technical and practical.

\begin{itemize}

\item Since a cluster of workstations offers no generic
method to start up a MIMD application, the inclusion
of dynamic process management in MPI provides a portable
method for applications to run in this environment.

\item Workstation network users migrating from PVM to MPI
are accustomed to using PVM's capabilities for process and resource
management. The lack of these features is a practical
stumbling block to migration.

\item Important classes of message passing applications, such as
client-server systems and task farming jobs, require dynamic
process control.

\item The ability to write fault tolerant applications is
important in unstable environments and for commercial applications.
MPI-1 does not provide mechanisms for building fault-tolerant
applications. The mechanisms required to support fault tolerance
largely overlap with those needed to support dynamic process
management.

\end{itemize}

While dynamic process management is essential, it is important
not to compromise the portability or performance of MPI.
In particular:
\begin{itemize}

\item The MPI-2 dynamic process management model must apply
to the vast majority of current parallel environments. These
include everything from tightly integrated MPPs such as the
Intel Paragon and the Meiko CS-2 to heterogeneous networks of
workstations.

\item MPI must not take over operating system responsibilities.
It should instead provide a clean interface between an application
and system software.

\item MPI must continue to guarantee communication determinism, i.e.,
dynamic process management must not introduce unavoidable race
conditions.

\item MPI must not contain features that compromise performance.

\item MPI-1 programs must work under MPI-2, i.e., the MPI-1
static process model must be a special case of the MPI-2 dynamic
model.

\end{itemize}

The MPI dynamic process management model addresses these issues in two
ways. First, MPI remains primarily a communication library. It
does not manage the parallel environment in which
a parallel program executes, though it provides a minimal
interface between an application and external resource and
process managers.

Second, MPI-2 does not change the concept of communicator. Once a
communicator is built, it behaves as specified in MPI-1. A
communicator is never changed once created, and it is always
created using deterministic collective semantics.

\section{The MPI-2 Dynamic Process Model}
\label{sec:model}

The MPI-2 dynamic process model allows for the creation and
destruction of processes after an MPI application has started.
It provides a mechanism to establish communication between
the newly created processes and the existing MPI application.
It also provides a mechanism to establish communication between
two existing MPI applications, even when one did not start'' the
other.

\subsection{Starting and Managing Processes}

MPI applications may start new processes (including non-MPI
processes), send them signals, and find out when they
die or become unreachable. They do this through
an interface to an external process manager, which can range
from a parallel operating system (CMOST) to layered
software (POE) to an {\tt rsh} command (p4).

There are two ways to start new processes.

\mpifunc{MPI\_SPAWN}
starts MPI processes and establishes communication with them,
returning an intercommunicator.

\mpifunc{MPI\_SPAWN\_INDEPENDENT}
starts processes (which may or may not be MPI processes)
but does not establish communication with them. It returns
a group.

Corresponding to these routines are two more routines that
start several different binaries (or the same binary with
different arguments) at the same time:
\mpifunc{MPI\_SPAWN\_MULTIPLE} and
\mpifunc{MPI\_SPAWN\_MULTIPLE\_INDEPENDENT}.

MPI uses the existing group abstraction to represent processes.
A process is identified by a (group, rank) pair.

\subsection{The Runtime Environment}

The
\mpifunc{MPI\_SPAWN} and
\mpifunc{MPI\_SPAWN\_INDEPENDENT}.
routines provide an interface between MPI and
the {\em runtime environment} of an MPI application.

The difficulty is that there is an enormous range of runtime
environments and application requirements, and MPI must not be
tailored to any particular one. Examples of such environments are:

\begin{itemize}
\item {\bf MPP managed by a batch queueing system}. Batch queueing
systems generally allocate resources before an application begins,
enforce limits on resource use (CPU time, memory use, etc), and do
not allow a change in resource allocation after a job begins.
Moreover, many MPPs have special limitations or extensions, such as a
limit the number of processes that may run on one processor, or
the ability to gang-schedule processes of a parallel application.

\item {\bf Network of workstations with PVM}. PVM (Parallel Virtual
Machine) allows a user to create a virtual machine'' out of
a network of workstations. An application may extend the virtual
machine or manage processes (create, kill, redirect output, etc.)
through the PVM library. Requests to manage the machine or processes
may be intercepted and handled by an external resource manager.

\item {\bf Network of workstations managed by a load balancing system}.
A load balancing system may choose the location of spawned processes
based on dynamic quantities,
such as load average. It may transparently migrate processes
from one machine to another when a resource becomes unavailable.

\item {\bf Large SMP with Unix}. Applications are run directly
by the user. They are scheduled
at a low level by the operating system. Processes may have
special scheduling characteristics (gang-scheduling, processor
affinity, deadline scheduling, processor locking, etc.) and
be subject to OS resource limits (number of processes, amount
of memory, etc.).

\end{itemize}

MPI assumes, implicitly, the existence of an environment in which an
application runs. It does not provide operating system'' services,
such as a general ability to query what processes are running, to kill
arbitrary processes, to find out properties of the runtime environment
(how many processors, how much memory, etc). Complex interaction of an
MPI application with its runtime environment should
be done through an environment-specific API.

\discuss{
An example of such an API would be the PVM task and machine management
etc., possibly modified to return an MPI (group,rank) when possible.
A Condor or PBS API would be another possibility.
}

At some low level, obviously, MPI must be able to interact with
the runtime system, but the interaction is not visible at the
application level and the details are not specified by the MPI
standard.

\discuss{
It has been requested that the MPI forum specify this interface
explicitly. MPI has so far shied away from specifying implementation
details, for good reasons. It would be nice, however, if implementors
agreed on a minimal standard interface to allow interoperability.

}

MPI does not require the existence of an underlying virtual machine''
model, in which there is a consistent global view of an MPI
application, and an implicit operating system'' managing resources
and processes. For instance, processes spawned by one task may not be
by one process may not be visible in another process; tasks spawned by
different processes may not be automatically distributed over available
resources. MPI does require, however, that a process be aware
of its own changes to the runtime environment. For instance, it can
be notified when its children die; tasks which it spawns will be
distributed in some sane manner, etc.
\begin{implementors}
A virtual machine abstraction is appropriate in most environments
and should be part of a high-quality'' implementation. In other
cases, providing the virtual machine abstraction may add too much
baggage (e.g., daemon processes) or inhibit portability.
\end{implementors}

Interaction between MPI and the
runtime environment is limited to the following areas:

\begin{itemize}
\item A process may start new processes with \mpifunc{MPI\_SPAWN},
\mpifunc{MPI\_SPAWN\_INDEPENDENT} and their multiple-executable
variants.
It may kill processes it has spawned (and possibly others)
and may be notified when processes die or become unreachable.

\item When a process spawns a child process, it optionlly passes a string to
the runtime environment to indicate where the process should be
started. The string is opaque to MPI.

\item A function \mpifunc{MPI\_UNIVERSE\_SIZE},
tells a program how large'' the runtime environment
is, namely how many processes can
usefully be started alltogether. One can subtract the size of
\mpifunc{MPI\_COMM\_WORLD} from this value to find out how many processes

\end{itemize}

\discuss{
We don't have \mpifunc{MPI\_ALLOCATE} any more, since
the runtime environment has dissolved into the background (i.e.,
we have no handle to it). Resources can be allocated through
the runtime-environment API, or through {\tt mpirun}. }

\subsection{Applications Requiring Direct Communication with the Runtime
System}
\label{sec:flexapplications}

The existing MPI specification is adequate for most parallel
applications. In these applications, the runtime environment,
whether simple or elaborate, allocates resources and manages
user processes without interacting with the application program. In
other applications, however, it is necessary that the {\em user
level\/} of the application communicate with the runtime environment.
Here we describe three broad classes of such
applications. In Section~\ref{sec:examples} we will give concrete
examples of each of these classes.

\label{sec:farming}
By a task farm'' application we mean a program that manages the execution of
a set of other, possibly sequential, programs. This situation often arises
when one wants to run the same sequential program many times with varying
input data. We call each invocation of the sequential program a {\em task}.
It is often simplest to parallelize'' the existing sequential program by
writing a parallel harness'' program that in turn devotes a separate,
transient process to each task. When one task finishes, a new process is
started to execute the next one. Even if the resources allocated to the job
are fixed, the harness'' process must interact frequently with the process
manager (even if this is just {\tt rsh}, to start the new processes with the
new input data). In many cases this harness can be written in a simple
scripting language like {\tt csh} or {\tt perl}, but some users prefer to use
Fortran or C. Note that it is an explicit goal of the MPI dynamic
process architecture to allow the management of non-MPI processes.

\paragraph{Dynamic number of processes in parallel job}
\label{sec:dynamic}
The program wishes to decide {\em inside\/} the program to adjust the number
of processes to fit the size of the problem. Furthermore, it may continue to
add and subtract processes during the computation to fit separate phases of
the computation, some of which may be more parallel than others. In order to
do this, the application program will have to interact with the resource manager
(however it is implemented) to request and acquire or return computational
resources. It will also have to interact with the process manager to request
that processes be started and in order to make the new processes known to the
message-passing library so that the larger (or smaller) group of processes can
communicate.

An important type of dynamic application is a scavenger application. A
scavenger application is embarrassingly parallel'' in the sense that
it performs a large number of completely independent tasks. If the
number of tasks is large enough, such an application can make use of
any resources that become available. Conversely, it can
easily give up resources to another application. Scavenger
applications are excellent for filling in the holes'' on
a space-shared parallel machine, allowing it to achieve very
high utilization.

\paragraph{Client/Server}
\label{sec:server}
This situation is the opposite of the situations above, where processes come
and go upon request. In the client/server model, one set of processes is
relatively permanent
(the server, which we assume here may be a parallel program).
At unpredictable times, another (possibly parallel) program (the client)
begins execution and must establish communication with the server. In this
case the process manager must provide a way for the client to locate the
server and communicate to the message-passing library that it must now support
communications with a new collection of processes.

It is currently possible to write the parallel clients and servers in MPI, but
because MPI does not provide the necessary interfaces between the application
program and the resource manager or process manager, other nonportable, machine
specific libraries must be called in order for the client and server to
communicate with one another. On the other hand, MPI does contain several
features that make it relatively easy to add such interfaces, and we propose
both a simple interface and a more complex but flexible one.

\section{Process Manager Interface}

\subsection{Processes in MPI}

A process is represented in MPI by a (group, rank) pair.
A process may or may not be an MPI process'' in that it may or may not call
\mpifunc{MPI\_INIT} and thus be assigned an \mpifunc{MPI\_COMM\_WORLD}.
MPI can't communicate directly with a process unless it has a communicator,
but a user can send signals to any process.
Note that a (group, rank) identification
is not unique because a process may belong to several groups.

\discuss{
The alternative is to have an \mpifunc{MPI\_Process} object. The Forum's
current preference is for the (group, rank) alternative.
}

A bad programming practice that has been used by naive users of systems that
provide both messages and signals is to send a process a message and then
signal it to wake up and receive the message.
This practice is erroneous in MPI because MPI provides no guarantees
on the order of operations between messages and signals.
If a message and a signal are sent to a process,
the signal may arrive or be processed before the message,
depending on hardware and MPI implementation details.

\subsection{Starting Processes - Simple Interface}

The following routine is the simplest way to start MPI processes.

\begin{funcdef}{MPI\_SPAWN(command-line, n, where, flags, root, comm, intercomm)}
\funcarg{\IN}{command-line}{executable and arguments, in a single string}
\funcarg{\IN}{n}{number of processes to start}
\funcarg{\IN}{where}{a string telling the runtime system where to
start the processes (significant only on root)}
\funcarg{\IN}{flags}{set of (boolean) flags describing spawn parameters}
\funcarg{\IN}{root}{rank of process in which previous arguments are valid}
\funcarg{\IN}{comm}{communicator of group of spawning processes}
\funcarg{\OUT}{intercomm}{intercommunicator between original group and
the newly spawned group}
\end{funcdef}

\mpifunc{MPI\_SPAWN} is collective over \mpiarg{comm} and the newly
created processes. It starts \mpiarg{n} identical copies
of the program specified by \mpiarg{command-line},
in a way which is determined by the
runtime environment.

\begin{missing}
We need to describe how the command line is parsed and how
leading or trailing spaces can be included in arguments
using proper quoting. We presumably want some subset of {\tt sh}
or {\tt csh} parsing.

\end{missing}

The \mpiarg{where} argument is
opaque to MPI and is passed directly to the runtime
environment, which uses it to determine where to
spawn processes.
If the where argument is omitted (NULL in C or an empty string in Fortran)
the runtime environment decides where to spawn processes.
MPI does not specify the format of the \mpiarg{where} argument.

The \mpiarg{flags} argument allows some additional flexibility
in the spawn. A zero value specifies default behavior. A nonzero
value must be the logical OR of one or more of the following values:
\begin{itemize}
\item \mpifunc{MPI\_SPAWN\_SOFT} specifies that \mpiarg{N}, the
number of requested processes, is a soft request. If MPI is
unable to spawn all of the requested processes, it will spawn
only a subset, and will return a communicator containing only
the processes it was able to spawn successfully. Note that
an implementation is not required to implement such partial
spawning, and may always return zero processes when there is
a failure. In this case, the practical difference is that
in the default behavior, failing to spawn a requested process
is an error (fatal by default), whereas with a soft request,
there is no error but an empty intercommunicator is returned.

%\discuss{Seems that we missed a few things at the meeting.
%Are empty groups, empty communicators (or empty intercommunicators,
%where the remote group is empty) allowed? Also, see discussion of
%error reporting below. -wcs}

%\item {\bf others?}
%
%\discuss{It would be possible to add things like
%\mpiarg{MPI\_SPAWN\_HOST} and
%\mpiarg{MPI\_SPAWN\_ARCH} that
%describe what is in the \mpiarg{where} argument. PVM
%does this. Does MPI want this? -wcs
%
%It would certainly make it easier on folks migrating from PVM,
%but I don't know how it would affect other runtimes.
%PVM users have found these necessary because often
%people name hosts after an architecture type f.e. SUN4.
%It is not possible to distiguish which the user wants with just where'. -Al
%
%I believe that this would compromise the opacity of 'where', so I
%would prefer to leave it out of the standard. I presume that the
%format of 'where' could distinguish the various cases. I.e., the
%contents of 'where could be either arch=sun4'' to ask for some
%machine with the architecture sun4 or host=sun4'' or to ask for the
%specific machine named sun4. - RL
%
%}

\end{itemize}

All arguments before the \mpiarg{root} argument are specified
on the process whose rank in \mpiarg{comm} is equal to \mpiarg{root}.
The value of these arguments on other processes is ignored.

MPI does not specify how to find the executable
or what directory the executable runs in. These are also
properties of the runtime environment. For instance,
MPI in a PVM environment will use PVM rules to find executables.
An MPI implementation running under IBM's POE will use
POE's method of finding executables.

\discuss{How do we want to handle errors? Spawns may fail for
resources not available, permission denied, etc). Is there any
way to indicate this to the user? Note that different
processes may have failed for different reasons. What about
the case of soft requests? Returning fewer than
the number requested will always indicate an error unless
the user made a mistake. Any way to tell him/her why
there was an error? This is potentially very useful information
(for instance, on one host, the user forgot to set up .rhosts
correctly, or there are only local filesystems and the user
forgot to make a local copy on one machine). \par
}

The spawned processes
are {\em required} to call \mpifunc{MPI\_INIT}, which is
collective with \mpifunc{MPI\_SPAWN} in the parent.
The intercommunicator can be obtained in the children through the
routine \mpifunc{MPI\_COMM\_PARENT}.
The children have their own \mpifunc{MPI\_COMM\_WORLD} which is
separate from that of the parent.

\begin{funcdef}{MPI\_COMM\_PARENT(intercomm)}
\funcarg{\OUT}{intercomm}{parent intercommunicator}
\end{funcdef}

\mpifunc{MPI\_COMM\_PARENT} can be called in any process.
If there is a parent, (i.e., the process was started with
\mpifunc{MPI\_SPAWN}) it returns an intercommunicator. The
local group of the intercommunicator consists of the processes created
with the same call to \mpifunc{MPI\_SPAWN}. The remote group consists
of the processes that cooperated on the call to \mpifunc{MPI\_SPAWN}.
If there is no parent, \mpifunc{MPI\_COMM\_PARENT} returns
\mpifunc{MPI\_COMM\_NULL}.

\discuss{
\mpifunc{MPI\_COMM\_PARENT} has
to be a function, not a global variable, so we can set
it to NULL for \mpifunc{MPI\_SPAWN\_INDEPENDENT} below.
(Thanks to Greg Burns for pointing this out)
}

%\discuss{
%The pre-allocated resources'' to be used to run spawned processes could
%be specified in a portable way if we standardize {\tt mpirun}. A set of
%arguments or a file argument could describe a virtual machine. -wcs
%
%But this would require we standardize every possible flag to {\tt mpirun}.
%At the last meeting, we felt we could agree on one and maybe two
%of the flags, but unlikely all of them.
%And then there would be the need to agree on the file format that
%describes the virtual machine. Could be a tough sell. -Al
%
%I think this will lead us back into the resource quagmire, but I will
%be happy to propose a file format to be rejected. It should obviously
%look a lot like the PVM format, with all keyword=value pairs.
%- RL
%
%}

\subsection{Starting Multiple Binaries or Independent Processes}

While \mpifunc{MPI\_SPAWN} is sufficient for most cases, it does
not allow the following:
\begin{itemize}
\item Spawning of non-MPI processes or MPI processes which do not
wish to (or need to) communicate with their parent.
\item Spawning of multiple binaries which all have the same
\mpifunc{MPI\_COMM\_WORLD}.
\item Spawning of the same binary with different arguments.
\end{itemize}

The following routine spawns multiple binaries or the same binary
with multiple arguments, establishing communication with them.
All spawned programs have the same \mpifunc{MPI\_COMM\_WORLD}.

\begin{funcdef}{MPI\_SPAWN\_MULTIPLE(ncommand-lines,
array-of-command-lines, array-of-n, array-of-where, array-of-flags,
root, comm, intercomm)}
\funcarg{\IN}{n-command-lines}{Number of command lines (size of each of the following arrays)}
\funcarg{\IN}{array-of-command-lines}{significant only at root}
\funcarg{\IN}{array-of-n}{}
\funcarg{\IN}{array-of-where}{}
\funcarg{\IN}{array-of-flags}{}
\funcarg{\IN}{root}{rank of process in which arguments are valid}
\funcarg{\IN}{comm}{communicator of group of spawning processes}
\funcarg{\OUT}{intercomm}{intercommunicator between original group and
spawned group}
\end{funcdef}

\mpifunc{MPI\_SPAWN\_MULTIPLE} is identical to \mpifunc{MPI\_SPAWN} except
that there are multiple executable specifications. The first argument,
\mpiarg{n-command-lines}, gives the number of specifications. Each of the
next four arguments are simply arrays of the corresponding arguments
in \mpifunc{MPI\_SPAWN}. All of the spawned processes have the same
\mpifunc{MPI\_COMM\_WORLD}.

It is not allowed to spawn some MPI processes and some non-MPI processes
in the same routine. Either all must call \mpifunc{MPI\_INIT} or
none must call it.

\begin{users}
If you need to spawn multiple executables, it is recommended to use
several times. There are two reasons for this. First, spawning several
things at once may be faster than spawning them sequentially. Second,
in some implementations,
communication between processes spawned at the same time may be
faster than communication between processes spawned separately.
\end{users}

%\discuss{
%We may want to tell the runtime system something that applies
%to the whole set of command lines. The \mpiarg{where} argument
%is our mechanism for telling the runtime system, but there is
%no global \mpiarg{where} argument. For instance, we may want fast
%communication (e.g. shared memory) between the processes
%corresponding to a single command line, but slower communication
%between different executables. This might be the usual case
%for multidisciplinary applications, for instance. Same could
%be said for scheduling (ganged within a discipline, not
%ganged between disciplines).
%Recommended solution: add a global \mpiarg{where} argument. -wcs
%
%This sounds like a job for the \mpiarg{array-of-flags} rather
%than a global where'' argument. Using the array would also make options
%like connect these tasks through the switch but not those'
%a quality of implementation and vendor decision rather than a
%part of the standard. -Al
%
%}

The following routine spawns processes that the parent does
not wish to communicate with.
It therefore returns a group rather than an intercommunicator.

\begin{funcdef}{MPI\_SPAWN\_INDEPENDENT(command-line, n, where, flags,
root, comm, group)}
\funcarg{\IN}{command-line}{executable and arguments, in a single string
significant only on root)}
\funcarg{\IN}{n}{number of processes to start}
\funcarg{\IN}{where}{a string telling the runtime system where to
start the processes}
\funcarg{\IN}{flags}{set of (boolean) flags describing spawn parameters}
\funcarg{\IN}{root}{rank of process in which arguments are valid}
\funcarg{\IN}{comm}{communicator of group of spawning processes}
\funcarg{\OUT}{group}{group of spawned processes}
\end{funcdef}

\mpifunc{MPI\_SPAWN\_INDEPENDENT} is similar to \mpifunc{MPI\_SPAWN}
except that it does not establish communication with the children
and thus does not have to wait till the children call \mpifunc{MPI\_INIT}
before returning.
It is collective only in the parent communicator and not with the children.
It may be used to start independent MPI processes or non-MPI processes.
If the spawned processes are MPI processes,
they will all have the same \mpifunc{MPI\_COMM\_WORLD} but
\mpifunc{MPI\_COMM\_PARENT} will return \mpifunc{MPI\_COMM\_NULL}.

\begin{implementors}
Since \mpifunc{MPI\_SPAWN\_INDEPENDENT} does not
know whether the children are MPI processes or not, it must
provide information to the children in such a way that they
have enough information to establish communication among themselves
(with \mpifunc{MPI\_INIT}) but also in such a way that it can
be ignored by non-MPI processes. Passing the information through
the environment would be ok, passing information on the command
line would not be. This requirement is one of the reasons why
the MPI-2 version of \mpifunc{MPI\_INIT} no longer looks at its arguments.
\end{implementors}

Finally, MPI provides the following routine to spawn multiple
executables without establishing communication with them.
If the executables are MPI processes, they all have the same
\mpifunc{MPI\_COMM\_WORLD}.

\begin{funcdef}{MPI\_SPAWN\_MULTIPLE\_INDEPENDENT(ncommand-lines,
array-of-command-lines, array-of-n, array-of-where, array-of-flags,
root, comm, group)}
\funcarg{\IN}{n-command-lines}{Number of command lines (size of each of the following arrays)}
\funcarg{\IN}{array-of-command-lines}{(significant only at root)}
\funcarg{\IN}{array-of-n}{}
\funcarg{\IN}{array-of-where}{}
\funcarg{\IN}{array-of-flags}{}
\funcarg{\IN}{root}{}
\funcarg{\IN}{comm}{communicator of group of spawning processes}
\funcarg{\OUT}{group}{group of spawned processes}
\end{funcdef}

\mpifunc{MPI\_SPAWN\_MULTIPLE\_INDEPENDENT} is similar to
\mpifunc{MPI\_SPAWN\_MULTIPLE} except that it returns a group
instead of an intercommunicator. It does not establish communication
with the children, and is collective only in the parent.

\begin{rationale}
With \mpifunc{MPI\_SPAWN\_MULTIPLE\_INDEPENDENT} it is now possible to
spawn any type of MPI-1 application.
\end{rationale}

\subsection{Environmental inquiry}

MPI provides a single mechanism for querying the runtime
environment. Complex interaction with the runtime environment
must be handled through an environment-specific API, but the
large majority of dynamic parallel applications need to know
only one fact -- how many processes they should spawn.

\subsubsection{Proposal 1}

MPI provides an integer variable, \mpifunc{MPI\_UNIVERSE\_SIZE}
that tells an application how many processes can usefully be
started alltogether.
The variable is initialized in \mpifunc{MPI\_INIT} and is not
changed by MPI. \mpifunc{MPI\_UNIVERSE\_SIZE} is assumed to
have been specified when an application was started. It is
effectively a message from the user to him/herself.

If the runtime environment changes size while an application
is running \mpifunc{MPI\_UNIVERSE\_SIZE} is not updated, and
the application must find out through direct communication
with the runtime system.

\subsubsection{Proposal 2}

MPI provides the following function:
\begin{funcdef}{MPI\_UNIVERSE\_SIZE(size)}
\funcarg{\OUT}{size}{Total number of processes that can be usefully started}.
\end{funcdef}

This function returns the total number of processes
that can be usefully started (including processes already running).
In MPI implementations that are tightly integrated with the
runtime environment, the universe size may change when
the runtime environment changes. This allows an application
to detect, in a portable way, when more resources are available
for spawning.

\subsection{Nonblocking requests}

Spawning new processes may be an expensive operation.
In order to allow processes to do useful work while
processes are being spawned, MPI provides non-blocking
versions of the each of the above routines.

\begin{funcdef}{MPI\_ISPAWN(..., request)}
\funcarg{\OUT}{request}{Request object that can be used to check for
completion.}
\end{funcdef}

\begin{funcdef}{MPI\_ISPAWN\_INDEPENDENT(..., request)}
\funcarg{\OUT}{request}{Request object that can be used to check for
completion.}
\end{funcdef}

\begin{funcdef}{MPI\_ISPAWN\_MULTIPLE(..., request)}
\funcarg{\OUT}{request}{Request object that can be used to check for
completion.}
\end{funcdef}

\begin{funcdef}{MPI\_ISPAWN\_MULTIPLE\_INDEPENDENT(..., request)}
\funcarg{\OUT}{request}{Request object that can be used to check for
completion.}
\end{funcdef}

The arguments of the non-blocking versions are the same
as those of the blocking versions, except that there is
an additional \mpiarg{request} argument. The request
can be used as input for \mpifunc{MPI\_WAIT},
\mpifunc{MPI\_TEST}, etc.

Note that all of these routines are collective, and obey the
non-blocking collective semantics described in chapter [xxx].
Note also that they leave the output intercommunicator in an undefined state
until the operation is complete. This may be considered no worse than the
undefined state of the buffer during non-blocking send/receive operations.

\missing{We should specify what, if anything, the fields
in the status argument to \mpifunc{MPI\_WAIT} mean. \par}

\subsection{Process Utilities}

A process represented by a (group,rank) pair cannot be communicated with
directly. MPI communication requires a communicator.
However, MPI allows an MPI process to send signals to an
arbitrary process group.

To ask the process manager to deliver signals to
processes, we use

\begin{funcdef}{MPI\_SIGNAL(group, rank, signal)}
\funcarg{\IN}{group}{group containing process to signal}
\funcarg{\IN}{rank}{rank of process to signal}
\funcarg{\IN}{signal}{signal type (int)}
\end{funcdef}

\begin{funcdef}{MPI\_SIGNAL\_GROUP(group, signal)}
\funcarg{\IN}{group}{group of processes to signal}
\funcarg{\IN}{signal}{signal type (int)}
\end{funcdef}

Where posix signals are supported, \mpiarg{signal} is a signal defined by
POSIX. It is the responsibility of an implementation to translate between
signals; in other words, a \code{SIGINT} that has value \code{3} on system A
must be delivered as a \code{SIGINT} on system B, even if system B uses the
value \code{5} for \code{SIGINT}. If the signal can not be delivered because
there is no corresponding signal, the error code is
\mpifunc{MPI\_ERR\_INVALID\_SIGNAL}. In a high quality implementation, the
full range of POSIX signals will be deliverable, but it is mandatory that at
least the KILL signal (\mpifunc{MPI\_SIGNAL\_KILL}) be supported.

The function \mpifunc{MPI\_SIGNAL} is redundant, since one can always use
MPI group manipulations to construct a singleton group. However, the extra
convenience of being able to signal a single process directly, without
building and then freeing the singleton group seems worthwhile.

Signals do not provide any of the reliability or guarantees
of regular MPI communication. There it no guarantee on
delivery of signals, ordering of signals or ordering between
signals and MPI communication. A high quality implementation
will deliver signals quickly and reliably, but applications
should never depend on ordering.

In order to manage processes or have the minimum in fault tolerance,
an MPI application must be able to be notified when a process exits.
We can think of this as the MPI interface to the Process Manager's
handling of the signal {\tt SIGCHILD}. The exit code from the process
(from a \code{return n} or \code{exit(n)} in C or \code{STOP n} in Fortran)
filled in the \mpifunc{MPI\_WAIT}.

Note that this level of process management allows us to manage non-MPI
processes as well, since communicators are not involved.

\begin{funcdef}{MPI\_NOTIFY(event, group, array\_of\_requests)}
\funcarg{\IN}{event}{a flag specifying what event to be notified about}
\funcarg{\IN}{group}{a group of processes}
\funcarg{\OUT}{array\_of\_requests}{\mpifunc{MPI\_Request}s to be tested and/or waited on}
\end{funcdef}

This function provides a general method for an MPI process
to be notified of a change in state of another process.
The most obvious state-change
and the only one MPI presently defines for \mpiarg{event}
is \mpifunc{MPI\_NOTIFY\_EXIT}.
\mpifunc{MPI\_NOTIFY\_EXIT} requests
notification when a process exits or becomes unreachable.
Other values for the \mpiarg{event} argument may be defined
by an implementation, some examples include process is suspended,
process is migrating, and process resumed.

\discuss{Should we reserve a set of values for future MPI use?}

\mpiarg{array\_of\_requests} must have at least as many
elements as the size of the group. A request completes when
the corresponding member of the group changes to the state
specified by mpiarg{event}, such as killed or unreachable.

%\discuss{
%At the last meeting there was discussion about having a notify
%that returned a single request rather than an array of requests.
%Since the behavior would be identical to calling \mpifunc{MPI\_WAITANY}
%on the above function, the alternative is superfluous. -Al
%
%The point was to avoid having an array of 1000 requests that would not
%really be used. I continue to prefer to have such an array, because
%it gives all the flexibility of the MPI WAIT/TEST routines in all
%their generality. But I thought the discussion at the meeting favored a
%single request. -- RL
%
%}

\section{Attaching to Independent Processes}

So far we have covered the case of creating new processes. For
client-server applications, the situation is different, because the processes
in question already exist, and what we need is a communicator to be used by
them to communicate with one another.
Similar situations arise in other applications as well.
For example, a visualization tool may want to start up and
attach to a running simulation, or two parts of a large application
may be started separately at two different sites and then want to
communicate with each other.

This section attempts to provide the functions needed to solve
the general case of creating an intercommunicator between
two sets of MPI processes that do not share an existing communicator.

\subsection{Registration and Connection}

%This is all new, although the actual routines follow pretty closely what Al
%put in as a summary of Marc's proposal at the meeting. If you understand
%why I changed the parameter name'' to given\_name'' in MPI\_CONNECT\_NAME,
%then my explanation is successful. Please test.

Establishing contact between two groups of processes that do not share an
existing communicator is a multi-step, asymmetric process. In order to
motivate the MPI functions we might need, let us consider a hypothetical weather
prediction application program that is made up of two parts, a long-running
ocean model, which will play the role of server'', and a more transient
atmospheric model (the client''), which is given current weather data when
it is started, contacts the ocean model and interacts with it, then exits.
Our client-server MPI interface is motivated by wanting to handle the
following situations.
\begin{itemize}
\item Both client and server may be parallel programs.
\item Multiple instances of the client may want to interact with the same
instance of the server at the same time, as well as serially.
\item Since this is an application rather than a system service, there may be
multiple instances of the server running on the same host.
\end{itemize}
Establishing contact, interacting, and then breaking the connection will be
a multi-step process.
\begin{enumerate}
\item The server supplies to the system a name'' by which it would like to
be known, and given back a related \mpiarg{given\_name} that can be used
to contact it. Note that while the name is likely to be coded into the
call (Ocean''), the given name must be system-supplied so that it can be
unique. Therefore different instances of the server, which share the name
Ocean'', can be given different \mpiarg{given\_name}s, so that they can
be contacted separately. (Also, nothing prevents the server from asking for
multiple \mpiarg{given\_name}s, if it wants.)
\item The \mpiarg{given\_name} is published'' in such a way that the
client can learn it. For ease of publication, we make it a string. (One
possibility would be {\tt host:port}, but we do not specify it.) One
common way for such names to be published is for a name server to exist
that a client can contact to turn a name, plus some way of identifying the
instance of the server that it wishes to contact, into a
\mpiarg{given\_name}. For the time being, we do not specify a name server
interface, and leave unspecified (outside the scope of MPI) the
publication mechanism. This means that the client must identify the
server instance that it wants to connect to by using the low-level
\mpiarg{given\_name} (e.g., {\tt foo.bar.gov:port1234}'') instead of
some higher-level name (e.g., {\tt ocean:user205}'').
\item The client and server engage in a collective operation (e.g.,
\mpifunc{MPI\_IACCEPT} by the server; \mpifunc{MPI\_ICONNECT} by the client)
using \mpiarg{given\_name}.
\item If the server is to be capable of interacting with multiple clients
simultaneously, it must clearly use \mpifunc{MPI\_IACCEPT}. If it uses
\mpifunc{MPI\_ACCEPT} then it cannot listen for new connections while
processing existing ones, and becomes a serial server. MPI makes concurrent
servers easier to write since a different communicator will be used for
interaction with each client or client instance.
\item The communicator is destroyed when both client and server processes call
\mpifunc{MPI\_COMM\_FREE}.
\item The server frees \mpiarg{given\_name} when it no longer wishes to be
contacted. Note that \mpiarg{given\_name} is the system resource to be
freed, not the original \mpiarg{name}.
\end{enumerate}
This discussion motivates the following interface:

\begin{funcdef}{MPI\_REGISTER\_NAME( name, given\_name)}
\funcarg{\IN}{name}{name that the server would like to be known by}
\funcarg{\OUT}{given\_name}{character string supplied by the system}
\end{funcdef}

\begin{funcdef}{MPI\_ACCEPT(mycomm, given\_name, root, newcomm)}
\funcarg{\IN}{mycomm}{communicator over which this call is collective}
\funcarg{\IN}{given\_name}{name associated with registered name}
\funcarg{\IN}{root}{the process that registered the name}
\funcarg{\OUT}{newcomm}{new inter-communicator, which includes the
processes of the remote group}
\end{funcdef}

\begin{funcdef}{MPI\_IACCEPT(mycomm, given\_name, root, newcomm, request)}
\funcarg{\IN}{mycomm}{communicator over which this call is collective}
\funcarg{\IN}{given\_name}{name associated with registered name}
\funcarg{\IN}{root}{the process that registered the name}
\funcarg{\OUT}{newcomm}{new inter-communicator, which includes the
processes of the remote group}
\funcarg{\OUT}{request}{an \mpiarg{MPI\_Request} that can be waited on}
\end{funcdef}

\begin{funcdef}{MPI\_CONNECT(mycomm, given\_name, newcomm)}
\funcarg{\IN}{mycomm}{communicator over which this call is collective}
\funcarg{\IN}{given\_name}{name by which remote process can be contacted (string)}
\funcarg{\OUT}{newcomm}{new inter-communicator, which includes the
processes of the remote group}
\end{funcdef}

\begin{funcdef}{MPI\_ICONNECT(mycomm, given\_name, newcomm, request)}
\funcarg{\IN}{mycomm}{communicator over which this call is collective}
\funcarg{\IN}{given\_name}{name by which remote process can be contacted (string)}
\funcarg{\OUT}{newcomm}{new inter-communicator, which includes the
processes of the remote group}
\funcarg{\OUT}{request}{an \mpiarg{MPI\_Request} that can be waited on}
\end{funcdef}

\begin{funcdef}{MPI\_FREE\_NAME( given\_name)}
\funcarg{\IN}{given\_name}{associated with registered name}
\end{funcdef}

\discuss{
This section needs discussion of error return codes.

Do we want to specify a name-server interface for the client?

}

\section{Examples}
\label{sec:examples}

\paragraph{Manager-worker example, using SPAWN.}

\begin{verbatim}
/* manager */
#include <mpi.h>
main(int, argc, char *arg[])
{
MPI_Status status;
int count, world_size, flag;
MPI_Comm everyone; /* intercommunicator */

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if (world_size != 1)
error("Top heavy with management");
MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &count, &flag)

if (count == 1)
error("No room to start workers");

MPI_Spawn("worker", NULL, count-1, MPI_COMM_SELF, &everyone);

/*
* Parallel code here. The communicator "everyone" can be used
* to communicate with the spawned processes, which have ranks 0,..count-2
* in the remote group of the intercommunicator "everyone".
*/

MPI_Finalize();
}

/* worker */

#include <mpi.h>
main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);

/*
* Parallel code here.
* The manager is represented as the process with rank 0 in (the remote
* group of) MPI_COMM_PARENT. If the workers need to communicate among
* themselves, they can either extract the local group of the
* intercommunicator MPI_COMM_PARENT, or use MPI_COMM_WORLD. They
* should be the same.
*/

MPI_Finalize();
}

\end{verbatim}

\paragraph{Manager-worker example, using ATTACH.}

\begin{verbatim}
/* manager */
#include <mpi.h>
main(int, argc, char *arg[])
{
int count, world_size, flag;
MPI_Comm everyone; /* intercommunicator */
MPI_Group worker_group;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if (world_size != 1)
error("Top heavy with management");
MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &count, &flag);

if (count == 1)
error("No room to start workers");

MPI_Group_init("worker", NULL, count-1, NULL, &worker_group);
MPI_Start_attach(MPI_COMM_SELF, 0, worker_group, &everyone);

/*
* Parallel code here. The communicator "everyone" can be used
* to communicate with the spawned processes, which have ranks 0,..count-2
* in the remote group of the intercommunicator "everyone".
*/

MPI_Finalize();
}

/* worker */

#include <mpi.h>
main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);

/*
* Parallel code here.
* Manager is represented as the process with rank 0 in (the remote group
* of) MPI_COMM_PARENT. The workers can communicate with one another
* using MPI_COMM_WORLD.
*/

MPI_Finalize();
}

\end{verbatim}

This example shows how to manage a set of non-MPI processes.

\begin{verbatim}
#include <mpi.h>
#define MAXPROC 1000
main(int argc, char *argv[])
{
MPI_Group workers[MAXPROC];
MPI_Group worker_group;
MPI_Request obituaries[MAXPROC];
MPI_Status statuses[MAXPROC]; /* see comment below */
char **args;

ThingToDo *thingstodo;
int nthingstodo, nthingsdone;

MPI_Init(&argc, &argv);
MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &count, &flag);
nslots =
GetThingsToDo(&thingstodo, nthingstodo);

/* need to initialize these for Waitany/all() below to work */
for (i = 0; i < MAXPROC; i++) obituaries[i] = MPI_REQUEST_NULL;

/* start up initial set of tasks */
for (nthingsdone = 0; nthingsdone < nslots && nthingsdone < nthingstodo;
nthingsdone++)
{
SetArgs(thingstodo[nthingsdone], &args);
MPI_Group_init("cow", args, 1, NULL, &workers[nthingsdone]);
}
merge_groups(workers, &worker_group); /* merge singleton groups
into one big group */
MPI_Group_start_independent(worker_group);
MPI_Notify(worker_group, nslots, obituaries);

/* spawn new ones as old ones finish */
while(nthingsdone < nthingstodo)
{
SetArgs(thingstodo[nthingsdone], &args);
MPI_Group_init("cow", args, 1, &workers[nthingsdone]);
MPI_Group_start_independent(workers[nthingsdone]);
MPI_Notify(&workers[nthingsdone], 1, &obituaries[nthingsdone]);
}
MPI_Waitall(nslots, obituaries, statuses);

MPI_Finalize();
}

\end{verbatim}

\commentOut{
\paragraph{Another task farm example, in old syntax.}
It always keeps a request for ten resources outstanding, but starts jobs as soon
as possible. To avoid spin-waits on the allocation and running of jobs, it
uses \mpifunc{MPI\_WAITSOME} on an array of requests that includes both
allocation requests and started jobs. The index \code{alloc\_top} gives the
number of allocation requests currently active; \code{r\_top} gives the total
number of active requests (both allocations and started processes).

The programs in this example are {\em not} MPI jobs; MPI is simply being used
to start and manage the programs.
For simplicity, we have not included any code to decide when the program is
done or to describe the program to be run and its arguments. Note that
\mpifunc{MPI\_CANCEL} can be used to cancel any unneeded allocation requests.

\begin{verbatim}
#include "mpi.h"
main( int argc, char **argv )
{
MPI_Request r[20];
MPI_Status s[20];
int idx[20], nout;
int alloc_top, r_top;
int rc;

MPI_Iallocate( 10, (char *)0, (char **)0, "*", (char **)0, MPI_HARD, r );
alloc_top = 10;
r_top = 10;
while (!done) {
MPI_Waitsome( r_top, r, &nout, idx, s );
for (i=0; i<nout; i++) {
if (idx[i] < alloc_top) {
/* Processor is ready. Start program */
j = idx[i];
MPI_Set_exec( r[j], program_name );
MPI_Set_args( r[j], program_args );
MPI_Start( r[j] );
r[r_top] = r[j];
r[j] = r[alloc_top];
r[alloc_top] = MPI_REQUEST_NULL;
alloc_top--;
}
else {
/* Program has finished */
j = idx[i];
MPI_Get_return_code( &s[i], &rc );
/* Make use of return code ... */
/* Note that r[j] is MPI_REQUEST_NULL already
(the wait does it) */
}
}
/* Repack request array and issue additional allocations */
j = alloc_top;
for (i=alloc_top; i<r_top; i++) {
if (r[i] != MPI_REQUEST_NULL)
r[j++] = r[i];
}
r_top = j;
MPI_Iallocate( 20 - r_top, (char *)0, (char **)0, "*", (char **)0,
MPI_HARD, r + r_top);
}
MPI_Finalize();
return 0;
}
\end{verbatim}

}

\paragraph{PVM-style SPMD example} This is how many PVM
programs are typically written. There is no reason they
can't be done with MPI-1, but in case users want the
appearance of minimal change, here is is. (Very similar to
the manager-worker example above.)

\begin{verbatim}
/* myprog */

#include <mpi.h>
main(int argc, char *argv[])
{
int count, world_size, flag;
MPI_Comm everyone; /* intercommunicator */
MPI_Comm new_world; /* corresponding intracommunicator */

MPI_Init(&argc, &argv);

if (MPI_COMM_PARENT == MPI_COMM_NULL) /* I'm the parent */
{
MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &count, &flag);
if (count == 1)
error("No room to start workers");
MPI_Spawn("myprog", NULL, count-1, MPI_COMM_SELF, &everyone);
MPI_Intercomm_merge(everyone, FALSE, &new_world);
}
else
MPI_Intercomm_merge(MPI_COMM_PARENT, TRUE, &new_world);

/* SPMD parallel code here, using new_world instead of MPI_COMM_WORLD */

MPI_Finalize();
}

\end{verbatim}

\paragraph{Client-server example.} This is a simple example; the server
accepts only a single connection at a time and serves that connection until
the client requests to be disconnected.

Here is the server. It accepts a single connection and then processes data
until it receives a message with tag {\tt 1}. A message with tag {\tt 0}
tells the server to exit.
\begin{verbatim}
#include "mpi.h"
main( int argc, char **argv )
{
MPI_Comm client;
MPI_Status status;
double buf[MAX_DATA];
char given_name[MAX_NAME_SIZE];
int again;

MPI_Init( &argc, &argv );
MPI_Register_name("cavewand",given_name);
printf("cavewand server available at %s\n",given_name); /* publish name */
while (1) {
MPI_Accept( MPI_COMM_WORLD, 0, given_name, &client );
again = 1;
while (again) {
MPI_Recv( buf, MAX_DATA, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
client, &status );
switch (status.tag) {
case 0: MPI_Comm_free( &client );
MPI_Finalize();
return 0;
case 1: MPI_Comm_free( &client );
again = 0;
break;
case 2: /* do something */
...
default:
MPI_Abort( MPI_COMM_WORLD, "Unexpected message type" );
}
}
}
}
\end{verbatim}

Here is the client.

\begin{verbatim}
#include "mpi.h"
main( int argc, char **argv )
{
MPI_Comm server;
double buf[MAX_DATA];
char given_name[MAX_NAME_SIZE];

MPI_Init( &argc, &argv );
strcpy(given_name, argv[1] ); /* assume server's name is cmd-line arg */

MPI_Connect( MPI_COMM_WORLD, given_name, &server );
while (!done) {
tag = 2; /* Action to perform */
MPI_Send( buf, n, MPI_DOUBLE, 0, tag, server );
/* etc */
}
MPI_Send( buf, 0, MPI_DOUBLE, 0, 1, server );
MPI_Comm_free( &server );
MPI_Finalize();
}
\end{verbatim}

If the server needs to manage multiple connections at once, it can use
\mpifunc{MPI\_IACCEPT} instead. The client need not be changed.