draft revision of Dynamic Chapter

Al Geist (geist@msr.EPM.ORNL.GOV)
Thu, 31 Aug 1995 07:57:26 -0400

Hi Gang,
I tried to incorporate all the functions and ideas presented at
the last meeting. This draft still needs to merged with some
material written by Bill S.
It is sent to this group for those people who like to
read early and get their ammo ready for the meeting.

(-:
Al
---------- revision of Dynamic Chapter (yes it is just LaTeX - sue me) ---
% Aug 29, 1995 - AG
% July 23, 1995 - wcs
% July 12, 1995 - RL
% July 3, 1995 - wcs
% Version as of May 30, 1995 - wcs
% Version as of May 29, 1995 - RL
% Version as of April 27, 1995

\chapter{Dynamic Processes}
\label{sec:dynamic-2}
\label{chap:dynamic-2}

\section{Introduction}
\label{sec:dynamic:introduction}

The MPI-1 message passing library allows processes in a parallel
program to communicate with one another. MPI-1 specifies neither how
the processes are created, nor how they establish
communication. Moreover, an MPI-1 application is static, that is, no
processes can be added to or deleted from an application while it is
running.

MPI users have asked that the MPI-1 model be relaxed to allow dynamic
process management.
% I believe this is true, and illustrates
% the relationship of PVM and MPI to confused users - wcs
The main impetus comes from the PVM~\cite{pvmbook} research effort,
which has provided a wealth of experience with dynamic process
management that illustrates its benefits and potential pitfalls. The
reasons for adding dynamic process management are both technical and
practical.

\begin{itemize}

\item Workstation network users migrating from PVM to MPI
are accustomed to using PVM's capabilities for process and resource
management. While relatively few applications are truly dynamic or
require features not in MPI, the lack of these features is a practical
stumbling block to migration.

\item Important classes of message passing applications, such as
client-server systems and task farming jobs, require dynamic
process control.

\item With dynamic resource and process management extensions,
it would be possible to write major parts of the parallel
programming environment in MPI itself.

\item The ability to write fault tolerant applications is
important in unstable environments and for commercial applications.
MPI-1 does not provide mechanisms for building fault-tolerant
applications. The mechanisms required to support fault tolerance
largely overlap with those needed to support dynamic process
management.

\end{itemize}

While dynamic process management is essential, it is important
not to compromise the portability or performance of MPI.
In particular:
\begin{itemize}

\item The MPI-2 dynamic process management model must apply
to the vast majority of current parallel environments. These
include everything from tightly integrated MPPs such as the
Intel Paragon and the Meiko CS-2 to heterogeneous networks of
workstations.

\item MPI must not take over operating system responsibilities.
It should instead provide a clean interface between an application
and system software.
\item MPI must continue to guarantee communication determinism, i.e.,
dynamic process management must not introduce unavoidable race
conditions.
\item MPI must not contain features that compromise performance.
\item MPI-1 programs must work under MPI-2, i.e., the MPI-1
static process model must be a special case of the MPI-2 dynamic
model.

\end{itemize}

The MPI dynamic process management model address these issues in two
ways. First, it separates the runtime environment of a parallel
program into three logical parts: resource management, process
management, and communication. MPI-1 addresses the communication
component. MPI-2 provides an interface between an MPI application and
logically external resource and process managers. The resource and
process managers may be separate programs, the same program, part of
the operating system, or even contained within the MPI implementation
itself, but their functionality is well-defined and exists in most
current message passing environments, including PVM.

Second, MPI-2 does not change the concept of communicator. Once a
communicator is built, it behaves as specified in MPI-1. A
communicator is never changed once created, and it is always
created using deterministic collective semantics.

\section{The MPI Dynamic Process Model}
% model->architecture? - wcs
\label{sec:model}

\subsection{Components}
\label{sec:components}

The MPI dynamic process model separates the functions of
{\em resource manager}, {\em process manager} and {\em message-passing
library}.

\paragraph{Resource Manager}
\label{sec:resman}

The {\em resource manager\/} is the part of the system that controls
resources and allocates them to an application. It decides when a job
will run and which processors will be allocated to it when it does
run. In some environments the resource manager is a sophisticated
batch queueing system; in others it is the user him/herself, who can
start jobs on a network whenever and wherever he/she likes. Logically,
the resource manager is external to an application even if it is
implemented internally.

\paragraph{Process Manager}
\label{sec:procman}

Once processors have been allocated to a program, user processes must
be started on those processors, and managed after startup. By
``managed'' we mean that signals must be deliverable, that {\tt
stdin}, {\tt stdout}, and {\tt stderr} must be handled in some
reasonable way, and that orderly termination can be guaranteed. A
minimal example is {\tt rsh}, which starts processes and reroutes {\tt
stdin}, {\tt stdout}, and {\tt stderr} back to the originating
process. A more complex example is given by {\tt poe} on the IBM SP2
or {\tt prun} on the Meiko CS-2, which start processes on processors
given to them by the job scheduler and manage them until they are
finished. In a tightly integrated parallel computer, process
management may be done entirely by the operating system.

In some cases the situation is muddied because a single piece of
software combines the functions of resource manager and process
manager.
% is this really true? For instance, doesn't loadleveler use
% poe to launch jobs? - wcs
Examples of this approach are the batch queueing systems such as
Condor, DQS, and LoadLeveler. Nonetheless, it is convenient to
consider resource and process management separately, since although
they interact, they are separate functions that can be independently
modified.

\paragraph{Message-Passing Library}
\label{sec:library}
By the {\em message-passing library\/} we mean the library used by the
application program for its interprocess communication. Programs containing
only calls to a message-passing library can be extremely portable, since they
fit cleanly into a variety of job scheduler--process manager environments.
MPI-1 defines a standard message passing library.

\subsection{Interaction of User, Application and Runtime Environment}

\begin{figure}[htbp]
\centerline{\epsfxsize=4.0in\epsfbox{figures/dynamic-fig1.eps}}
\caption{Structure of the Runtime Environment}
\label{fig:3pieces}
\end{figure}

The starting point for dynamic resource and process management is the
resource manager. It is the resource manager that allocates,
implicitly or explicitly, resources necessary for running processes.
The ``resource manager'' may be anything from a sophisticated
batch scheduling system, to a file containing a list of machines
on which an application can run, to the user him/herself.

\paragraph{Acquiring Resources}

In a generic application, there are two distinct phases of interaction
with the resource manager. In the first phase, the user requests the
initial resources on which an application will run. This phase is
implicit but not specified in the static MPI model. In the second
phase, the application itself may request additional resources
from the resource manager.

While it seems at first that MPI need only be concerned with the
second phase, it must be aware of the first as well. The reason
is that resources may be ``preallocated'' to an application.

We expect the most common type of ``dynamic'' application to be one in
which all resources are actually allocated before the application
starts. The application will be started on a subset of the resources,
will {\em discover} the additional resources using
\mpifunc{MPI\_Resource\_discover}, and will start the rest of the
application on those additonal resources. We term these applications
{\em quasi-static} because they use static resources but may have
dynamic processes. Applications which discover preallocated resources
but do not allocate new ones will generally be more portable than
those which allocate new resources. Resource discovery may or may not
require communication with an external resource manager, depending on
the MPI implementation.
Resource discovery can be an simple operation, so that it
can be used by a naive user to mimic PVM-style behavior.

Truly dynamic applications will allocate new resources while
running, interacting directly with the resource manager
through \mpifunc{MPI\_RESOURCE\_ALLOCATE}.
A resource request may be arbitrarily complicated, but the
details of the request are interpreted by the resource manager, not MPI.
The request itself it therefore a string which has no meaning to MPI.

Regardless of how an application acquires resources,
it may release them to the system through
\mpifunc{MPI\_Resource\_free}.

\discuss{
The description above is slightly different from earlier
descriptions in that it contains the concept of resource
discovery. There are two reasons for this. First, almost all
applications (including most new ones we target in MPI-2)
need only the first phase - making use of pre-allocated
resources. An application usually does not actually request
resources nor does it know what it wants.
It only wants to find out what the user has already allocated.
Second, it was difficult to come up with
a unified semantics that would work well for both resource
discovery and resource allocation. They are therefore
separate, although they both produce the same result - an
\mpifunc{MPI\_Resource} object (or objects).
The separation of discovery and resource allocation also exists in PVM.
It is the difference between {\tt pvm\_config} and {\tt pvm\_addhosts}.

It would be possible to treat both resource allocation and resource
discovery through an allocation-based interface. An application would
``allocate'' the resources it knew to be available and the allocation
routine would return quickly. This approach is undesirable for several
reasons. First, in the allocation-based approach, an application must
find out through some external means (such as a configuration file)
what resources it should ask for. This is cumbersome and redundant
with the initial resource request. Second, it requires even simple
applications to formulate a potentially complicated resource
request. Resource specification will vary widely with resource
managers and should remain outside an application as much as possible,
for portability. Third, it would be a lie, as resources have already
been allocated. This is always a dangerous thing and can mislead
programmers.

I expect that true dynamic applications are rare. A lot of
the true dynamicism may come in parallel environments
which are written in MPI. For instance, with the dynamic
interface, it is possible to imagine writing an MPI console
similar to the PVM console that puts together a virtual
machine and spawns MPI applications on that virtual machine.

I also expect that resource discovery will be quite
portable, while resource allocation may be tailored to
specific environments.

- wcs

In the future, true dynamic applications will be the norm.
The dynamic qualities will be required for fault tolerance,
task migration and load balancing. While MPI-2 many not
solve these requirements, we need to be sure our design
does not preclude their solution in the future. - AG

}

MPI does not address the issue of finding out what additional (not yet
allocated) resources may be available. For instance, an application
might want to know what is possible to request before it requests
resources. This issue is quite complex and not directly necessary for
MPI, in contrast to resource allocation itself, in which an
application obtains simple MPI-defined objects necessary for starting
processes. It is expected that an application will use a resource
manager-specific API to inquire about resource availability.

\paragraph{Starting and Managing Processes}

As with resource management, there are two types of
process creation. The first is the creation of the
original MPI application. The second is creation
of processes from within an MPI application. Fortunately
MPI need only worry about the second type.

MPI applications may start new processes (including non-MPI
processes), send them signals, and find out when they
die or become unreachable. They do this through
an interface to the process manager, which range
from a parallel operating system (CMOST) to layered
software (POE) to an {\tt rsh} command (P4).

There are two ways to start new processes.
Both take as input an \mpifunc{MPI\_Resource} object,
representing computational resources on which a job can
run. For simple applications, there is a simple way
to obtain this \mpifunc{MPI\_Resource} object.

\mpifunc{MPI\_START\_ATTACH} starts MPI processes and establishes
communication with them, returning an intercommunicator.

\mpifunc{MPI\_START\_ABANDON} starts new processes,
which may or may not communicate amoung each other,
but don't communicate with their parent.
This function is required for starting non-MPI tasks
and is useful for starting an SPMD application.

\subsection{Examples of Runtime Environments}
\label{sec:examples-runtime}

To illustrate how the above framework allows us to describe a wide variety of
actual systems, we give here some examples.

\paragraph{Environments with explicit Resource Managers}

The SP2 computers at Argonne National Laboratory and the NAS facility
at NASA Ames Research Center use (different) locally written job
schedulers to manage
the SP2. The schedulers ensure that only one user has access to any SP
node at a time and manage resources in a ``fair'' way to ensure that
all users can get access to the machines. They require users to
provide times limits for their jobs so that the machine can be tightly
scheduled. Users submit scripts to the scheduler, which allocates
resources and runs jobs using IBM's Parallel Environment
software. The systems interact with a variety of message-passing
libraries, including two based on MPI.

The correspondence with the model above is straightforward.
The resource manager is the locally written job scheduler.
The process manager is contained in IBM's parallel environment
software.

There are numerous examples of job management systems that allocate
resources. These include PBS (from NAS), EASY (from ANL), LSF (from
Platform Computing), LoadLeveler (from IBM), DQS~\cite{green:dqs}
(from the Supercomputer Research Institute at Florida State
University), Condor (from the University of Wisconsin), and NQS.

Each one of these resource managers can be used in conjunction with
one or more process managers. Process managers may be programs
external to an application ({\tt poe} on the IBM SP2), part of the
operating system (the usual case on tightly integrated MPPs such as the Intel
Paragon, Meiko CS-2 and TMC CM-5), part of the message passing library
({\tt p4} or {\tt pvm}) or integrated with the resource manager (Condor
with {\tt pvm}).

\paragraph{Network of Workstations with PVM}

One reason for PVM's popularity is that it can be viewed as a completely
self-contained system that supplies its own process management and can be used
to implement a resource manager as well. On systems that have neither of these
functions pre-installed, PVM can provide a complete solution. A user creates
a ``virtual machine'' by starting ``daemons'' on an assortment of machines and
then schedules jobs to run on it and manages his processes with the help of
the daemons. The virtual machine itself can be reconfigured from inside the
user program.

Equivalent MPI functionality requires an MPI implementation
in which both resource management and process management are
provided by the MPI implementation itself. While implemented
within MPI, these functions would be logically external
to an MPI application, allowing the application to run virtually
unchanged in the presence of real resource and process managers.

One can conceive
of a system in which resource allocation and process management functions were
provided by the existing PVM daemon structure, while MPI was
used by the application for message passing. PVM would be used as the
implementation layer for the functions described in this chapter, but would
not be visible to the application program.

\subsection{Applications Requiring Direct Communication with the Runtime
System}
\label{sec:flexapplications}

The existing MPI specification is adequate for most parallel
applications. In these applications, the resource manager and process
manager, whether simple or elaborate, allocate resources and manage
user processes without interacting with the application program. In
other applications, however, it is necessary that the {\em user
level\/} of the application communicate with the job scheduler and/or
the process manager. Here we describe three broad classes of such
applications. In Section~\ref{sec:examples} we will give concrete
examples of each of these classes.

\paragraph{Task Farming}
\label{sec:farming}
By a ``task farm'' application we mean a program that manages the execution of
a set of other, possibly sequential, programs. This situation often arises
when one wants to run the same sequential program many times with varying
input data. We call each invocation of the sequential program a {\em task}.
It is often simplest to ``parallelize'' the existing sequential program by
writing a parallel ``harness'' program that in turn devotes a separate,
transient process to each task. When one task finishes, a new process is
started to execute the next one. Even if the resources allocated to the job
are fixed, the ``harness'' process must interact frequently with the process
manager (even if this is just {\tt rsh}, to start the new processes with the
new input data). In many cases this harness can be written in a simple
scripting language like {\tt csh} or {\tt perl}, but some users prefer to use
Fortran or C. Note that it is an explicit goal of the MPI dynamic
process architecture to allow the management of non-MPI processes.

\paragraph{Dynamic number of processes in parallel job}
\label{sec:dynamic}
The program wishes to decide {\em inside\/} the program to adjust the number
of processes to fit the size of the problem. Furthermore, it may continue to
add and subtract processes during the computation to fit separate phases of
the computation, some of which may be more parallel than others. In order to
do this, the application program will have to interact with the resource manager
(however it is implemented) to request and acquire or return computational
resources. It will also have to interact with the process manager to request
that processes be started and in order to make the new processes known to the
message-passing library so that the larger (or smaller) group of processes can
communicate.

% isn't there a more commonly used word than scavenger? - wcs
An important type of dynamic application is a scavenger application. A
scavenger application is ``embarassingly parallel'' in the sense that
it performs a large number of completely independent tasks. If the
number of tasks is large enough, such an application can make use of
any resources that become available. Conversely, it can
easily give up resources to another application. Scavenger
applications are excellent for filling in the ``holes'' on
a space-shared parallel machine, allowing it to achieve very
high utilization.

\paragraph{Client/Server}
\label{sec:server}
This situation is the opposite of the situations above, where processes come
and go upon request. In the client/server model, one set of processes is
relatively permanent
(the server, which we assume here may be a parallel program).
At unpredictable times, another (possibly parallel) program (the client)
begins execution and must establish communication with the server. In this
case the process manager must provide a way for the client to locate the
server and communicate to the message-passing library that it must now support
communications with a new collection of processes.

It is currently possible to write the parallel clients and servers in MPI, but
because MPI does not provide the necessary interfaces between the application
program and the resource manager or process manager, other nonportable, machine
specific libraries must be called in order for the client and server to
communicate with one another. On the other hand, MPI does contain several
features that make it relatively easy to add such interfaces, and we propose
both a simple interface and a more complex but flexible one.

\paragraph{Summary}

Using dynamic processes in MPI is a three step process as shown
in Figure~\ref{fig:dynamic}.
\begin{enumerate}

\item {\bf Allocate or discover resources}, obtaining one or more
\mpifunc{MPI\_Resource} objects, which represent permission to
start additional processes.

\item {\bf Start tasks on allocated/discovered resources}, obtaining
one or more \mpifunc{MPI\_Process} objects, which represent
running processes but don't allow communication.

\item {\bf Establish MPI communication with or between newly created
processes} by obtaining a communicator whose group contains the new processes.
\end{enumerate}

\begin{figure}[htbp]
\centerline{\epsfxsize=4.0in\epsfbox{figures/dynamic-fig2.eps}}
\caption{Dynamic Process Management}
\label{fig:dynamic}
\end{figure}

\section{Resource Management Interface}

\subsection{The \mpifunc{MPI\_Resource} Object}
\label{sec:resource}

An \mpifunc{MPI\_Resource} object identifies a set of computational
resources on which the object's owner may start processes.
Depending on the request that generated it, an \mpifunc{MPI\_Resource}
may represent a physical CPU, a collection of CPUs, or even a promise
from a resource manager that an application is allowed to start one or
more processes (whose physical location will be determined dynamically).

Each resource object is owned by a single process.
Initially, the process identified by (MPI\_COMM\_WORLD, 0)
owns all pre-existing resources in a single object.
Resource objects can be split and merged into other
resource objects and ownership of unused resources
can be passed to a processes children when they are started.
In this case, the child with rank 0 obtains ownership.

Any process may own a resource given that it has successfully
called \mpifunc{MPI\_ALLOCATE}. Ownership is not limited to
processes with rank 0.

\discuss{
The definitions above assume that processes in MPI-2
continue to be named based on (COMM, RANK).
Later in this chapter we describe an alternate naming scheme
based on a process object.
-AG
}

The simplest understanding of a resource object is that it
is a bag of process slots. Even so,
it is necessary to distinguish between the total number of
process slots and the number of processes recommended
for a given resource. For instance, if a resource corresponds
to a Unix workstation, it may theoreticaly run
as many processes as Unix can support. On the other hand,
for best results it should run only one or two process per processor.
We could give it one process slot, but some applications
will need to be able to do more. We don't want to tell
the user, however, that there are 100 slots available because
there will be no way to distinguish this resource from one
with 100 CPUs. We may have an SMP, but our MPI
processes may be multithreaded processes, so that we may still
want only one process per node. Thus ``recommended processes''
doesn't always mean ``number of CPUs.'' Most applications can
look at recommended processes and ignore max processes.
Finally, for debugging, a user might want to
start multiple processes on a single workstation.

The notion of {\em resource} is deliberately flexible. Most
applications do not care where processes are started, as long
as there is one process per CPU. For these applications,
a single resource object containing several CPUs may be
appropriate. Other appications will want explicit control
over which processes go on which machine. In this case,
a single resource object may correspond to a single machine.
Similarly, it might be convenient to think of an SMP as
a single ``node'' or as a collection of (virtual) processors.

In order to give applications explicit control over resources,
and still hope to maintain portability in MPI, we need to
define a resource description. This description would be used in
the \mpifunc{ALLOCATE} function to specify a specific set of
resources, such as, "4 SUNs with at least 32 MB each".
The following general description string is proposed:

{\tt attr == value && attr == value & ... , attr == value & ...}

\noindent
The resource description string is composed of attributes ``ANDed''
together and logical blocks of attributes separated by commas.
The attribute can be any string that is meaningful to the
application or underlying resource manager.
The following default set of attribute strings would be defined in MPI.
\begin{itemize}
\item NSLOTS - recommended number of processes to run.
\item MAX\_SLOTS
\item HOSTNAME - string as returned by {\it uname}.
\item ARCH - string as returned by {\it uname}.
\item MEMORY
\end{itemize}

Using the above definition to ask for a network of 4 SUN and 5 IBM
workstations, we get the simple resource description string:

{\tt NSLOTS == 4 && ARCH == "SUN" , NSLOTS == 5 && ARCH == "IBM"}

Given a resource object it would be useful to extract information
about this object. Conversely, it would be useful to be able to
set any undefined or user-defined attribute of a resource object.
The ability to set values is very important when an application
is designed to exploit a particular attribute, for example,
``ATM connected''. The following functions get and set resource values.

\begin{funcdef}{MPI\_SET\_RESOURCE(resource, attr, value)}
\funcarg{\IN}{resource}{an \mpifunc{MPI\_Resource} object}
\funcarg{\IN}{attr}{ attribute string}
\funcarg{\IN}{value}{ value of attribute}
\end{funcdef}

Other examples of information that could be set in the resource
description include: authentication information,
if required by the resource manager.
If the resource manager requires some special string,
the user can define an attribute with this string value.
And in the case where a host has
multiple communication interfaces (Ethernet, HiPPI, ATM),
an attribute could specify so that the desired communication infrastructure
would be used.

\begin{funcdef}{MPI\_GET\_RESOURCE(resource, attr, value, multiplicity)}
\funcarg{\IN}{resource}{an \mpifunc{MPI\_Resource} object}
\funcarg{\IN}{attr}{ attribute string}
\funcarg{\OUT}{value}{ value of attribute}
\funcarg{\OUT}{multipliciity}{ returned when value can't be defined }
\end{funcdef}

Multiplicity is necessary to help the user understand complex
resource objects of which he has no knowledge. For example,
consider our example above with the network of workstations
as a resource object. Calling
\mpifunc{MPI\_GET\_RESOURCE( resource, ARCH, value, multiplicity)}
would return undefined in value and 2 in multiplicity
letting the user know that there are two different architectures
in this resource object.

There is also a function that returns the entire resource description
string given a resource object.

\begin{funcdef}{MPI\_RESOURCE\_DESCRIPTION(resource, description)}
\funcarg{\IN}{resource}{an \mpifunc{MPI\_Resource} object}
\funcarg{\OUT}{description}{a string describing the resource}
\end{funcdef}

\begin{funcdef}{MPI\_RESOURCE\_FREE(resources)}
\funcarg{\IN}{resources}{resource to be returned to resource manager}
\end{funcdef}

\noindent
Frees resource object and
releases resources to resource manager.
Processes running on those resources are lost
(dealing with this case is difficult and should be addressed).
The resource must be owned by the caller of any of these functions.

\discuss{
An alternative to the above resource description has been proposed
by WCS it is ...
}
\discuss{
There is the issue of what happens to resources that expire (fail).
}

\subsection{Discovering, Allocating, and Manipulating Resources}

This section describes an interface for obtaining
resources from a resource manager.

There are two ways that applications typically use new resources. In
the first, they make use of existing, pre-allocated resources. For
instance, in PVM, a user may create a virtual machine, and than start
a master process on that machine. The master process determines how
many hosts are available and spawns slave processes on the unused
hosts. Resources are discovered, not allocated, by the master
process. The second way is for an application to obtain new resources
explicitly. In PVM, this is done through the {\tt pvm\_addhosts}
routine.

Resource discovery is expected to be a fast operation that may be
possible to implement through internal lookup rather than external
communication (depending on the MPI implementation). Resource
allocation, on the other hand, can be time-consuming.
We present the blocking version \mpifunc{ALLOCATE} here for simplicity
with the understanding that a non-blocking version,
which returns a request to be waited on, may be added later.

Neither function starts any application-visible processes; rather,
they obtain an MPI\_Resource object for use by other functions.
However, we need to take into account those resource
managers that cannot allocate resources without starting processes,
such as LoadLeveler or DQS. In those cases, the executables may not
be the application executables, but rather interface processes that
will create the application processes in response to one of the
process-creation functions.

The interface to an external resource manager could be arbitrarily
complex. The MPI interface specifies the absolute minimum amount
of information that must be known to MPI and tries to make common
cases very simple to specify.

\subsubsection{Resource Discovery}

\begin{funcdef}{MPI\_RESOURCE\_DISCOVER( resource)}
\funcarg{\OUT}{resource}{ \mpifunc{MPI\_Resource object}
\end{funcdef}

MPI\_RESOURCE\_DISCOVER returns the resource object containing
any pre-allocated resources for this process. In most cases
only the process with rank 0 will get a defined value
due to the default rules of ownership.
If a process has no pre-allocated resources, then mpiarg{resource} is NULL.

\subsubsection{Resource Allocation}

\begin{funcdef}{MPI\_RESOURCE\_ALLOCATE( resource\_description, resource)}
\funcarg{\IN}{resource\_description}{String containing description of resources}
\funcarg{\OUT}{resource}{\mpifunc{MPI\_Resource} allocated}
\end{funcdef}

This is a blocking resource allocation routine based on the
resource description string, which was presented earlier.
\discuss{
The previous draft of this chapter contained an alternate
method of specifying requests in both DISCOVER and ALLOCATE.
The forum should decide which method they prefer.

The alternate proposal is based on an additional argument called
MPI\_RESOURCE\_TYPE.
\mpiarg{resource\_type} is a general category of resource
request, used to make it easy to specify simple requests.
The categories are the same as for the \mpiarg{resource\_type}
argument for \mpifunc{MPI\_RESOURCE\_DISCOVER}. In this
context, they mean:

\begin{itemize}

\item \mpiarg{MPI\_RESOURCE\_TYPE\_DEFAULT}:
Matching is done only on the \mpiarg{NPROC} value.
The string in \mpiarg{resource\_description} is ignored.

\item \mpiarg{MPI\_RESOURCE\_TYPE\_HOST}: The string in
\mpiarg{resource\_description} is a hostname.

\item \mpiarg{MPI\_RESOURCE\_TYPE\_ARCH}: The string in
\mpiarg{resource\_description} is an architecture.
Note that architecture names are not specified by MPI
itself.

\item \mpiarg{MPI\_RESOURCE\_TYPE\_PROCESSOR}: The string in
\mpiarg{resource\_description} describes a single processor.
The recommended approach is that the name should be
the name of the host in which the processor resides.

\item \mpiarg{MPI\_RESOURCE\_TYPE\_SPECIAL}: The string in
\mpiarg{resource\_description} describes an arbitrary set
of resources.

\end{itemize}
End of discusson.
}

\subsubsection{Resource Merge and Split}

\begin{funcdef}{MPI\_RESOURCE\_MERGE( n, array\_of\_resources, resource)}
\funcarg{\IN}{n}{The number of objects in array}
\funcarg{\IN}{array\_of\_resources}{ containing MPI\_resources}
\funcarg{\OUT}{resource}{\mpifunc{MPI\_Resource} }
\end{funcdef}

This functions merges several resource objects into a single object.
When resources are merged the MPI\_resource objects describing
the individual members are automatically freed. This ensures that
any resource has only one owner.

There is also a function to split a resource object into multiple objects
with similar attributes.

\begin{funcdef}{MPI\_RESOURCE\_SPLIT(attr, resource, n, array\_of\_resources)}
\funcarg{\IN}{attr}{Specifies attribute to split across}
\funcarg{\IN}{resource}{\mpifunc{MPI\_Resource} }
\funcarg{\INOUT}{n}{The number of objects in array}
\funcarg{\OUT}{array\_of\_resources}{ containing MPI\_resources}
\funcarg{\IN}{resource\_description}{String containing description of resources}
\end{funcdef}

With the exception of MEMORY, the predefined default attributes
used for resource description have logical meaning in the context of split.
These meanings are:

\begin{itemize}
\item NSLOT:
A different resource object is returned for each NSLOT.
\item HOST}:
A different resource object is returned for each HOST.
\item ARCH}:
A different resource object is returned for each group of
similar architectures.
\item user-defined attr}:
A different resource object is returned for each group with
the given attribute. For example, all workstations connected by FCS network.
\end{itemize}

\section{Process Manager Interface}

Once an application has obtained resources through allocation or
discovery, it may start processes on those resources using the
routines presented in this section. More abstractly, it can request
that the process manager start processes on those resources.

\discuss{
We need to make a decision about how processes will be identified
in MPI-2. The two proposals on the table are to use only the (comm,rank)
method that is used in MPI-1, or augment this method by introducing
an \mpifunc{MPI\_Process} Object. One problem with the former
is there is no communicator when non-MPI processes are started,
so how are they identified?

Historical reference: PVM had to make this same decision 5 years ago.
PVM-2 used only (group,rank) ids. PVM-3 introduced a task id as the
lowest level id with (group,rank) as the id for tasks that wanted it.

At the last meeting, many were in favor of just (comm,rank),
but there were a few who felt strongly about process objects.
The present section includes the \mpifunc{MPI\_Process} Object
material.
-AG
}
\subsection{The \mpifunc{MPI\_Process} Object}

\discuss{
In MPI-1, processes can be identified by their rank in a group or
communicator. It might be possible to continue to use a (group,rank)
pair to identify a process as required for MPI-2. While convenient for
some operations (e.g., spawning a group of processes), the
(group,rank) identification would be inconvenient or unnatural for
others (e.g., sending a signal). Moreover, an
\mpifunc{MPI\_Group} object provides an abstraction
for a collection of processes which is an unnatural
starting point to define a single process.
Therefore, in the following we introduce a new object,
\mpifunc{MPI\_Process}.
}

An \mpifunc{MPI\_Process} represents a process -- a program executing
on a processor. An \mpifunc{MPI\_Process} is a ``process'', not a
``thread'' in that processes have their own address spaces and
don't share data unless they take explicit action to do so (outside
of MPI). A process may or may not be ``registered'' with
MPI; it may or may not call \mpifunc{MPI\_INIT}.

\discuss{
People have been confused by the process vs. thread distinction
and the fact that an \mpifunc{MPI\_Process} may not be
an MPI Process.

I would prefer that MPI-2 not say that an \mpifunc{MPI\_Process}
cannot be a thread. -AG
}

You can't communicate
directly with an \mpifunc{MPI\_Process} (that requires a communicator), but
you can send signals to one. If it is an MPI process in the sense that
it has a rank in a communicator, then the \mpifunc{MPI\_Process} object
representing it can be retrieved from its rank in that communicator (See
\mpifunc{MPI\_COMM\_PROCESS} below). MPI provides no guarantees
on the order of operations between messages and signals.
If a meassage and a signal are sent to a process,
the signal may arrive or be processed before the message,
depending on hardware and MPI implementation details.

\discuss{
A bad programming practice that has been seen by naive users of systems that
provide both messages and signals is to send a process a message and then
signal it to wake up and receive the message. This needs to be plainly
forbidden. - AG.
}

\subsection{Starting Processes}

Starting a process using MPI requires two steps.
First, one or more \mpifunc{Process\_init} calls are made.
Each of these calls specifies an executable and a single resource object.
and returns a group handle.
MPI-1 group functions can be used to join groups together.
The second step actually starts the processes specified by a group handle
and places them all in a single MPI\_COMM\_WORLD.

The two step process not only allows for all the children
to have the same MPI\_COMM\_WORLD,
but it also gives the user the flexibility needed to run
applications across a network. The simple example is were
executables must be matched with particular workstations (resource objects),
but the user also wants all the processes on all the workstations
working together.

There are two options at the second step.
If the parent process is starting MPI processes that will
communicate back with the parent process at some point in the future,
then the application should call \mpifunc{Start\_Attach}.
This function blocks until the children establish an
inter-communicator with the parent group.
The sychronization occurs when the children call \mpifunc{MPI\_INIT}.
When MPI\_INIT is called, the varables MPI\_COMM\_WORLD
and MPI\_COMM\_PARENT (inter-communicator) are set in the children.

The advantage of using MPI\_INIT in this way
is that MPI-1 programs can be started with the new MPI-2 functions
without modification to these programs.

The other option in the second step is to call \mpifunc{Start\_Abandon}.
This function does not wait for the children and is expected to be
used when the children are non-MPI processes or when the
children only communicate among themselves and not back to the parents.
In the latter case, the MPI\_COMM\_PARENT intercommunicator is not defined
on return from MPI\_INIT.

\begin{funcdef}{MPI\_PROCESS\_INIT(executable, arguments, n,
resource, inherit\_resource, group)}
\funcarg{\IN}{executable}{executable file containing program to be run}
\funcarg{\IN}{arguments}{arguments for the program}
\funcarg{\IN}{n}{number of processes to start}
\funcarg{\IN}{resource}{\mpifunc{MPI\_Resource} to run processes on}
\funcarg{\IN}{inherit\_resource}{\mpifunc{MPI\_Resource} which children inherit}
\funcarg{\OUT}{group}{identifies the group of processes}
\end{funcdef}

\begin{funcdef}{MPI\_PROCESS\_START\_ATTACH(mycomm, group, newcomm)}
\funcarg{\IN}{mycomm}{communicator of parents group}
\funcarg{\IN}{group}{group returned by Process\_Init}
\funcarg{\OUT}{newcomm}{inter-communicator between parent and child groups}
\end{funcdef}

\begin{funcdef}{MPI\_PROCESS\_START\_ABANDON(group)}
\funcarg{\IN}{group}{group returned by Process\_Init}
\end{funcdef}

\subsection{Process Utilities}

\begin{funcdef}{MPI\_PROCESS\_RUNNING(resource, nprocs)}
\funcarg{\IN}{resource}{an \mpifunc{MPI\_Resource} object}
\funcarg{\OUT}{nprocs}{number of processes currently running}
\end{funcdef}

\begin{funcdef}{MPI\_RESOURCE\_PROCESSES(resource, n, array\_of\_processes)}
\funcarg{\IN}{resource}{an \mpifunc{MPI\_Resource}}
\funcarg{\OUT}{n}{number of processes running on resource}
\funcarg{\OUT}{array\_of\_processes}{array of \mpifunc{MPI\_Process}es}
\end{funcdef}

\noindent
Gets the processes running on a given resource. Returns $n = 0$ if
none (resource is then free for a new task)

\discuss{
Do we need to specify the size of the array or do we let
the user hang him/herself?
}

\begin{funcdef}{MPI\_PROCESS\_RESOURCE(process, resource)}
\funcarg{\IN}{process}{an \mpifunc{MPI\_Process}}
\funcarg{\OUT}{resource}{the \mpifunc{MPI\_Resource} on which the process is run
ning}
\end{funcdef}

\noindent
Gets the resource on which a process is running. This routine
can only be called on the process which owns the resource.
On all other process it returns \mpifunc{MPI\_RESOURCE\_NULL}.

\discuss{
Allowing other processes to retrieve the corresponding
resource causes all sorts of problems. The current definition
is probably fine until we start talking about fault tolerance,
in which case we have another problem.
}

\begin{funcdef}{MPI\_PROCESS\_IN\_GROUP(group, rank, process)}
\funcarg{\IN}{comm}{group}
\funcarg{\IN}{rank}{rank in group}
\funcarg{\OUT}{process}{\mpifunc{MPI\_Process} corresponding to rank in group}
\end{funcdef}

\noindent
gets the \mpifunc{MPI\_Process} corresponding to a given group and
rank.

A process represented by an \mpifunc{MPI\_Process} cannot be communicated with
directly, until a communicator is constructed containing it (see next
section). It need not be an ``MPI process'' in the sense that it might not
call \mpifunc{MPI\_INIT}. On the other hand, it allows out-of-band
communication, such as signals, and might be a useful concept for dealing with
failures.

To ask the process manager to deliver signals to
processes, we use

\begin{funcdef}{MPI\_PROCESS\_SIGNAL(signal, num\_processes, array\_of\_processes)}
\funcarg{\IN}{signal}{signal type (int)}
\funcarg{\IN}{num\_processes}{number of processes in array}
\funcarg{\IN}{array\_of\_processes}{\mpifunc{MPI\_Process}es to be signalled}
\end{funcdef}

It is the responsibility of an implementation to translate between signals; in
other words, a \code{SIGINT} that has value \code{3} on system A must be
delivered as a \code{SIGINT} on system B, even if system B
uses the value \code{5} for \code{SIGINT}. If the signal can not be delivered
because there is no corresponding signal, the error code is
\mpifunc{MPI\_ERR\_INVALID\_SIGNAL}.

\subsection{Notification}

There needs to be some way of finding out when a process finishes. In MPI we
have no mechanism for asynchronous notification. Therefore the best we can do
is to construct a request that can be tested and waited on. If processes were
represented by requests, then we could wait on them directly. Since we have
introduced the notion of \mpifunc{MPI\_Process}, we can explicitly attach a
request to it in order to wait on it.

PVM users have found that a flexible `notify' function is important
to building fault tolerant applications. We propose to supply
similar functionality in MPI.

\begin{funcdef}{MPI\_NOTIFY(what, n, array, request)}
\funcarg{\IN}{what}{a flag specifying what to be notified about (see below)}
\funcarg{\IN}{n}{number of objects in following array}
\funcarg{\IN}{array}{array of MPI objects (depends on value of `what')}
\funcarg{\OUT}{request}{mpifunc{MPI\_Request} to be tested and/or waited on}
\end{funcdef}

Initially, we propose that the \mpiarg{what}
can take on two values: MPI\_Process\_Exit and MPI\_Resource\_Failure.
In these cases \mpiarg{array} contains MPI Process objects and
MPI resource objects respectively.

\discuss{
These are the two most obvious values for `what'.
Are there other values we want to consider?
PVM contains the equivalent of MPI\_Resource\_Add case,
but we have restricted MPI resource ownership such that
the MPI\_Resource\_Add case is not important in MPI.
-AG
}

We can think of this as MPI interface to the Process Manager's handling of the
signal {\tt SIGCHILD}. The exit code from the process (from
a \code{return n} or \code{exit(n)} in C or \code{STOP n} in Fortran) can be
retrieved from the \mpifunc{MPI\_status} filled in the \mpifunc{MPI\_WAIT}.

Note that this level of process management allows us to manage non-MPI
processes, since communicators are not involved.

\section{Attaching Independent Processes}

So far we have covered the case of creating new processes. For
client-server applications, the situation is different, because the processes
in question already exist, and what we need is a communicator to be used by
them to communicate with one another.
Similar situations arise in other applications as well.
For example, a visualization tool may want to start up and
attach to a running simulation, or two parts of a large application
may be started separatly at two different sites and then want to
communicate with each other.

This section attempts to provide the functions needed to solve
the general case of creating an intercommunicator between
two MPI processes with no knowledge of each other.

\subsection{Registration and Connection}

The following four functions define the interface.

\begin{funcdef}{MPI\_PROCESS\_REGISTER( name, handle)}
\funcarg{\IN}{name}{string used for contacting}
\funcarg{\OUT}{handle}{associated with the name}
\end{funcdef}

The form of the \mpiarg{name} argument has several possibilities. The most
obvious is to use the {\tt net-address:port-number} format that current
systems will find most straightforward. However, in the long run, name
servers of various kinds may require more flexibility.

\begin{funcdef}{MPI\_PROCESS\_ACCEPT(mycomm, handle, root, newcomm)}
\funcarg{\IN}{mycomm}{communicator over which this call is collective}
\funcarg{\IN}{handle}{associated with registered name}
\funcarg{\IN}{root}{the process that registered the name}
\funcarg{\OUT}{newcomm}{new inter-communicator, which includes the
processes of the remote group}
\end{funcdef}

\begin{funcdef}{MPI\_PROCESS\_CONTACT(mycomm, name, newcomm)}
\funcarg{\IN}{mycomm}{communicator over which this call is collective}
\funcarg{\IN}{name}{name by which remote process can be contacted (string)}
\funcarg{\OUT}{newcomm}{new inter-communicator, which includes the
processes of the remote group}
\end{funcdef}

\begin{funcdef}{MPI\_FREE\_NAME( handle)}
\funcarg{\IN}{handle}{associated with registered name}
\end{funcdef}

We will illustrate the use of these functions with a client/server example.
The server would register a ``name'' by which it wants to be known.
The Register function would return an error if there is a name conflict.
The server group calls MPI\_PROCESS\_ACCEPT and the client group
calls MPI\_PROCESS\_CONTACT. The output of these two collective calls
is an inter-communicator between the two groups.
Now any process in the client group can communicate with any process in the
server group and vice versa, using the inter-communicator.

Disconnection occurs when processes call \mpifunc{MPI\_COMM\_FREE} on the
inter-communicator.

\discuss{
I'm not clear how the server can continue to service clients
it is already connected to, and still ACCEPT new clients.
Do we need to make Attach and Contact non-blocking calls?
-AG
}

\discuss{
AG- This is an old discussion but it was left in because
WCS makes several important observations in it.

After going through the exercise of replacing connections with
processes, I think I understand why connections were proposed in the
first place. The problem is that you may want to prevent the server
from blocking no matter what the client does. As proposed here, the
server will block while ATTACHing if the client doesn't do a matching
ATTACH. A ``connection'' was more than a simple process in that
establishing an intercommunicator from an \mpifunc{MPI\_Connection} (using
REMOTE\_ATTACH) was a {\em local} operation, not requiring synchronization
with the remote processes.

I maintain that this only swept the problem under the rug. In effect,
establishing a connection with ICONNECT/IACCEPT was a a nonblocking
collective operation that established communication. REMOTE\_ATTACH
was really only a local lookup of the already-established
communicator. A more direct solution is (IMHO) to provide a non-blocking
version of \mpifunc{MPI\_PROCESS\_ATTACH}. This has all the same
associated issues of nonblocking collective operations,
but seems cleaner.

More generally, a server must always avoid collective communication
with a client (unless nonblocking collective is available) and must
always use non-blocking point-to-point communication if it wishes to
avoid deadlock due to an uncooperative/incorrect client.
Because there are (currently) no non-blocking collective operations,
and because these would probably not be too useful to
a server in any case (?) We may want to avoid the issue
entirely and provide only a local (non-collective)
non-blocking version of \mpifunc{MPI\_PROCESS\_IATTACH}
(i.e, without the \mpiarg{comm} and \mpiarg{root} arguments. ).

If we do have a nonblocking collective operation, it will
need to function correctly when several are overlapped.
Is this possible?

\noindent wcs
}

\section{Examples}
\label{sec:examples}

% The examples have not been updated to reflect the new functions 8/29 -AG

\paragraph{Manager-worker example.}

\begin{verbatim}
/* manager */
#include <mpi.h>
#define MAXPROC 128
main(int, argc, char *arg[])
{
MPI_Resource resource;
MPI_Status status;
int count, world_size;
MPI_Comm everyone;
#ifndef SPAWNEM
MPI_Process processes[MAXPROC];
MPI_Comm my_children;
#endif

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if (world_size != 1) error("Top heavy with management");
MPI_Resource_discover(MPI_RESOURCE_TYPE_DEFAULT, NULL, &resource);

MPI_Resource_nslots(resource, &count);
if (count < 1) error("No resources");
if (count > MAXPROC) { warning("Too many resources"); count = MAXPROC };

#ifdef SPAWNEM
MPI_Spawn(MPI_COMM_SELF, 0, count, "worker", NULL, resource, &everyone);
#else
MPI_Process_create("worker", NULL, count, resource, processes);
MPI_Process_attach(MPI_COMM_SELF, 0, count, processes, &my_children);
MPI_Intercomm_merge(my_children, FALSE, &everyone);
#endif

/*
* Parallel code here. The communicator "everyone" can be used
* to communicate with the spawned processes
*/

MPI_Finalize();
}

/* worker */

#include <mpi.h>
main(int, argc, char *argv[])
{
MPI_Comm parent; /* intercommunicator */
MPI_Comm everyone; /* intracommunicator */
MPI_Init(&argc, &argv);
MPI_Parent(&parent);
MPI_Intercomm_merge(parent, TRUE, &everyone)

/*
* Parallel code here.
*/
MPI_Finalize();
}

\end{verbatim}

\paragraph{Task farm example.}

The nice thing here is that you need allocate resources
only once. Then you can reuse them. It also shows
that you can interact with processes without using
MPI to communicate.

\begin{verbatim}
#include <mpi.h>
#define MAXPROC 100
main(int argc, char *argv[])
{
MPI_Resource resource;
MPI_Process processes[MAXPROC];
MPI_Request obituaries[MAXPROC];
MPI_Status statuses[MAXPROC]; /* see comment below */
int nslots, i, deadone;
char **args;

ThingToDo *thingstodo;
int nthingstodo, nthingsdone;

MPI_Init(&argc, &argv);
MPI_Resource_discover(MPI_RESOURCE_TYPE_DEFAULT, NULL, &resource);
MPI_Resource_nslots(resource, &nslots);

GetThingsToDo(&thingstodo, nthingstodo);

/* need to initialize these for Waitany/all() below to work */
for (i = 0; i < MAXPROC; i++) obituaries[i] = MPI_REQUEST_NULL;

/* start up tasks on initial resources */
for (nthingsdone = 0; nthingsdone < nslots && nthingsdone < nthingstodo; nthingsdone++) {
SetArgs(thingstodo[nthingsdone], &args);
MPI_Process_create("cow", args, 1, MPI_PROCESS_TYPE_NONMPI,
resource, &processes[nthingsdone]);
MPI_Process_notify_when_done(nthingsdone,
&processes[nthingsdone],
&obnthingsdonetuaries[nthingsdone]);
}

/* spawn new ones as old ones finish */
while(nthingsdone < nthingstodo) {
/* perhaps could have some element of status = return
* code so we could check for error ? */
MPI_Waitany(nslots, obituaries, &deadone, &status[0]);
SetArgs(thingstodo[nthingsdone], &args);
/* it would probably be more efficient to use
* MPI_Process_icreate here */
MPI_Process_create("cow", args, 1, MPI_PROCESS_TYPE_NONMPI,
resource, &processes[nthingsdone]);
MPI_Process_notify_when_done(nthingsdone,
&processes[nthingsdone],
&obituaries[nthingsdone]);

}
/* this is a good reason why MPI should allow you to pass
* statusptr = NULL
*/
MPI_Waitall(nslots, obituaries, statuses);

MPI_Finalize();
}

\end{verbatim}

\commentOut{
\paragraph{Another task farm example, in old syntax.}
It always keeps a request for ten resources outstanding, but starts jobs as soon
as possible. To avoid spin-waits on the allocation and running of jobs, it
uses \mpifunc{MPI\_WAITSOME} on an array of requests that includes both
allocation requests and started jobs. The index \code{alloc\_top} gives the
number of allocation requests currently active; \code{r\_top} gives the total
number of active requests (both allocations and started processes).

The programs in this example are {\em not} MPI jobs; MPI is simply being used
to start and manage the programs.
For simplicity, we have not included any code to decide when the program is
done or to describe the program to be run and its arguments. Note that
\mpifunc{MPI\_CANCEL} can be used to cancel any unneeded allocation requests.

\begin{verbatim}
#include "mpi.h"
main( int argc, char **argv )
{
MPI_Request r[20];
MPI_Status s[20];
int idx[20], nout;
int alloc_top, r_top;
int rc;

MPI_Iallocate( 10, (char *)0, (char **)0, "*", (char **)0, MPI_HARD, r );
alloc_top = 10;
r_top = 10;
while (!done) {
MPI_Waitsome( r_top, r, &nout, idx, s );
for (i=0; i<nout; i++) {
if (idx[i] < alloc_top) {
/* Processor is ready. Start program */
j = idx[i];
MPI_Set_exec( r[j], program_name );
MPI_Set_args( r[j], program_args );
MPI_Start( r[j] );
r[r_top] = r[j];
r[j] = r[alloc_top];
r[alloc_top] = MPI_REQUEST_NULL;
alloc_top--;
}
else {
/* Program has finished */
j = idx[i];
MPI_Get_return_code( &s[i], &rc );
/* Make use of return code ... */
/* Note that r[j] is MPI_REQUEST_NULL already
(the wait does it) */
}
}
/* Repack request array and issue additional allocations */
j = alloc_top;
for (i=alloc_top; i<r_top; i++) {
if (r[i] != MPI_REQUEST_NULL)
r[j++] = r[i];
}
r_top = j;
MPI_Iallocate( 20 - r_top, (char *)0, (char **)0, "*", (char **)0,
MPI_HARD, r + r_top);
}
MPI_Finalize();
return 0;
}
\end{verbatim}

}

\paragraph{PVM-style SPMD example} This is how many PVM
programs are typically written. There is no reason they
can't be done with MPI-1, but in case users want the
appearance of minimal change, here is is. (Very similar to
the manager-worker example above.)

\begin{verbatim}
#include <mpi.h>
#define MAXPROC 128
main(int, argc, char *argv[])
{
MPI_Resource resource;
MPI_Status status;
int count, world_size;
MPI_Comm everyone;
MPI_Process processes[MAXPROC];
MPI_Comm my_children, my_parent;

MPI_Init(&argc, &argv);
MPI_Comm_parent(&my_parent);
if (my_parent == MPI_COMM_NULL) { /* I'm the parent */
MPI_Resource_discover(MPI_RESOURCE_TYPE_DEFAULT, NULL, &resource);
MPI_Resource_nslots(resource, &count);
if (count < 1) error("No resources");
if (count > MAXPROC) { warning("Too many resources"); nprocs = MAXPROC };
MPI_Process_create(&argv[0], NULL, count, resource, processes);
MPI_Process_attach(MPI_COMM_SELF, 0, count, processes, &my_children);
MPI_Intercomm_merge(my_children, FALSE, &everyone);
} else {
MPI_Intercomm_merge(my_parent, TRUE, &everyone);
}

/* SPMD parallel code here, using everyone instead of MPI_COMM_WORLD */

MPI_Finalize();
}

\end{verbatim}

\paragraph{Client-server example.} This is a simple example; the server
accepts only a single connection at a time and serves that connection until
the client requests to be disconnected.

Here is the server. It accepts a single connection and then processes data
until it receives a message with tag {\tt 1}. A message with tag {\tt 0}
tells the server to exit.
\begin{verbatim}
#include "mpi.h"
main( int argc, char **argv )
{
MPI_Comm client;
MPI_Status status;
double buf[MAX_DATA];
int again;

MPI_Init( &argc, &argv );
while (1) {
MPI_Server_connect( MPI_COMM_WORLD, "cave:1234", &client );
again = 1;
while (again) {
MPI_Recv( buf, MAX_DATA, MPI_DOUBLE, 0, MPI_ANY_TAG,
client, &status );
switch (status.tag) {
case 0: MPI_Comm_free( &client );
MPI_Finalize();
return 0;
case 1: MPI_Comm_free( &client );
again = 0;
break;
case 2: /* do something */
...
default:
MPI_Abort( MPI_COMM_WORLD, "Unexpected message type" );
}
}
}
}
\end{verbatim}

Here is the client.

\begin{verbatim}
#include "mpi.h"
main( int argc, char **argv )
{
MPI_Comm server;
double buf[MAX_DATA];

MPI_Init( &argc, &argv );
MPI_Client_connect( MPI_COMM_WORLD, "cave:1234", &server );
while (!done) {
tag = 2; /* Action to perform */
MPI_Send( buf, n, MPI_DOUBLE, 0, tag, server );
/* etc */
}
MPI_Send( buf, 0, MPI_DOUBLE, 0, 1, server );
MPI_Comm_free( &client );
MPI_Finalize();
}
\end{verbatim}

If the server needs to manage multiple connections at once, it can use
\mpifunc{MPI\_IACCEPT} instead. The client need not be changed.