> I think I understand the MPI spec's, but I'm more interested now in
> the MPI philosophy and future directions. Thanks. - Gene Cooperman
You should look closely at the "Dynamic Processes" chapter in the
current MPI-2 proposal. This aims to provide mechanisms for
dynamically creating new groups of MPI processes. (The whole document
will be available at SC'96, but you should be able to get working
drafts from
> http://www.cs.wisc.edu/~lederman/mpi2/mpi2-report.ps.Z and
> ftp://ftp.cs.wisc.edu/pub/lederman/mpi2/mpi2-report.ps.Z. (Note: This
> is not where the chapter releases have occurred.) These directories
> also have all the sources and figures.
However even with dynamic processes we have not found a way to make
MPI resilient to process failure. (Which seems to be what you are
really after).
The problems here are
1) MPI addresses processes by rank in a communicator, if one of the processes
in the communicator vanishes there is no method for informing the
other processes, therefore the communicator becomes broken.
(Imagine the effect if a process crashes in the middle of a
collective operation...)
2) MPI communicators never change their membership. (This remains true
even with dynamic processes). This is highly beneficial for the
specification of the semantics, and allowing efficient
implementations. It does not, however, make fault tolerance when
processes vanish easy (or maybe even possible).
3) MPI does not mandate any specific implementation technique,
(whereas PVM *is* and implementation, not a specification).
The benefit of this is that MPI can run efficiently in many
environments, for instance workstation clusters and also MPP
machines with their own resource management and process startup
mechanisms. The disadvantage is that MPI cannot (and should not
IMHO) require local daemon processes (which would be useful for
detecting erroneous user processes).
The reasons for some of these decisions are perhaps based on a view
that the environment in which MPI programs will run is likely to be
closely coupled, so an SMP machine, an MPP, or some other hardware
with high bandwidth low latency interconnect. By definition a machine
with very low latency interconnect will be in one room (light is too
slow for this not to be true). Such a machine is also likely to be
administered in a coherent manner, and to be logically a single
entity. Therefore tolerance of single process failures is seen to be an
issue of much less importance than being able to achieve communication
with low latency.
MPI is not trying to replace sockets for network computing, rather to
provide a portable way of achieving low latency communication in
tightly coupled applications.
-- Jim
James Cownie
Dolphin Interconnect Solutions
Phone : +44 117 9071438
E-Mail: jcownie@dolphinics.com