Re: dead sockets in MPICH

Gene Cooperman (gene@ccs.neu.edu)
Wed, 16 Oct 1996 12:30:12 -0400

Thanks very much for the insights into MPI. I will certainly look up the
chapter on dynamic processes for MPI-2. I'd like to quote part of the kind
response from James Cownie, and point out the way it fits into my own
applications. I hope that it helps you to hear about other applications
and the ways that they might fit into MPI-2.

> The reasons for some of these decisions are perhaps based on a view
> that the environment in which MPI programs will run is likely to be
> closely coupled, so an SMP machine, an MPP, or some other hardware
> with high bandwidth low latency interconnect.
> ...
> MPI is not trying to replace sockets for network computing, rather to
> provide a portable way of achieving low latency communication in
> tightly coupled applications.

I understand that MPI is still intended to include NOW's, but that there
may be a greater opportunity in SMP's, MPP's and other low latency hardware.
These are the places where a good MPI implementation can improve existing
performance levels for current applications on message-passing programs,
and that MPI will not contribute a performance improvement over sockets for
network programming.

Of course, another target of MPI is portability across multiple architectures.
Hence, one can develop on a NOW, and run production jobs on a MPP.
For myself, it is valuable to program on NOW's, because we
have student labs with tens of computers whose CPU cycles are essentially free
and unused at night and even much of the day. I could have programmed
directly with sockets. However, MPI provides greater portability (perhaps
to a MPP, etc.), and it also provides higher level abstractions easing the
burden of programming.

With this in mind, I now have an undergraduate student writing a subset
implementation of MPI intended for NOW's with high latency. It will include
the point-to-point layer, MPI_Init(), and not so much more. In addition to
UNIX workstations, we would like to look at possible ports to Windows.
Naturally, we intend to contribute the implementation as free code for any
who would find it useful.

The perceived advantages are (1) a minimal implementation that can be easily
distributed as part of a larger application, and (2) a small implementation
that can easily be examined and modified by students and less sophisticated
users. By maintaining a MPI subset, any site could replace this by
a better, low latencye MPI implementation for higher performance hardware.

The point of an implementation whose internals can easily be studied has
another side benefit. As part of my own work, I have developed a higher level
task-oriented parallel abstraction on top of GCL (LISP) integrated with MPICH.
Both GCL and MPICH are large software packages that each have their own
ideas of how to handle malloc/sbrk, signals, external interrupts, etc.
When something doesn't work, I don't know whether it's a bug in GCL, a bug
in MPICH, a bug in my own code layer, or an incompatibility of assumptions
among them. A small MPI subset implementation helps in deciding such questions.
(For a further description of my work,
see http://www.ccs.neu.edu/home/gene/papers.html )

For the future, once one considers an integration to an interactive language,
it is clear that still other issues can arise. For example, suppose one
remote process corrupts its memory (and possibly loses its sockets), but
one still wants to diagnose the application state with the remaining processes.

Anyway, I am a strong believer in the benefits of MPI, and I appreciate
hearing about the philosophical directions. I hope my own experience helps
you by providing one more data point. Thank you. - Gene