>
> >If we want MPI_ERRORS_ARE_FATAL to abort the entire
> >computation, then we should say that it behaves as if
> >MPI_ABORT(MPI_COMM_WORLD, somecodenumber) was called. I would suggest the
> >second interpretation.
>
> Been there with PVM, and we found aborting the "world" to be a bad thing.
> I suggest "at most" just processes in COMM are aborted.
> For MPI-1 picking either interpretation is OK
> but looking ahead to MPI-2 with client servers
> I wouldn't want a failed client to cause the server to be killed.
> (this is what we ran into and had to fix in PVM)
>
> Looking even further ahead (MPI-3?) if we incorporate some fault tolerance
> then we don't want choices made in MPI-1 to hinder this work.
>
> Al Geist
>
This would require a (modest) change in the current MPI1 text. The current
text allows implementations of MPI_ABORT(comm,...) that kill only processes
in comm or all processes in MPI_COMM_WORLD. (The current text seems to imply
that it is even OK to have an implementation where only comm=MPI_COMM_WORLD is
accepted -- which seems strange). The bhavior which I think you want is that
MPI_ABORT(comm,....) always kill only the processes in comm -- I.e. restrict
the set of allowable implementations.
A separate discussion is what should be the behavior of the default error
handler MPI_ERRORS_ARE_FATAL. I would argue that the default error handler
should keep killing the entire computation, otherwise naive users would be
faced with the strange behavior where program error causes some but not all
the processes to disapear. Users can always define new error handlers that
kill only processes in the communicator group. Or, if we wish, we could add
such new errror handler (MPI_ERRORS_ARE_GROUP_FATAL,
MPI_ERRORS_ARE_PROCESS_FATAL?).
-------------------
Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425