The first aspect is that the continuability after errors is generally
NOT associated with an error code, but with the communicator state.
For example, consider a rejection "insufficient store" or "too many
outstanding requests".
It will very often be possible to retry after this, provided that it
was due to a temporary logjam. But it may not be, especially if the
failure is due to a key server not responding. And identical errors
that occurs under different circumstances may need different recovery
strategies.
This will very often differ according to the communicator; for example,
one process may be on a system that uses a fixed request pool (and so
such a failure is fatal) but another may be on one which has a "busy -
try later" return.
Furthermore, some such errors will affect only the individual operation,
others will affect only the communicator and others will indicate that
the local MPI environment is knotted.
So, that is why I suggested a function that would inquire the state of a
communicator, and that the return should not be a simple yes or no, but
an indication of what has to be done to continue.
Incidentally, speaking as a run-time system implementor, the function
doesn't ask for any information that won't be held internally. All I am
suggesting is that the programmer be told how the MPI implementation
regards the current state.
The second aspect is that of context-dependent messages. The standard
joke about Unix is that there are three: "can't", "shan't" and "didn't".
I currently have a problem where there is a failure SOMEWHERE in the IP
or UDP stack, but have no way of finding out anything more.
As dynamic processes provide PVM-like facilities, there will be an
increasing need to provide an indication of WHERE the failure occurred.
If you have an intercommunicator covering 6 vendors' systems, the
failure "insufficient store" isn't exactly helpful.
MPI needs to provide some way that the implementation can pass arbitrary
text back to the programmer, so that it can be written out and taken to
the support staff or MPI implementors. There is clearly no way that MPI
can specify what the information will say, but any decent implementation
will at least indicate which processes were involved!
And please note that I am thinking as an implementor, because one of the
main purposes of this information is to enable problems to be reported
in a useful way. Error reports "I have got a request rejected message"
aren't exactly helpful.
I have also been thinking about the interface, and believe that the call
to clear the errors is unnecessary and can be dropped (it was there
because I was thinking in C terms) and that the error indication and
messages should be requested separately. So here is a minimal syntax:
MPI_ERROR_COMM_STATE (comm, code, severity, scope)
IN comm Communicator
OUT code Error code associated with the communicator
OUT severity Severity of the error state
OUT scope Scope of the error state
MPI_ERR_IGNORABLE No special action is needed
MPI_ERR_RECOVERABLE Specific action is needed
MPI_ERR_RESTARTABLE All outstanding operations must be abandoned
MPI_ERR_CORRUPTED This is beyond hope
MPI_ERR_ACTION The failure affects only the operation
MPI_ERR_LOCAL The failure affects only the local processor
MPI_ERR_GLOBAL The failure affects the whole communicator
MPI_ERR_UNIVERSAL The failure is not localised
MPI_ERROR_COMM_MESSAGE (comm, message, length)
IN comm Communicator
OUT message Context-dependent messages
OUT length Length of messages returned
Nick Maclaren,
University of Cambridge Computer Laboratory,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nmm1@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679