There are three choices for error handling in MPI_WAITALL:
1. after an error has been detected on some communication, MPI_WAITALL should
just wait for all other communications to complete, for as long as it takes to
wait for normal completion. Problem: An error may have occurred that will
prevent that from happening forever. E.g., the process may have posted a send
and a receive to a process that has died. The send fails, but MPI_WAITALL
does not return, ever, because the receive is still posted. (The receive may
have source=dontcare, so that MPI cannot fail the receive.)
2. after an error has occurred, MPI_WAITALL has the option of cancelling the
remaining pending communications, so as to be able to return, and return in a
mode where all pending communications completed. The error code returned will
indicate one of three things:
1. Successful completion
2. Real error
3. Successful cancelation.
We need a new error code to indicate the 3rd option
3. after an error has occurred, MPI_WAITALL has the option of returning
immediatly, without canceling any of the still active communications. The
error code returned will indicate one of 3 things:
1. successful completion
2. real error
3. communication pending (MPI_ERR_PENDING)
The difference, from the user viewpoint, between 2 and 3, is that with 2 the
user has to post again the Send or Recv operation, whereas with 3 it just has
to execute again the Wait or Test.
The difference, from the implementation viewpoint, is that error handling in 2
requires more cleanup (cancelling operations).
I believe that 1 is not acceptable, since it creates situations where, by
definition, there can be no error recovery. 2 is questionable: cancelling
communications is expensive; the cancel itself may fail (e.g., in the scenario
I outlined before), which brings us back to were we started.
3 has the advantage that, once an error has been detected, then the WAITALL
operation can be completed with only very simple local processing (stuffing
information in statuses), and the error handler can be invoked asap. I think
this is an important goal: namely that, once an error is detected, error
recovery can be initiated asap, without requiring any further communication.
The current version of the MPI doc is written assuming option 3. But this is
an issue that was not dicussed at last meeting. Therefore, I want to flag it,
and make sure it is discussed.
Btw, the text should make clear that "Pending" does not imply that the
operation has not completed (It may complete by the time we check the status).
It just means it was pending when the status was updated. Note that there
is no race condition: the MPI_ERR_PENDING flag is in the status; when the
operation completes, then the request is updated by MPI. Next time the user
posts a Wait, it will get the right information in the (new) status.