Proposal for Fault Tolerance

Anna Rounbehler (anna@sky.com)
Wed, 28 Aug 1996 15:41:49 -0400

This is a multi-part message in MIME format.

--------------1CFBAE393F54BC7EFF6D5DF
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Attached is a proposal for fault tolerance for MPI-RT.

--------------1CFBAE393F54BC7EFF6D5DF
Content-Type: text/plain; charset=us-ascii; name="fault_tol_1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="fault_tol_1"

Section 8.6 Fault Tolerance
-----------------------------

A fault tolerance policy may have three components: fault assessment, fault
detection and fault recovery. Non-recoverable error handlers contribute to
fault detection. Performance monitoring and operability assessment policies
identify erroneous data and indicate hardware failures in a system. Error
handlers are late response indicators and performance monitoring/operability
assessment are preventative.

To support a fault tolerance policy, consideration must be given to message recovery when the message has failed or has been lost. In a system with a fault tolerance policy, message failure or loss may be handled for all traffic with a general recovery policy. For systems that do not have such a policy, the capability may be provided for MPI communication. A communication fault occurs when data is erroneous, data is lost or the latency period exceeds a tolerance limit for processing effectively. The following is proposed to handle MPI communication faults.

1.0 Detecting MPI communication faults

1.1 This binding may be strategically placed in accordance with the application code to check for disfunctional MPI communication. The error codes
will indicate which process(es) in the comm group are suspect.

MPI_CHECK_COMM(comm, array_of_errcodes)
IN comm communicator handle
OUT array_of_errcodes one error/warning code per process

Advice to implementors:
-------------------------
The following is one of many methods that may be used to implement a check.
If any members of the comm group has not been accessed in TBD time, then a warning may be issued. An access timer on each node can clock the time between
accesses. For each new access, the timer is reset.

1.2 This binding is optional and will invoke a policy to verify that
the whole communication group agrees that communication is down.

MPI_FAULT_POLICY(comm, keyval)
IN comm communicator handle
OUT flag indicates concensus (0,1) as to which
process(es) have faults
OUT array_of_errcodes one error code per process

Advice to implementors
-----------------------
May implement the Byzantine General's Algorithm for concensus.

If the error codes agree between MPI_CHECK_COMM and MPI_FAULT_POLICY, the user may take action to create a new comm group or declare a fatal error.

If the user prefers to invoke a single fault check, then MPI_CHECK_COMM or
MPI_FAULT_POLICY may be used.

The user may call MPI_COMM_RECREATE if a new comm group is desired.

Seamless replacement without degradation is an optimal goal of fault tolerance
policies. Although a fault is detected, the problem of data loss until corrective action is taken, (creating a new comm group) is not addressed
in this proposal.

--------------1CFBAE393F54BC7EFF6D5DF--