13.7.3. Progress

PreviousUpNext
Up: Semantics and Correctness Next: Registers and Compiler Optimizations Previous: Ordering

One-sided communication has the same progress requirements as point-to-point communication: once a communication is enabled it is guaranteed to complete. RMA calls must have local semantics, except when required for synchronization with other RMA calls.

There is some fuzziness in the definition of the time when an RMA communication becomes enabled. This fuzziness provides to the implementor more flexibility than with point-to-point communication. Access to a target window becomes enabled once the corresponding synchronization (such as MPI_WIN_FENCE or MPI_WIN_POST) has executed. On the origin process, an RMA communication operation may become enabled as soon as the corresponding put, get or accumulate call has occurred, or as late as when the ensuing synchronization call is issued. Once the operation is enabled both at the origin and at the target, the operation must complete.

Consider the code fragment in Example General Active Target Synchronization. Some of the calls may have to delay their return until the target window has been posted. However, if the target window is posted, then the code fragment must complete. The data transfer may start as soon as the put call occurs, but may be delayed until the ensuing complete call occurs.

Consider the code fragment in Example Lock. Some of the calls may delay their return until the lock is acquired if another MPI process holds a conflicting lock. However, if no conflicting lock is held, then the code fragment must complete.

Consider the code illustrated in Figure 32.

Image file


Figure 32: Symmetric communication

Each MPI process updates the window of the other MPI process using a put operation, then accesses its own window. The post calls are local. Once the post calls occur, RMA access to the windows is enabled, so that each MPI process should complete the sequence of start-put-complete. Once these are done, the wait calls should complete at both MPI processes. Thus, this communication should not deadlock, irrespective of the amount of data transferred.

Assume, in the last example, that the order of the post and start calls is reversed at each MPI process. Then, the code may deadlock, as each MPI process may not return from the start call, waiting for the matching post to occur. Similarly, the program will deadlock if the order of the complete and wait calls is reversed at each MPI process.

The following two examples illustrate the fact that the synchronization between complete and wait is not symmetric: the wait call returns only once the complete occurs, but not vice versa. Consider the code illustrated in Figure 33.

Image file


Figure 33: Deadlock situation

This code will deadlock: the wait of process 1 completes only once process 0 calls complete, and the receive of process 0 completes once process 1 calls send. Consider, on the other hand, the code illustrated in Figure 34.

Image file


Figure 34: No deadlock

This code will not deadlock. Once process 1 calls post, then the sequence start-put-complete on process 0 can proceed. Process 0 will reach the send call, allowing the receive call of process 1 to return.


Rationale.

MPI implementations must guarantee that an MPI process makes progress on all enabled communications it participates in, while blocked on an MPI call. This is true for send-receive communication and applies to RMA communication as well. Thus, in the example in Figure 34, the put and complete calls of process 0 should complete while process 1 is waiting for the receive operation to complete. This may require the involvement of process 1, e.g., to transfer the data.

A similar issue is whether such progress must occur while an MPI process is busy computing, or blocked in a non- MPI call. Suppose that in the last example the send-receive pair is replaced by a write-to-socket/read-from-socket pair. Then MPI does not specify whether deadlock is avoided. Suppose that the blocking receive of process 1 is replaced by a very long compute loop. Then, according to one interpretation of the MPI standard, process 0 must return from the complete call after a bounded delay, even if process 1 does not reach any MPI call in this period of time. According to another interpretation, the complete call may block until process 1 reaches the wait call, or reaches another MPI call. The qualitative behavior is the same, under both interpretations, unless an MPI process is caught in an infinite compute loop, in which case the difference may not matter. However, the quantitative expectations are different. Different MPI implementations reflect these different interpretations. While this ambiguity is unfortunate, the MPI Forum decided not to define which interpretation of the standard is the correct one, since the issue is contentious. See also Section Progress on progress. ( End of rationale.)
The use of shared memory loads and/or stores for synchronizing purposes between MPI processes does not guarantee progress, and therefore a deadlock may occur if an MPI implementation does not provide strong progress, as shown in Example Progress.


Example Possible deadlock due to the use of a shared memory variable for synchronization.

comm_sm shall be a shared memory communicator (e.g., returned from a call to MPI_COMM_SPLIT_TYPE with split_type = MPI_COMM_TYPE_SHARED) with at least two MPI processes. win_sm is a shared memory window with the AckInRank0 as window portion in MPI process with rank 0. The ranks in comm_sm and win_sm should be the same. According to Section Semantics and Correctness rules USemantics and Correctness and USemantics and Correctness, a volatile store to AckInRank0 will be visible in the other MPI process without further RMA calls.

Image file

While the call to MPI_Recv in the MPI process with rank 1 delays its return (until an unspecific MPI procedure call in the MPI process with rank 0 happens to send the buffered data), the subsequent statement cannot change the value of the shared window buffer AckInRank0. As long as this value is not changed, the while loop in the MPI process with rank 0 will continue and therefore the next MPI procedure call ( MPI_Buffer_detach) cannot happen, which is then a deadlock.

Note that both communication patterns (A) BSEND-RECV-DETACH and (B) the shared memory store/load for synchronization purpose, can be in different software layers and each layer would work properly, but the combination of (A) and (B) can cause the deadlock.


PreviousUpNext
Up: Semantics and Correctness Next: Registers and Compiler Optimizations Previous: Ordering


Return to MPI-4.1 Standard Index
Return to MPI Forum Home Page

(Unofficial) MPI-4.1 of November 2, 2023
HTML Generated on November 19, 2023