7.12. Nonblocking Collective Operations

Up: Collective Communication Next: Nonblocking Barrier Synchronization Previous: Example using MPI_SCAN

As described in Section Nonblocking Communication, performance of many applications can be improved by overlapping communication and computation, and many systems enable this. Nonblocking collective operations combine the potential benefits of nonblocking point-to-point operations, to exploit overlap and to avoid synchronization, with the optimized implementation and message scheduling provided by collective operations [35,39]. One way of doing this would be to perform a blocking collective operation in a separate thread. An alternative mechanism that often leads to better performance (e.g., avoids context switching, scheduler overheads, and thread management) is to use nonblocking collective communication [37].

The nonblocking collective communication model is similar to the model used for nonblocking point-to-point communication. A nonblocking call initiates a collective operation, which must be completed in a separate completion call. Once initiated, the operation may progress independently of any computation or other communication at participating MPI processes. In this manner, nonblocking collective operations can mitigate possible synchronizing effects of collective operations by running them in the ``background.'' In addition to enabling communication-computation overlap, nonblocking collective operations can perform collective operations on overlapping communicators, which would lead to deadlocks with blocking operations. Their semantic advantages can also be useful in combination with point-to-point communication.

As in the nonblocking point-to-point case, all calls are local and return immediately, irrespective of the status of other MPI processes. The call initiates the operation, which indicates that the system may start to copy data out of the send buffer and into the receive buffer. Once initiated, all associated send buffers and buffers associated with input arguments (such as arrays of counts, displacements, or datatypes in the vector versions of the collectives) should not be modified, and all associated receive buffers should not be accessed, until the collective operation completes. The call returns a request handle, which must be passed to a completion call.

All completion calls (e.g., MPI_WAIT) described in Section Communication Completion are supported for nonblocking collective operations. Similarly to the blocking case, nonblocking collective operations are considered to be complete when the local part of the operation is finished, i.e., for the caller, the semantics of the operation are guaranteed and all buffers can be safely accessed and modified. Completion does not indicate that other MPI processes have completed or even started the operation (unless otherwise implied by the description of the operation). Completion of a particular nonblocking collective operation also does not indicate completion of any other posted nonblocking collective (or send-receive) operations, whether they are posted before or after the completed operation.

Advice to users.

Users should be aware that implementations are allowed, but not required (with exception of MPI_IBARRIER), to synchronize MPI processes during the completion of a nonblocking collective operation. ( End of advice to users.)
Upon returning from a completion call in which a nonblocking collective operation completes, the values of the MPI_SOURCE and MPI_TAG fields in the associated status object, if any, are undefined. The value of MPI_ERROR may be defined, if appropriate, according to the specification in Section Return Status. It is valid to mix different request types (i.e., any combination of collective requests, I/O requests, generalized requests, or point-to-point requests) in functions that enable multiple completions (e.g., MPI_WAITALL). It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request associated with a nonblocking collective operation. Nonblocking collective requests created using the APIs described in this section are not persistent. However, persistent collective requests can be created using persistent collective operations described in Sections Persistent Collective Operations and Persistent Neighborhood Communication on Process Topologies.

Rationale.

Freeing an active nonblocking collective request could cause similar problems as discussed for point-to-point requests (see Section Communication Completion). Cancelling a request is not supported because the semantics of this operation are not well-defined. ( End of rationale.)
Multiple nonblocking collective operations can be outstanding on a single communicator. If the nonblocking call causes some system resource to be exhausted, then it will fail and raise an error. Quality implementations of MPI should ensure that this happens only in pathological cases. That is, an MPI implementation should be able to support a large number of pending nonblocking operations.

Unlike point-to-point operations, nonblocking collective operations do not match with blocking collective operations, and collective operations do not have a tag argument. All MPI processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a MPI process calls a collective operation, all other MPI processes in the communicator must eventually call the same collective operation, and no other collective operation with the same communicator in between. This is consistent with the ordering rules for blocking collective operations in threaded environments.

Rationale.

Matching blocking and nonblocking collective operations is not allowed because the implementation might use different communication algorithms for the two cases. Blocking collective operations may be optimized for minimal time to completion, while nonblocking collective operations may balance time to completion with CPU overhead and asynchronous progress.

The use of tags for collective operations can prevent certain hardware optimizations. ( End of rationale.)

Advice to users.

If program semantics require matching blocking and nonblocking collective operations, then a nonblocking collective operation can be initiated and immediately completed with a blocking wait to emulate blocking behavior. ( End of advice to users.)
In terms of data movement, each nonblocking collective operation has the same effect as its blocking counterpart for intra-communicators and inter-communicators after completion. Likewise, upon completion, nonblocking collective reduction operations have the same effect as their blocking counterparts, and the same restrictions and recommendations on reduction orders apply.

The use of the ``in place'' option is allowed exactly as described for the corresponding blocking collective operations. When using the ``in place'' option, message buffers function as both send and receive buffers. Such buffers should not be modified or accessed until the operation completes.

The progress rules for nonblocking collective operations are similar to the progress rules for nonblocking point-to-point operations, refer to Sections Progress and Semantics of Nonblocking Communication Operations.

Advice to implementors.

Nonblocking collective operations can be implemented with local execution schedules [38] using nonblocking point-to-point communication and a reserved tag-space. ( End of advice to implementors.)

Up: Collective Communication Next: Nonblocking Barrier Synchronization Previous: Example using MPI_SCAN

Return to MPI-4.1 Standard Index
Return to MPI Forum Home Page

(Unofficial) MPI-4.1 of November 2, 2023
HTML Generated on November 19, 2023