Problems with Code Movement and Register Optimization

20.1.17. Problems with Code Movement and Register Optimization

Up: Support for Fortran Next: Nonblocking Operations Previous: Optimization Problems, an Overview

20.1.17.1. Nonblocking Operations

Up: Problems with Code Movement and Register Optimization Next: Persistent Operations Previous: Problems with Code Movement and Register Optimization

If a variable is local to a Fortran subroutine (i.e., not in a module or a COMMON block), the compiler will assume that it cannot be modified by a called subroutine unless it is an actual argument of the call. In the most common linkage convention, the subroutine is expected to save and restore certain registers. Thus, the optimizer will assume that a register that held a valid copy of such a variable before the call will still hold a valid copy on return.

Example Fortran 90 register optimization---extreme.

Image file

Example Nonblocking Operations shows extreme, but allowed, possibilities. MPI_WAIT on a concurrent thread modifies buf between the invocation of MPI_IRECV and the completion of MPI_WAIT. But the compiler cannot see any possibility that buf can be changed after MPI_IRECV has returned, and may schedule the load of buf earlier than typed in the source. The compiler has no reason to avoid using a register to hold buf across the call to MPI_WAIT. It also may reorder the instructions as illustrated in the rightmost column.

Example Similar example with MPI_ISEND

Image file

Due to valid compiler code movement optimizations in Example Nonblocking Operations, the content of buf may already have been overwritten by the compiler when the content of buf is sent. The code movement is permitted because the compiler cannot detect a possible access to buf in MPI_WAIT (or in a second thread between the start of MPI_ISEND and the end of MPI_WAIT).

Such register optimization is based on moving code; here, the access to buf was moved from after MPI_WAIT to before MPI_WAIT. Note that code movement may also occur across subroutine boundaries when subroutines or functions are inlined.

This register optimization/code movement problem for nonblocking operations does not occur with MPI parallel file I/O split collective operations, because in the MPI_ XXX_BEGIN and MPI_ XXX_END calls, the same buffer has to be provided as an actual argument. The register optimization / code movement problem for MPI_BOTTOM and derived MPI datatypes may occur in each blocking and nonblocking communication call, as well as in each parallel file I/O operation.

Up: Problems with Code Movement and Register Optimization Next: Persistent Operations Previous: Problems with Code Movement and Register Optimization

20.1.17.2. Persistent Operations

Up: Problems with Code Movement and Register Optimization Next: One-sided Communication Previous: Nonblocking Operations

With persistent requests, the buffer argument is hidden from the MPI_START and MPI_STARTALL calls, i.e., the Fortran compiler may move buffer accesses across the MPI_START or MPI_STARTALL call, similar to the MPI_WAIT call as described in the Nonblocking Operations subsection in Section Problems with Code Movement and Register Optimization.

Up: Problems with Code Movement and Register Optimization Next: One-sided Communication Previous: Nonblocking Operations

20.1.17.3. One-sided Communication

Up: Problems with Code Movement and Register Optimization Next: MPI_BOTTOM and Combining Independent Variables in Datatypes Previous: Persistent Operations

An example with instruction reordering due to register optimization can be found in Section Registers and Compiler Optimizations.

Up: Problems with Code Movement and Register Optimization Next: MPI_BOTTOM and Combining Independent Variables in Datatypes Previous: Persistent Operations

20.1.17.4. MPI_BOTTOM and Combining Independent Variables in Datatypes

Up: Problems with Code Movement and Register Optimization Next: Solutions Previous: One-sided Communication

This section is only relevant if the MPI program uses a buffer argument to an MPI_SEND, MPI_RECV, etc., that hides the actual variables involved in the communication. MPI_BOTTOM with an MPI_Datatype containing absolute addresses is one example. Creating a datatype that uses one variable as an anchor and brings along others by using MPI_GET_ADDRESS to determine their offsets from the anchor is another. The anchor variable would be the only one referenced in the call. Also attention must be paid if MPI operations are used that run in parallel with the user's application.

Example MPI_BOTTOM and Combining Independent Variables in Datatypes shows what Fortran compilers are allowed to do.

Example Fortran 90 register optimization.

Image file

In Example MPI_BOTTOM and Combining Independent Variables in Datatypes, the compiler does not invalidate the register because it cannot see that MPI_RECV changes the value of buf. The access to buf is hidden by the use of MPI_GET_ADDRESS and MPI_BOTTOM.

Example Similar example with MPI_SEND

Image file

In Example MPI_BOTTOM and Combining Independent Variables in Datatypes, several successive assignments to the same variable buf can be combined in a way such that only the last assignment is executed. ``Successive'' means that no interfering load access to this variable occurs between the assignments. The compiler cannot detect that the call to MPI_SEND statement is interfering because the load access to buf is hidden by the usage of MPI_BOTTOM.

Up: Problems with Code Movement and Register Optimization Next: Solutions Previous: One-sided Communication

20.1.17.5. Solutions

Up: Problems with Code Movement and Register Optimization Next: The Fortran ASYNCHRONOUS Attributebute Previous: MPI_BOTTOM and Combining Independent Variables in Datatypes

The following sections show in detail how the problems with code movement and register optimization can be portably solved. Application writers can partially or fully avoid these compiler optimization problems by using one or more of the special Fortran declarations with the send and receive buffers used in nonblocking operations, or in operations in which MPI_BOTTOM is used, or if datatype handles that combine several variables are used:

Use of the Fortran ASYNCHRONOUS attribute.
Use of the helper routine MPI_F_SYNC_REG, or an equivalent user-written dummy routine.
Declare the buffer as a Fortran module variable or within a Fortran common block.
Use of the Fortran VOLATILE attribute.

Example Protecting nonblocking communication with the ASYNCHRONOUS attribute.

Image file

Each of these methods solves the problems of code movement and register optimization, but may incur various degrees of performance impact, and may not be usable in every application context. These methods may not be guaranteed by the Fortran standard, but they must be guaranteed by a MPI-3.0 (and later) compliant MPI library and associated compiler suite according to the requirements listed in Section Requirements on Fortran Compilers. The performance impact of using MPI_F_SYNC_REG is expected to be low, that of using module variables or the ASYNCHRONOUS attribute is expected to be low to medium, and that of using the VOLATILE attribute is expected to be high or very high. Note that there is one attribute that cannot be used for this purpose: the Fortran TARGET attribute does not solve code movement problems in MPI applications.

Up: Problems with Code Movement and Register Optimization Next: The Fortran ASYNCHRONOUS Attributebute Previous: MPI_BOTTOM and Combining Independent Variables in Datatypes

20.1.17.6. The Fortran ASYNCHRONOUS Attributebute

Up: Problems with Code Movement and Register Optimization Next: Calling MPI_F_SYNC_REG Previous: Solutions

Declaring an actual buffer argument with the ASYNCHRONOUS Fortran attribute in a scoping unit (or BLOCK) informs the compiler that any statement in the scoping unit may be executed while the buffer is affected by a pending asynchronous Fortran input/output operation (since Fortran 2003) or by an asynchronous communication (TS 29113 extension). Without the extensions specified in TS 29113, a Fortran compiler may totally ignore this attribute if the Fortran compiler implements asynchronous Fortran input/output operations with blocking I/O. The ASYNCHRONOUS attribute protects the buffer accesses from optimizations through code movements across routine calls, and the buffer itself from temporary and permanent data movements. If the choice buffer dummy argument of a nonblocking MPI routine is declared with ASYNCHRONOUS (which is mandatory for the mpi_f08 module, with allowable exceptions listed in Section MPI for Different Fortran Standard Versions), then the compiler has to guarantee call by reference and should report a compile-time error if call by reference is impossible, e.g., if vector subscripts are used. The MPI_ASYNC_PROTECTS_NONBLOCKING is set to .TRUE. if both the protection of the actual buffer argument through ASYNCHRONOUS according to the TS 29113 extension and the declaration of the dummy argument with ASYNCHRONOUS in the Fortran support method is guaranteed for all nonblocking routines, otherwise it is set to .FALSE..

The ASYNCHRONOUS attribute has some restrictions. Section 5.4.2 of the TS 29113 specifies:

``Asynchronous communication for a Fortran variable occurs through the action of procedures defined by means other than Fortran. It is initiated by execution of an asynchronous communication initiation procedure and completed by execution of an asynchronous communication completion procedure. Between the execution of the initiation and completion procedures, any variable of which any part is associated with any part of the asynchronous communication variable is a pending communication affector. Whether a procedure is an asynchronous communication initiation or completion procedure is processor dependent.
Asynchronous communication is either input communication or output communication. For input communication, a pending communication affector shall not be referenced, become defined, become undefined, become associated with a dummy argument that has the VALUE attribute, or have its pointer association status changed. For output communication, a pending communication affector shall not be redefined, become undefined, or have its pointer association status changed.''

In Example Solutions Case (a) on page Solutions, the read accesses to b within function(b(i-1), b(i), b(i+1)) cannot be moved by compiler optimizations to before the wait call because b was declared as ASYNCHRONOUS. Note that only the elements 0, 1, 100, and 101 of b are involved in asynchronous communication but by definition, the total variable b is the pending communication affector and is usable for input and output asynchronous communication between the MPI_I XXX routines and MPI_Waitall. Case (a) works fine because the read accesses to b occur after the communication has completed.

In Case (b), the read accesses to b(1:100) in the loop i=2,99 are read accesses to a pending communication affector while input communication (i.e., the two MPI_Irecv calls) is pending. This is a contradiction to the rule that for input communication, a pending communication affector shall not be referenced. The problem can be solved by using separate variables for the halos and the inner array, or by splitting a common array into disjoint subarrays that are passed through different dummy arguments into a subroutine, as shown in Example Permanent Data Movement.

If one does not overlap communication and computation on the same variable, then all optimization problems can be solved through the ASYNCHRONOUS attribute.

The problems with MPI_BOTTOM, as shown in Example MPI_BOTTOM and Combining Independent Variables in Datatypes and Example MPI_BOTTOM and Combining Independent Variables in Datatypes, can also be solved by declaring the buffer buf with the ASYNCHRONOUS attribute.

In some MPI routines, a buffer dummy argument is defined as ASYNCHRONOUS to guarantee passing by reference, provided that the actual argument is also defined as ASYNCHRONOUS.

Up: Problems with Code Movement and Register Optimization Next: Calling MPI_F_SYNC_REG Previous: Solutions

20.1.17.7. Calling MPI_F_SYNC_REG

Up: Problems with Code Movement and Register Optimization Next: A User Defined Routine Instead of MPI_F_SYNC_REG Previous: The Fortran ASYNCHRONOUS Attributebute

The compiler may be prevented from moving a reference to a buffer across a call to an MPI subroutine by surrounding the call by calls to an external subroutine with the buffer as an actual argument. The MPI library provides the MPI_F_SYNC_REG routine for this purpose; see Section Additional Support for Fortran Register-Memory-Synchronization.

The problems illustrated by the Examples Nonblocking Operations and Nonblocking Operations can be solved by calling MPI_F_SYNC_REG(buf) once immediately after MPI_WAIT.

Example Nonblocking Operations can be solved with
Example Nonblocking Operations can be solved with

The call to MPI_F_SYNC_REG(buf) prevents moving the last line before the MPI_WAIT call. Further calls to MPI_F_SYNC_REG(buf) are not needed because it is still correct if the additional read access copy=buf is moved below MPI_WAIT and before buf=val_overwrite.
The problems illustrated by the Examples MPI_BOTTOM and Combining Independent Variables in Datatypes and MPI_BOTTOM and Combining Independent Variables in Datatypes can be solved with two additional MPI_F_SYNC_REG(buf) statements; one directly before MPI_RECV/ MPI_SEND, and one directly after this communication operation.

Example MPI_BOTTOM and Combining Independent Variables in Datatypes can be solved with
Example MPI_BOTTOM and Combining Independent Variables in Datatypes can be solved with

The first call to MPI_F_SYNC_REG(buf) is needed to finish all load and store references to buf prior to MPI_RECV/ MPI_SEND; the second call is needed to assure that any subsequent access to buf is not moved before MPI_RECV/ MPI_SEND.
In the Example Registers and Compiler Optimizations in Section Registers and Compiler Optimizations, two asynchronous accesses must be protected: in Process 1, the access to bbbb must be protected similar to Example Nonblocking Operations, i.e., a call to MPI_F_SYNC_REG(bbbb) is needed after the second MPI_WIN_FENCE to guarantee that further accesses to bbbb are not moved ahead of the call to MPI_WIN_FENCE. In Process 2, both calls to MPI_WIN_FENCE together act as a communication call with MPI_BOTTOM as the buffer. That is, before the first fence and after the second fence, a call to MPI_F_SYNC_REG(buff) is needed to guarantee that accesses to buff are not moved after or ahead of the calls to MPI_WIN_FENCE. Using MPI_GET instead of MPI_PUT, the same calls to MPI_F_SYNC_REG are necessary.

Example Solution for the Fortran register optimization problems with one-sided communication in Example Registers and Compiler Optimizations.
The temporary memory modification problem, i.e., Example Temporary Data Movement and Temporary Memory Modification, can not be solved with this method.

Up: Problems with Code Movement and Register Optimization Next: A User Defined Routine Instead of MPI_F_SYNC_REG Previous: The Fortran ASYNCHRONOUS Attributebute

20.1.17.8. A User Defined Routine Instead of MPI_F_SYNC_REG

Up: Problems with Code Movement and Register Optimization Next: Module Variables and COMMON Blocks Previous: Calling MPI_F_SYNC_REG

Instead of MPI_F_SYNC_REG, one can also use a user defined external subroutine, which is separately compiled:

Image file

Note that if the INTENT is declared in an explicit interface for the external subroutine, it must be OUT or INOUT. The subroutine itself may have an empty body, but the compiler does not know this and has to assume that the buffer may be altered. For example, a call to MPI_RECV with MPI_BOTTOM as buffer might be replaced by

Image file

Such a user-defined routine was introduced in MPI-2.0 and is still included here to document such usage in existing application programs although new applications should prefer MPI_F_SYNC_REG or one of the other possibilities. In an existing application, calls to such a user-written routine should be substituted by a call to MPI_F_SYNC_REG because the user-written routine may not be implemented in accordance with the rules specified in Section Requirements on Fortran Compilers.

Up: Problems with Code Movement and Register Optimization Next: Module Variables and COMMON Blocks Previous: Calling MPI_F_SYNC_REG

20.1.17.9. Module Variables and COMMON Blocks

Up: Problems with Code Movement and Register Optimization Next: The (Poorly Performing) Fortran VOLATILE Attribute Previous: A User Defined Routine Instead of MPI_F_SYNC_REG

An alternative to the previously mentioned methods is to put the buffer or variable into a module or a common block and access it through a USE or COMMON statement in each scope where it is referenced, defined or appears as an actual argument in a call to an MPI routine. The compiler will then have to assume that the MPI procedure may alter the buffer or variable, provided that the compiler cannot infer that the MPI procedure does not reference the module or common block.

This method solves problems of instruction reordering, code movement, and register optimization related to nonblocking and one-sided communication, or related to the usage of MPI_BOTTOM and derived datatype handles.
Unfortunately, this method does not solve problems caused by asynchronous accesses between the start and end of a nonblocking or one-sided communication. Specifically, problems caused by temporary memory modifications are not solved.

Up: Problems with Code Movement and Register Optimization Next: The (Poorly Performing) Fortran VOLATILE Attribute Previous: A User Defined Routine Instead of MPI_F_SYNC_REG

20.1.17.10. The (Poorly Performing) Fortran VOLATILE Attribute

Up: Problems with Code Movement and Register Optimization Next: The Fortran TARGET Attribute Previous: Module Variables and COMMON Blocks

The VOLATILE attribute gives the buffer or variable the properties needed to avoid register optimization or code movement problems, but it may inhibit optimization of any code containing references or definitions of the buffer or variable. On many modern systems, the performance impact will be large because not only register, but also cache optimizations will not be applied. Therefore, use of the VOLATILE attribute to enforce correct execution of MPI programs is discouraged.

Up: Problems with Code Movement and Register Optimization Next: The Fortran TARGET Attribute Previous: Module Variables and COMMON Blocks

20.1.17.11. The Fortran TARGET Attribute

Up: Problems with Code Movement and Register Optimization Next: Temporary Data Movement and Temporary Memory Modification Previous: The (Poorly Performing) Fortran VOLATILE Attribute

The TARGET attribute does not solve the code movement problem because it is not specified for the choice buffer dummy arguments of nonblocking routines. If the compiler detects that the application program specifies the TARGET attribute for an actual buffer argument used in the call to a nonblocking routine, the compiler may ignore this attribute if no pointer reference to this buffer exists.
Rationale.

The Fortran standardization body decided to extend the ASYNCHRONOUS attribute within the TS 29113 to protect buffers in nonblocking calls from all kinds of optimization, instead of extending the TARGET attribute. ( End of rationale.)

Up: Problems with Code Movement and Register Optimization Next: Temporary Data Movement and Temporary Memory Modification Previous: The (Poorly Performing) Fortran VOLATILE Attribute

Return to MPI-4.1 Standard Index
Return to MPI Forum Home Page

(Unofficial) MPI-4.1 of November 2, 2023
HTML Generated on November 19, 2023