The compiler is allowed to temporarily modify data in memory. Normally, this problem may occur only when overlapping communication and computation, as in Example 0 , Case (b) on page 0 . Example 0 also shows a possibility that could be problematic.
Example Overlapping Communication and Computation.
USE mpi_f08 REAL :: buf(100,100) CALL MPI_Irecv(buf(1,1:100),...req,...) DO j=1,100 DO i=2,100 buf(i,j)=.... END DO END DO CALL MPI_Wait(req,...)
REAL :: buf(100,100), buf_1dim(10000) EQUIVALENCE (buf(1,1), buf_1dim(1)) CALL MPI_Irecv(buf(1,1:100),...req,...) tmp(1:100) = buf(1,1:100) DO j=1,10000 buf_1dim(h)=... END DO buf(1,1:100) = tmp(1:100) CALL MPI_Wait(req,...)
REAL :: buf(100,100), local_buf(100,100) CALL MPI_Irecv(buf(1,1:100),...req,...) local_buf = buf DO j=1,100 DO i=2,100 local_buf(i,j)=.... END DO END DO buf = local_buf ! may overwrite asynchronously received ! data in buf(1,1:100) CALL MPI_Wait(req,...)In the compiler-generated, possible optimization in Example 0 , buf(100,100) from Example 0 is equivalenced with the 1-dimensional array buf_1dim(10000). The nonblocking receive may asynchronously receive the data in the boundary buf(1,1:100) while the fused loop is temporarily using this part of the buffer. When the tmp data is written back to buf, the previous data of buf(1,1:100) is restored and the received data is lost. The principle behind this optimization is that the receive buffer data buf(1,1:100) was temporarily moved to tmp.
Example 0 shows a second possible optimization. The whole array is temporarily moved to local_buf.
When storing local_buf back to the original location buf, then this implies overwriting the section of buf that serves as a receive buffer in the nonblocking MPI call, i.e., this storing back of local_buf is therefore likely to interfere with asynchronously received data in buf(1,1:100).
Note that this problem may also occur:
Note also that the methods
Note also that compiler optimization with temporary data movement should not be prevented by declaring buf as VOLATILE because the VOLATILE implies that all accesses to any storage unit (word) of buf must be directly done in the main memory exactly in the sequence defined by the application program. The VOLATILE attribute prevents all register and cache optimizations. Therefore, VOLATILE may cause a huge performance degradation.
Instead of solving the problem, it is better to prevent the problem:
when overlapping communication and computation,
the nonblocking communication (or nonblocking or split collective I/O)
and the computation should be executed on different variables,
and the communication should be protected with the
In this case, the temporary memory modifications are done
only on the variables used in the computation and cannot have any
side effect on the data used in the nonblocking MPI operations.
This is a strong restriction for application programs.
To weaken this restriction, a new or modified asynchronous feature
in the Fortran language would be necessary:
an asynchronous attribute that can be used on parts of an array
and together with asynchronous operations outside the scope of Fortran.
If such a feature becomes available in a future edition of the Fortran standard,
then this restriction also may be weakened in a later version
of the MPI standard.
( End of rationale.)
In Example 0 (which is a solution for the problem shown in Example 0 and in Example 0 (which is a solution for the problem shown in Example 0 ), the array is split into inner and halo part and both disjoint parts are passed to a subroutine separated_sections. This routine overlaps the receiving of the halo data and the calculations on the inner part of the array. In a second step, the whole array is used to do the calculation on the elements where inner+halo is needed. Note that the halo and the inner area are strided arrays. Those can be used in non-blocking communication only with a TS 29113 based MPI library.