Re: Volatile & Caching section

Karl Feind (kaf@cray.com)
Thu, 18 Apr 1996 20:51:38 -0500

I have some comments and some further discussion about the three cache
coherency issues raised by Rolf. And I've expanded on some of Lloyd's
comments to produce a fourth issue which affects MPI implementations on
networked as well as shared memory multiprocessor systems.

-----------------------------------------------------------------------------

Problem 1: MPI_GET obtains stale data when intermixed with local stores.

Description:

Data from the application on the target node is *not* stored into the memory
as soon a possible and subsequent accesses to that memory by MPI_GET issued
from the origin node will give older values.

Systems affected:

Shared memory multi-processors without cache coherency hardware and with
write-back (not write-through) caches. Examples of such systems include
the RACE i860 from Mercury Computer Systems, according to Lloyd.

Proposal 1.1 (From Rolf's paper):

A call to MPI_RMA_WINDOW_READY on the target (memory) node
after it is finished updating memory in the window. This allows
must be called on the target before the origin issues an MPI_GET.

One difficulty with this approach is that the problem of synchronizing the
call to MPI_RMA_WINDOW_READY on one node with the MPI_GET call on another
node is left to the user unless a matching MPI_RMA_WINDOW_BUSY call
locks out (delays) any MPI_GET requests that come in. A call to
MPI_RMA_WINDOW_READY would unlock the window, allowing MPI_GETs to
continue. (Rolf--is this your intent?)

This idea might be more useful if MPI_RMA_WINDOW_READY is defined to be
called on the node that originates the subsequent MPI_GET call. I doubt
that a very efficient implementation would be realized, however, because
a signal handler invocation on the target node would probably be required.

Proposal 1.2:
Do not address this issue in the MPI standard and leave the implementation
to prevent the problem from occurring.

The affected systems would suffer a very high performance penalty in the
implementation of GET, RMW, and ACCUMULATE operations. System calls and
signal handlers likely would have to be invoked instead of simply
transferring data and perhaps locking/unlocking memory areas.

New proposal 1.3:
Disallow local stores to a window between synchronization points if
any RMA operation from a different node is reading data from the window.
Disallowed RMA operations include GET, RMW, and ACCUMULATE.

A "synchronization point" is defined as any MPI 1-sided or 2-sided
communication call on any communicator (for any window). This definition
is made with a broad brush stroke to allow maximum user flexibility. For
example, a window might be set up for any node to access, but during a
particular point in the computation just one producer and one consumer must
synchronize local (via ordinary stores) and remote (via MPI_GET) updates to
the window.

As in Proposals 1.1 and 1.2, local loads may occur simultaneously with remote
MPI_GET operations on a window.

-----------------------------------------------------------------------------

Problem 2: MPI_PUT intermixed with local stores can result in lost data.

Description:

Data from the application on the target node is *not* stored into the memory
as soon a possible and subsequent stores to that memory by MPI_PUT issued
from the origin node are done before the local store data is written from
the cache to the memory. The data written by MPI_PUT is overwritten when
the local store data is later flushed to memory. (Note that previously
data written locally by MPI_PUT can be incorrectly overwritten by flushes
to memory of cache lines containing locally stored data too).

Systems affected:

Same as for Problem 1.

Proposal 2.1 (From Rolf's paper):

A call to MPI_RMA_WINDOW_READY on the target (memory) node
after it is finished updating memory in the window. This allows
must be called on the target before the origin issues an MPI_PUT.

The same difficulties noted under proposal 1.1 apply here to MPI_PUT.

Proposal 2.2:
Do not address this issue in the MPI standard and leave the implementation
to prevent the problem from occurring.

The same severe performance penalty noted under proposal 1.2 applies here
to MPI_PUT as well for the systems affected.

New proposal 2.3:

Disallow local stores to a window between synchronization points if
any RMA operation from a different node is updating data from the same
window. Disallowed RMA operations include PUT, RMW, and ACCUMULATE.

See discussion under proposal 1.3.

-----------------------------------------------------------------------------

Problem 3: Local loads return stale data after an MPI_PUT operation.

Description:

Data from the application on the target node is stored into the memory *as
soon as possible* and subsequent stores to that memory by MPI_PUT
issued from the origin node are done also *as soon as possible* but
subsequent loads from that memory by the application on the target node
is done by using the older data in the cache that was written when the
data was stored by the application on the target node.

Systems affected:

Shared memory multi-processors without cache coherency hardware. Examples
include CRAY T90, CRAY T3D, and RACE i860 from Mercury Computer Systems.

Proposal 3.0 (MPI_DELIVER as proposed by Karl)

A call to MPI_DELIVER flushes the cache and guarantees that subsequent
loads can access all data delivered by RMA requests to that target prior
to the call to MPI_DELIVER.

MPI_DELIVER must be called on the target node
to ensure complete delivery of any data written to the target by prior
RMA requests originated by other nodes. Without a call to MPI_DELIVER,
it is undefined whether any local loads of data written by prior RMA
operations access the new or the old memory contents.

An MPI_DELIVER operation is implied by any MPI synchronization routine--
MPI_BARRIER, MPI_RECK, RMW, and ACCUMULATE requests.

With this proposal, no synchronization is really needed between origin
and target. The most typical use of this would be when the target node
is spinning on memory:

volatile int windowarray[SIZE];

while (windowarray[0] == 0) {
mpi_deliver();
}

On systems with coherent caches, MPI_DELIVER, could be defined as a
null macro in mpi.h. Therefore there is a negligible performance
penalty for unaffected systems.

This proposal weakens the progress rule for 1-sided communication. But
if that's a problem would someone explain to me why it's a problem?
I understand that the strength of the progress rule has been established
by MPI-1 for 2-sided communication. But, nothing has been established or
standardized for 1-sided communication as of yet.

Proposal 3.1 (MPI_DELIVER as proposed by Rolf)

A call to MPI_DELIVER flushes the cache and guarantees that subsequent
loads are fetched from the actual memory and not from old caches.
MPI_DELIVER must be called on the target node after the memory is updated
by (some) MPI_POUT issued from one or more origin nodes. The application
must guarantee, e.g. by looking to the window counter, that there is no
ongoing MPI_POUT to that window. Before calling MPI_DELIVER the
application on the target node must not access the window memory. After
calling MPI_DELIVER the application on the target host may access the
memory again but the application on the origin node must not issue further
MPI_PUTs.

This proposal must be fleshed out further to define whether the user
must synchronize origin and target with separate MPI synchronization
calls or whether MPI_DELIVER would set a lock which would block further
MPI_PUTs until the next synchronization point.

This proposal also weakens the progress rule for 1-sided communication.

Proposal 3.2

Do not address this issue in the MPI standard and leave the implementation
to prevent the problem from occurring.

The same severe performance penalty noted under proposals 1.2 and 2.2
applies here to MPI_PUT for the systems affected.

This proposal leaves the progress rule strongly intact.

Proposal 3.3

Disallow local loads to a window between synchronization points if
any RMA operation from a different node is updating data from the same
window. Disallowed RMA operations include PUT, RMW, and ACCUMULATE.

See discussion under proposal 1.3. By disallowing concurrent access to
a window by local loads and remote RMA requests, a spin-wait loop would
be required to use MPI_GET instead of a local load. There is a performance
penalty here, but this may be an acceptable restriction.

This proposal doesn't weaken the progress rule for 1-sided communication
because it implies that you must test for progress via an RMA request,
not a local load. (if a tree falls in the forest and nobody hears it...)

-----------------------------------------------------------------------------

Problem 4: Memory corruption at window boundaries.

Description:

There are two similar issues which can result in memory coherency problems
at the boundaries of 1-sided communication windows:
1) word size
2) cache line size
The problem is most easily illustrated with a window containing one
byte which is in the middle of a word. The bytes on both sides
are not in the window and so may be updated by a node while remote
MPI_PUT operations are updating this same byte. On systems that have
no byte store instruction, the word must be loaded, and then the byte
inserted into the word before storing the entire word back to memory.

A similar issue exists for non-cache-coherent systems when a window
boundary falls in the middle of a cache line.

Systems affected:

Any system that does not have a byte-store instruction. Networked
systems and shared memory systems alike are affected. Also affected
are multiprocessor shared memory systems without full cache coherency
implemented in hardware.

Proposal 4.1

Require the implementation to handle the issue by recognizing when a
PUT is being done at the boundary of a window.

Although the boundary-within-a-cache-line problem is solvable this way,
the boundary-inside-a-word case is not solvable.

Proposal 4.2

Remove MPI_RMA_INIT from the standard. Require that all RMA windows
be allocated by MPI_RMA_MALLOC. The latter routine could adequately
ensure any needed boundary alignment for a given system.

Proposal 4.3

Keep MPI_RMA_INIT in the standard, but allow it to return an error
code if a user-requested window does not meet an implementation's
window alignment requirements. This would ensure that any window
successfully established would avoid the problem. But the collective
MPI_RMA_INIT call would have to fail on *all* nodes if any of the nodes
passed in a buffer with improper alignment or length.

We could also provide some constant e.g. MPI_WINDOW_ALIGN_SIZE which would
define the alignment requirements (in bytes) for the starting location
and the length of a window in a given implementation.

Karl Feind

+-----------------------------------+----------------------------------+
| Karl Feind | E-Mail: kaf@cray.com |
| Cray Research, Inc. | Phone: 612/683-5673 |
| 655F Lone Oak Drive | Fax: 612/683-5276 |
| Eagan, MN 55121 | |
+-----------------------------------+----------------------------------+