> If RMA is implemented via a DMA or other remote access mechanism that is not
> coherent with the local processor(s) cache, then we really need a cache
> flush/invalidate before the get/put starts. Otherwise, the get may get
> invalid data and the put may be overwritten by lines that are written back. I
> don't understand how a deliver call after thehRMA transfer suffices. The
> sense of the forum was that MPI should not got out of its way to accomodate
> such hardware. If the current design stays, it means that windows must be
> uncached, or, to the least, must be flushed back to memory at any potential
> synchronization point -- quite umpleasnt, I know.
I don't really have a problem with forcing cache flushes at synchronization
points. This is not sufficient however. My reading of the current MPI_PUT
semantics and atomicity is that it is OK for one task to spin wait on
memory being updated by another task through use of MPI_PUT. This construct
would not work on a noncoherently cached system, even if cache flushes
happened at sync points. The spin wait has no sync points in it and that is
why the MPI_DELIVER call serves a purpose. It inserts a pseudo-sync point
in the loop which gives MPI a chance to flush the cache.
> If we want to accomodate
> both noncoheren tremote DMA, and local cached access to a window, then we
> need to alternate the state of the window between noncached and cached. We
> need two calls: MPI_WINDOW_EXPOSE, and MPI_WINDOW_HIDE. RMA access should
> occur only whne the window is exposed, and local accesses should occur only
> whne the window is hidden. (or, may be, remote accesses are delayed until
> the winodw is exposed).
I think I would rather roll this window exposure/hiding into existing MPI
sync points. Notice that it would be difficult for a user to correctly
expose a window:
mpi_barrier();
mpi_window_hide();
/* RMA traffic is permitted to the window on this rank here */
mpi_window_expose();
/* RMA traffic will occur to the window here */
mpi_barrier();
This first attempt at using the routines is wrong. We need additional
synchronization around each call to hide/expose to ensure that the ramote tasks
are not in a code region which might send RMA requests.
Or, as you add parenthetically, we could cause the expose/hide operations
to throttle RMA traffic. But this would require a lock on
every RMA operation to check that the window on the target task is not
exposed. This would be too large of a penalty. A much smaller penalty
would result if we chose one of these options:
a) Disallow concurrent local and remote RMA access to a window,
thus requiring use of MPI_GET when spin-waiting.
or b) Disallow concurrent local and remote RMA access to a window, with
one exception. Local loads (but not stores) of the window are OK at
any time, but MPI_DELIVER must be called to ensure that any
MPI_PUTs delivered since the last sync point are visible to local
loads on the target.
or c) Changed the hide/expose routines' semantics. The hide/expose
operation takes effect at the next synchronization point.
This option is really the same as "a".
Synchronization issues aside, the hide/expose approach seems to be assuming
a capability exists to activate and suppress caching (or cache coherency) on
the fly. The number of systems which can do this are problably not sufficient
to justify this approach (I think).
The MPI_DELIVER approach to the problem assumes only that there is a way
to invalidate or make the cache coherent at the time of the call. This
seems likely to be a more widely available hardware feature.
Karl
+-----------------------------------+----------------------------------+
| Karl Feind | E-Mail: kaf@cray.com |
| Cray Research, Inc. | Phone: 612/683-5673 |
| 655F Lone Oak Drive | Fax: 612/683-5276 |
| Eagan, MN 55121 | |
+-----------------------------------+----------------------------------+