polling is less efficient with alternate proposal

Karl Feind (kaf@cray.com)
Thu, 11 Jul 1996 12:50:18 -0500

Overview

After my first read of the alternate 1-sided chapter which folds WINDOW_IN/OUT
into MPI_BARRIER and MPI_WINDOW_LOCK/UNLOCK, I see only one item of concern
for users who need to do efficient polling synchronization using RMA PUTs and
GETs. In this note I'll describe that concern.

Because of this concern, I prefer the voted-on version of the chapter
over the alternate proposal.

The Concern

As background, many users use the { barrier, communicate, barrier, compute}
cycle in their parallel algorithms. These users seem about as well off
in the voted-on as well as the alternate proposal.

However, some codes can be optimized further by replacing the barrier
with point-to-point synchronization with each of a small number of neighbors.
The fastest way to do this is probably a call to MPI_FENCE, then MPI_PUT
to deliver the flag word after all prior PUTs on the communicator are
"complete" (globally visible). It is desirable if the neighbor can
poll on the completion flag for a very low-latency method of synchronization
on DMA systems. Of course, the voted-on proposal requires the receiver to
poll using MPI_GET (not local loads), but the atomicity and progress
requirements seem sufficient to make this work.

With the alternate proposal, the producer code looks like this:

MPI_PUT put data to the window
MPI_WINDOW_LOCK
MPI_PUT write the completion flag to the window
MPI_WINDOW_UNLOCK

And the polling code which follows where every process waits for its
neighbors must look like this:

while (flag not set) {
MPI_WINDOW_LOCK
MPI_GET
MPI_WINDOW_UNLOCK
}

The LOCK calls will add significant latency to this type of exchange.

Karl Feind E-Mail: kaf@cray.com
Cray Research, an SGI Company Phone: 612/683-5673
655F Lone Oak Drive Fax: 612/683-5276
Eagan, MN 55121