Re: polling is less efficient with alternate proposal

Raja Daoud (raja@tbag.rsn.hp.com)
Thu, 11 Jul 1996 13:24:51 CDT

>
>
> Overview
>
> After my first read of the alternate 1-sided chapter which folds WINDOW_IN/OUT
> into MPI_BARRIER and MPI_WINDOW_LOCK/UNLOCK, I see only one item of concern
> for users who need to do efficient polling synchronization using RMA PUTs and
> GETs. In this note I'll describe that concern.
>
> Because of this concern, I prefer the voted-on version of the chapter
> over the alternate proposal.
>
>
> The Concern
>
> As background, many users use the { barrier, communicate, barrier, compute}
> cycle in their parallel algorithms. These users seem about as well off
> in the voted-on as well as the alternate proposal.
>
> However, some codes can be optimized further by replacing the barrier
> with point-to-point synchronization with each of a small number of neighbors.
> The fastest way to do this is probably a call to MPI_FENCE, then MPI_PUT
> to deliver the flag word after all prior PUTs on the communicator are
> "complete" (globally visible). It is desirable if the neighbor can
> poll on the completion flag for a very low-latency method of synchronization
> on DMA systems. Of course, the voted-on proposal requires the receiver to
> poll using MPI_GET (not local loads), but the atomicity and progress
> requirements seem sufficient to make this work.
>
> With the alternate proposal, the producer code looks like this:
>
> MPI_PUT put data to the window
> MPI_WINDOW_LOCK
> MPI_PUT write the completion flag to the window
> MPI_WINDOW_UNLOCK
>
> And the polling code which follows where every process waits for its
> neighbors must look like this:
>
> while (flag not set) {
> MPI_WINDOW_LOCK
> MPI_GET
> MPI_WINDOW_UNLOCK
> }
>
> The LOCK calls will add significant latency to this type of exchange.
>
>
>
> Karl Feind E-Mail: kaf@cray.com
> Cray Research, an SGI Company Phone: 612/683-5673
> 655F Lone Oak Drive Fax: 612/683-5276
> Eagan, MN 55121
>