2-phase collective

Marc Snir (snir@watson.ibm.com)
Fri, 31 Jan 1997 17:46:45 -0400

---------------------- Forwarded by Marc Snir/Watson/IBM Research on
01/31/97 05:45 PM ---------------------------

To: mpi-external @ mcs.anl.gov
cc: Thomas.Boenisch @ RUS.Uni-Stuttgart.DE (bcc: Marc Snir/Watson/IBM
Research)
Subject: two-face collective scenario

An application scenario from real programmers life -- a background
information for the two-faced collective proposal.

Forwarded message:
> From: Thomas Boenisch <Thomas.Boenisch@rus.uni-stuttgart.de>
>
> Hallo Rolf,
>
> I need a incomplite reduce and allreduce to hide communication time
> and to avoid synchronisation in my application:
>
> One Iteration (programmed with your proposed two-faced...):
>
> call MPI_REDUCE_START(teilresl2, resl2, 1, REALx, MPI_SUM, 0,
> . MPI_COMM_WORLD, info)
> call MPI_ALLREDUCE_START(teilresl1, resl1, 1, REALx, MPI_SUM,
> . mpi_comm_world_a, info)
> call MPI_ALLREDUCE_START(teilresl0, resl0, 1, REALx, MPI_MAX,
> . mpi_comm_world_b, info)
>
> c...solv the equationsystem
> call parallelsolver( rm000, rmf00, rmb00, rm0f0, rm0b0, rm00f,
> . rm00b, dq, rhs, nsubit, nhofhausit )
>
> c...computation of the CFL-Number with resl1
> call MPI_ALLREDUCE_END(mpi_comm_world_a, resl1)
> ...
>
> c...output of all residuals
> call MPI_REDUCE_END(MPI_COMM_WORLD, resl2)
> call MPI_ALLREDUCE_END(mpi_comm_world_b, resl0)
> ...
>
> Ciao
> Thomas

What is the "overlap" argument pushed here? If the argument is that magic
hardware does the reduce in the bakcground while the main computation
proceeds in the foreground, then I would say that there little urgency to
add this feature to MPI: I am not aware of any existing system that would
do that, nor do I expect such systems to appear soon. Another possible
argument is software overlap, e.g., with a dedicated thread that makes
progress on the reduction while another thread runs the parallel solver.
But I doubt that this is a good enough The only other argument that I can
think of is one of uneven progress: asynchronous operations allow one to
have a less strict dependency between the time an operation is started at
one process and the time it is started at another process, allowing for
more asynchrony without wasted iddle cycles. For this argument to work OK,
we need an implementation of 2-phase collectives with the following
properties:

a. Each process can enter the 1st phase without blocking and waiting for
other processes to enter the 1st phase.
b. Once all processes have entered the frst phase, each process can exit
the second phase without blocking or waiting for other processes to enter
the second phase.

Note that this is the implicit behavior we expect from nonblocking
point-topoint: the Isend and Irecv are nonblocking and proceed irrespective
of the matching call. And, once both an Isend and an Irecv has been
posted, a wait should be able to complete the communication on one side,
irrespective of what's happening on the oher side (progress...).

I don't suggest that we mandate such behavior, only that such behavior
should be possible in high quality implementations. Otherwise, we have
replaced a blocking collective by two collective calls, one of which is
blocking, but we don't know which: hardly a progress.

So, does anyone have a suggestion on how to implement (efficiently) a
nonblocking reduce, so that this holds? The only way I know how to do this
is for systems with shared memory and with few processes: a globally shared
variable is used for the reduction; in 1st phase every process locks it,
adds its own value and unlocks it; in the second phase every process reads
the result (we also need a counter to make sure that everybody added its
contribution). This requires shared memory and is not scalable. If
nobody an think of a scalable, truely asynchronous implementation
(warning: this is not easy!) I would suggest to drop these 2-phase
collective constructs.