Comments on collective ops, persistence and threads in MPI-2

Tony Skjellum (tony@Aurora.CS.MsState.Edu)
Fri, 19 Apr 1996 05:00:51 -0500

Dear Collective subcommittee, this write-up is long overdue, but I think
we have a clearer picture of the whole collective extensions now. Please
consider this as the prequel for a the chapter write-up to be given Monday.
I am cleaning up the whole thing.

Persistent collective operations
--------------------------------

Rationale:
In MPI-1, it is stated roughly that the effect of an MPI_SEND_INIT()
+ MPI_START() is insignificantly different from an MPI_ISEND(), and hence
there was no need for an MPI_ISEND_INIT() at all.

As currently worded in our chapter, MPI-2 will have non-blocking
collective operations without tags. Currently, we as a forum remain
ambivalent as to whether part of the comm's group may use non-blocking, and
the rest blocking. I will try to address this as part of the
discussion that follows. [Our only straw poll was nearly tied. We have
tried not to compromise MPI-1 functionality by adding MPI-2 features,
and reducing performance of MPI-1 features is one such unattractive
side effect.]

Background
----------
A short definition of non-blocking collective proposal (as I see it):
For every collective operation in MPI-1, a non-blocking [I]
variant is supposed, that adds a non-persistent request argument, which,
when used with MPI_WAIT/TEST suite, provides a local completion mechanism.

For every collective operation posed elsewhere in MPI-2, a
non-blocking [I] variant is created by implication, as for the MPI-1
operations.

Example: MPI_IBARRIER(comm, request).

I want to retain this formula for defining the expanse of collective
non-blocking operations, as there are many in other chapters now,
as well as those introduced in MPI-1 and added in the Collective Extensions.

Now, it is proposed that there be persistent request versions, for each such
operation, in addition:

Example: MPI_IBARRIER_INIT(comm, request);
...
MPI_START(request);
...
MPI_WAIT(request, status);

This operation, and those with actual data motion, should be more
optimizable than the straight non-blocking variety, because the number
of outstanding non-blocking persistent handles on a communicator is a
hint to the implementation about the potential concurrency and amount
of safe collective comm. space needed (I will say it: extra contexts).
Hence, for situations where non-blocking collectives are used inside
loops, the persistent varieties are preferable (in this proposal's view)
to the nonce forms. [Of course, the persistent - which have so-called thin
interfaces - do not let you change anything except for contents of buffers].
{Implementor's question: for language bindings that use by-reference
parameter passing, have we explicitly restricted changes to just buffer
contents, or can tags change etc? Is there canon on this issue somewhere?}

The existence of persistent collective operations where data sizes and
other arguments are frozen, should be much more optimizable, and
amenable to poly-algorithmic selection. Currently, because arguments
can change, and in some operations, need not be the same across all
processes, it would be exhorbitantly expensive to try to choose the
best algorithm on the fly. The persistent approach rewards reuse, and
opens the door to implementations that study the collective problem on
the _INIT call, and then do it more efficiently at each START.

(Proposal 1)
In summary, there should be persistent variants for each MPI-2
non-blocking collective call. These are fundamentally the most
important for high performance. The non-persistent versions should be
provided as calorically tolerable syntactic sugar, as they do not significantly
increase implementation effort, yet aid the user considerably.

- - - - - - - - -

Now, in returning to the issue of impact of mixing blocking an
non-blocking across the group of a communicator, The argument against
is that certain highly synchronous collective operations are possible
within special parts of memory hierarchy/network fabric, and that this
is lost if it is necessary to provide for asychronous completion in an
arbitrary subset of the communicator's processes. The argument for has
been stated (eg, by Paul Pierce) as nice functionality that users will
expect, and Marc Snir has argued elegantly that it offers nice symmetry
with point-to-point, where SEND's or ISEND's may be matched with RECVS
or IRECVS and so on.

let comm have six ranks: 0-5

do
MPI_IBARRIER(comm,request)
MPI_WAIT(request,status)
in 0-3

and

MPI_BARRIER(comm)
in 4-5

Surely, if MPI_BARRIER is implemented as the non-blocking form, followed
by WAIT, then there is no big deal about providing the functionality!
Can an implementation do better? {Would an implementation not rather
layer non-blocking onto top of blocking!}

Consider: The asymmetry could possibly be captured with a persistent operation.
For instance, consider executing

MPI_BARRIER_INIT(comm, request) (in all ranks of comm)

Then, if ranks 0-3 do MPI_START(request), they have launched a
non-blocking operation, and can defer the WAIT until they wish to do
so. Likewise, the rest could do an MPI_START()+MPI_WAIT() and
approximate immediate blocking.

In this case, only the persistent operation has to allow the non-blocking
and blocking mix, and we could insist that the non-persistent forms
not be mixed. This gives the full functionality, addresses the need
for mixture (but not with the complete freedom wanted by some).
(Call this option 2a).

The problem is that the persistent operation does not capture the
method of planned invocation (to block or not to block), nor can it,
unless this were posed as part of the persistent initialization!
Hence, the persistent operation cannot use the best blocking barrier,
per se, even if all the processes know they will block on every use of
the persistent operation.

So, we offer an alternative in which each persistent collective op could
have an argument to be blocking or nonblocking. This would create a
further asymmetry in the argument lists compared to the non- persistent
operations, but appears to capture the semantics fully, and allow a
fully blocking barrier to work faster. (Call this option 2b).

Example: MPI_BARRIER_INIT(comm, blockflag, request)
If all blockflags are TRUE, across comm, then

MPI_START(request)

uses the best blocking algorithm for the barrier

whereas, if blockflags are TRUE in some processes, and FALSE in the
rest, then

MPI_START(request)

uses a potentially less optimized algorithm, which still might exploit
blocking in subsets.

One tiresome aspect of a proposal that allows collectives to block in START
is possible deadlocks. So, I see this as defective, and one should not use
MPI_START if one expects partial blocking of the operation. MPI_STARTALL
with this side effect seems evil too.

(Therefore, Option 2c.)
If it were required to use this operation on persistent collective
operations that had blocking components, one might code as follows,
using MPI_DO(), that indicates the requirement for synchronization.

blockflag = (some function of rank in comm);
MPI_BARRIER_INIT(comm, blockflag, request);
...
MPI_DO(request, status);
if(blockflag != TRUE) MPI_WAIT(request, status);

MPI_DO() is a new, proposed function, like MPI_START(), but with
the caveat of blocking in some processes.

This approach would arguably restrict the complexity added to be only
in the START-like functionality and not to the WAIT/TEST family. Fully
blocking calls achieve maximum performance using the anonymous, thin
interface of MPI_DO(), but do block there, making MPI_DO have
collective semantics, unlike MPI_START(), which behaves like a local
operation.

Should there be an inquiry mechanism to set/change the blocking/non-blocking
status of a persistent request?

Should there be an MPI_ALLDO()? MPI_ALLDO(count, requestarray) would
do an unsequenced set of partially blocking operations. Seems like a
hard one to get right. Let's steer away for now. {It should be noted
that an array of persistent requests could itself be made a
meta-persistent operation, which would be more optimizable. John
Salmon (Caltech) and several other folks at TJ Watson wanted some type of
multi-persistent request technology, and a good reason for that is more
potential runtime optimizability.}

(2)
In summary: An add-on proposal is that persistent collective calls
elect to block/non-block on a per-process basis when instantiated; they
do so by adding an argument that flags this (blockflag).

Blocking and non-blocking non-persistent calls may not be matched (ie,
IBARRIER with BARRIER over the same comm), and an MPI_DO() mechanism is
needed, to service mixed or all-blocking persistent operations. Both
pt2pt and collective persistent handles can appear in MPI_START() or
MPI_DO(), but any appearing MPI_START() will behave completely
non-blocking, while those in MPI_DO() will be selectively blocking/non-blocking,
according to their initial definition.

If we agree that layering of non-blocking on top of blocking argument
below is valid, then we drop the mixing requirement here, but we keep
the features of the persistent versions to maximize performance. (See below).

- - - - - - - - -

Threads and collective operations. {Processes remain the named
entities in this discussion.} As we see it,

Model #1 - Threads are created only when doing special form of
MPI_COMM_DUP(), call it MPI_ICALL (See def below).

Assumes one thread per communicator per process.

This model is conceptually easy to support for MPI. By extension,
additional threads may only do send and receive, and match
up calls using tags can be supported. The ICALL'd procedure
gets a new communicator and a new thread in each process,
and the parallel thread ends when the call ends.

The safe communication space needed for collective is restricted,
and one primary thread per process would be restricted to using
collective. Inside the ICALL, a new communicator would be
present to offer one more concurrent collective operation.
{Subsequent layers of calls would not have further concurrency,
this would have to be recoded to replace synchronous calls
by asynchronous calls.}

Restriction against multiple collectives...
* This could be achieved by binding a thread to
a communicator's collective service , to detect
erroneous uses
* Don't detect, just declare erroneous
* It could be policed by the user program strictly

{Implementors would cache sleeping threads and reactivate them
so that further ICALLs in a process, instantiating
more MPI-created threads might prove cheaper than the first
instantiation... It is a reality that thread creation is
expensive, and a simple wrapper around a thread call could
be used with common systems to allow the thread to be reused
without the full initialization cost.}

More formally:
MPI_ICALL(comm,f,arg,request) is added, that provides an asynchronous,
well-formed parallel call, and dup's the communicator. Existing
parallel code that itself dups the communicator finds this operation
fast in good implementations {there is argumentation about
MPI_COMM_DUP() that says shows that it is easily
optimizable for repetitive use on the same underlying group},
and a new thread executes the call asynchronously from the
parallel parent thread. {This preserves existing MPI libraries
as is.}

Any additional, non-MPI threads may only do send/recv.

Example:
MPI_ICALL(comm, f, arg, request)
[asynchronously calls f(comm',arg) in parallel]
...
MPI_WAIT(request, status)
{comm's is a simple dup of comm}

This is clearly useful, a single call, and adds to MPI's
functionality without bringing in lots of details about threads.

Any needed mutual exclusion technology remains outside of MPI,
or as analogues to that which is needed anyway to offer one-sided
operations .

{An intercomm form of ICALL would be extremely useful, and will be
posed if ICALL is accepted, and will be dropped if ICALL fails.
Such a call would build a bi-partite, completely overlapped pair
of communicators between parent and child parallel threads (analogous
to SPAWN), and allow for light-weight messaging between threads.}

Model #2 - Many collective operations may be posed per communicator
in the presence of multiple threads. No special MPI calls
for threads.

When we had asynchronous collective operations with tags, or
synchronous collective with tags, this was a snap for the user
to orchestrate. We dropped collective tags in MPI-1 because we
saw no use (threaded situations is in fact a use for it), and
in MPI-2 on non-blocking ops because it was an asymmetry. {BTW,
layered implementations tend to use the tag internally to
guarantee segregation of sequential compositions of collective
ops on the same communicator, but they only need a few bits
of the tag to do this correctly.}

We propose that there be a set of persistent handles defined for the
collective operations by a master thread per process (in the comm), and
that these persistent collective ops be used to achieve the desired
asynchrony. All policing is by user-code that decides which threads
own which persistent handles in each user process.

If a persistent handle is used in a parallel thread, it is the
persistent handle (request) that keeps the threads from getting
confused. This is also true in the base implementation of non-blocking
collectives where (in the absence of tags), non-blocking operations
must be posed in order across the group of the comm so that their
requests contain the ordering information (which is no different than a
context integer). So, one solution is to predefine lots of collective
ops on a single communicator, then create lots of threads using your
favorite thread package, and rely on persistent requests to keep things
clean. This implies a dual layer of asynchrony, so the MPI_DO()
option mentioned above seems indicated together with blocking flags,
so that the only asynchrony is the user's threads.

- - - - - - -

About layering on MPI, without anything special at all said about
threads from the user's perspective, but support for non-blocking
collectives included.

MPI_COMM_DUP() and MPI_COMM_DUP_LOCAL() were originally posed, and
MPI_COMM_DUP() survived the MPI-1 standard. The latter, _LOCAL() form,
was meant to allow quick, completely local communicator formation, if
communication space had been reserved (eg, recently created then
discarded). It would fail if communication were required to generate
the new communicator, indicating the need for a full MPI_COMM_DUP()
call, with non-local semantics. Good implementations are meant as of
now to provide the effect of COMM_DUP_LOCAL() internally, by caching
the (context words) internal information needed for the safe
communication space, per group. However, for layering, explicit access
to this call is helpful.

_LOCAL failed in MPI-1 because no compelling reason was posed for it.
One such belated rationale is that layering of non-blocking collective
operations in a portable fashion on top of blocking collectives using
threads and MPI_COMM_DUP_LOCAL() seems relatively easy, whether or not
the user himself/herself would use the call. It essentially lets you
know that safe communication space is available immediately, so that a
measure of decision making about the cost of committing to a COMM_DUP()
can be made. If available, one can safely spawn additional
asynchronous collective communications, for instance, by using the locally/
cheaply DUP'd communicator, and reserve the parent communicator for
further DUP's as needed.

It is suggested that a principle about layering of non-blocking on top
of blocking collective communication be established {analogous to the
rule about collective on top of point to point being possible}, and the
reintroduction of MPI_COMM_DUP_LOCAL() and possibly related calls such
as MPI_COMM_SPLIT_LOCAL() be considered. {One can safely ignore
MPI_COMM_CREATE().}

In such a situation, fully portable MPI code could access the MPI
layers to get communication dups locally, and using one's favorite
thread package, cache sleeping threads (per process) to activate when
needed for concurrency. IMHO, This would allow maximizing the amount
of reference or model implementation work that would be fully
portable. {Portable mechanisms for modifying requests would also be
needed, btw, but this is only a small stretch beyond layerability.}

How does this bear on the issue of using MPI_BARRIER() in some
processes, and MPI_IBARRIER() in others?

If IBARRIER is layered on barrier_implementation(comm)
IBARRIER := MPI_COMM_DUP_LOCAL(comm, local_comm)
if local_comm is not OK
{
lock comm
}
Stuff a sleeping pre-allocated thread with
if local_comm is not OK
{
MPI_COMM_DUP(comm, local_comm);
unlock comm
}
barrier_implementation(local_comm)
Use a portable mechanism for request marking (!)
MPI_COMM_FREE(local_comm)
Sleep the parallel thread

This could match up with the blocking logic:

MPI_BARRIER(MPI_Comm comm)
{
MPI_COMM_DUP_LOCAL(comm, local_comm) // briefest locking of comm only
if(local_comm is not OK) // eg, just a fetch and add or so
{
lock comm
MPI_COMM_DUP(comm, local_comm);
unlock comm
}
barrier_implementation(local_comm)
MPI_COMM_FREE(local_comm)
}

Note: If MPI_COMM_DUP_LOCAL() works, it works everywhere. When it
fails, the portable code must prevent its parent thread from proceeding,
so it can use the safe comm. space existing already to get more.

The lock of the communicator, FBO the non-blocking layer, could be in
an attribute, whereas, DUP_LOCAL might use a small atomic access, that
is much cheaper than (eg, mutex) a full lock.

This bounds the amount of pain a blocking operation has to do to
using an MPI_COMM_DUP_LOCAL() and and MPI_COMM_FREE() on such an object.
If both are inlined, and carefully done, this is not a lot of instructions,
especially if caching of the freed objects is emphasized, rather than
actually doing memory management that puts data back on the heap. To
get success from later MPI_COMM_DUP_LOCALs, one wants to do this, mostly,
anyway.

One fly in ointment is that, to be truly layered on top of an MPI
implementation, the user program would have to have access to
barrier_implementation(comm). This could be done by providing
a name for the versions of collective operations that actually
use the given communicators instead of DUPing. These would
not interoperate between blocking and nonblocking. Like profiling,
these names could be standardized as shifts of the public names.

This argumentation, if not found to have a flaw, shows an upper bound
on the impact of mixing blocking and nonblocking, which is probably
acceptable.

Finally, how does this impact the persistent operation, when all
mark the blocking flag as true?

Here is an example (poly-algorithmic non-blocking, persistent collective op):
IBARRIER_INIT(comm, flag, request) :=
do a collective comm to see if all agree on false
if all false, ptr := barrier_implementation
else if(false) ptr := IBARRIER else ptr := BARRIER

Save ptr as a cached datum inside request
Dave a reference to comm inside request
(Could Save dup'd communicator inside request instead,
skip this optimization for now)
Save local blocking flag

MPI_DO(request) :=
Extract ptr, comm from request
(*ptr)(comm);
// do some clean up

{Other collective operations would store arguments lists, but the idea
is captured fully here}.

Recommendation: Subcommittee recommend to full committee that
non-blocking be layerable on blocking in MPI-2 implementations, analogous
to MPI-1 requirement of blocking collective layering on pt2pt collective
{where the magic of separate collective contexts helps}.

- - - - - - -

Overall, further discussion of threads vs. persistent vs. asychronous
collective ops is suggested.

{It should be noted that that which is holding back threaded MPIs is
the need for thread-safe implementations, and those which can also
support concurrency from device level up in addition to being safe (but
potentially using giant locks). This has lagged first because of
threads being behind the eight ball on many systems, and now because
the large, existing code base of MPI implementations must be reworked
to make them suitable. Several threaded MPI efforts are evidently
forthcoming. My view is that MPI should fix a set of asynchronous call
options, in addition to asynchronous collectives, analogous to ICALL.
We could also require MPI implementations to be thread safe to be
called MPI-2 compliant. MPI should provide the whole environment needed,
not rely on a secondary package like DCE threads, or such to get the
whole program to work.

For instance, many optimizations inside an implementation, such as
separate queues per communicator, could help allow for needed
concurrency in multithreaded messaging, but devices will still
need to be arbitrated, and all device code, even mfgr drivers, will
have to be safe and sane.

We should also consult the external interface chapter to relate
this to the current view of asynchronous support there.}

Note in closing: here we do not consider promoting threads to be named
entities for MPI. That is also possible, but beyond current scope.