If I remember correctly, Lloyd objected to the restricted two-phase
collectives (1 per comm) saying that an optimal implementation would
require shared-memory O(#comm) and that was too much on his system.
I agree with that, though on my system that's acceptable.
By adding tags, now it's also proportional to # of concurrent tags per
communicator. That starts being too much memory even "on systems with
lots of shmem." To detect at runtime the out-of-shmem case and
collectively degrade back to point-to-point requires a collective step
gathering each process' no-shmem flag. The same problem exists even if
we use a fixed # of shmem areas per communicator (for the first N tags),
since users can start these tagged-two-phase collectives in different
orders.
> Advice to implementors.
> For systems that implement collective on top of point-to-point,
> this poses no special penalty. For systems that propose to
> use hardware for collective, I suspect there can be objections,
> but these same objections apply to overlapping groups posing
> non-blocking collectives as well, or any case where the special
> hardware must be multiplexed between activities.
This design penalizes optimized implementations but adds no burden on
point-to-point implementations, which anyway are no better than what the
user can hand-code. In the case of shmem collectives, "hardware
multiplexing" is a collective pre-step unless users are forced to start
the tagged-two-phase collectives in the same tag order within each
communicator. In this case the tag would be nothing more than a
glorified index to the shmem areas, and cannot be used for matching.
Feasible, but no better than telling users they can start N two-phased
collectives instead of 1, where N is system-dependent (set in some
attribute).
All in all, I still prefer the non-tagged two-phase collectives.
--Raja
-=-
Raja Daoud Hewlett-Packard Co.
raja@rsn.hp.com Convex Division