[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [mpi-21] Proposal: Topological Collectives for MPI-3
Hi George,
> If I correctly understand your proposal these two functions are just a
> less powerful version of the MPI_Alltoallv and MPI_Alltoallw. There is
> nothing in the proposed neighbors collectives that cannot be achieved
> by using only ONE carefully crafted alltoall[v|w]. Additionally, I
> hardly see any optimization in terms of communications that can be
> applied to these neighbors collective and that cannot be applied to
> the corresponding alltoall collective.
Yes, you are right, Alltoall[v,w] can be used in a versatile way. They
can express all communication patterns by definition and can thus easily
replace MPI_Gather, MPI_Scatter and of course MPI_Alltoall. The other
non-reduction functions can also be emulated with local buffer copies.
However, these functions are standardized, reasons could be:
1) the communication pattern is fixed and the network could be
initialized to support it (e.g., static routing could be adjusted, cf.
IB) or offloading NICs can be programmed to support this pattern
2) messages can be scheduled in a more intelligent manner then
MPI_Alltoall[v,w] allows, because every node can (at least for the
non-vector variants) compute the global view and thus allow
receiver-based scheduling (thing of non-FBB networks where the packets
have to be routed). MPI_Alltoall[v,w] only has a very limited view on
the global communication, and it is unclear if it is reasonable to
establish a global view.
3) MPI_Alltoall[v,w] is very hard to optimize, many implementations do
not optimize at all. Actually, I could not name a single implementation
that does.
4) Parallel systems are steadily growing in size and communications are
usually extremely sparse. Especially nearest neighbor communications
and shifts are very sparse (usually 2-6 neigbors). MPI_Alltoall[v,w]
are rather inefficient in this case because the interface is suboptimal
for sparse commnication, i.e., on a 128k processor system and a common
4-point stencil nearest neighbor communication the user has to
initialize 4 (!) arrays of size 128*sizeof(int) where only 4*4 values
are non-zero. And the MPI implementation has to investigate all those
zero values to find that there is not much to do. The storage space
required on such a big system (assuming 64 bit integers) would be 4MB
for every communication.
5) it seems more complicated/inconvenient for the application prgrammer
> The topology associated with the communicator can be used to determine
> where data will be sent and from where data will be received inside a
> small wrapper around the alltoall functions. The two proposed
> neighbors collective are a perfect example on what kind of
> functionality a library developed on top of MPI can offer in a very
> portable manner across all MPI implementations (with a minimal
> overhead).
Yes, this addresses #5 and would be easily possible and would be an easy
unoptimized implementation. I would probably rather use MPI_Send/Recv
than MPI_Alltoall[v,w] for this wrapper. However, it is not possible to
take advantage of network features or intelligent scheduling.
Best,
Torsten
--
bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
"When there were no computers programming was no problem. When we had a
few weak computers, it became a mild problem. Now that we have gigantic
computers, programming is a gigantic problem." Edsger Wybe Dijkstra