Re: Plans for next meeting

(no name) (snir@watson.ibm.com)
Wed, 27 Sep 95 18:32:28 EDT

: -) :-) :-) *** (-: (-: (-:
Date: Wed, 27 Sep 1995 18:32:07 -0400
From: Marc Snir <snir@snir.watson.ibm.com>

Collective communication proposal

Consider the following problem: A 2D matrix is stored at one node, and one
has to scatter it onto a 2D grid, with a block,block distribution. If all
blocks have the same size in each dimension, then this can be acheived with
one call to MPI_SCATTERV. One the other hand, if not all blocks are of equal
size (e.g., if the last block in each dimension is smaller, because grid size
does not divide evenly matrix size), then one cannot achieve this
communication with one MPI call. The same holds true for many other similar
tasks. E.g., pretty few of the communications that are required to reshape or
redistribute HPF arrays map into one MPI communication, unless arrays are
nicely padded.

The reason this problem occurs is that collective communication calls have
been designed to be less powerful than an equivalent set of point-to-point
communications: In particular, a process that sends or receives multiple
messages (such as the root of a scatter) has to use the same datatype for all
these messages.

The solution is to have collective communication calls that are as powerful as
an equivalent set of point-to-point communications. Rather then a <sendbuff,
sendcount, sendtype> set of arguments, a scatter call will have a <sendbuf,
sendcounts, senddisps, sendtypes> set of arguments, where plural indicates an
array the size of the communication group. Similar changes will occur for
gather and alltoall functions.

Lets illustrate by code to solve the scatter problem. The code is incorrect
with high probability, but, hopefully, sufficiently close to the right code to
illustrate the new function.

PROCEDURE SCATTER_2D(A, B, N, COMM)
! collective call executed by all processes of the group of COMM to
! scatter array A, that is initially stored at the origin, onto
! the 2D grid
INTEGER N(2)
REAL A(N(1),N(2))
REAL, ALLOCATABLE B(*,*)
INTEGER COMM
! COMM is a 2D processor grid
! N(1) and N(2) are the dimensions of array A at the origin;
! they are equal to 1 at the other processes.

INTEGER MM(2,2), NN(2), NFIRST(I), NLAST(2), DIMS(2), PERIODS(2), COORDS(2)
INTEGER, ALLOCATABLE DISPS(*,*), TYPES(*,*), COUNTS(*,*)
INTEGER SUBMAT_TYPE(2,2)
CALL MPI_BCAST(N, 2, MPI_REAL, 0, COMM, IERR)
CALL MPI_CART_GET(COMM, 2, DIMS, PERIODS, COORDS)
DO I=1,2
MM(1,I) = (N(I)+DIMS(I)-1)/DIMS(I) ! size of full block
MM(2,I) = N(I) - NFIRST(I)*(DIMS(I)-1) ! size of last block
IF (COORDS(I) .EQ. DIMS(I) -1) THEN ! last process in array
NN(I) = NLAST(I)
ELSE ! other processes
NN(I) = NFIRST(I)
END IF
END DO
ALLOCATE(B(NN(1),NN(2))
IF (COORDS(1).EQ.0 .AND. COORDS(2).EQ.0) THEN ! root process
! build datatypes for 4 kinds of submatrices
DO I=1, 2
DO J=1, 2
CALL MPI_TYPE_VECTOR(MM(J,2), MM(I,1), N(1), MPI_REAL,
SUBMAT_TYPE(I,J), IERR)
CALL MPI_TYPE_COMMIT(SUBMAT_TYPE(I,J))
END DO
END DO
! Build array of types and array of displacements for all submatrices
ALLOCATE (DISPS(COORDS(1), COORDS(2)), TYPES(COORDS(1), COORDS(2))
DO I =1, COORDS(1)
DO J=1, COORDS(2)
DISPS(I,J) = 4*((I-1)*MM(1,1) + (J-1)*(1,2)*N)
TYPES(I,J) = SUBMAT_TYPE(1,1)
COUNTS(I,J) = 1
END DO
END DO
DO I =1, COORDS(1)
TYPES(I, COORDS(2)) = SUBMAT_TYPE(1,2)
END DO
DO J = 1, COORDS(2)
TYPES(COORDS(1), J) = SUBMAT_TYPE(2,1)
TYPES(COORDS(1), COORDS(2)) = SUBMAT_TYPE(2,2)
CALL MPI_NEW_SCATTER(A, COORDS(1)*COORDS(2), COUNTS, DISPS, TYPES,
B, NN(1)*NN(2), MPI_REAL, 0, COMM, IERR)
ELSE ! all other processes
CALL MPI-NEW-SCATTER(0,0,0,0,0,B, NN(1)*NN(2), MPI_REAL, 0, COMM, IERR)
END IF
RETURN

The proposal is to add five new collective communication functions (10 if also
nonblocking are supported) to cover these cases.

A less painful proposal is to add only two new functions, namely a generalized
all-to-all function and a generalized allgather function. Scatter and gather
are particular cases of all-to-all where only one process sends (receives)
messages with positive counts, and all processes send (receive) zero length
messages. Bcast is a particular case of allgather, where only on process
sends a nonzero message. (One could even do with only one new function).

-------------------

Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425