Re: Is non-blocking collective I/O desirable?

Terry R. Jones (trj@nimble.llnl.gov)
Wed, 19 Feb 97 17:06:21 PST

Rajesh writes:
>Regarding blocking and non-blocking collective I/O, I think the most
>important question to be answered is: when does a collective I/O routine
>return?
>

Do you mean when do the MPI_Ixxx_all routines return or when do the
MPI_Wait routines return? If you are talking about the former, please
refer to Q1 below. If the latter, please refer to Q2. If I have
misunderstood your concern, could you please describe it in more
detail?

>This question is extremely important from an implementor's point of view.
>Lets consider a two-phase implementation of a collective read. A subset
>of processors (master processors) can read the
>collective data set and then redistributes the data to clients (other
>processors) using message passing or shared memory operations. In this
>case, the masters will require the maximum time to read data (but will
>differ among masters). Should the clients wait for the masters to read
>the data (loosely synchronous collective read) or proceed ahead in the
>computation (non-synchronous collective read)?
>
>In cases where collective and non-collective I/O routines are
>interleaved, non-synchronous collective routines may cause data
>consistency problems (esp. interleaving writes with reads). The situation
>becomes even serious when collective
>I/O routines are to be implemented in OS. The cost of inter-process (or
>NORMA) communication can significantly affect the per-processor time for
>collective read or write.
>

Are you referring to meaning of atomic and non-atomic mode? If so,
please see Q3 below.

>I think if collective I/O is implemented as loosely synchronous
>collective I/O, then it might be easier to implement non-blocking collective
>I/O. (Note the difference blocking/non-blocking collective I/O and
>synchronous/non-synchronous collective I/O.) Obviously, a loosely
>synchronous collective I/O routine will not cause data consistency
>problems. Therefore, collective and non-collective I/O routines can be
>interleaved. However, from the performance perspective, non-synchronous
>collective I/O is more appealing than the synchronous version.
>

There seems to be several issues here; I'll address the issues separately
according to my current understanding of the issue (please followup if I
missed the question or if I misspeak):

Q1) Is it valid to make any assumptions on the status of a non-blocking
collective operation after the operation is initiated and before
it is completed?
A1) As the current interface is specified (the MPI_Ixxx routines and
the MPI_{TEST|WAIT}{ANY|SOME|ALL} routines), the application does
not have sufficient information to safely make assumptions on pending
non-blocking operations. I don't think this is a problem, it just
means that proper care must be taken when using non-blocking collective
operations. I feel that it would be erroneous to assume anything about
a non-blocking collective data access between the time that it
initiated and the time it completes (as determined by MPI_Wait,
MPI_Test, or their variants).

Q2) Once one participant process has completed its portion of a
non-blocking collective operation, can assumptions be made on the
remaining pending processes of that collective operation?
A2) One may say with certainty that all nodes with a lower nodeid have
their results committed to storage for the MPI_Iwrite_shared_ordered()
call. Likewise, all nodes with a lower nodeid have accessed their
results from storage on an MPI_Iread_shared_ordered() call. In the
case of a read, it would be erroneous to assume that the info is
actually in the recipients buffers until the operation is said to
have completed via MPI_{TEST|WAIT}{ANY|SOME|ALL}.

Q3) What exactly does atomic mode do?
A3) Let's assume that processes A, B, and C, are writing to the same
place in the same file. The following description holds whether
processes A,B,C are all part of a collective operation; or if all
three processes are issuing independent operations; or if some of
the processes are members of a collective operation and and some
of the processes are issuing independent operations. Atomic mode
specifies that all the bytes of the overlapped region will be from
either A, B, or C (i.e. there will NOT be fragments from multiple
processes in the overlapped region.) Atomic mode applies only to
the buffer of a single process; it says nothing about ordering for
collective or independent operations. If A, B, and C perform a
collective write to the same place in the same file with atomic mode
set to TRUE, the results in the overlapped region will either be:
entirely from A; entirely from B; or entirely from C. If atomic
mode is set to FALSE, implementations are permitted to have portions
of write buffers from multiple processes in the overlapped region.

-terry
-----------------------------------------------------------------------------
terry jones |
lawrence livermore lab | The proton absorbs a photon and emits
POB 808 MS: L-61 | two morons, a lepton, a boson, and
Livermore, CA 94550 | a boson's mate.
email: trj@llnl.gov |
voice: (510) 423-9834 | - exotic reaction discovered by Sean Malloy
fax: (510) 423-6961 |
-----------------------------------------------------------------------------