1. Lst comment (I promise!) on the issue of "why offline polling is needed to
implement MPI".
Consider the following code:
if (myrank == 0)
MPI_Send(&buff, count, MPI_BYTE, 1, 0, comm);
else if (myrank == 1)
MPI_Irecv(&buff, count, MPI_BYTE, 0, 0, comm, &status);
This is a correct (albeit somewhat useless) MPI program: there is no
requirement to test or wait for each nonblocking call. The MPI standard says
quite explicitly that the blocking send call must complete, even if the
receiver never executes a wait or test call. Consider the case where the
send started after the matching receive, and where the message sent is too
large to be buffered. The only way the send can complete is for the receiving
node to start receiving the sent message, even if no MPI call executes on the
receiving node. Therefore, a correct implementation of MPI cannot postpone
reception of pending messages until an MPI call occurs at the receiving
processor. The mechanism for initiating the reception can vary, of course:
polling by an independent coprocessor, periodic polling by the main processor,
interrupt at messafge arrival, etc.
2. ALLPUT/ALLGET.
The current proposal for put/get does not require a synchronization to occur
for each individual put/get. The put/get can be issued without specifying a
target request, in which case, there is no synchronization for that individual
operation. A global fence operation can be used to wait for the completion of
all pending put/get operations. Thus, for algorithms that work in phases, the
code will be:
repeat
generate data to be communicated
initiate all put/get needed for communication
global fence
use data from communication.
Some local computation is interleaved in this loop, so as to hide latency.
One can, of course, use double buffering for better latency tolerance. If one
thinks in terms of Valiant's BSP model, the global fence is the global clock
tick for moving from one phase to the next. I.e., the computation (within a
group) proceeds in phases: at each phase one can use data communicated via
put/get commands issued at the previous phase.
The question is:
do we envisage algorithms that will use put/get, where a global, loosely
synchronous model of the type described above is not suitable, but where one
still wants to avoid the need to synchronize for each individual put/get
communication?
- -------------------
Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425
------- End of Forwarded Message
-------------------
Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425