Re: Progress rule vs. MPI_DELIVER

Rolf Rabenseifner (Rabenseifner@RUS.Uni-Stuttgart.DE)
Wed, 17 Apr 1996 13:29:32 +0100 (DST)

Karl Feind:
>> 2) Another alternative is to add an MPI_DELIVER function:
... on the target node (=memory node)

Eric Salo:
> MPI_Flush_remote_put_data()
... on the origin node (=execution node)
= MPI_FENCE ??????????????

After some discussion with Joel Clark (Intel) and Cray and
Dick Treumann (IBM) I think we have three problems.

----------------------------------------------------------------------
The current semantics of MPI_PUT is:

execution node(s) memory node

application makes loads
and stores with MEMORY

-- CACHE PROBLEM 1 --

/- send a "syncronization"
receive the "syncronization" <-/

MPI_PUT (DATA) --------\
MPI_PUT (DATA) \
MPI_PUT (DATA) \
\------> stored at any time
into MEMORY

Synchronization:
case a) MPI_FENCE - - - - - - - - - - - - now really stored
plus
send a "syncronization" ----\
\---> receive the "sync."
or plus b), c) or d)

case b) MPI_GET_COUNTER
and poll in necessary

case c) MPI_SET_COUNTER_THRESHOLD
and MPI_TEST or
MPI_WAIT

case d) MPI_BARRIER(newcomm) MPI_BARRIER(newcomm)
End of synchronization

-- RELIABILITY PROBLEM -- -- RELIABILITY PROBLEM --

-- CACHE PROBLEM 2 --

load from the MEMORY
to get the new DATA
----------------------------------------------------------------------

The problems are:

-- CACHE PROBLEM 1 --

If the data from the application is not written directly
through the cache into the MEMORY then old data may "hang"
in a cache.
If now the new DATA from MPI_PUT is written to that MEMORY
then at a later moment the cache with the old data can be
written to that MEMORY and therefore overwrite the new DATA.

On systems with a write-through-cache we hope that the
old data is stored faster than the new DATA can come from
MPI_PUT.

Possible solution:
The application on the memory node must issue a call like
CALL MPI_RMA_WINDOW_READY(newcomm)

-- CACHE PROBLEM 2 --

In a subsequent load by the application in the memory node
systems without full cache coherency can load old data from
the cache instead of loading the new DATA from the MEMORY.

Because we have allowed also case a) for synchronizing,
there is no real chance for the MPI implementation to
issue a cache-clear as part of MPI library routines.

Possible solutions:

1. If we omit MPI_FENCE (i.e. case a)) then MPI can issue
the cache-clear as part of MPI_GET_COUNTER,
MPI_TEST(threshold_request), MPI_WAIT(threshold_request)
and MPI_BARRIER(newcomm).

2. The application on the memory node must issue a call like
MPI_DELIVER(newcomm)
(This is the proposal from Karl)
I think, it has nothing to do with SEND/RECEIVE because
MPI_DELIVER must be called after the data "is there".

This proposal changes the meaning of the counters:
The new counter value says only, that the data is
"announced". First after MPI_DELIVER the data is
really there and can be used.

3. Solutions with signal handlers -- I think -- we should not
discuss in "high performance computing standardization".

-- RELIABILITY PROBLEM --

In the paper
Todd Mummert, Corey Kosak, Peter Steenkiste and Allan Fisher.
Fine Grain Parallel Communication on General Purpose LANs.
School of Computer Science, Carnegie Mellon University,
Pittsburgh, PA 15213.
http://www.cs.cmu.edu/afs/cs/project/iwarp/archive/nectar-papers/96ics.p
"a simple adapter for ATM networks that supports efficient
remote memory writes, sometimes referred to as PUT operations"
is described.

The main difference to distributed and shared memory multiprocessors
is the lack of a reliable interconnect.
But it has the chance to give us in the future a better
price/performance.

Now the problem:
At the point mentioned in the picture above they need a
collective call to retransmit lost or currupted data.
To get the most performance it is necessary that the
user data in the execution node is not overwritten
until that moment.

A possible solution consist of several changes:
A) At MPI_RMA_INIT the application must decide which
synchronization model it wants to use;
B) we additional allow the following synchronization models:
B1) The application synchronizes all processes with one
of the methodes of case a)..d) and additionally
calls MPI_DELIVER(newcomm) in all processes
in newcomm, and MPI is allowed to implement
MPI_DELIVER as local or collective routine.
B2) Same as B1) but sending processes do not modify
the data used in MPI_PUT until it has called
MPI_DELIVER.
C) We recommend to use
Case c) of the figure above (i.e. the
synchronization at the memory node can be done locally
without additional messages)
in conjunction with B2) (i.e. it can be implemented
efficiently on all possible systems!!!)

Does anyone sees better solutions for these three problems?

And a last setence about MPI_GET:
I think MPI_GET has another semantics.
It is like a synchronous remote procedure call with
input = 'address of the data' and output = 'the data itself'.
But also with MPI_PUT we have the CACHE PROBLEM 1
and therefore we need also MPI_RMA_WINDOW_READY with MPI_GET.

Rolf


Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de