(SORRY about sending it a few minutes ago to the wrong list (mpi-misc))
(therefore if you answer, then please to this mail.)
(The table at the end should be printed with a 124 columns printer)
I will show in this comparison, that the best will be a proposal
with Raja's post-wait interface,
and Marc's progress rules,
and a no_store flag in the argument list of post.
I tried to make it carefuly, and therefore it is not short, sorry.
The proposals are:
A) The main proposal voted in June.
B) Marc's post(any)/start(rank) & complete(rank)/wait(count)
from 16th Aug. with the additional mail from 17th Aug.
C) same as A), but post(any) /start(rank) is substituted by
post(rank)/start(count)
D) same as B), but both pairs post(rank)/start(count) and
complete(rank)/wait(count) are substituted by one pair
post(rank)/wait(count), that is used for both functionalities.
This is the proposal from Raja, but with Marc's progress rules.
E) Dave's put/get/offer proposal.
For comparing I use the examples I, II, II and 1a, 1b, 2, 3 and 4
as described below.
To decide, which proposal is the best, we can look at several topics,
negativ ones are marked on the left side with "-", those that can be
an exclusion argument are marked with "--".
Possibly we have a chance to get a consensus before the meeting,
which one is really the best, then we have a real chance to get
it voted in the next two meetings.
1. Fulfilling applications' synchronization need
- A) Lack of syncronization functionality in exa.1a and 4 must
be solved by usage of send/recv and bsend/recv.
Exa.2 must be programmed with nonblocking IPUT
- B) does not solve exa. III above, but this is possibly
minor aspect
C) does solve all I-III because of its matching rule
D) does solve all I-III because of its matching rule
E) seems to solve all cases (Dave, can you decide please)
2. Doing synchronization with minimal overhead
a) inside the application
B) inside one iteration step normally
- only one post
- but several start.
Because technically it is necessary
to have "public" and "privat" epochs, this proposal
does not add additional idle time to the application.
(but this needs nonblocking version of start).
C,D) inside one iteration step normally
- several post
- but only one start.
- This adds unnecessary idle time to the application because
the RMA can only start when all windows are posted.
Without ISTART in B) this is comparable to B).
- B,C,D) additional complete/wait or post/wait in exa. 4.
b) inside the shared memory implementation
- A) multiple PUTs must all increment the counter
B) post/start must be implemented by an bidirectional RPC,
i.e. a fetch of the 1-bit "posted" flag (atomicity is
no problem.) In the case of shared memory, this is
efficient.
-- In the case of virtual shared memory this is not
efficient.
(Preliminary answer -- Marc's next version will show
how to implement his model)
C,D) each process has a counter which must be atomically
increased from remote by 1 and on which is locally
waited until it has reached count and then it is
atomically reduced by count
(P&V operations -- post = V(1), start = P(count))
c) inside the distributed memory implementation
A) can be piggybacked
B) can be piggybacked, except start(mpi_strong),
but there a piggybacked solution is not expected.
C,D) can be piggybacked on the communication needed
for PUT and GET
3. The necessary cache coherence operations can be included:
A-D) Yes.
-- E) there are major problems because of the lack of an
epoch based model
4. Only necessary cache coherence operations should be done:
A) is the optimum in this topic.
B, C) are here identical but to achive the optimum
MPI must define hints and the user must use them.
It is necessary to add a no_stores_were_done hint
to the post operation. Then it can be identical to A).
D) in more cases than B) and C) hints are necessary.
It is necessary to add a no_stores_were_done hint
to the post operation. Then it can be identical to A).
E) fixing the problem mentioned in 3., it seems that
also unnecessary IN/OUT operations will be done.
5. Simplicity for the user:
A) not simple because he/she must learn about caching,
Although it results in short user codes.
B,C) Not simple due 4 functions for simple post/wait
synchronization.
D) Small interface. Example 2 (probably the most important
for FE-codes) shows, that A, D and E results in the
shortest application codes.
E) Smallest interface.
Evaluation:
3.E eliminates proposal E.
2.a C,D is less worse than 2.b B.
In combination with 1. B and 5. B one can say C and D are better than B.
2.a BCD can be accepted because the GET before is a bidirectional
operation.
Due to 5.A, C, D, one can say D is better than C and A.
D seems to be the best, presupposing that the no_stores_were_done
hint is added to the post operation.
Additional important arguments can leed to another evaluation.
----------------------------------------------------------------
The examples I, II and III are used to show the limitations of
proposal B).
I and II can be programmed with B), but III not.
But III can be programmed with the other proposals A, C, D.
With E too -- I cannot decide?
- I: 2-party synchronization
- II: switching the put-party from r.0 to 2 by a mes. from 1 to 2
- III: switching the put-party from r.0 to 2 by a mes. from 0 to 2
^^^
Here the programming sceme with proposal B:
---- I ------ -------- II -------- -------- III -------
rank_0 rank_1 rank_0 rank_1 rank_2 rank_0 rank_1 rank_2
post post post
start start start
put put put
complete complete complete
start wait send- wait
: wait send--> \------->
: load load recv load recv
: post start start
return post : post :
put return return
complete put put
wait complete complete
load wait wait
load load
This third case need not be synchronized in that way, i.e. the
"start" in rank_2 can also match with the first "post" in rank_1:
rank_0 rank_1 rank_2
post
start
put
complete
send--------->recv
start
wait put
load complete
post
wait
load
--------------------------------------------------------------------
The folowing examples are those used in the July meeting.
They should be simply programmable and efficiently executable
with each proposal.
[Exa. 1: put/load] A producer loops doing put operations in
the target window. The consumer loops reading the data locally.
The producer code is: put/put; post; wait;
The consumer code is: wait; load/load; post;
Unnecessary MPI_WINDOW_OUT operations may occur inside the
MPI_WINDOW_POST calls, and an extra MPI_WINDOW_IN
operation inside the producer's MPI_WINDOW_WAIT call.
By setting the only_put_load or the no_stores hint,
both processes can get rid of the MPI_WINDOW_OUT operations.
A high quality implementation can optimize this by detecting the lack of
local stores.
An optimized implementation would be able to detect that no
RMA put or accumulate operations were done on a window and
bypass the unnecesary MPI_WINDOW_IN.
Alternatively, the last post/wait pair may be replaced by a
send/recv pair since no RMA operations take place and any
synchronization would do.
[Exa. 2: put double-buffering] Each process splits its window
and its local buffers
in two halves and take turn computing on the data in one half
and pushing it into the other half.
The local buffers are named la and lb, the target windows wa and wb.
The example is given for two processes.
Process i's code is: la=compute(wb); put(la,1-i,wa); post(1-i); wait(1);
lb=compute(wa); put(lb,1-i,wb); post(1-i); wait(1);
Unnecessary MPI_WINDOW_OUT operations may occur inside the
MPI_WINDOW_POST calls.
By setting the only_put_load or the no_stores hint,
both processes can get rid of the MPI_WINDOW_OUT operations.
A high quality implementation can optimize this by detecting the lack of
local stores.
[Exa. 3: store/get] A producer loops storing data locally.
The consumer loops doing get operations from the target window.
The producer code is: store/store; post; wait;
The consumer code is: wait; get/get; post;
Unnecessary MPI_WINDOW_IN operations may occur inside the
MPI_WINDOW_WAIT calls, and an extra MPI_WINDOW_OUT
operation inside the consumer's MPI_WINDOW_POST call.
An optimized implementation would be able to detect that no RMA
put or accumulate operations were done and bypass the extra
MPI_WINDOW_IN operations.
By setting the only_store_get hint, the extra
MPI_WINDOW_IN operations can be bypassed without any
additional internal synchronization effort.
By setting the no_stores hint, the consumer can get
rid of the MPI_WINDOW_OUT operation.
[Exa. 4: get double-buffering] Each process splits its window
and its local buffer
in two halves and take turn computing on the data in one half
and pulling it from the other half.
The local buffers are named la and lb, the target windows wa and wb.
The example is given for two processes.
Process i's code is: wa=compute(lb); post(1-i); wait(1); get(la,1-i,wa); post(1-i); wait(1);
wb=compute(la); post(1-i); wait(1); get(lb,1-i,wb); post(1-i); wait(1);
Unnecessary MPI_WINDOW_IN operations may occur inside the
MPI_WINDOW_WAIT calls.
An optimized implementation would be able to detect that no RMA
put or accumulate operations were done and bypass the extra
MPI_WINDOW_IN operations.
By setting the only_store_get hint, the extra
MPI_WINDOW_IN operations can be bypassed without any
additional internal synchronization effort.
Unnecessary MPI_WINDOW_OUT operations may occur inside the
second and forth MPI_WINDOW_POST.
A high quality implementation
would optimize this by detecting the lack of remote stores.
___________________|____________________|____________________|____________________|____________________|____________________
| A | B | C | D | E
| | | | |
Generic description| expl.in/out&counter|post strt compl.wait|with post(rank) | post & wait (Raja) | put / get / offer
| |Marc, 16-17 Aug.96 | start(count) | Raja, mod. by Rolf,|David DiNucci
| | | | and Marc's progress|
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Exa. 1a | | | | |
| | | | |
loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{
put | iput(INCR) | put | put | put | put(.t.
put | iput(incr) | put | put | put | put(.f.
post --> wait | \--> consu(2)| compl wait | compl wait | post wait | \--> offer(1)
| | | | | | | | |OUT *) | | \ |
| | |post --> |wait | |post --> |wait | |post --> |wait | \-> |wait
in | in | |in | |in | |in | |in
load | load | load | load | load | load
load | load | load | load | load | load
wait <-- post | recv <-- send | strt_weak post | start(1) post | wait post | recv <-- send
| | | |OUT *)| | |OUT *)| | |OUT *)|
| | |read --> |return| |wait <-- |post | |wait <-- |post |
| | |_flag <-- |_flag | | |IN #)+) |
} } |} } |} } |} } |} } |} }
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Exa. 1b | | | | |
| | | | |
loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{
put | iput(INCR) | put | put | put | put(.t.
put | iput(incr) | put | put | put | put(.f.
post --> wait | \--> consu(2)| compl wait | compl wait | post wait | \--> offer(1)
| | | | | | | | |OUT *) | | \ |
| | |post --> |wait | |post --> |wait | |post --> |wait | \-> |wait
in | in | |in | |in | |in | |in %)
load | load | load | load | load | load
load | load | load | load | load | load
any <-- any | any <-- any | any s. <-- any s. | any s. <-- any s. | any <-- any | recv <-- send
sync. sync. | sync. sync. | strt_ post | start(0) post(non| sync. sync. |
| | _nocheck |OUT *]| | |OUT *]| |
} } |} } |} } |} } |} } |} }
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Exa. 2 | | | | |
| | | | |
loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{
load(wb) load wb| load(wb) load(wb)| load(wb) load(wb)| load(wb) load(wb)| load(wb) load(wb)| load(wb) load(wb)
la=comp() la=comp| la=comp() la=comp | la=comp() la=comp | la=comp() la=comp | la=comp() la=comp | la=comp() la=comp
| | post(wb) post(wb)| post(none) post( )| | ioffer ioffer
| | |OUT *] |OUT *]| |OUT *] |OUT *]| | (wb,req_a) (wb,r_b)
| | strt_ strt_ | start(0) start(0)| |
| | _nochck(wa) _nochck| | |
put(l/wa) put(wa)| iput(INCR) iput(IC)| put(l/wa) put(wa) | put(l/wa) put(wa) | put(l/wa) put(wa) | put(l/wa) put(wa)
put(l/wa) put(wa)| iput(incr) iput(ic)| put(l/wa) put(wa) | put(l/wa) put(wa) | put(l/wa) put(wa) | put(l/wa) put(wa)
post -X- post | X | compl(wa) compl(wa) compl(wa) compl(wa) post post | wait wait
| | | | | | | | | | |OUT *) |OUT *)| (req_a) (req_a)
| | | | |post -X- |post | |post -X- |post | |post -X- |post | | |
wait < > wait | consu(2)< >consu(2)| wait | wait | wait | wait | wait | wait | | |
| | |wait < > |wait | |wait < > |wait | |wait < > |wait | |wait |wait
in in | in in | |in |in | |in |in | |in |in | |in |in %)
| waitall waitall | | | |
| (rq_lb) (rq_lb)| | | |
| | | | |
and the same with| and the same with | and the same with | and the same with | and the same with | and the same with
lb/wb/wa instead | lb/wb/wa instead | lb/wb/wa instead | lb/wb/wa instead | lb/wb/wa instead | lb/wb/wa instead
of la/wa/wb | of la/wa/wb | of la/wa/wb | of la/wa/wb | of la/wa/wb | of la/wa/wb
} } |} } |} } |} } |} } |} }
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Exa. 3 | | | | |
| | | | |
loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{
store | store | store | store | store | store
store | store | store | store | store | store
out | out | | | |
wait <-- post | recv <-- send | strt_weak post | start(1) post | wait post | offer(1)
| | | |out | | |out | | |out | get(.t. --> |addr
| | |read --> |return| |wait <-- |post | |wait <-- |post | |out ^)
| | |_flag <-- |_flag | | | |IN #)+) | <-- |value
get | get(INCR=1) | get | get | get | get(.f. --> |addr
get | get(incr=1) | get | get | get | <-- |value
post --> wait | \-> consu(2)| compl wait | compl wait | post wait |
| | | | | | | | |OUT $) | |
| | |post --> |wait | |post --> |wait | |post --> |wait |
| | |IN#)+)| |IN#)+)| | |IN#)+)|
} } |} } |} } |} } |} } |} }
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Exa. 4 | | | | |
| | | | |
loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{ |loop{ loop{
comp(lb) comp lb| comp(lb) comp(lb)| comp(lb) comp(lb)| comp(lb) comp(lb)| comp(lb) comp(lb)| comp(lb) comp(lb)
store(wa) store | store(wa) store(wa) store(wa) store(wa) store(wa) store(wa) store(wa) store(wa) store(wa) store(wa
out out | out out | | | |
post -X- post | bsend -X- bsend | post(wa) post(wa)| post(wa) post(wa)| post(wa) post(wa)| ioffer( ioffer(
| | | | |out |out | |out |out | |out |out | 1,req_a) 1,req_a)
| | | | start_ start_ | |post -X- |post | |post -X- |post |
wait < > wait | recv < > recv | _weak(wa) _weak(wa) start(1) | start(1)| wait(wa) | wait(wa)|
| | |read_ -X- |read_ | |wait < > |wait | |wait < > |wait |
| | |_flag < > |_flag | | |IN #)+) |IN#)+)|
| | |return-X- |return| | |
| | |_flag < > |_flag | | |
get(l/wa) get(wa)| get(la/wa, get(wa) | get(la/wa) get(wa) | get(la/wa) get(wa) | get(la/wa) get(wa) | get(la/wa) get(wa)
| incr=0) | compl(wa) compl(wa) compl(wa) compl(wa) post(wa) post(wa)| wait( wait(
| | | | | | | | |OUT @) |OUT @)| req_a) req_a)
| | |POST -X- |POST | |POST -X- |POST | |POST -X- |POST | |addr |addr
| | wait(wb) wait(wb)| wait(wb) wait(wb)| wait(wb) wait(wb)| |out |out ^)
| | |WAIT <X> |WAIT | |WAIT <X> |WAIT | |WAIT <X> |WAIT | |value |value
| | |IN #)+) |IN#)+)| |IN #)+) |IN#)+)| |IN #)+) |IN#)+)|
| | | | |
and the same with| and the same with | and the same with | and the same with | and the same with | and the same with
la/wb/lb instead | la/wb/lb instead | la/wb/lb instead | la/wb/lb instead | la/wb/lb instead | la/wb/lb instead
of lb/wa/la | of lb/wa/la | of lb/wa/la | of lb/wa/la | of lb/wa/la | of lb/wa/la
} } |} } |} } |} } |} } |} }
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Summary of | A | B | C | D | E
unnecessary IN/OUT | | | | |
| | | | |
Exa. 1a | | |OUT *)| |OUT *)| |OUT *) |OUT *)|
| | | | |IN #)+) |
| | | | |
Exa. 1b | | |OUT *]| |OUT *]| |OUT *) |
| | | | |
Exa. 2 | | |OUT *] |OUT *]| |OUT *] |OUT *]| |OUT *) |OUT *)|
| | | | |
Exa. 3 | | |IN#)+)| |IN#)+)| |IN #)+) |IN#)+)|
| | | | |OUT $) |
Exa. 4 | | | | |
| | |IN #)+) |IN#)+)| |IN #)+) |IN#)+)| |IN #)+) |IN#)+)|
| | | | |OUT @) |OUT @)|
| | | | |IN #)+) |IN#)+)|
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Footnotes | |*) can be eliminated|*) can be eliminated|*) can be eliminated|%) "offer" performs
| | by only_put_load | by only_put_load | by only_put_load | "in" because it
| | or no_store hint | or no_store hint | or no_store hint | has received a
| |#) can be eliminated|#) can be eliminated|#) can be eliminated| flag that it
| | by only_store_get| by only_store_get| by only_store_get| matches a "put"
| | hint | hint | hint |^) "offer" waits
| | | |$) can be eliminated| and sees the
| | | | by no_stores hint| "get" request
| |+) can be eliminated|+) can be eliminated|+) can be eliminated| and therefore
| | by optimization | by optimization | by optimization | performs an "out"
| | detecting lack of| detecting lack of| detecting lack of|
| | put on remote | put on remote | put on remote |
| | process | process | process |
| |*] same as *) but |*] same as *) but |@) can be eliminated|
| | also one can | also one can | by optimization |
| | allow to omit | allow to omit | detecting lack of|
| | the post/start | the post/start | local stores |
___________________|____________________|____________________|____________________|____________________|____________________
| | | | |
Lacks |- additional incre- | - additional COMPL/| - additional COMPL/| - additional POST/ | - needs on a shared
| menting in put | WAIT in exa. 4 | WAIT in exa. 4 | WAIT in exa. 4 | memory system a
| sequences | | handler to per-
|- iput and waitall | This needs additional remote access, but the application | form "window_out"
| instead of put | is not blocked if different communicators are used for |
| necessary (exa. 2)| wa and wb. |
|- mix of one- and | | | |
| two-sided in exa.4| | | |
Legend:
upper case letters : unnecessary operations
oper_a : oper_a is implemented
|sub_1 by doing sub_1
|sub_2 and sub_2
Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de