# Re: New version of the chapter

Rolf Rabenseifner (Rabenseifner@RUS.Uni-Stuttgart.DE)
Fri, 18 Oct 1996 12:12:59 +0200 (DST)

Marc,

sorry that I did not find time earlier to read your last draft.

Here my error corrections and proposals for minor clarifications:
Probabely you can incorporate them til today noon for Steve's deadline.

Error and type corrections:
---------------------------

Indicees wrong in Example 5.2 on page 9-11

10:21f let suspect that the indices of map are i=k+j*m with
i = 0..p*m-1 and k=0..m-1 and j=0..p-1
10:7 needs that j=1..p

Corrections if you want to use Fortran-style indices
(this list is not complete):

10:6 j = (map(i)-1)/m + 1
10:21 j = (map(i)-1)/m + 1
10:22 k = MOD(map(i)-1,m) + 1

And a clarification after 10:3 or 9:25

! The indices of map are i = k + (j-1)*m with
! j = 1..p, the number of MPI process that holds this part of B,
! k = 1..m, index in B inside of that MPI process.

21:18-27 the "mark"-algorithm with "only the group size" is wrong.
It must be added or replaced by an algorithm that takes into
account the relation of matching posts and starts.
Because this relation can not be done be numbering the posts
and starts, I think the best is to delete this algorithm:

21:19-20 delete ", or just the group size (the later....checking)"
21:26-27 delete "; decrease the counter, if a counter isused"

Prove -- the error in your simple algorithms can be shown by the
following example:

rank 0 rank 1 rank2
post(0+2)
start(1) wait start(1)
put(1) :(blocked) |
complete : |some
start(1) >| : |computation
put(1) >| : |
complete >| : put(1)
| : complete
v (return)
this must be post(0)
delayed after
the second post!!!

23:44 "C users" instead of "Users"

24:5 delete the linefeed or ...
24:29 "<new definition> MPI\_NOPUT" instead of "noput"

25:7 "public window copy" instead of "window copy"
25:8 the same correction
25:11 "private window copy" instead of "process memory"

25:25 delete "in process memory"
because in the implementation advice you show that the
public copy is in the memory while the privat copy is in the
cache.

26:7-9 The text can be misinterpreted in the direction that after
the outcome of put/accumulate is in the public copy
an additional wait/barrier/lock is necessary for getting
it into the privat copy.

Correct is that wait and barrier complete a put and accumulate
in the public and in the private copy.
A lock is doig this only if it is called by the window owner.

Therefore I suggest to cancel item 6 and put the content
into item 2 and 3 and split item 4.

Then the corrections are in detail:

25:40 "(in the public and private copy of the window)"
instead of "(in the public window copy)"

25:43 "(in the public and private copy of the window)"
instead of "(in the public window copy)"

25:47 do not use parenthesis for "in the public window copy

After 26:2 add a new item:

"x. If an operation is completed at the origin by a call
to \mpifunc{MPI\_WIN\_UNLOCK} then the operation is
completed at the target in the privat window copy by a
subsequent call to \mpifunc{MPI\_WIN\_LOCK} on the
window by the target process (window owner), or by
a call to \mpifunc{MPI\_WIN\_BARRIER} on the window
by the target process.

26:17 last word: "public window" instead of only "window"

27:10 Oh no, it is not good to allow that MPI_WINBARRIER can break
into lock-unlock or post-start-complete-wait cycles
(if we really want this then we must write a lot of exceptions
into the definitions of post, start, complete, wait, lock, and
unlock).
Therefore I believe we should define:

changing window or synchronization mode:
Before changing the actually used mode on a window must be
finished.
Using post-start-complete-wait it is finished at the origin by
a call to \mpifunc{MPI\_WIN\_COMPLETE} and at the target process
by a call to \mpifunc{MPI\_WIN\_WAIT}.
Using lock-unlock or barrier synchronization it is completed by
a call to \mpifunc{MPI\_WIN\_BARRIER}.
After a synchronization mode is finished it is possible to use
another mode or to use another window that overlaps with the
previous window.
For using one-sided communication on different windows that do
not overlap there are no restrictions.

27:40 "18" instead of "19" (have you put the \label on the right line?)

28:20-21 "the code may deadlock, as each process may block on start..."
"the code will deadlock, as each process blocks on start..."

(Reason: the implementation decides whether strong or weak
post-start is used; with weak post-start there is no deadlock)

31:39-42 I believe that this part is wrong, because it is necessary
that after the WINDOW_IN the data is really read from memory
instead from the cache. I believe that this is the
functionality "cache invalidate".
And "cache flush" only means that the content of the cache
is written back to the memory (if some locations are changed).
This makes sense only on sysatems with write-back noncoherent
cache.
I do not know whether there are systems with write-back
noncoherent cache and a function "cache flush & invalidate".
And whether there exist a system where "cache flush&invalidate"
is faster than "cache invalidate".

Because we really need the "cache invalidate" functionality
and because faster-"cache flush&invalidate"-systems are
probably really seldom I propose to delete the two sentense
31:39-42 "On some systems...(... lines.)"

If you do not want to delete them, then please correct:
31:40 the same correction

32:9-33 This must be typed as an "Advice to users".

32:8-33 You agreed that this Advice to users should be before the
section 5.5.3 "Implementation model" that is an advice to
implementors.

Then you should also correct:

34:20/21 "win0" instead of "win1" (and no unnecessary indent)
34:38 "if(!converged(A0,A1))" before the Win_post
and indent of MPI_Win_post

Clarifications:
---------------

And it is allowed that also put and complete do not block if the
implementation guarantees that the put completes at the origin
after the complete call and that its outcome at the target
is delayed until it is posted.

16:19 The "load" on the left side should be replaced by
"some computation"

26:46 I propose to add the following clarification to the
post-start-complete-wait paragraph:

"With the post-start synchronization the target process can tell
to the origin process that its window is now ready for RMA,
and with the complete-wait synchronization the origin process
tells to the target process that it has finished its RMA."

28:32 "put request (using a call to \mpifunc{MPI\_ISEND})"
"put request"

31:15 I propose you add again some text for the window-boundary &
cache-line-boundary problem - although it exists only with
write-back noncoherent caches.

And a similar problem arises on systems with write-back
noncoherent caches (see below) if the window boundary is
not cacheline aligned.
Then local updates outside the window can initiate a cache
flush overwriting the outcome of a concurrent RMA to the
partial cacheline inside of the window.
This conflict is avoided if two copies of the locations
of that partial cacheline are maintained -- the original
memory used for the privat window copy and an additional
memory space with a separate cacheline for public copy.

34:16/17 After "MPI_Barrier(comm0);" I propose to add

/* to allow MPI_NOCHECK also in the first iteration */

34:46 I believe it makes sense to add the following sentence:

If one clearly separates the locally and for RMA used
areas of A0 and A1 then one can use put instead of get
that can be faster on some systems because it needs
only information exchange in one direction.

A lot, and I thought in a first flight over the text at wednesday
that all will be okay -- oh terrible details.

Rolf

Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de