Onesided -- comments & corrections

Rolf Rabenseifner (Rabenseifner@RUS.Uni-Stuttgart.DE)
Fri, 25 Oct 1996 12:20:24 +0200 (DST)

Marc,

because you did not give reasons, why you have not done some
of my corrections, here again a list to the onesided draft from Oct 24.

Error and type corrections:
---------------------------

15:8 The "load" on the left side should be replaced by
"some computation"

Figure 5.2 on page 15 induces and the text 15:3-34 induces
that the put is delayed in all cases after post has occured,
i.e. that it is guaranteed that post has occured before
the call to put has finished.

This is wrong. We do not want to say this. We want to allow
the usage of eager protocols to implement PUT.

Therefore after 14:30 "operations may occur earlier." please add:

"And it is allowed that also put and complete do not block if the
implementation guarantees that the put completes at the origin
with the complete call and that its outcome at the target
is delayed until it is posted."

20:18-27 Now your algorithm does not work at all !
Please look at my example.
There must be one flag at the target for each origin!
It is set by "post" and waited for by "start" and cleared
by "start" or "complete"!

In your old text, the "mark"-algorithm with the "group"
was correct. Only the "mark"-algorithm with "only the group size"
was wrong.
Therefore your old text with the corrections
21:19-20 delete ", or just the group size (the later....checking)"
21:26-27 delete "; decrease the counter, if a counter isused"
(here page:linenumbers of the Oct 13 draft)

Prove -- the error in your simple algorithms can be shown by the
following example:

rank 0 rank 1 rank2
post(0+2)
start(1) wait start(1)
put(1) :(blocked) |
complete : |some
start(1) >| : |computation
put(1) >| : |
complete >| : put(1)
| : complete
v (return)
this must be post(0)
delayed after
the second post and your new algorithm does not make this delay!

Section 5.6 you use the following terms and equivalences:

public copy = window copy = memory
privat copy = process memory = local processor cache
^ ^
^-26:29-30 ^- 33:21-23

This is not good.
Best solution: delete the pair window copy / process memory
and use in the whole chapter public copy
and privat copy instead of window copy
and process memory!

27:9-11 The text can be misinterpreted in the direction that after
the outcome of put/accumulate is in the public copy
an additional wait/barrier/lock is necessary for getting
it into the privat copy.

You write "an ensuing call to ... MPI_WIN_BARRIER...", but
ensuing after what????
Your text can be interpreted like "after an update by a put
or accumulate has become visible in the public copy"
and e.g. item 2 says that this is after a MPI_WIN_BARRIER
that matches the MPI_WIN_BARRIER after the put/accumulate.

This clearly implies that the values are in the privat copy
after the MPI_WIN_BARRIER after the MPI_WIN_BARRIER that
matches the MPI_WIN_BARRIER after the put/accumulate.

And this we do not want to say !!!!!!!!!!!!!

We want to say that the values are in the privat copy
after the MPI_WIN_BARRIER that
matches the MPI_WIN_BARRIER after the put/accumulate.

Instead of item 6 it is correct to say
that wait and barrier complete a put and accumulate
in the public and in the private copy.
A lock is doing this only if it is called by the window owner.

Therefore I suggest to cancel item 6 and put the content
into item 2 and 3 and split item 4.

Then the corrections are in detail:

26:43 "in the public and private copy of the window"
instead of "in the window copy"

26:47 "in the public and private copy of the window"
instead of "in the window copy"

After 27:4 add a new item:

"x. If an operation is completed at the origin by a call
to \mpifunc{MPI\_WIN\_UNLOCK} then the operation is
completed in the privat window copy at the target by a
subsequent call to \mpifunc{MPI\_WIN\_LOCK} on the
window by the target process (window owner), or by
a call to \mpifunc{MPI\_WIN\_BARRIER} on the window
by the target process.

27:9-11 delete this item

28:26-27 Better, but correct is because the cache coherence call
inside of MPI_LOCK is not done and therefore your rule 28:20-23
is only true if you change the following:

"after a local call to \mpifunc{MPI\_WIN\_UNLOCK} or
\mpifunc{MPI\_WIN\_BARRIER} if the accesses are synchronized
with locks and the window was updated only by local stores;
and after a local call
\mpifunc{MPI\_WIN\_BARRIER} if the accesses are synchronized
with locks and the window was updated by put or accumulate."

instead of

"after a local call to \mpifunc{MPI\_WIN\_UNLOCK} or
\mpifunc{MPI\_WIN\_BARRIER} if the accesses are synchronized
with locks."

29:33 You have corrected only the first missing "may", therefore:
"as each process may block on start..."
instead of
"as each process blocks on start..."

page 25, exa.5.11 is incorrect.
You have two possibilities:

possibility A: (the better one)

42:18 "if(!converged(A0,A1))" before the Win_post
and indent of MPI_Win_post
42:41 "if(!converged(A0,A1))" before the Win_post
and indent of MPI_Win_post

possibility B: (the worse)

25:45 add before Win_complete:
MPI_Barrier(comm0);
MPI_Win_start(neighbors,MPI_NOCHECK,win0);
25:46 add after Win_complete:
MPI_Win_wait(win0);

And I believe the following clarifications should be included:

9:47 Add a clarification because the indices of map are beginning with
zero !!!

! The indices of map are i = k + (j-1)*m - 1 with
! j = 1..p, the number of MPI process that holds this part of B,
! k = 1..m, index in B inside of that MPI process.

28:7 "post-start-complete-wait" instead of "post-wait-start-complete"

28:11 I propose to add the following clarification to the
post-start-complete-wait paragraph:

"With the post-start synchronization the target process can tell
to the origin process that its window is now ready for RMA,
and with the complete-wait synchronization the origin process
tells to the target process that it has finished its RMA."

30:27 "put request (using a call to \mpifunc{MPI\_ISEND})"
instead of only
"put request"

33:9 I propose you add again some text for the window-boundary &
cache-line-boundary problem - although it exists only with
write-back noncoherent caches.

Therefore please add after 31:15:

And a similar problem arises on systems with write-back
noncoherent caches (see below) if the window boundary is
not cacheline aligned.
Then local updates outside the window can initiate a cache
flush overwriting the outcome of a concurrent RMA to the
partial cacheline inside of the window.
This conflict is avoided if two copies of the locations
of that partial cacheline are maintained -- the original
memory used for the privat window copy and an additional
memory space with a separate cacheline for public copy.

25:19/20 After "MPI_Barrier(comm0);" I propose to add

/* to allow MPI_NOCHECK also in the first iteration */

26:21 I believe it makes sense to add the following sentence:

If one clearly separates the locally and for RMA used
areas of A0 and A1 then one can use put instead of get
that can be faster on some systems because it needs
only information exchange in one direction.

Rolf


Rolf Rabenseifner (Computer Center )
Rechenzentrum Universitaet Stuttgart (University of Stuttgart)
Allmandring 30 Phone: ++49 711 6855530
D-70550 Stuttgart 80 FAX: ++49 711 6787626
Germany rabenseifner@rus.uni-stuttgart.de