If the only problem is preventing that two successive start occur at the
same source, for the same destination window, before that window was posted
anew, then I think that a generation counter works OK, as an implementation
technique. If the set of window clients change, a synchronization must occur
in the user code, to prevent races.
I may be missing something, but it seems to me that as written, the
Rma_post() <--> Rma_start() synchronization has to always be a full
bidirectional exchange (RPC-like), with start asking the target if
the post flag is currently set (locally), and blocking if it's not
yet set (and the target buffering all these requests until the next
post, and then replying to them all).
Q1: is this correct?
Since post, as I defined it, does not have the "address" of the remote
process, then, indeed,
you will need a hand-shake (request to RMA; request granted; RMA), with
the usual optimization that short messages can be sent without first
asking permission, and will be buffered on the receive side. In a pure
distributed memory implememtation, this is the same as a send occurring
ahead of the matching receive. Of course, once shared memory in some
form is available, we can do better.
Q2: if yes, is this what we want (as opposed to Rma_post(comm, rank)
and making the use do multiple of them, thus keeping track of the
synchronization arrows and keeping post/start single-directional)?
We need to hash this out, what's preferable from a user viewpoint
and performance wise. I shall try to put both alternatives in the
text to be discussed when we meet.