> I will show in this comparison, that the best will be a proposal
> with Raja's post-wait interface,
> and Marc's progress rules,
> and a no_store flag in the argument list of post.
I hope to change your mind.
> The proposals are:
> A) The main proposal voted in June.
> B) Marc's post(any)/start(rank) & complete(rank)/wait(count)
> from 16th Aug. with the additional mail from 17th Aug.
> C) same as A), but post(any) /start(rank) is substituted by
> post(rank)/start(count)
> D) same as B), but both pairs post(rank)/start(count) and
> complete(rank)/wait(count) are substituted by one pair
> post(rank)/wait(count), that is used for both functionalities.
> This is the proposal from Raja, but with Marc's progress rules.
> E) Dave's put/get/offer proposal.
> 1. Fulfilling applications' synchronization need
> E) seems to solve all cases (Dave, can you decide please)
I'll fill in the rest of your table in a following message.
> 2. Doing synchronization with minimal overhead
> a) inside the application
You omitted E:
E) No posts, starts, or locks. Every process involved must
perform only PUTs or GETs, and OFFER or IOFFER to process
incoming PUTs or GETs and/or to wait for outgoing PUTs or GETs
to complete.
> b) inside the shared memory implementation
> B) post/start must be implemented by an bidirectional RPC,
> i.e. a fetch of the 1-bit "posted" flag (atomicity is
> no problem.) In the case of shared memory, this is
> efficient.
> -- In the case of virtual shared memory this is not
> efficient.
> (Preliminary answer -- Marc's next version will show
> how to implement his model)
Before waiting for Marc's next version, I would like to address this.
Before deciding on which approach is the best, we must agree on what these
approaches should accomplish. I hope that you are not complaining that B)
allows a target process to postpone PUTs and GETs from other processes from
taking effect until the target is ready. This functionality is not a fatal
flaw: it is a requirement.
There are at least three ways to implement this functionality.
(1) As you describe above -- i.e. the operation on the target just sets
or clears a local flag, and other processes performing PUT or GET
check that flag.
(2) The target sends (or broadcasts) the information that it is (not) OK
to perform PUT or GET to all of the processes that might try to PUT or
GET, and they check the flag locally before performing PUT or GET.
(3) Each PUT or GET is actually processed by the target (as a message or
page fault), which the target holds or acts upon based upon the value
of the local flag.
In the case of virtual shared memory, #3 might be most efficient.
There must be some technique for implementing this functionality in all of the
proposals. I think that most of them require the user to explicitly implement
approach #2 using semaphores (post/wait). Marc's proposal and P/G/O allow
any one of the three implementations.
The only criticism I might understand is that B) and P/G/O do not allow the
target to specify who the possible origins are. That is, in #2 above, the
target may not know which processes need this information. We can discuss the
relative merits of allowing the target to specify the processes which might
PUT and/or GET, but I personally feel that in many cases either the target
doesn't know, or all of the processes will PUT and/or GET. In these cases,
adding such information won't help. Someone should consider what must happen
in the other proposals when the target does not know which processes will PUT
and/or GET.
You omitted E:
E) Like A, except that multiple PUTs or GETs can be combined at
origin, so the counter is only incremented once for each combined
event. Like B, a target can postpone PUTs or GETs until the
target is ready.
> c) inside the distributed memory implementation
> A) can be piggybacked
> B) can be piggybacked, except start(mpi_strong),
> but there a piggybacked solution is not expected.
> C,D) can be piggybacked on the communication needed
> for PUT and GET
I think you are overly generous to many of these. When synchronization and
communication are specified by separate operations, they can only be
piggybacked if the implementation waits for multiple user operations to be
performed before sending a single communication. So, when a synchronization
(or communication) operation is performed by a user, the MPI implementation
has a choice of either waiting for the next user operation with which the sync
or comm can be combined (which might not happen for a while) or to not
piggyback at all.
You omitted E:
E) User specifies both synchronization and communication together,
so can always be piggybacked.
> 3. The necessary cache coherence operations can be included:
> A-D) Yes.
> -- E) there are major problems because of the lack of an
> epoch based model
The answer to E) is "Yes" as well. In the P/G/O proposal, an OFFER
operation must always begin between the last local load or store and the
servicing of the first remote operation. The OFFER operation can therefore
perform a cache flush (i.e. "out") when it begins. An OFFER operation must
always end between the last remote operation and the first local load
or store, so a cache invalidate (i.e. "in") can be performed at this time.
However, there is no real need to perform a cache invalidate, since legal
programs will not have brought any of the "public" addresses into cache.
Most (or all) of this is noted in the "Advice to Implementors" under the
description of the "MPI_OFFER" operation in the P/G/O proposal.
I do not understand why you believe that an epoch-based model is important,
but if it helps, you can just view the OFFER as delimiting the public epoch,
and everything outside of the OFFER as a private epoch.
> 4. Only necessary cache coherence operations should be done:
> A) is the optimum in this topic.
> B, C) are here identical but to achive the optimum
> MPI must define hints and the user must use them.
> It is necessary to add a no_stores_were_done hint
> to the post operation. Then it can be identical to A).
> D) in more cases than B) and C) hints are necessary.
> It is necessary to add a no_stores_were_done hint
> to the post operation. Then it can be identical to A).
> E) fixing the problem mentioned in 3., it seems that
> also unnecessary IN/OUT operations will be done.
As for E), see my response to #3. In addition, the cache flush (i.e. "out")
can be omitted if the "count" argument is zero. The P/G/O proposal does
not use hints. In other words, I believe that your characterization
of A) might be incorrect.
> 5. Simplicity for the user:
> E) Smallest interface.
:-)
You have omitted a very important metric:
6. Not blocking origin when target is not ready
You know much more about the individual proposals than I do, so maybe you
can answer this for each. I believe that, in most of them other than Marc's
new proposal and mine, a process cannot perform a PUT or a GET until it knows
that the target is ready to accept it. This means that, if the target is not
ready to accept it, no statements after the PUT or GET can execute, either.
In P/G/O, and I think in Marc's new proposal, the PUT or GET can always
progress, so statements after PUT or GET can execute before the target is
ready. (If the target is not ready, the PUT or GET event and/or the data can
be buffered, just like an ISSEND.) In the P/G/O proposal, this non-blocking
nature also applies to collective communication, but I don't think that is
the case in Marc's proposal, which uses BARRIER.
The performance advantage resulting from the ability to continue processing
in this way could easily be greater than the performance advantages resulting
from all of the other factors which you have mentioned.
> Evaluation:
> 3.E eliminates proposal E.
See above.
> D seems to be the best, presupposing that the no_stores_were_done
> hint is added to the post operation.
If the hint is wrong, then the program may give the wrong answer. I still
do not understand why this is called a hint.
> Additional important arguments can lead to another evaluation.
:-)
I do not know if we will reach agreement before the next meeting, but
maybe reviews like yours will help.
-Dave