> page 3 line 12
>
> I read this as saying that the invisible headers are sufficient to allow
> MPI_RMA_INIT to determine what type of segment has been passed to it. However,
> MPI_RMA_INIT may have been passed a pointer to an area not allocated by
> MPI_RMA_MALLOC, in which case it will not have the invisible header. Thus,
> this is not a robust solution to the problem of determining the memory type
> (page 4 line 38).
>
> The seperation of MPI_RMA_MALLOC into two parts, MALLOC followed by INIT
> complicates the implementation, and requires MPI to maintain an internal list
> of the segments (for FREE and INIT). In contrast, a single MPI_RMA_MALLOC,
> which both allocates the memory, and performs the INIT function can cache the
> memory address in the created communicator. The allocated memory can easily be
> free'd when the communicator is destroyed. This also eliminates the need for
> MPI_RMA_FREE.
>
Well, we can continue one or two more circles on this design issue -- but the
options are clear enough. My assumption is that memory not allocatged by
RMA_INIT (or an equivalen "internal" allocator) is considered "bad". Of
course, the implementation may keep track of "good" memory allocated by other
means using the same data structure created by RMA_INIT.
> Page 4 line 12
>
> We need to make it clear that the displacement is scaled by the target
> displacement unit. All of the processes may have opened windows, each with its
> own displacement unit. The displacement for a get or put to a target window
> must be scaled by the displacement specified by the target.
OK
>
> Page 6 line 9
>
> The words "interfere in any way" seem to all encompasing. Some (in the
> real-time world) might read this to mean that gets and puts use a different
> hardware mechanism to ensure that they don't interfere with the performance of
> the other MPI operations.
OK - do not interfere semantically.
>
> Page 6 line 29
>
> Actually these datatypes are portable because they contain no explicit
> displacements. All displacements are derived by the MPI implementation, which
> can thus use its knowledge of the remote machine architecture to derive the
> correct displacements. In a NOW implementation, only the type signature needs
> to be transmitted with the data, not the type map.
The text is clear: a datatype is interpreted at the target process as if it
was created at the target process by the same sequence of calls. Thus, if the
type constructing call used an explicit byte displacement, then the same
displacement is used at the target; if it used a displacement that is in terms
of multiples of a basic datatype, then this displacement is scaled. In
gneral, one needs to transfer sufficent information to capture the syntactic
"definition tree" of the datatype.
>
> Page 18 line 9
>
> Why do we wish to discuss implementations which support atomic access to
> larger elements? If one is writing a portable program (the intent of MPI), one
> must take the worst case assumption, which is that the hardware only supports
> byte atomicity. We shouldn't encourage the user (or vendors) to use/support
> anything more.
Minor issue -- not worth arguing
>
> Page 18 line 18
>
> I beleive we are intentionally disallowing an implementation which uses only
> callbacks to implement the RMA agent. We need some more definitive examples to
> hammer this home, e.g.:
>
> MPI_Put ()
> MPI_Fence ()
>
> must not deadlock irrespective of the actions of the target process, in
> particular, the target process need not make any MPI calls.
>
> Process 1 Process 2
> -------------------------------
> a = 0;
> MPI_Barrier(); MPI_Barrier();
> MPI_Put (); while (a == 0); /* put a non zero value to a */
>
> Assuming "a" is a byte within the target window, the above is legal, and
> should not deadlock. No MPI calls are made on the receiver from which the RMA
> agent can be called, therefore the agent must execute from an interrupt, or
> signal, or in a seperate processor.
We need a long argument on what shared memory model we attempt to pattern
put/get. Is it a strong consistency/release consistency/delayed
consistency/dag consistency/...? We keep having a lot of discussions because
different people have different implicit models of shared memory coherence.
With some of the weaker memory models, the code above (with put replaced with
store) may, indeed, loop forever.
>
> We also need to address resource limitations. Similar to the standard mode
> send, MPI_Put calls should be automatically throttled to ensure that resources
> are not exhausted. Thus, while puts may run ahead for a while, they must
> eventually block and synchronize with the remote agent. Non-blocking puts may
> fail due to resource limits.
>
> We may wish to make a statement about the (lack) of fairness for put calls.
>
OK
> Page 22 line 24
>
> The example should be MPI_Type_recv, not MPI_Recv
>
> Page 24
>
> This example assumes that "comm" refers to a window communicator, for which
> the window is presumably each processes entire local address space. Presumably
> in such a case the displacement unit would be one. This should be made
> clearer.
>
> To ease communication of pointers, I propose the following extension. This
> allows local pointers into a window to be automatically converted into
> displacements into the window and vica-versa.
>
> Define a new MPI datatype: MPI_POINTER. The type matching rules for
> MPI_POINTER are:
>
> 1) MPI_POINTER matches MPI_INT
> 2) MPI_POINTER matches MPI_POINTER only if the origin and target are the same
> process
> 3) On the target process: void* matches MPI_POINTER
> 4) If the origin and target are the same process, the void* matches
> MPI_POINTER
> on the origin process
>
> The semantics of MPI_POINTER when used for get are as follows:
>
> 1) A void* is read from the target buffer
> 2) The base address of the window is subtracted from 1) above
> 3) 2) above is divided by the window displacement
> 4) The displacement from 3) above is sent to the origin process, and received
> as an int.
>
> The pointer must point to a location within the local window.
>
> The semanitics for put are the reverse of the above.
>
> MPI_POINTER can also be used in a send/recv operation with a window
> communicator. The semantics are the same as above.
>
> We also define a new MPI constant "MPI_NULL_PTR" which is an integer
> displacement corresponding to a NULL pointer. Clearly this cannot be zero,
> since a zero displacement into a window is legitamate. It would probably have
> to be something like -1.
>
> An example which allows remote access to local structures is then:
>
> /* Definition of a public node in the memory of the target process */
> typedef struct {
> void *next;
> double content;
> } PNODE;
>
> MPI_Datatype PNODE_TYPE;
> blens = {1,1};
> types = {MPI_POINTER, MPI_DOUBLE};
> MPI_Type_contiguous_struct (2, blens, types, &PNODE_TYPE);
> MPI_Type_commit (&PNODE_TYPE);
>
> /* Equivelent node after is has been read into origin memory */
> typedef struct {
> int next;
> double content;
> } RNODE;
>
> MPI_Datatype RNODE_TYPE;
> blens = {1,1};
> types = {MPI_INT, MPI_DOUBLE};
> MPI_Type_contiguous_struct (2, blens, types, &RNODE_TYPE);
> MPI_Type_commit (&RNODE_TYPE);
>
> void insert (double context, void *head, MPI_COMM comm);
> {
> PNODE *pnode;
> int rank;
> MPI_Aint displ;
>
> pnode = (PNODE*)malloc(sizeof(PNODE));
> pnode->content = content;
>
> MPI_Comm_rank (comm, &rank);
> MPI_Address (head, &displ);
>
> /* The assumption here is that the comm window is the entire address space of
> the local process, and the displacement unit is 1. Thus, the displacement
> generated by MPI_Address is the correct displacement to use within the
> window. */
> MPI_Rmw ((void **)&pnode, &(pnode->next), 1, MPI_POINTER, rank, displ,
> MPI_SWAP, 0, comm);
> return ();
>
> /* Local node for list on origin process */
> typedef lnode struct {
> double content;
> struct lnode *next;
> } LNODE;
>
> void remote2local_list_copy (int rank, MPI_Comm comm, LNODE **head);
> {
> RNODE rnode;
> LNODE *tail;
> int rhead,
>
> /* Assume displacement 0 contains the head of the list */
> MPI_Get (&rhead, 1, MPI_INT, rank, 0, 1, MPI_POINTER, comm);
>
> if (rhead == MPI_NULL_PTR) {
> *head = NULL;
> return ();
> }
> else {
> *head = (LNODE*)malloc(sizeof(LNODE));
> MPI_Get (&rnode, 1, RNODE_TYPE, rank, rhead, 1, PNODE_TYPE, comm);
> /* If we overlay the local pointer and remote displacement field, we
> could probably avoid copying content. Instead we could overwrite
> the remote displacement after we are finished with it. */
> (*head)->content = rnode.content;
> tail = *head;
> while (rnode->next != MPI_NULL_PTR) {
> tail->next = (LNODE*)malloc(sizeof(LNODE));
> tail = tail->next;
> MPI_Get (&rnode, 1, RNODE_TYPE, rank, rnode->next, 1, PNODE_TYPE, comm);
> tail->context = rnode.content;
> }
> tail->next = NULL;
> }
> return ();
> }
>
> The problem with 64 bit to 32 bit addresses is somewhat avoided since the
> target processor is (conceptually) responsible for converting a pointer into a
> displacement. Of course this displacement may be greater than 32 bits, but
> this problem exists already since window displacements use ints. In essence,
> our semantics for windows already limits 32 bit processors so that they can
> only access windows less than 2^31 bits in length. If a 64 bit processor
> wishes to open its entire address space to 32 bit remote process there is a
> problem. Changing window sizes and displacements to MPI_Aint doesn't solve the
> problem, since an Aint is driven by the size of a local pointer, and not the
> largest pointer in the system.
Too long for online reply. Let's discuss this Wed...
>
> Lloyd Lewins
> Hughes Aircraft Co.,
> llewins@msmail4.hac.com
>
>
>
>
Marc Snir
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598
email: snir@watson.ibm.com
phone: 914-945-3204
fax: 914-945-4425
URL: http://www.research.ibm.com/people/s/snir