I agree with all of the above.
> 2. This agent can access any part of the target process memory as easily as
> any other. This is clearly true if the agent runs in the same address space
> as the target process; this is true if the agent is the equivalent of a
kernel
> communication process that can cross map the target process memory. I see no
> reason that access to "heap" or to a shareable memory segment be any easier
> than access to "stack" or "text". (The later would be true only if the agent
> was a totally separate user process -- but this is an unlikely
implementation.)
I disagree that such an implementation is likely to be unlikely. If the agent
is a lightweight thread, then yes, memory is memory and there's nothing magic
about the heap. If the agent is a kernel process, then of course it can do
whatever it wants. However, MPI is a user-level library and implementing a
kernel-level agent for it is a lot more painful than implementing a user-level
agent. For example, a much simpler implementation on a NOW would be to create a
single, seperate agent process for each host, running as non-root, that uses
dynamically allocated shared memory to map in the get/put windows for each MPI
processes.
> Thus, in such environment, restricting put/get operations to a special,
> dynamically allocated memory segment, is restricting the functionality of
> put/get without improving performance or simplifying implementation.
I agree that this is probably true for MPPs, but not for NOWs. (see below)
> B. In a shared memory environment, a put/get request could be entirely
> serviced by code executed by the calling process, if that process has access
> to the target process memory. Such implementation will result in higher
> performance than an implementation that uses a remote agent, on systems that
> have no h/w support for block transfers, and can only use processor
> memory-to-memory copy for put/get. Since it is likely that memory sharing
> between processes is restricted to dynamically allocated memory regions
(e.g.,
> Unix 5 shared segments), it is advantageous in such environment to restrict
> put/get calls to "special memory".
Depending on how your block transfer engine works, it might also require
special memory. But essentially I agree.
> Given this situation, I see two possible directions.
>
> 1. The proposal of Eric: put/get can only use memory in the target process
> that has been dynamically allocated by a special call. This would be a
> variant of the MPI_RMC_MALLOC call in the current proposal.
>
> 2. My preference, which is an evolution of the current proposal (hopefully
> simplified): put/get can use any memory in the target process. A call is
> provided to allocate dynamically "good" memory on shared memory machines. If
> a put/get window is restricted to such memory, performance will be higher on
> shared memory machines; but put/get windows can also include other parts of
> the target process memory, with a possible performance penalty on shared
> menmory systems. Thus, two initialization calls are provided: one
> MPI_RMC_INIT, to open a put/get window on preallocated memory, and
> MPI_RMC_MALLOC to allocate "good" memory for a put/get window.
>
> The advantage of 2 is that it does not penalize distributed memory systems,
> and that it does not impose a somewhat unnatural restriction on the use of
> put/get. The difference between different parts of the memory is performance
> (on SMP's), not function. This is not to service library or compiler writers
> - -- this is to service application developers that may want to access via
> put/get any array they declare in their code.
>
> The disadvantage of 2, for shared memory implementers, is that they will
> either have to use a portable, lower performance put/get implementation that
> uses a remote agent, or they will have to have two execution paths for
> put/get: put/get executed by caller, if the target window is accessible, and
> put/get executed by a target agent, otherwise. I am not sure the pain is so
> bad: a portable implementation will have to have the "remote agent" solution,
> for NOWs and MPPs. It behooves the SMP implementers to also have the "direct
> access" solution. Since each window will be either of one type or another,
> the choice of the put/get method will depend on the communicator argement in
> the put/get call.
I completely disagree with these conclusions. An interface that all but
requires an extremely complicated agent for correctness is a Bad Thing. We
certainly don't want to have to go thru an agent to communicate within a node,
ever. And I interpret having more than one initialization call as a sign that
we don't really know what we're doing.
I claim that for something as fundamentally machine-dependent as puts and gets,
a very few well-placed restrictions will make our lives a whole lot easier.
Restricting remote addresses to the heap is perhaps unsatisfying from an
academic point of view, but the practical benefits in terms of portability and
performance are too substantial to be ignored.
Here is one possible implementation for MPI_PUT() on a NOW (SMPs and/or
uniprocessors) that works quite nicely if we limit ourselves to the heap: At
initialization time, one agent is forked off for the application. When
MPI_SHMALLOC() is called, all process in the communicator allocate some shared
buffers that are also mapped in by the agent. Whenever a MPI process within the
host wants to perform a put within that same host, it simply copies the data
into the appropriate shared addresses. When it wants to perform a put to a
process on a different host, it sends the data off to the remote agent on that
host. And when a put request arrives at the local agent from a remote MPI
process, the data is simply copied into the shared buffer by the agent exactly
as if it had come from a local MPI process. MPI_GET() would work similarly.
This is a very clean, simple, and efficient implementation that requires
nothing special from the kernel and is extremely portable.
I challenge anyone to come up with a similarly strong implementation that can
handle arbitrary addresses.
> One more thought: on SMP's one should not only be able to speed up put/gets,
> when the target memory is accessible by the caller, but also to speed up
> send/receives, by avoiding additional buffering. The receiver can directly
> copy a message sent from the sender memory to its own memory. We may want
> MPI_RMC_MALLOC to be a more general function that is used to allocate "good"
> memory for communication, for put/get as well as for send/receive.
This is an interesting idea. P4 does something much like this, I think, where
the application may preallocate a message buffer for higher performance. One
additional benefit is that you get to allocate space for the message header at
the same time. One semi-weird twist is that such memory would probably only be
needed for the send buffer in implementations where the receiver does the
copying.
I'm really glad that we're starting to focus on implementation issues now, so
let's all try to keep this thread going...
-- Eric Salo Silicon Graphics Inc. "Do you know what the (415)390-2998 2011 N. Shoreline Blvd, 7L-802 last Xon said, just salo@sgi.com Mountain View, CA 94043-1389 before he died?"
-- Eric Salo Silicon Graphics Inc. "Do you know what the (415)390-2998 2011 N. Shoreline Blvd, 7L-802 last Xon said, just salo@sgi.com Mountain View, CA 94043-1389 before he died?"