So far there have been three messages in support, and one opposed, but
the one opposed has all the technical details to back it up. So while
I know it's against MPI-2 tradition to have much detailed discussion
over email before the meeting :-), I strongly encourage anyone with an
opinion, pro or con, to speak up, especially vendors or other
implementors for whom the new proposal may be either easy or hard to
implement. This includes people who have discussed over private
email but not sent to the list. Also, of course, questions
about the intent or meaning of any part of the proposal are
encouraged.
> That doesn't mean, of course, that radical new counter-proposals cannot be
> considered. So there is nothing wrong with bringing such a proposal to the
> next meeting, but it has to be considered in the context of the existing
> draft as it has been evolving, and has been voted on, so far.
So the procedure I would like to follow (and of course this is just my
opinion) is the following.
1. If there is not clear evidence of substantial support for moving in
this new direction, drop it before the next meeting.
2. If there is a lot of support, prepare for the next meeting a
chapter that has both the new proposal and the old proposal. As the
first order of business, discuss the two proposals and take a straw
vote to determine which one to pursue. The worst possible outcome of
this would be the infamous 10-10-10 vote, which effectively prevents
us from making any substantial progress at the next meeting. If that
is going to be the vote (which would mean that there has not been
enough discussion for thoughts to crystalize) I would prefer not to
bring it up at all - hence item 1, above. It could set us back at
least a full meeting, and time is getting short. Of course if the vote
is 2-20 against the new proposal, there was also no point in bringing
it up. So what I'm really looking for is enough discussion to generate
a rough consensus before the next meeting.
> All that said, I think the new proposal is a step backward. The key issue
> is the separation of the spawn from the creation of the communicator.
So I will repeat some of the arguments that I believe came up in the
preliminary discussion that led to this proposal. I think others made
these arguments, but I'm not sure who. (For those of you in favor
of the new proposal, this does not absolve you of the responsibility
to send out your opinion!).
> 1. Combining spawning with communicator formation in a collective operation
> was done at an early stage in order to obtain scalability and
> implementability, especially on MPP's, where there might need to be switch
> allocation, etc.
I would particularly like to heard from the MPP vendors on this. How
much of a problem is it? Can it be handled using the strategies below?
> 2. Scalability concerns say that one should not unecessarily revisit spawned
> processes a second time, just to set up communication, when that could have
> been done at spawn time. Think thousands of processes.
I will claim that revisiting spawned processes can be very minimal,
and it can be done in a scalable manner. Here's the idea.
First, I think we are talking about the INDEPENDENT_MPI (spawn
MPI+detach) functionality. In the previous proposal, SPAWN and
SPAWN_MULTIPLE are explicitly collective with the children so
presumably there is no additional issue with the new proposal.
Second, I claim that if MULTIPLE_INDEPENDENT_MPI (spawn MPI +
group_union + detach), while essential for completeness, is in fact a
very rare case. So if we can do a very good job on INDEPENDENT_MPI
and have a small amount of extra overhead for
MULTIPLE_INDEPENDENT_MPI that is not a problem.
So let's think about the case of a tightly integrated MPP where
communication is set up "automatically" and doesn't need the
assistance of a parent. The basic idea is that setting up this
communication (possibly among thousands of processes) can still be
handled automatically, without intervention from the parent.
When you spawn an MPI application on an MPP, it goes ahead and sets up
communication without the assistance of the parent. It creates,
inside of MPI_INIT, an MPI_COMM_SIBLINGS that consists of the
processes spawned at the same time. It then blocks awaiting info from
the parent. There are 3 cases.
1. parent calls detach
In this case, the parent sends only a single short piece
of data to the root process saying
"detach yourself". The root process broadcasts this to
its siblings (a scalable operation). Everyone dups
MPI_COMM_SIBLINGS to form MPI_COMM_WORLD or just makes
MPI_COMM_WORLD equal to MPI_COMM_SIBLINGS.
The extra overhead is:
a. a single broadcast
b. sending one small piece of data from the parent to one
of the children.
c. requiring parent and child to synchronize
Of these, I claim that only c. may be a problem.
2. parent calls group_union + detach
In this case, the parent needs to contact each of the group leaders,
act as a rendezvous point for them to exchange information, and then
can leave. The communicator merge that goes on in the children can
probably be done in a scalable way.
I think that the only serious problem here would be if
you called spawn() 1000 times to create 1 process each
time, and then merged the groups and detached them as a
single MPI_COMM_WORLD. In this case the parent is a
bottleneck. Is it important enough to worry about?
3. parent calls other group manipulation functions, then detach.
This might be more complicated, involving a lot of coordination.
First, I think this is a very rare case (which we might want
to explicitly disallow, although having it seems cleaner
from the point of view of design, as groups are groups are groups).
Second, this functionality isn't possible in the proposal
of the last meeting.
Thinking about it a bit more, though, I can see serious implementation
issues if the parent might conceivably be required to talk to a process
other than the root process of a spawned set. So perhaps
this should be taken out. Only group_union is allowed - this would
be the same functionality as the original proposal.
> 3. It seems especially strange to have to have independent processes wait,
> just to be detatched, instead of proceeding, getting scheduled by the
> scheduler, and running, while they could be doing useful work, or getting
> themselves submitted to the job scheduler, or whatever. This is a bad,
> non-scalable, and very unecessary synchronization.
Presumably the spawn did not complete until the processes
were already running (since you get a group back,
indicating running processes). It has been my understanding
that SPAWN_MULTIPLE_INDEPENDENT, if for instance it submitted
a job to the job scheduler, would wait for that job
to start running before completing. That is one important
reason to have the nonblocking ispawn(). In the new
proposal there is still a nonblocking ispawn().
In the usual case the application
will call
spawn()
detach()
or possibly
ispawn()
... work ...
wait()
detach()
I can see no useful reason to do
spawn()
... work ...
detach()
So the amount of time spent waiting can be quite small, unless the
children take a long time to call MPI_Init(). This is a possibility,
but one I don't think we need to take too seriously. My understanding
is that the number of processes may in fact be undefined before the
process calls MPI_Init() (i.e., MPI_Init() may spawn the extra
processes, as in P4) so applications pretty much have to call
MPI_Init() right at the beginning.
Since we have acknowledged that spawn() operations may be quite
expensive, I'm not worried about the very small amount of extra
overhead of the parent sending a message to the child telling it to
detach. As for non-scalable, I don't think there's a problem, but
perhaps I have missed something.
> 4. We have tried in the past not to constrain implementations. This proposal
> is very constraining to implementation, and prevents them from being scalable
> and robust.
I may have missed some scalability issues. Can you give some
examples?
I don't see robustness issues, but again I may be missing
something. The new proposal seems, in a way, more robust,
because it clearly separates starting processes from establishing
communication. There is no ambiguity about what happens
when a process starts but quickly dies, or fails to call MPI_Init()
for some other reason.
> 5. Starting one process and having it start others is not the "MPI way" to
> start a set of processes. Surely what is expected is that one will use spawn
> to expand an already existing communicator. Then it is critical that this
> operation be collective. In the new proposal it is not.
I think there may be some misunderstanding of the proposal here.
Everytime a communicator is involved there is a collective
operation. It has always seemed wrong to me, for instance,
that MPI_SPAWN_INDEPENDENT is a collective operation potentially
called on many nodes. It is collective because its cousin,
MPI_SPAWN (old proposal), is collective. MPI_SPAWN has to be collective
because it returns an intercommunicator, and needs a
communicator to be the parent half of that intercommunicator.
Perhaps there is an extra issue I'm missing. For the proposal of the
last meeting, I have expected that the actual spawn would always occur
on one process and that only the establishment of communication would
be collective. Do you have in mind that the spawn might be done in
parallel, by different processes in the parent communicator? If so,
and if this is a serious implementation possibility (I have never
heard of it) then the new proposal definitely has less functionality.
Have implementors thought about how they would do a collective
spawn?
Bill