| Date: Wed, 31 Jan 1996 21:53:07 -0800
| From: wcs@nas.nasa.gov (William C. Saphir)
| Subject: notes from meeting - dynamic processes
|
| - "where" argument in MPI_Spawn
|
| [text deleted]
| It was also noted that this argument may
| specify "how" as well as "where," so perhaps
| its name should be changed. "runtimeinfo" was
| suggested.
| [text deleted]
Because the functionality is
"where and/or how to start the processes"
I think "startinfo" would be a more appropriate name
than "where" or "runtimeinfo".
| - "command-line" argument parsing.
| [text deleted]
| It was suggested that the command line could
| be "mpirun -np 4 app" to spawn an MPI process.
| What happens in this case (N copies of "mpirun" are
| spawned, each of which may start a 4-process app, which
| the spawning program would not be aware of)
| should be explicitly explained.
I thought, that this was suggested for the "where" argument.
And then if the user says "-np 4" and chooses "N=3"
I think it is overdetermined and the normal error
handling (which must be still defined) should be used.
But this discussion leads to the idea of proposing the constant:
MPI_SPAWN_MAX
N=MPI_SPAWN_MAX means that as maximum as possible processes
will be started for the given "where" and "command-line"
argument.
Example:
N = MPI_SPAWN_MAX
MPI_SPAWN('',N,'mpirun -np 4 app',0,root,comm,intercomm)
will start 4 times the application 'app'.
This is also an other example for allowing an empty command-line
argument.
| Date: Wed, 31 Jan 1996 23:05:25 -0800
| From: wcs@nas.nasa.gov (William C. Saphir)
| Subject: Re: dynamic process chapter comments
|
| > From: Rabenseifner@RUS.Uni-Stuttgart.DE (Rolf Rabenseifner)
| > Date: Tue, 30 Jan 96 11:18:41 MEZ
| Rolf, thanks very much for your comments.
|
| > - In 3.3.2, MPI_SPAWN, description of "command-line" argument
| > I propose to append (page 8, after line 41):
| >
| > If the "command-line" argument is omitted (NULL in C or an empty
| > string in Fortran) further copies of the calling program are
| > started.
| >
| > Reason:
| > An application that uses MPI_UNIVERSE_SIZE can be written
| > very portable. It must not analyse the calling sequence
| > to find its own name.
|
| So this sounds at first like a reasonable thing, but it looks
| as if the intent is to facilitate a bad thing, which is
| a host-node bootstrap approach to SPMD programs. i.e.,
| you want 8 copies of "foo" in your application so you
| spawn a copy of foo which spawns the other 7. It is
| really much more appropriate in MPI to spawn all 8 at once,
| and the 1+7 approach seems like an PVM (PARMACS?) relic.
| This might be reasonable for
| true host-node programs, because the master process may
| not know what to spawn until it has started, and because
| degraded master-slave communication is not likely to be
| too much of a problem, but I don't see a justification
| for the SPMD case. Is this what you're targeting?
With the example above I would propose the following extension:
If the "command-line" argument is omitted (NULL in C or an empty
string in Fortran) further copies of the calling program are
started or the "where" argument defines the executable that
will be started.
Im targeting two cases:
1) The user starts the application on its workstation and the
application uses for some numerical parts an PVP host as
compute server an for some other parts an MPP system.
On each system the application is installed under the same
name and probably only such modules are included that run
efficiently on that system.
2) Details are specified in the "where" argument.
Using where='file=my_config' the application can be totaly
separated from all spawning information, which is very
simple for application programming.
The following example should run on each MPI system:
N = MPI_SPAWN_MAX
MPI_SPAWN('',N,'file=my_config',0,root,comm,intercomm)
I did not want to target your "host-node bootstrap SPMD" example.
But also in that example I agree with you only from the application
programmer's viewpoint: it is more natural to start 8 processes
than 1+7. From the application User's viewpoint it can be more
comfortable that he starts one process (he/she must only be sure
that this executable can be called by his/her shell and then the
application starts the further 7 processes and tells him/her
what he can do in the case of errors).
| > - In 3.3.2, MPI_SPAWN, paragraph about MPI_SPAWN_SOFT,
| > I propose to append (page 9, after line 16):
| > Advice to users. The number of spawned processes can be
| > inquired with MPI_COMM_REMOTE_SIZE(intercomm, size).
| > (End of advice to users.)
| > Reason:
| > It clarifies how to test "an empty intercommunicator is returned"
| > (page 9, line 16).
| This looks like a good clarification. We still have a problem
| for MPI_Spawn_multiple() - suggestions welcomed.
There I have a very simple solution (we have used it in our DFN-RPC):
Using N as input and output argument.
IN/OUT n number of processes to start / that were started
And in MPI_Spawn_multiple()
IN/OUT array-of-n
Advice to users. The output value of "n" is stored into the
argument list only at the root process and only in the case
of a successful spawn. (End of advice to users.)
Advice to implementors. The value of "n" may be changed only if
it is necessary. The application may pass a read-only variable
if the value is not MPI_SPAWN_MAX and the "flags" argument does
not contain MPI_SPAWN_SOFT. (End of advice to implementors.)
Comment: Our experience was that the users are passing constants
if they think that it is possible. And then we had the bug
that we wrote back the original value into a write protected
memory. Bad luck. Therefore both advices.
| >- In 3.3.4, proposal 2, MPI_UNIVERSE_SIZE(size)
| > I propose to append (page 13, after line 3):
| >
| > Advice to users. Because it is not guaranteed that the returned
| > number of processes can be started in a subsequent MPI_SPAWN,
| > it is recommended to use MPI_SPAWN_SOFT in the "flag" argument
| > there. (End of advice to users.)
| If MPI_Universe_size is to be at all useful, I think it has
| to be very reliable in the usual case, which is probably something
| like:
| mpirun -ntotal 10 -np 1 master
| So I would hesitate to add this.
| However, this brings up an interesting point, which is
| that MPI_SPAWN_SOFT may return fewer processes than
| requested either because of resource limitations,
| which are in some sense expected, or because of "hard"
| errors, such as a missing .rhosts file or executable.
| It might be good to be able to distinguish
| in whatever we come up with for MPI_Spawn error reporting.
This is hard to do - an example:
Three systems are started with rsh in the background:
- on the first one the application is started correctly
- on the second one the application is not started because
there is no executable
- on the third one the operating system limit for running processes
is reached and therefore the application is not started
And in all three cases the "rsh...&" does return "success".
I think the implementation should use probabely an error report
file where it can report all details. But this is the general
error discussion where I have not really good ideas.
I did not found "-ntotal" in any description of MPICH.
What is its meaning?
| > - In 3.3.4, further proposal for MPI_UNIVERSE_SIZE(size):
| >
| > MPI provides the following function:
| >
| > MPI_SPAWNABLE (where, n)
| >
| > IN where A string telling the runtime system where and/or
| > how to start the processes as described in
| > MPI_SPAWN
| > OUT n Number of processes that can be usefully spawned.
| >
| > This function returns the number of processes that can be
| > usefully started with a subsequent MPI_SPAWN or MPI_SPAWN_... .
| > In MPI implementations that are tightly integrated
| > . ... (same text as in proposal 2, page 9, lines 1-3)
|
| It seems to me that this involves too much interaction with
| the runtime system. In the current proposal, this would
| be accomplished by direct interaction with the runtime system,
| for instance, with pvm_config() (with a PVM runtime system).
| There's a portability argument for putting this into
| MPI. Any others?
|
| A brief history of the MPI_UNIVERSE_SIZE thing is the following.
| We originally had lots of interaction with the runtime system.
| Reacting to the horror of this pandora's box, we eliminated
| all interaction with the runtime system, saying that an application
| should query the runtime system directly. However, what about
| as above:
| mpirun -ntotal 10 -np 1 master
| Followed perhaps by
| mpirun -ntotal 8 -np 1 master
| mpirun -ntotal 12 -np 1 master
|
| The assumption is that this is probably the most common
| type of "dynamic" application.
| The problem is how do you communicate the "10", "8" and "12" to
| the program. MPI_UNIVERSE_SIZE was the solution. The reasoning
| was that there isn't really any interaction with the
| runtime system - it is just a message from the user to him/herself.
| It was recognized that this was a slippery slope and we
| have seen that happening. First with the proposal
| for MPI_Universe_size() which seductively promised
| to make things much easier for a large class of scavenger
| applications, but despite the apparently simple change
| added a key element - interaction with the runtime
| system after MPI_Init(). The proposals above
| are just a bit further down the slope. So I'm
| advocating that we hold the line at a static universe size,
| and worry, based on the straw poll at the last meeting,
| that the whole thing will be voted out if it goes any further.
To say it very clear:
The proposal MPI_UNIVERSE_SIZE() is not consistent with
the proposal of MPI_SPAWN() !
Think about an MPI implementation that allows rsh calls in
the "where" argument: it can start an application on each
system in the world; the size of the universe is nearly
infinite. But in each MPI_SPAWN with a concrete "where"
the number is hardly finite!
Bill, it seems that you think only about encapsulated systems
like an Intel Paragon or a workstation cluster with a
fixed MPI configuration file.
I think we should not vote for an inconsistent interface.
And I think each MPI implementation is allowed to restrict
the values of "where" to must be always empty strings
(or 'file=....').
Summary: With my proposal I'm targeting four things:
- consistency between MPI_SPAWN and MPI_SPAWNABLE;
- applications that want to use e.g. half of the available
processors for executable A and the other ones for B
(then MPI_SPAWN_MAX cannot be used)
i.e. one call to MPI_SPAWNABLE and many calls to MPI_SPAWN;
- applications that want to choose among a list of available
systems that one that has the most processors available
at the moment,
i.e. many calls to MPI_SPAWNABLE and one call to MPI_SPAWN;
- because "where" has the same meaning in MPI_SPAWN and
MPI_SPAWNABLE and because the MPI implementation defines
which values are allowed, I think this definition is more
simple than the proposal with MPI_UNIVERSE_SIZE().
| > - In 3.4.1 Registration and Connection (page 15-17)
| > The proposal should separate clearly the functionality of
| > [text deleted]
| I agree this needs to be clarified in the way you suggest.
|
| > My criticism:
| > The word REGISTER in the function name "MPI_REGISTER_NAME" and
| > its argument "name" are normally used in name service APIs.
| >
| > Comparison with DCE RPC string binding:
| > The necessary functionality "inquire the port address" is
| > realized with two calls:
| > [code deleted]
| > MPI-2's "given-name" is the same as the vector of
| > string_bindings of the DCE RPC with the restriction of
| > vector_length == 1.
| >
| > DCE uses a vactor of bindings because a computer can have
| > more than one network interface and each binding represents
| > a port of the application on each of these network interfaces.
| >
| > In high performance computing one case of using MPI_CONNECT
| > can be that the user wants to use a high speed network
| > instead of the "default Ethernet".
| > [more text deleted]
| This is an interesting point.
| I think there is a way around it, since the port used by
| accept() is not the same port as will eventually be used
| for communication. It is only used to bootstrap the
| communication. Thus it's ok if accept() uses a slow
| interface. Or are there cases where there is no single
| (slow) interface that can be reached from anywhere, so
| that multiple "ports" are required?
| What do you think?
Yes the big difference between DCE and MPI is, that in DCE
such a binding is used to establish communication links from
a group of processes (or one process) TO ONE process;
in MPI the communication links are established TO A GROUP of
processes.
Therefore I think we overload this interface if we want
that the application can explicitely choose the communication
network.
Then my proposal is only:
Omit the argument "name" in MPI_REGISTER_NAME and the text
and change the name of MPI_REGISTER_NAME into MPI_GIVE_NAME.
Rolf