[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[mpi-21] [mpi-3] Some thoughts on MPI 2.1, 2.2, 3.0
Dear all,
since I will not be present at the meeting next week, here are some thoughts
on the process from MPI 2.1 and onwards.
MPI 2.1
=======
Some clarifications on
* semantics of one-sided synchronization, the memory model, public and
private windows (see email discussion 2003 - I hope to find time soon to
read Chapter 6 carefully, and make a concrete suggestion)
* what thread-safety really means (see 2007 paper in Parallel Computing by
Bill and Rajeev)
would in my opinion be in order and helpful to the user.
Progress rule????? In my opinion enough should be said to make clear that/how
it is possible to implement MPI without a dedicated progress thread. Many
will disagree.
MPI 2.2
=======
Focus on MPI as a tool for library building, and repair what is missing
in this respect. This may prevent bloating of the standard by things which
can (and should) properly be implemented as separate libraries on top
of MPI (one example: non-blocking collectives). And even if new, ambitious
things are taken into MPI 3.0, this would have the advantage of making
a good prototype implementations in terms of existing MPI 2.2 functionality
possible, fast.
In this respect, one thing that is missing is pack/unpack functionality
for accessing only parts of MPI-typed buffers efficiently. This prevents the
user from efficiently implementing pipelined algorithms on structured
MPI data. It is known that such functionality can be provided. I would be
happy to make a proposal for one of the next meetings.
Another lack in this respect could be the error-behavior of MPI. Being
more strict here could be a aid towards fault-tolerance, without actually
having to put new functionality into MPI. Of course, this has to done with
much care so not to compromise efficiency.
New collective:
---------------
[contradicting slightly the above]
MPI_Reduce_scatter_block - has been discussed in emails, a useful and minor
extension (has already been implemented in some libraries): easier to
use if known that blocks to be scattered have the same size, overhead can be
saved in implementation inside MPI library.
MPI_Exscan_all - found in BSP libraries. As exscan, but the global
sum(i=0,p-1) is also computed and distributed to all processes. Ask BSP
experts/users if there is reason enough?
New built-in binary reduction operators:
----------------------------------------
[This is not a strong proposal, but I'm mentioning it as something that
could be discussed]
It would be tempting to use the MPI pair-datatypes for segmented
scans/reductions (start of each segment is marked), selective
reductions (only marked elements are reduced), and for the following
problem: find out whether all processes have contributed the same value
(this seems to require - correct me if I'm wrong - two MPI_Allreduce's,
first to find a global min or max, second to determine if all processes
did contribute this value). Operators like
MPI_SEGMENTED_SUM, ...
MPI_SELECTIVE_SUM, ...
MPI_ALL_MAX, MPI_ALL_MIN - computes in the value part of the pair min or
max, respectively, sets flag part true if both value arguments are equal
could be added for these problems (but must not - can all be implemented
as user-defined ops; argument in favor is convenience for the user, and
better performance).
MPI 3.0
=======
I agree with Steven that the standard does not always distinguish well
between "advice to users" and "advice to implementers". It would be nice to
rewrite, but this is probably too dangerous. Precision could be
improved at some essential places (communication, progress, ...), but I
don't think a formal specification is the way to go, for the following
reasons: Which formalism is sufficient, complete, feasible (I don't
think there is any), will this be accessible and useful in any way to
the user/reader of the standard (easily the answer becomes no), will it
help the implementer (probably yes, if he knows the formalism well, otherwise
not), how to ensure that the formal specification is correct (unsolved), ???
An issue that has so far not been raised is
* Improved topology functionality
As is, the functionality is not really scalable (graph topologies),
and have shortcomings in many other respects, e.g. it does not really
convey enough information to the MPI implementation to allow an
efficient process reordering. I have discussed some of these
shortcomings in papers, and think the issue deserves airing in the
meetings, at least. Of course, the decision may be to leave the
chapter as is; or more radically: to deprecate most of this
functionality (with the intention of having special libraries for this
kind of stuff). My suggestion would be to take the library building
approach: what is needed from MPI to make a broad range of process
remapping libraries possible on top of MPI, and would be happy to make
a proposal for what could be done for one of the upcoming meetings.