Eric, the push to have users favour one way of coding MPI over another
in order to get "portable max performance" is currently not attainable
and code will have to be tweaked as it moves from one architecture to
another (but portable good-enough performance is achievable on the other
hand). On the T3D, Cray will probably push for pt2pt 1sided, while
MPI-1 pt2pt would be preferred on clusters. Currently for SGI, off-host
BW is higher than on-host, but not for HP (curious: do you ask users to
de-localize their application to gain performance?). With DMA engines
(e.g. SP2) I*send() is better whereas blocking pt2pt is faster on
systems that simulate concurrency using signals or daemons.
All this to say that, while pushing users toward Mem_alloc() helps you
(and HP incidentally :-) ), I'm not at ease adding it to MPI-1.2,
especially since we will keep tweaking and #ifdefing code to milk
performance out of it.
--Raja
-=-
Raja Daoud Hewlett-Packard Co.
raja@rsn.hp.com Convex Division