> You should look at the numbers and discussion for the IBM
> implementation. They have taken the trouble to collect data "on the
> fly", so that the time spent collecting it into a contiguous buffer
> can be masked by the time sent sending the previous block. This also
> means that they use only communication chunk sized buffers, no matter
> how large the non-contiguous message is. (Something the user could
> never achieve if she had to flatten the message herself).
References, *please*!! This is very interesting! Is it really that much faster?
If the individual chunks are small, it seems likely that the overhead of
breaking up the transfer might slow things down, but that's just off the top of
my head. Does anyone have a good set of measurements for lots of different
> The problem is that we're trying to write a standard which encourages
> portability. This means that where there are restrictions in the
> standard the user code should conform to them *even if it need not* on
> the particular machine it is currently running on. However on such
> machines it may be impossible for the user to conform to the
This would be bad, of course, and in general I agree. But in the specific case
of data alignment I still don't see the problem. Let's say just for the sake of
argument that Hell freezes over and we actually do vote to require all data
chunks to be 64-bit aligned. This is trivial for both users and implementations
to check. Even in Fortran, since we've yet to locate any vendors with Fortran
compilers that don't have some sort of pointer extensions.
> That's where we disagree, and it's exactly the point I'm making. I
> don't believe that we can or should promise anything about the
> performance of remote store access.
We can't make any ironclad guarantees, but we can and should make it possible
to create simple implementations which will be very fast. Remote stores are
fundamentally more optimizible than vanilla send/recv and IMHO MPI should do
everything that it can to allow implementations to take advantage of that.
> Just consider two SGI machines. I'd write a matrix multiply very
> differently for a T90 than I would for a node of a T3D.
-- Eric Salo Silicon Graphics Inc. "Do you know what the (415)933-2998 2011 N. Shoreline Blvd, 7L-802 last Xon said, just email@example.com Mountain View, CA 94043-1389 before he died?"