Re: we should wait for 1sided implementations

Eric Salo (salo@mrjones.engr.sgi.com)
Fri, 31 May 1996 14:30:56 -0700

Jim, I've been thinking about your comments on the Meiko DMA model and I'm now
very curious. Is it 100% general? Meaning, can the transfer lengths be an
arbitrary number of bytes, and does this apply to both reads and writes? It
seems like it would be relatively easy, as you say, to implement fully general
reads from the host memory with a few shifts and masks. But writes are sort of
nasty, because unless the DMA engine is only 8-bits wide it will have to load
in the first/last word from the host, mask in the new partial values, and then
write out the complete words.

> You should look at the numbers and discussion for the IBM
> implementation. They have taken the trouble to collect data "on the
> fly", so that the time spent collecting it into a contiguous buffer
> can be masked by the time sent sending the previous block. This also
> means that they use only communication chunk sized buffers, no matter
> how large the non-contiguous message is. (Something the user could
> never achieve if she had to flatten the message herself).

References, *please*!! This is very interesting! Is it really that much faster?
If the individual chunks are small, it seems likely that the overhead of
breaking up the transfer might slow things down, but that's just off the top of
my head. Does anyone have a good set of measurements for lots of different
layouts?

> The problem is that we're trying to write a standard which encourages
> portability. This means that where there are restrictions in the
> standard the user code should conform to them *even if it need not* on
> the particular machine it is currently running on. However on such
> machines it may be impossible for the user to conform to the
> restrictions.

This would be bad, of course, and in general I agree. But in the specific case
of data alignment I still don't see the problem. Let's say just for the sake of
argument that Hell freezes over and we actually do vote to require all data
chunks to be 64-bit aligned. This is trivial for both users and implementations
to check. Even in Fortran, since we've yet to locate any vendors with Fortran
compilers that don't have some sort of pointer extensions.

> That's where we disagree, and it's exactly the point I'm making. I
> don't believe that we can or should promise anything about the
> performance of remote store access.

We can't make any ironclad guarantees, but we can and should make it possible
to create simple implementations which will be very fast. Remote stores are
fundamentally more optimizible than vanilla send/recv and IMHO MPI should do
everything that it can to allow implementations to take advantage of that.

> Just consider two SGI machines. I'd write a matrix multiply very
~~~~~~~~~~~~
> differently for a T90 than I would for a node of a T3D.

:-)

-- 
Eric Salo         Silicon Graphics Inc.             "Do you know what the
(415)933-2998     2011 N. Shoreline Blvd, 7L-802     last Xon said, just
salo@sgi.com      Mountain View, CA   94043-1389     before he died?"