Arbitrary number of bytes from an arbitrary address for both read and
write.
i.e. Yes to both questions.
> It seems like it would be relatively easy, as you say, to implement
> fully general reads from the host memory with a few shifts and
> masks. But writes are sort of nasty, because unless the DMA engine
> is only 8-bits wide it will have to load in the first/last word from
> the host, mask in the new partial values, and then write out the
> complete words.
The precise implementation will depend on your memory bus protocol. If
it supports byte writes then it's obviously easy. If not then you need
to do whatever is required.
In the CS-2 implementation the DMA engine is on the MBus. The DMA
processor appears as a cache-coherent Mbus master, and implements all
of the necessary protocols. In other words it looks to the other
processors on the bus as if it too is a cache coherent CPU.
> References, *please*!! This is very interesting! Is it really that much faster?
> Does anyone have a good set of measurements for lots of different
> layouts?
You should talk to Hubertus Franke at IBM (frankeh@watson.ibm.com),
they had a paper at the MPI implementors workshop (and possibly
published elsewhere) which gave some measurements showing the
performance difference for scattered vs contiguous data.
> If the individual chunks are small, it seems likely that the overhead of
> breaking up the transfer might slow things down, but that's just off the top of
> my head.
Ah, but you don't break up the transfers over the network. The idea is
that you choose your network chunk size at a suitable length to get
good bandwidth (maybe 4/8K ?) then you can pipeline the packing with
the transfer, so while sending the previous chunk you pack the
next. This overlaps the packing cost, and minimises the data
buffering, since you only ever need 2 chunks worth of buffer, instead
of a buffer the size of the user's message.
> > The problem is that we're trying to write a standard which encourages
> > portability. This means that where there are restrictions in the
> > standard the user code should conform to them *even if it need not* on
> > the particular machine it is currently running on. However on such
> > machines it may be impossible for the user to conform to the
> > restrictions.
>
> This would be bad, of course, and in general I agree. But in the
> specific case of data alignment I still don't see the problem. Let's
> say just for the sake of argument that Hell freezes over and we
> actually do vote to require all data chunks to be 64-bit
> aligned. This is trivial for both users and implementations to
> check. Even in Fortran, since we've yet to locate any vendors with
> Fortran compilers that don't have some sort of pointer extensions.
Try Linux with g77. As far as I can tell that doesn't support pointers.
> > Just consider two SGI machines. I'd write a matrix multiply very
> ~~~~~~~~~~~~
> > differently for a T90 than I would for a node of a T3D.
>
> :-)
I thought you'd like that !
-- Jim
James Cownie
BBN UK Ltd
Phone : +44 117 9071438
E-Mail: jcownie@bbn.com