I don't think the question is whether or not a machine can be or should
be designed to allow byte aligned DMA. The fact of the matter is that
there *are* machines in existence that have this limitation (the Cray T3D
quickly coming to mind) that people like to use.
-] You are arguing for similar restrictions (contiguous, aligned data),
-] 1) the restrictions are hard for users to comply with. (How can I
-] specify the alignment of a variable in standard Fortran ?)
-] 2) many of the extended features only cost if you use them, so
-] it is indeed more work for you as an implementer, but it isn't a
-] performance issue for your users. (You are entirely free to publish
-] guidelines explaining how to achieve the fast path in your
-] implementation, though, of course, you may not want to do this since
-] by implication it also points out the slow path...)
-] 3) remote store access isn't there only for performance, it's also
-] there because it's a useful programming model which is
-] fundamentally different from message passing in its semantics.
-] It isn't in general trivial to change a remote store access code
-] into a message passing code.
I think that even on a machine that doesn't support byte aligned DMA,
byte aligned put/get -s can be done. Actually, I *know* it can be done (BTDT).
Of course, it may be very expensive and involve participation on both sides of
the communications (no longer 1-sided) but it *can* be done. I think this can be
an '*' in some documentation saying that
"put/get -s that aren't 16-bit (32-bit, 64-bit, or whatever) aligned are horribly slow
and you're better off just using MPI_ISend for them or something.
However, put/get -s that are aligned on the above boundary are nice and fast."
I like to think of MPI as a portable API. I would like to think that I could
use a network of Linux boxes to write, test, and verify my code (because
it is cheap) and when I get the bugs ironed out, recompile the same code
on an SGI Challenge 10K, a Cray T3D, or whatever to run on my real
datasets. Of course, if I want optimum performance I may have to tweak
sections of the code but the code I wrote on the Linux boxes *should* *run* and
*work* on the fast machine with no modification. To allow byte aligned
transfers on the Linux box and receive a core dump on the fast machine
because of a misaligned transfer is not what I call portable. Code that ran
extremely slow but *did* run and even printf()-ed nasty messages to the console
saying that I am doing misaligned put/get -s making the code slow, that
the implementor was being emailed to call me to tell me how to write code,
and my name was even now being plastered on all the usenet news groups
telling of my silliness would be OK in my book as long as it still gave me
Having two libraries, one that is portable and one that is super optimized,
may be an option so that everyone is satisfied. (yeah, yeah... I know
- like any of us has time to write two libraries...)