Thanks !
> > The same argument could have been applied to the non-contiguous data
> > types, however as we expected the implementations have been able to
> > exploit the additional information that they provide and achieve much
> > higher performance than had they been omitted.
>
> Some examples/numbers would be very interesting, my impression was that most
> implementations did a poor/nonexistent job of optimizing the non-contiguous
> cases.
You should look at the numbers and discussion for the IBM
implementation. They have taken the trouble to collect data "on the
fly", so that the time spent collecting it into a contiguous buffer
can be masked by the time sent sending the previous block. This also
means that they use only communication chunk sized buffers, no matter
how large the non-contiguous message is. (Something the user could
never achieve if she had to flatten the message herself).
You're probably right about "most" at the moment, however the point is
that the specification permits and encourages these optimisations
whereas the restricted specification would not have allowed them.
> > 1) the restrictions are hard for users to comply with. (How can I
> > specify the alignment of a variable in standard Fortran ?)
>
> Users of the Cray SHMEM library have to deal with this today. Having
> used that library extensively myself, it's not a Big Deal. (This
> ties into my other argument, that MPI_RMA_MALLOC should be
> mandatory, which will require some sort of FORTRAN pointers anyway.)
I've no doubt that restrictions such as these are "not a Big Deal" on
machines that require them, because on such machines language
extensions will have been implemented to allow you to conform to the
restrictions.
The problem is that we're trying to write a standard which encourages
portability. This means that where there are restrictions in the
standard the user code should conform to them *even if it need not* on
the particular machine it is currently running on. However on such
machines it may be impossible for the user to conform to the
restrictions.
In effect this forces the user to write code which does not conform to
the standard, and guarantees that they'll get bitten when they move to
a machine which enforces the restrictions.
As Shane said
> I like to think of MPI as a portable API. I would like to think
> that I could use a network of Linux boxes to write, test, and verify
> my code (because it is cheap) and when I get the bugs ironed out,
> recompile the same code on an SGI Challenge 10K, a Cray T3D, or
> whatever to run on my real datasets. Of course, if I want optimum
> performance I may have to tweak sections of the code but the code I
> wrote on the Linux boxes *should* *run* and *work* on the fast
> machine with no modification. To allow byte aligned transfers on
> the Linux box and receive a core dump on the fast machine because of
> a misaligned transfer is not what I call portable.
Unfortunately if we introduce restrictions which the user has no way of
conforming to we're guaranteeing core dumps, or, worse, indeterminate
behaviour.
> > 3) remote store access isn't there only for performance, it's also
> > there because it's a useful programming model which is
> > fundamentally different from message passing in its semantics.
> > It isn't in general trivial to change a remote store access code
> > into a message passing code.
>
> I agree, but I don't quite see your point. I'm certainly not arguing that
> remote store access isn't useful. I'm quite a fan of it, actually, which is why
> I want to see a good standard.
>
> But performance is implicitly promised in any get/put model.
That's where we disagree, and it's exactly the point I'm making. I
don't believe that we can or should promise anything about the
performance of remote store access.
Remote store access is useful because of the semantics that it
provides. This is entirely a separate issue from its performance. (Of
course up to a point, if it's implemented by using the US-mail maybe
it is too slow...)
There are applications which are *much* easier to code using remote
store access, and which will still run faster than a message passing
code even if the remote store access operations are *slower* by some
factor (2, 4) than a message passing operation.
MPI doesn't (and can't) say anything about performance. Any
presumption that remote store access will be faster than message
passing (or the converse) will likely be wrong on some machine on which
MPI is implemented.
This is no different from any other standard. Fortran doesn't tell you
how to write fast programs, just that a program which conforms to the
standard will produce the same results on any machine. (Give or take
numeric accuracy issues which are unspecified by the Fortran
standard).
Just consider two SGI machines. I'd write a matrix multiply very
differently for a T90 than I would for a node of a T3D.
-- Jim
James Cownie
BBN UK Ltd
Phone : +44 117 9071438
E-Mail: jcownie@bbn.com