> 1) shared file pointers as currently specified (i.e. every node
> must have the same filetype).
>
> ... There was concern that applications which
> exhibited a "blocked-sequential" access style (the sequential accesses
> occur in groups of operations which form the file as a collection of
> indistinguishable sequential portions) may find item 1) overly restrictive.
Yes, they may. Tough. I tried designing a model that used different
filetypes for each node combined with a shared pointer model and got
totally knotted. My proposal simplified beyond all recognition when I
dropped that idea. I can't see how to define it.
However, related to the "destructive write" problem (see later), I think
that it is necessary to forbid the filetype to have holes for sequential
output. It would be possible to fill them with junk (i.e. leave their
contents undefined), but I don't think that is a good idea.
> 2) adding a new info (say "MPI_IO_MATCHED") to the open info list.
>
> For example, on collective writes you can avoid synchronization
> by doing memcopies of each node's buffers into a single large
> contiguous buffer as the nodes checkin -- but the contents
> for any given node can only flushed from the single large
> contiguous buffer to a pipe or tape once all the nodes below
> it have checked in. A similar strategy applies to collective reads.
I agree that this is the most important optimisation, but it is actually
semi-orthogonal to sequential I/O. The problem about having it on its
own is that it swaps a transfer bottleneck for a store one. A typical
implementation will need TWO of the large buffers on the I/O node (one
being filled and the other emptied), and each large buffer is the
transfer size times the number of nodes in the communicator!
This wouldn't be too serious, if it weren't for the fact that collective
transfers are now blocking. Non-blocking transfers start to be
reasonably efficient at (say) 64 KB, but blocking ones are bad news at
(say) below 1 MB. The reason is that the overheads of the former are
simply those to schedule the I/O, but those of the latter include waking
up the target, starting the transfer and waiting for it to complete.
As a relatively minor point, the implementation could get reasonable
efficiency with ONE large buffer if the transfers were unordered (i.e.
handled as they came in, and not in node order). So, for good
efficiency on a 64-node transfer, I estimate the store requirements, per
filehandle, on the I/O node/server/whatever as:
Blocking and ordered 128 MB
Blocking and unordered 64 MB
Non-blocking and ordered 8 MB
Non-blocking and unordered 4 MB
These are wild guesses, and people may disagree with them strongly. But
it is definitely the case that blocking transfers need a lot more store
to deliver comparable performance compared to non-blocking ones.
There are a couple of other aspects to sequential I/O; the first is not
critical (but is important):
Most seeks and changes of direction must be forbidden. This is easy to
do without AMODE changes, but it isn't nice at all. It means that an
application may write data to a socket, try a seek, and then bomb out.
Well, that is no worse than C and not THAT much worse than Fortran
(where REWIND may fail). But it does contradict 10.6.1.
The last point is extremely nasty, because it is a serious semantic
change:
Writes are destructive. On most sequential devices, writing to an
existing file causes all data beyond that point to become inaccessible.
In particular, this is true for almost all tape-like devices. I don't
see how this can be added to the current model without AMODE changes
because of 10.6.1.
I still believe that the simplest and cleanest solution is to specify
sequential versus random access in the AMODE. I will strip my proposal
down to its bare essentials and send it in another message.
Nick Maclaren,
University of Cambridge Computer Laboratory,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nmm1@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679