John
---** In contrast to the existing proposal (10.7), the communicator is NOT cached in the file handle. Instead, it is required in collective I/O calls as a demarcation.
This seems awkward. I would expect the common cases to be: 1) accessing a file from all the nodes that opened it; and 2) accessing a file from only one node. The second case is easily done in both the current version of MPI-IO and in this plan. However the first case in this plan requires the program to pass around both the file handle and the communicator. In general, it seems like allowing communicators to differ between the open call and subesequent operations on the same file creates a bunch of little implementation headaches, so I'd like to have a better idea of the real benefit.
MPI_FLAYOUT(fh, etype, filetype, count, info) IN fh File handle IN etype Elementary datatype IN filetype Filetype to use from current fh position IN count number of "Filetype" units from which new layout is valid IN info MPI_INFO structure for access pattern, etc.
** In contrast to the existing proposal (10.7), MPI_FLAYOUT is lightweight and incremental. It is a heads-up to the implementation that "any read/write/seek operations I perform from the beginning of this layout through the given count will be of type {etype, filetype}".
Assuming you've written a file using a series of layouts, is it possible to seek back to byte 0 of the file (e.g., to look at some header information) without remembering and individually unsetting (i.e, setting with negative counts) each of the layouts that got you where you are?
** a count < 0 means backwards from the current layout. count > 0 means from the current layout. count == 0 means infinite; i.e., this file is homogeneous
Does count == 0 imply that the whole file is homogeneous, or only the rest of the file from the point where the current layout is declared?
** initially (on open), a file has no layout but etype and filetype == MPI_BYTE. The user's view is by default
|BBBBBBBBBBBBBBBBBBBBBBBBBB ... | (B == MPI_BYTE) ^ ^ fh EOF
Suppose that MPI_FOPEN is immediately followed by
MPI_FLAYOUT(fh, MPI_INTEGER, MPI_INTEGER, 2, whatever)
on some node. The view on that node is now:
|IIBBBBBBBBBBBBBBBB ... | (I == MPI_INTEGER) ^ ^ fh EOF
If two different nodes declare different layouts for the same file segment (e.g., the first 8 bytes) then future results are undefined, with no error if the user is in reckless mode.
A minor quibble: MPI-IO has has some convenience functions that define interleaving-but-not-overlapping file types for a group of nodes that want to take turns accessing a file. Each node gets a filetype that contains holes corresponding to data that other nodes access. The filetypes are different, but they could be considered to cover the same segment of the file. It's even possible that nodes would want to define overlapping types, which they could use to read regions with ghost cells. Layouts should be defined to allow this behavior, even if atomicity isn't guaranteed for overlapping writes in reckless mode.
Now suppose the MPI_FLAYOUT above is immediately followed by
MPI_FLAYOUT(fh, MPI_REAL, MPI_REAL, 3, whatever)
on the same node. The view on that node is now:
|IIRRRBBBB ... | (R == MPI_REAL) ^ ^ fh EOF
The file handle now "points" at the beginning of the new layout. Semantics: no memory of the past layout is expected.
Now suppose the user reads 2 reals with no offset:
MPI_FREAD(fh, mybuf, MPI_REAL, 2, 0, status)
The resulting "position" of the file handle is
|IIRRRBBBB ... | (R == MPI_REAL) ^ ^ fh EOF
I'm confused. Why did the file pointer advance by 3 reals when the read only asked for 2? Is this just a typo?
* Support of independent parallel read/write access is provided when comm == MPI_COMM_NULL.
Why not MPI_COMM_SELF?
MPI_FCLOSE(fh, comm, info) MPI_IFCLOSE(fh, comm, info, request) IN fh File handle IN comm MPI communicator IN info MPI_INFO structure of {key, value} pairs OUT request request handle (for asynchronous versions)
* These calls should have Unix-like semantics for close, and MPI semantics for collective, (a)synchronous operations.
** It can be performed independently (with MPI_COMM_NULL), by a split communicator, or the original communicator set. Tolerance semantics: if the file handle is first closed by a split communicator, then again by the original (opening) set, then the user is granted forgiveness and no error is generated. This is permissive towards master-slave programming models.
This creates a minor implementation headache: Suppose a file opened by several processes is assigned the file handle 3 on node 0. Then suppose node 0 closes file handle 3 independently of the other nodes. If the implementation reuses file handle 3 when another file is opened, its meaning is ambiguous, because node 0 is still permitted to close (the old) 3 again as part of a collective close operation. Therefore, an implementation can't reuse closed file handles until the nodes have closed the file with the full communicator, and this may never happen if the program closes the file only once with MPI_COMM_SELF on each node. This isn't an impossible problem, just a pain in the neck. Is it worth it?