Contents:
--------
1. Functionalities
2. API
1. Proposed MPI I/O Functionalities
-----------------------------------
User's View:
* Nodes participating in a parallel I/O transaction are
specified by an MPI communicator.
* The basic units in a file or I/O stream are MPI_Datatypes
* The default unit for a (file) stream is MPI_Byte
* Support for "a priori" or externally defined file formats
- streams homogeneous in datatype; e.g., a file of #N vectors
where N is known prior to opening the stream
- streams composite (heterogeneous) in datatypes; e.g., a
file of #K integer vectors followed by #M real matrices
where K, M, and the data ordering are known prior to open
* Support for "a posteriori" or Run Length Encoded file formats
- simple case of header followed by either homogeneous
or composite data segments; e.g., a file with 1k header
specifying the type and count of subsequent segments
(i.e., after the header is read the format is "a priori")
- coarse-grained parallel Run Length Encodings where the
header+segment lengths are long with respect to I/O
latencies and the number of nodes in the I/O communicator
- fine-grained RLE's such as character encodings are
definitely a poor model for parallel I/O; however, support
for RLE transcriptions by one node for subsequent broadcast
to others is a needed functionality. Both the coarse and
fine-grained RLE models could be supported by the same
mechanism. Advice to users: don't do fine-grained RLE
in parallel. Instead, read on one node and then farm it.
This is already a standard model in comp-chemistry.
* Support for coordinated parallel read/write stream access
- A parallel read of a single data structure from a (possibly
parallel) storage system to the processor space and memory
of a parallel application is typically a
"scatter < scatter < gather" operation. Given the data
structure layout on disk, in memory, in the computation,
and the number of nodes the I/O library should support
efficient, automatic transfer of data.
- Likewise for write; consider: "gather > gather > scatter".
- Other gather/scatter combinations occur and should be
accounted for.
* Support of independent parallel read/write access. There are
two cases:
- Support for efficient single-node striding through a data
segment; for example, a node reading every kth value of
N vectors. This is simply "single node performance" for
the coordinated read/write above. It is also desirable
when independent nodes are reading different file segments
from the same stream.
-
* Standard READ/WRITE stream access model
* Standard SEEK positioning model, in units of MPI_Datatype(s)
* Atomic SEEK_AND_READ/WRITE for high-performance
* Standard OPEN semantics for local file services
* Optional CONNECT semantics for remote or special I/O services
* Standard CLOSE semantics
* Synchronous and Asynchronous collective operations
* Control directives for I/O stream, including
- rewind
- flush
- get/set attribute (fstat model)
- store in architecture-neutral format
* Control hints for I/O service, including
- use local file cache
- access pattern hints; e.g., "periodic", "aperiodic", and
"I'll be computation bound for a long time"
Implementor's View
* Semantics which do not limit development of a library or
server implementation.
- Clear understanding of what functionalities must be
collective in each (library, server) case.
* Only functionalities which can be acheived by existing services
- External connections to I/O and DBMS servers should be
quality-of-implementation dependent; i.e., if the
"service information" specified in a hints or the proposed
MPI_INFO structure are unknown to the implementation (or
local site) then "connect" should be expected to fail.
2. Proposed API
---------------
MPI_FOPEN(comm, filename, amode, info, fh)
IN comm MPI communicator
IN filename Name of file to be opened (string)
IN amode File access mode (integer)
IN info MPI_INFO structure of {key, value} pairs
OUT fh File handle
* Nodes participating in a parallel I/O transaction are
specified by an MPI communicator.
** In contrast to the existing proposal (10.2), append mode is
permitted in amode.
* The basic units in a file or I/O stream are MPI_Datatypes
* The default unit for a (file) stream is MPI_Byte
** In contrast to the existing proposal (10.7), implementation
directives may NOT be given in the filename, but rather
are specified in the "info" argument.
** When requests in "info" are not satisfied, MPI_FOPEN
are not satisfied, MPI_FOPEN returns (possibly or'd
set of) warning code(s) in ierror.
MPI_IFOPEN(comm, filename, amode, info, fh, request)
IN comm MPI communicator
IN filename Name of file to be opened (string)
IN amode File access mode (integer)
IN info MPI_INFO structure of {key, value} pairs
OUT fh File handle
OUT request asynchronous request handle
** Asynchronous version of above. This is particularly useful
when the dataset is stored in an archival storage system.
MPI_FLAYOUT(fh, etype, filetype, count, info)
IN fh File handle
IN etype Elementary datatype
IN filetype Filetype to use from current fh position
IN count number of "Filetype" units from which new
layout is valid
IN info MPI_INFO structure for access pattern, etc.
** In contrast to the existing proposal (10.7), MPI_FLAYOUT
is lightweight and incremental. It is a heads-up to the
implementation that "any read/write/seek operations I
perform from the beginning of this layout through the given
count will be of type {etype, filetype}".
** a count < 0 means backwards from the current layout.
count > 0 means from the current layout.
count == 0 means infinite; i.e., this file is homogeneous
** initially (on open), a file has no layout but etype and
filetype == MPI_BYTE. The user's view is by default
|BBBBBBBBBBBBBBBBBBBBBBBBBB ... | (B == MPI_BYTE)
^ ^
fh EOF
Suppose that MPI_FOPEN is immediately followed by
MPI_FLAYOUT(fh, MPI_INTEGER, MPI_INTEGER, 2, whatever)
on some node. The view on that node is now:
|IIBBBBBBBBBBBBBBBB ... | (I == MPI_INTEGER)
^ ^
fh EOF
If two different nodes declare different layouts for the
same file segment (e.g., the first 8 bytes) then future
results are undefined, with no error if the user is in
reckless mode.
Now suppose the MPI_FLAYOUT above is immediately followed by
MPI_FLAYOUT(fh, MPI_REAL, MPI_REAL, 3, whatever)
on the same node. The view on that node is now:
|IIRRRBBBB ... | (R == MPI_REAL)
^ ^
fh EOF
The file handle now "points" at the beginning of the new
layout. Semantics: no memory of the past layout is expected.
Now suppose the user reads 2 reals with no offset:
MPI_FREAD(fh, mybuf, MPI_REAL, 2, 0, status)
The resulting "position" of the file handle is
|IIRRRBBBB ... | (R == MPI_REAL)
^ ^
fh EOF
Note that the file handle is pointing at the next
(yet to be used) value in the current layout.
[In an earlier proposal, it was thought that only
one MPI_FREAD should be permitted per MPI_FLAYOUT.
This is an interesting paradigm, but does not support
the homogenous file model. For example, when count
is 0, we should not restrict the user to only one
MPI_FREAD over their entire file!]
MPI_FLAYOUT provides support for both initial
and incremental declarations of file datatype layouts.
As such it supports both the "a priori" and "a posteriori"
(Run Length Encoding) file formatting models.
Advice to users: don't do fine-grained RLE
in parallel. Instead, read on one node and then farm it.
MPI_FREAD(fh, buff, buftype, bufcount, offset, status)
MPI_FWRITE(...)
MPI_IFREAD(..., request)
MPI_IFWRITE(..., request)
IN fh File handle
IN buff buffer (OUT on write)
IN bufcount number of buftypes to read/write
IN offset offset in filetypes
OUT status success/warning/failure
OUT request request handle (for asynchronous versions)
** The dependency on a communicator has been removed.
Users can acheive task parallel I/O on the same file with
multiple MPI_FOPEN statements.
** Note that the implementation must keep a "position" pointer
per node for both the user and the library. Consider a
sequence of IFREAD, FLAYOUT, IFREAD, FLAYOUT, etc., followed
by a wait on all outstanding requests. The user's view of
the file handle position is that it moves with each read.
However, the file handle in the backend of the implementation
will not "catch up" in position until the wait is satisfied.
The implementation need not store all of the interim
positions. Each set of new displacements can be computed
from the previous; i.e., FLAYOUT has incremental semantics.
* A Standard SEEK positioning model is not supported by this
API, instead positioning is acquired through MPI_FLAYOUT and
the offset argument in MPI_FREAD, et al.
* Seek-and-read within a particular layout can easily be
accomplished with offset. In particular, offset can be negative.
It is an error (non-fatal?) to attempt a MPI_FREAD past either
end of a declared layout.
* Seeking to the beginning of the file is equivalent
to "rewind", a file control operation.
* An Atomic SEEK_AND_READ/WRITE for high-performance could be
constructed as a union of the MPI_FLAYOUT and MPI_FREAD specs.
MPI_FSEEKREAD(fh, etype, filetype, count,
buff, buftype, bufcount, offset, comm, status)
MPI_FSEEKWRITE(...)
MPI_IFSEEKREAD(..., request)
MPI_IFSEEKWRITE(..., request)
MPI_FCLOSE(fh, info)
MPI_IFCLOSE(fh, info, request)
IN fh File handle
IN info MPI_INFO structure of {key, value} pairs
OUT request request handle (for asynchronous versions)
* These calls should have Unix-like semantics for close, and
MPI semantics for collective, (a)synchronous operations.
* The info argument can be used to supply hints.
MPI_FCONNECT(comm, amode, info)
IN comm MPI communicator
IN amode File access mode (integer)
IN/OUT info MPI_INFO structure of {key, value} pairs
* Optional functionality for remote or special I/O services.
All resource specifications are given in the info argument.
This call is useful when multiple I/O services
are available from a single service provider, or when
user authentication data (password keys, etc.) need to be
established prior to "open".
* Service-side information is returned in info. This particular
info object can then be used in subsequent MPI_FOPEN statements.
Note that if the user already "knows" the correct info contents
to open a file, they can skip the MPI_FCONNECT command and
simply construct the info argument using the MPI2 info utilities.
** Whether anything useful is ever returned is a quality of
implementation issue. If a connection is not accomplished,
a warning is returned in ierror.
* Suggested functionalities:
- "standard" archive services
- ftp services
- named blobs in ODBC compliant databases
MPI_FCONTROL(fh, comm, flag, choice, info)
* Stream and service control hints and directives can be
implemented by either one function with N flags, or
N functions with no flags.
Please send any requests for control functionalities not
listed in MPI2 chapter 10 to mpi-io@mcs.anl.gov with a
suitable subject line.