Simplified I/O Interface Proposals Summary

Bill Nitzberg (nitzberg@nas.nasa.gov)
Tue, 25 Jun 1996 19:41:14 -0700

This message is an effort to simplify the numerous "Simplified
I/O Interface" proposals, as well as to break up the proposals
into vote-able pieces.

These proposals represent a distillation of the recent email as
well as conversations which took place at the June MPI Forum
meeting. In particular, they include work from Leslie Hart,
Rusty Lusk, John May, Bill Nitzberg, Yoonho Park, Eric Salo,
Bill Saphir, Rajeev Thakur, Parkson Wong, and others.

The proposals all attempt to provide the look and feel of the
basic UNIX interface by default, and provide extra features via
additional routines. All of the proposals separate the existing
MPI_OPEN routine into two routines: MPI_OPEN and MPI_LAYOUT;
and all rename the individual file pointer routines MPI_READ and
MPI_WRITE. In this way, MPI_OPEN, MPI_READ, MPI_WRITE, and
MPI_CLOSE have identical semantics to their familiar UNIX
equivalents.

Note that some proposals replace the existing interface, some
can either replace or be added to the existing interface, and,
not all of them can be combined sensibly.

----------

First, I believe there are a set of proposals which are (somewhat)
orthogonal to the "simplified I/O interface" proposals. These
should probably be decided separately:

PROPOSAL #1: Add non-blocking MPI_IOPEN

PROPOSAL #2: Add non-blocking MPI_ICLOSE

PROPOSAL #3: Add non-blocking MPI_IFILE_SYNC

PROPOSAL #4: Add an info/hints argument to MPI_CLOSE

PROPOSAL #5: Add APPEND mode to MPI_OPEN

Semantically, the scope of "APPEND" could be either
- the current MPI_COMM_WORLD only,
- all applications running under this MPI
implementation, or
- all applications with access to the file (i.e. UNIX
semantics).

PROPOSAL #6: Allow finite count of filetypes to tile a file

The accessible area of a file can be finite, and is
specified exactly as a count of filetypes (e.g. by adding
"count" to MPI_OPEN). Attempts to access areas of the file
outside the currently defined extent of the file would be an
error. (Note that we could also require file block
preallocation here.)

PROPOSAL #7: Make all READ/WRITE routines collective

Eliminate the _ALL routines and make all data access
routines collective. Individual accesses would be supported
by specifying MPI_COMM_SELF when opening the file.

PROPOSAL #8: Add additional support for "non-local" I/O services

In order to portably support "standard" archive servers
(e.g. Unitree), ftp services, named blobs in ODBC compliant
databases, as well as authentication data (e.g. password
keys), an additional routine is called prior to OPEN:

MPI_CONNECT(comm, amode, info)

MPI_CONNECT returns the necessary keys/cookies which are
then passed to the OPEN call. All resource specifications
are given in the info argument. Service-side information is
returned in info. This particular info object can then be
used in subsequent MPI_OPEN statements.

----------

Finally, without getting into too much detail, here is the meat
of the alternative simplified I/O interface proposals. In these
proposals, "hints" has been replaced by "info", and READ is
used as the canonical example representing all MPI data access
routines.

PROPOSAL A: Split MPI_OPEN into MPI_OPEN and MPI_LAYOUT

MPI_OPEN(comm, filename, amode, info, fh)

Mimics a UNIX open call, opening "filename" with
disp = 0, etype = MPI_BYTE, filetype = MPI_BYTE.

MPI_LAYOUT(fh, disp, etype, filetype, count, info)

Changes the process's view of the file. It would be
allowed to be done at any time on an open file handle.
The "count" argument would be added if PROPOSAL #6 were
adopted.

PROPOSAL B: Rename READ/WRITE routines to UNIX look and feel

Rename the individual file pointer routines to MPI_READ and
MPI_WRITE to mimic their UNIX counterparts.

Specifically, the existing MPI_READ would be renamed
MPI_READ_EXPLICIT or MPI_READ_SEEK, and the existing
MPI_READ_NEXT would be renamed MPI_READ. We could even go
one step further and add a simpler (more UNIX-like) READ
which assumes a buftype of MPI_BYTE.

Note that this is almost entirely a name binding issue, and
we should probably avoid discussing name binding issues at
this early juncture.

PROPOSAL C: Add filetype to the READ/WRITE routines

Add a routine taking both filetype and buftype such as:
MPI_READ_LAYOUT(fh, disp, filetype, buf, buftype, bufcount, status).
Note that an absolute file address must be specified. This
address is denoted here by including the byte displacement
"disp"; PROPOSAL E has an alternative method of specifying
an absolute file address.

PROPOSAL D: Add persistent READ/WRITE routines

Persistent READ/WRITE routines would support applications (e.g.
time-stepping computational flow solvers) which perform the same
I/O operation many times. This would permit significant
optimizations to be performed once and re-used by the
implementation.

Typical code might look like:

MPI_WRITE_NEXT_INIT(fh, buf, buftype, bufcount, request)
...
Loop
... Compute Timestep ...
MPI_START(request)
...
MPI_WAIT(request, status)

PROPOSAL E: Make displacement implicitly based on MPI_LAYOUT

Eliminate the explicit byte displacement from the proposal and
replace it by a system maintained displacement. The system
maintained displacement, initially zero, is modified by the
MPI_LAYOUT routine as follows:

MPI_LAYOUT(fh, etype, filetype, count, info) would be
equivalent to the following calls in the existing proposal:

static MPI_Offset disp = 0; /* Line 1 */
static int previous_count = 0;
static MPI_Datatype previous_filetype = MPI_BYTE;

if (count < 0)
disp -= count * sizeof(filetype);
else if (count > 0)
disp += previous_count * sizeof(previous_filetype);
else /* (count == 0) */
/* disp doesn't change */;
previous_count = count;
previous_filetype = filetype;

MPI_CLOSE(fh);
MPI_OPEN(comm, filename, amode, disp, etype, filetype, info, fh);

Possible additions/modifications to the above:

At Line 1 above, set disp to the current value of the
individual file pointer. This better supports layouts
where the filetype is infinitely tiling the file.

Add MPI_REWIND(...) or MPI_SEEK(...) which resets the
system maintained displacement to zero or any specified value
respectively.

----------

I apologize for any errors or inconsistencies, they are most
likely editing errors on my part.

Thank you for reading this far,

- bill

Bill Nitzberg nitzberg@nas.nasa.gov
NAS Parallel Systems, MRJ, Inc.
NASA Ames Research Center, M/S 258-6 Tel: (415) 604-4513
Moffett Field, CA 94035-1000 FAX: (415) 966-8669