File Pointer Updating - reviewing evolved text

Bill Nitzberg (nitzberg@nas.nasa.gov)
Tue, 11 Mar 1997 19:22:13 -0800

Ian Stockdale wrote:
> In view of the changes to the I/O chapter over the last
> several meetings, it seems like it might be useful to look at the
> evolved text to see how the newer changes interact with the details
> which were already present. There may be some constraints which were
> determined based on requirements which no longer exist. Similarly,
> some choices may have been made to preserve functionality which has
> since vanished. Small modifications to improve the standard without
> making any fundamental changes might thus be possible.

In this vein, I would like to review the "funny" file pointer update
semantics. The current draft states (from 10.4.1 Positioning):

For both performance and thread safety reasons, \MPI/ always
updates the file pointer at the outset of an operation by the
amount of data requested. Waiting for an access to complete
before updating the file pointer would restrict the ability to
initiate additional accesses by the same process (for both
types of file pointers) and by other processes (for the shared
file pointer).

We have already voted to eliminate this "funny" update rule for
sequential files (serial files such as sockets and tapes). From
the new "10.6.x Random Access vs. Sequential Files" section (if you
missed the Forum meeting, a new draft will be out soon):

[... for sequential files ...] the pointer update rules
specified for the data access routines do not apply.

I believe that recent changes in the draft have eliminated the need
for this "funny" rule entirely (for random access files as well).
Specifically, the combination of:

- all non-blocking accesses must be completed before FILE_SYNC
or FILE_SET_SIZE is called,
- FILE_SYNC is required for updates to be visible between file
handles from different collective OPEN calls,
- End of file is not an error, and
- I/O errors may leave MPI in an undetermined state (such as updating
the file pointer by the amount requested rather than the amount
accessed)

Given these new rules, one can implement "normal" update semantics
as follows:

When a file is OPENed, cache the file size with the file handle

For a WRITE:
lock(file pointer)
update file pointer by amount *requested* (== amount *accessed*)
unlock(file pointer)
execute WRITE (if WRITE fails, raise an error)

For a READ:
lock(file pointer)
calculate new position (p_new) based on amount *requested*
if (p_new > cached file size) then
update pointer to cached file size
otherwise,
update pointer by amount *requested* (== amount *accessed*)
unlock(file pointer)
execute READ (if READ fails, raise an error)

File size can only be updated by:
a. SET_FILE_SIZE
b. FILE_SYNC
c. WRITE(fh, ...) where fh is in the set returned by the OPEN

Both a. and b. are collective, so the implementation can simply
change it's cached value of the file size.

For c., there are two arguments which favor "normal" update semantics:
1. For update semantics to be important at all, a subsequent READ
must overlap this WRITE. This requires FILE_SET_ATOMICITY, which
will basically serialize all accesses anyway (making them all expensive).
2. Only READs which are about to read beyond the cached EOF are
affected. The common case is still fast.

For both READing and WRITEing, there is no need to wait for an access
to complete before updating the file pointer. Therefore, this
implementation does not restrict the ability to initiate additional
accesses (by other processes or other threads in the same process).

Comments?

- bill