It appears you are asking for:
- a mechanism to store the data representation of a file
"transparently" with the file
- fixed size file headers
O.K., if by "data representation" you mean the architectural information
(e.g, *-endian and byte length of integers) then of course the header
could be of fixed size.
In the canonical form, we need to store the datatypes. These are
by nature variable in length (per file).
For files written in MPI_DATA_NATIVE format, no header is needed.
In my view, they should not even be permitted; NATIVE = reckless.
Hopefully, some desirable performance efficiencies can be acheived in
this mode.
You seem to agree with my viewpoint that "self-describing" files
can have many interpretations.
This month's ARPA PI meeting had a session on metadata for self-describing
objects (datasets). The session was chaired by Rick Hayes-Roth.
Some of the items included in their draft schema include:
storage format (architectural information)
data format (datatype information)
content (application metadata)
lineage (upstream datasets and methods that created the object)
metadata bindings (semantics)
...
I reject the idea of MPI storing this kind of information in a file.
Instantiation of such a schema is the task of a data management system (DMS).
Notice however, that a DMS needs to obtain the first two items from
the storage utility; i.e., from MPI. Hence, my quest for
MPI_DATA_TYPE_ENCODE/DECODE, MPI_COMM_TYPE_ENCODE, and MPI_FH_TYPE_ENCODE.
At this time, I do not believe there is any restriction in
MPI_DATA_INTERNAL mode on headers. So, an implementation could put the
architectural information produced by MPI_FH_TYPE_ENCODE at the head of
each file. However, I don't believe this will entirely solve the
problem on virtual file systems. When an MPI_OPEN for read is
executed, how will the implementation discover where and how many local
segments were written? Clearly more than architectural information
will have to be stored. In studying this problem, I believe you will
be lead towards the need for a variable length header, whose length (or
table of contents) is specified by a pre-header of fixed length.
Here's an alternate view of file metadata: suppose that when you store
(open for write) "foo.dat" that you embed "header" information (metadata)
at the beginning, but store NO data!! Instead, "foo.dat" tells you where
to find the files (segments) that compose the dataset (virtual file).
On Thu, 24 Oct 1996, Jean-Pierre Prost wrote:
>
>
> Do not take my comments as destructive, I wish I had a constructive
> suggestion to make. And I do not suggest to delete the section. On the
> other hand, I am glad you are trying to address this difficult issue.
>
> With regard to votes, I agree we ruled out self describing files.
> However,
> introducing an embedded file header for encoding the data representation
> and architecture information of the file data is very different from and
> not as
> ambitious as having self describing files.
>
> Embedded file headers do not resolve everything, but at least they
> it allow file interoperability without requiring from the user the need
> to remember where (s)he has stored the data representation of the
> files (s)he had created.
> Embedded file header (of fixed size) have the advantage to be highly
> portable across file systems and are self contained. I agree they
> may present some problem for legacy applications that expect
> standard UNIX files, the reason for making them optional on a per
> file creation basis. Nothing is perfect in our world.
>
> Jean-Pierre
>
> ---------- Forwarding Original Note --------
> To: JPPROST
> cc: mpi-io @ mcs.anl.gov
> From: frost @ sdsc.edu
> Date: 10/24/96 01:43:10 PM MST
> Subject: Re: interoperability and metadata [was: Proposal for better ...]
> Security:
>
>
>