Window Creation

13.2.1. Window Creation

Up: Initialization Next: Window That Allocates Memory Previous: Initialization

MPI_WIN_CREATE(base, size, disp_unit, info, comm, win)
IN base	initial address of window (choice)
IN size	size of window in bytes (non-negative integer)
IN disp_unit	local unit size for displacements, in bytes (positive integer)
IN info	info argument (handle)
IN comm	intra-communicator (handle)
OUT win	window object (handle)

C binding
int MPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, MPI_Win *win) int MPI_Win_create_c(void *base, MPI_Aint size, MPI_Aint disp_unit, MPI_Info info, MPI_Comm comm, MPI_Win *win) Fortran 2008 binding
MPI_Win_create(base, size, disp_unit, info, comm, win, ierror) TYPE(*), DIMENSION(..), ASYNCHRONOUS :: base INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size INTEGER, INTENT(IN) :: disp_unit TYPE(MPI_Info), INTENT(IN) :: info TYPE(MPI_Comm), INTENT(IN) :: comm TYPE(MPI_Win), INTENT(OUT) :: win INTEGER, OPTIONAL, INTENT(OUT) :: ierror MPI_Win_create(base, size, disp_unit, info, comm, win, ierror) !(_c) TYPE(*), DIMENSION(..), ASYNCHRONOUS :: base INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size, disp_unit TYPE(MPI_Info), INTENT(IN) :: info TYPE(MPI_Comm), INTENT(IN) :: comm TYPE(MPI_Win), INTENT(OUT) :: win INTEGER, OPTIONAL, INTENT(OUT) :: ierror Fortran binding
MPI_WIN_CREATE(BASE, SIZE, DISP_UNIT, INFO, COMM, WIN, IERROR) <type> BASE(*) INTEGER(KIND=MPI_ADDRESS_KIND) SIZE INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR

This procedure is collective over the group of comm. It returns a handle to a window that can be used by the MPI processes in this group to perform RMA operations. Each MPI process specifies a window of existing memory that it exposes to RMA accesses by any MPI processes in the group of comm. The window consists of size bytes, starting at address base. In C, base is the starting address of a memory region. In Fortran, one can pass the first element of a memory region or a whole array, which must be `simply contiguous' (for `simply contiguous,' see also Section Problems Due to Data Copying and Sequence Association with Subscript Triplets). An MPI process may elect to expose no memory by specifying size = 0.

The displacement unit argument is provided to facilitate address arithmetic in RMA operations: the target displacement argument of an RMA operation is scaled by the factor disp_unit specified by the target process, at window creation.

Rationale.

The window size is specified using an address-sized integer, rather than a basic integer type, to allow windows that span more memory than can be described with a basic integer type. ( End of rationale.)

Advice to users.

Common choices for disp_unit are 1 (no scaling), and (in C syntax) sizeof(type), for a window that consists of an array of elements of type type. The latter choice will allow one to use array indices in RMA calls, and have those scaled correctly to byte displacements, even in a heterogeneous environment. ( End of advice to users.)
The info argument provides optimization hints to the runtime about the expected usage pattern of the window. The following info keys are predefined:

"no_locks" (boolean, default: false):

if set to true, then the implementation may assume that passive target synchronization (i.e., MPI_WIN_LOCK, MPI_WIN_LOCK_ALL) will not be used on the given window. This implies that this window is not used for 3-party communication, and RMA can be implemented with no (less) asynchronous agent activity at this process.

"accumulate_ordering" (string, default rar,raw,war,waw):

controls the ordering of accumulate operations at the target. See Section Ordering for details.

"accumulate_ops" (string, default: same_op_no_op):

if set to "same_op", the implementation will assume that all concurrent accumulate calls to the same target address will use the same operator. If set to "same_op_no_op", then the implementation will assume that all concurrent accumulate calls to the same target address will use the same operator or MPI_NO_OP. This can eliminate the need to protect access for certain operators where the hardware can guarantee atomicity.

"mpi_accumulate_granularity" (integer, default 0):

provides a hint to implementations about the desired synchronization granularity for accumulate operations, i.e., the size of memory ranges in bytes for which the implementation should acquire a synchronization primitive to ensure atomicity of updates. If the specified granularity is not divisible by the size of the type used in an accumulate operation, it should be treated as if it was the next multiple of the element size. For example, a granularity of 1 byte should be treated as 8 in an accumulate operation using MPI_UINT64_T. By default, this info key is set to 0, which leaves the choice of synchronization granularity to the implementation. If specified, all MPI processes in the group of a window must supply the same value.

Advice to users.

Small synchronization granularities may provide improved latencies for accumulate operations with few elements and potentially increase concurrency of updates, at the cost of lower throughput. For example, a value matching the size of a type involved in an accumulate operation may enable implementations to use atomic memory operations instead of mutual exclusion devices. Larger synchronization granularities may yield higher throughput of accumulate operation with large numbers of elements due to lower synchronization costs, potentially at the expense of higher latency for accumulate operations with few elements, e.g., if atomic memory operations are not employed. By dividing larger accumulate operations into smaller segments, concurrent accumulate operations to the same window memory may update different segments in parallel. ( End of advice to users.)

Advice to implementors.

Implementations are encouraged to avoid mutual exclusion devices in cases where the granularity is small enough to warrant the use of atomic memory operations. For larger granularities, implementations should use this info value as a hint to partition the window memory into zones of mutual exclusion to enable segmentation of large accumulate operations. ( End of advice to implementors.)

"same_size" (boolean, default: false):

if set to true, then the implementation may assume that the argument size is identical on all MPI processes, and that all MPI processes have provided this info key with the same value.

"same_disp_unit" (boolean, default: false):

if set to true, then the implementation may assume that the argument disp_unit is identical on all MPI processes, and that all MPI processes have provided this info key with the same value.

mpi_assert_memory_alloc_kinds (string, not set by default):

If set, the implementation may assume that the memory for all communication buffers passed to MPI operations performed by the calling MPI process on the given window will use only the memory allocation kinds listed in the value string. See Section Memory Allocation Info. This info hint also applies to the window buffer provided in a call to MPI_WIN_CREATE or MPI_WIN_ATTACH. It does not apply to the memory allocated in a call to MPI_WIN_ALLOCATE or MPI_WIN_ALLOCATE_SHARED.

Advice to users.

The info query mechanism described in Section Window Info can be used to query the specified info arguments for windows that have been passed to a library. It is recommended that libraries check attached info keys for each passed window. ( End of advice to users.)
The various MPI processes in the group of comm may specify completely different target windows, in location, size, displacement units, and info arguments. As long as all the get, put and accumulate accesses to a particular MPI process fit their specific target window this should pose no problem. The same area in memory may appear in multiple windows, each associated with a different window object. However, concurrent communications to distinct, overlapping windows may lead to undefined results.

Implementations may make the memory provided by the user available for load/store accesses by MPI processes in the same shared memory domain. A communicator of such processes can be constructed as described in Section Communicator Constructors using MPI_COMM_SPLIT_TYPE. Pointers to access a shared memory segment can be queried using MPI_WIN_SHARED_QUERY.

Rationale.

The reason for specifying the memory that may be accessed from another MPI process in an RMA operation is to permit the programmer to specify what memory can be a target of RMA operations and for the implementation to enforce that specification. For example, with this definition, a server MPI process can safely allow a client MPI process to use RMA operations, knowing that (under the assumption that the MPI implementation does enforce the specified limits on the exposed memory) an error in the client cannot affect any memory other than what was explicitly exposed. ( End of rationale.)

Advice to users.

A window can be created in any part of the MPI process memory. However, on some systems, the performance of windows in memory allocated by MPI_ALLOC_MEM (Section Memory Allocation) will be better. Also, on some systems, performance is improved when window boundaries are aligned at ``natural'' boundaries (word, double-word, cache line, page frame, etc.). ( End of advice to users.)

Advice to implementors.

In cases where RMA operations use different mechanisms in different memory areas (e.g., load/store accesses in a shared memory segment, and an asynchronous handler in private memory), the MPI_WIN_CREATE call needs to figure out which type of memory is used for the window. To do so, MPI maintains, internally, the list of memory segments allocated by MPI_ALLOC_MEM, or by other, implementation-specific, mechanisms, together with information on the type of memory segment allocated. When a call to MPI_WIN_CREATE occurs, then MPI checks which segment contains each window, and decides, accordingly, which mechanism to use for RMA operations.

Vendors may provide additional, implementation-specific mechanisms to allocate or to specify memory regions that are preferable for use in one-sided communication. In particular, such mechanisms can be used to place static variables into such preferred regions.

Implementors should document any performance impact of window alignment. ( End of advice to implementors.)

Up: Initialization Next: Window That Allocates Memory Previous: Initialization

Return to MPI-4.1 Standard Index
Return to MPI Forum Home Page

(Unofficial) MPI-4.1 of November 2, 2023
HTML Generated on November 19, 2023