8.4.2. Communicator Constructors

PreviousUpNext
Up: Communicator Management Next: Communicator Destructors Previous: Communicator Accessors

The following are collective functions that are invoked by all MPI processes in the group or groups associated with comm, with the exception of MPI_COMM_CREATE_GROUP, MPI_COMM_CREATE_FROM_GROUP, and MPI_INTERCOMM_CREATE_FROM_GROUPS. MPI_COMM_CREATE_GROUP and MPI_COMM_CREATE_FROM_GROUP are invoked only by the MPI processes in the group of the new communicator being constructed. MPI_INTERCOMM_CREATE_FROM_GROUPS is invoked by all the MPI processes in the local and remote groups of the new communicator being constructed. See the discussion below for the definition of local and remote groups.


Rationale.

Note that, when using the World Model, there is a chicken-and-egg aspect to MPI in that a communicator is needed to create a new communicator. In the World Model, the base communicator for all MPI communicators is predefined outside of MPI, and is MPI_COMM_WORLD. The World Model was arrived at after considerable debate, and was chosen to increase ``safety'' of programs written in MPI. ( End of rationale.)
This chapter presents the following communicator construction routines: MPI_COMM_CREATE, MPI_COMM_DUP, MPI_COMM_IDUP, MPI_COMM_DUP_WITH_INFO, MPI_COMM_IDUP_WITH_INFO, MPI_COMM_SPLIT and MPI_COMM_SPLIT_TYPE can be used to create both intra-communicators and inter-communicators; MPI_COMM_CREATE_GROUP, MPI_COMM_CREATE_FROM_GROUP and MPI_INTERCOMM_MERGE (see Section Inter-Communicator Operations) can be used to create intra-communicators; MPI_INTERCOMM_CREATE and MPI_INTERCOMM_CREATE_FROM_GROUPS (see Section Inter-Communicator Operations) can be used to create inter-communicators.

An intra-communicator involves a single group while an inter-communicator involves two groups. Where the following discussions address inter-communicator semantics, the two groups in an inter-communicator are called the left and right groups. An MPI process in an inter-communicator is a member of either the left or the right group. From the point of view of that MPI process, the group that the MPI process is a member of is called the local group; the other group (relative to that MPI process) is the remote group. The left and right group labels give us a way to describe the two groups in an inter-communicator that is not relative to any particular MPI process (as the local and remote groups are).

MPI_COMM_DUP(comm, newcomm)
IN commcommunicator (handle)
OUT newcommcopy of comm (handle)
C binding
int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_dup(comm, newcomm, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_DUP(COMM, NEWCOMM, IERROR)

INTEGER COMM, NEWCOMM, IERROR

MPI_COMM_DUP duplicates the existing communicator comm with associated key values, topology information and error handlers. For each key value, the respective copy callback function determines the attribute value associated with this key in the new communicator; one particular action that a copy callback may take is to delete the attribute from the new communicator. MPI_COMM_DUP returns in newcomm a new communicator with the same group or groups, same topology, same error handlers and any copied cached information, but a new context (see Section Functionality). The newly created communicator will have no buffer attached (see Section Buffer Allocation and Usage).


Advice to users.

This operation is used to provide a parallel library with a duplicate communication space that has the same properties as the original communicator. This includes any attributes (see below) and topologies (see Chapter Virtual Topologies for MPI Processes). This call is valid even if there are pending point-to-point communication operations or decoupled MPI activities involving the communicator comm. A typical call might involve a MPI_COMM_DUP at the beginning of the parallel call, and an MPI_COMM_FREE of that duplicated communicator at the end of the call. Other models of communicator management are also possible.

This call applies to both intra- and inter-communicators. ( End of advice to users.)

Advice to implementors.

One need not actually copy the group information, but only add a new reference and increment the reference count. Copy on write can be used for the cached information. ( End of advice to implementors.)

MPI_COMM_DUP_WITH_INFO(comm, info, newcomm)
IN commcommunicator (handle)
IN infoinfo object (handle)
OUT newcommcopy of comm (handle)
C binding
int MPI_Comm_dup_with_info(MPI_Comm comm, MPI_Info info, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_dup_with_info(comm, info, newcomm, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(MPI_Info), INTENT(IN) :: info
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_DUP_WITH_INFO(COMM, INFO, NEWCOMM, IERROR)

INTEGER COMM, INFO, NEWCOMM, IERROR

MPI_COMM_DUP_WITH_INFO behaves exactly as MPI_COMM_DUP except that the hints provided by the argument info are associated with the output communicator newcomm.


Rationale.

It is expected that some hints will only be valid at communicator creation time. However, for legacy reasons, most communicator creation calls do not provide an info argument. One may associate info hints with a duplicate of any communicator at creation time through a call to MPI_COMM_DUP_WITH_INFO. ( End of rationale.)

MPI_COMM_IDUP(comm, newcomm, request)
IN commcommunicator (handle)
OUT newcommcopy of comm (handle)
OUT requestcommunication request (handle)
C binding
int MPI_Comm_idup(MPI_Comm comm, MPI_Comm *newcomm, MPI_Request *request)
Fortran 2008 binding
MPI_Comm_idup(comm, newcomm, request, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(MPI_Comm), INTENT(OUT), ASYNCHRONOUS :: newcomm
TYPE(MPI_Request), INTENT(OUT) :: request
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_IDUP(COMM, NEWCOMM, REQUEST, IERROR)

INTEGER COMM, NEWCOMM, REQUEST, IERROR

MPI_COMM_IDUP is a nonblocking variant of MPI_COMM_DUP. With the exception of its nonblocking behavior, the semantics of MPI_COMM_IDUP are as if MPI_COMM_DUP was executed at the time that MPI_COMM_IDUP is called. For example, attributes changed after MPI_COMM_IDUP will not be copied to the new communicator. All restrictions and assumptions for nonblocking collective operations (see Section Nonblocking Collective Operations) apply to MPI_COMM_IDUP and the returned request.

It is erroneous to use the communicator newcomm as an input argument to other MPI functions before the MPI_COMM_IDUP operation completes.

MPI_COMM_IDUP_WITH_INFO(comm, info, newcomm, request)
IN commcommunicator (handle)
IN infoinfo object (handle)
OUT newcommcopy of comm (handle)
OUT requestcommunication request (handle)
C binding
int MPI_Comm_idup_with_info(MPI_Comm comm, MPI_Info info, MPI_Comm *newcomm, MPI_Request *request)
Fortran 2008 binding
MPI_Comm_idup_with_info(comm, info, newcomm, request, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(MPI_Info), INTENT(IN) :: info
TYPE(MPI_Comm), INTENT(OUT), ASYNCHRONOUS :: newcomm
TYPE(MPI_Request), INTENT(OUT) :: request
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_IDUP_WITH_INFO(COMM, INFO, NEWCOMM, REQUEST, IERROR)

INTEGER COMM, INFO, NEWCOMM, REQUEST, IERROR

MPI_COMM_IDUP_WITH_INFO is a nonblocking variant of MPI_COMM_DUP_WITH_INFO. With the exception of its nonblocking behavior, the semantics of MPI_COMM_IDUP_WITH_INFO are as if MPI_COMM_DUP_WITH_INFO was executed at the time that MPI_COMM_IDUP_WITH_INFO is called. For example, attributes or info hints changed after MPI_COMM_IDUP_WITH_INFO will not be copied to the new communicator. All restrictions and assumptions for nonblocking collective operations (see Section Nonblocking Collective Operations) apply to MPI_COMM_IDUP_WITH_INFO and the returned request.

It is erroneous to use the communicator newcomm as an input argument to other MPI functions before the MPI_COMM_IDUP_WITH_INFO operation completes.


Rationale.

The MPI_COMM_IDUP and MPI_COMM_IDUP_WITH_INFO functions are crucial for the development of purely nonblocking libraries (see [41]). ( End of rationale.)

MPI_COMM_CREATE(comm, group, newcomm)
IN commcommunicator (handle)
IN groupgroup, which is a subset of the group of comm (handle)
OUT newcommnew communicator (handle)
C binding
int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_create(comm, group, newcomm, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(MPI_Group), INTENT(IN) :: group
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_CREATE(COMM, GROUP, NEWCOMM, IERROR)

INTEGER COMM, GROUP, NEWCOMM, IERROR

If comm is an intra-communicator, this function returns a new communicator newcomm with communication group defined by the group argument. No cached information propagates from comm to newcomm and no virtual topology information is added to the created communicator. Each MPI process must call MPI_COMM_CREATE with a group argument that is a subgroup of the group associated with comm; this could be MPI_GROUP_EMPTY. The MPI processes may specify different values for the group argument. If an MPI process calls with a nonempty group then all MPI processes in that group must call the function with the same group as argument, that is the same MPI processes in the same order. Otherwise, the call is erroneous. This implies that the set of groups specified across the MPI processes must be disjoint. If the calling MPI process is a member of the group given as group argument, then newcomm is a communicator with group as its associated group. In the case that an MPI process calls with a group to which it does not belong, e.g., MPI_GROUP_EMPTY, then MPI_COMM_NULL is returned as newcomm. The function is collective and must be called by all MPI processes in the group of comm.

Image file


Figure 17: Inter-communicator creation using MPI_COMM_CREATE extended to inter-communicators. The input groups are those in the grey circle.


Rationale.

The interface supports the original mechanism from MPI-1.1, which required the same group in all MPI processes of comm. It was extended in MPI-2.2 to allow the use of disjoint subgroups in order to allow implementations to eliminate unnecessary communication that MPI_COMM_SPLIT would incur when the user already knows the membership of the disjoint subgroups. ( End of rationale.)

Rationale.

The requirement that the entire group of comm participate in the call stems from the following considerations:


( End of rationale.)

Advice to users.

MPI_COMM_CREATE provides a means to subset a group of MPI processes for the purpose of separate MIMD computation, with separate communication space. newcomm, which emerges from MPI_COMM_CREATE, can be used in subsequent calls to MPI_COMM_CREATE (or other communicator constructors) to further subdivide a computation into parallel sub-computations. A more general service is provided by MPI_COMM_SPLIT, below. ( End of advice to users.)

Advice to implementors.

When calling MPI_COMM_DUP, all MPI processes call with the same group (the group associated with the communicator). When calling MPI_COMM_CREATE, the MPI processes provide the same group or disjoint subgroups. For both calls, it is theoretically possible to agree on a group-wide unique context with no communication. However, local execution of these functions requires use of a larger context name space and reduces error checking. Implementations may strike various compromises between these conflicting goals, such as bulk allocation of multiple contexts in one collective operation.

Important: If new communicators are created without synchronizing the MPI processes involved then the communication system must be able to cope with messages arriving in a context that has not yet been allocated at the receiving MPI process. ( End of advice to implementors.)
If comm is an inter-communicator, then the output communicator is also an intercommunicator where the local group consists only of those MPI processes contained in group (see Figure 17). The group argument should only contain those MPI processes in the local group of the input inter-communicator that are to be a part of newcomm. All MPI processes in the same local group of comm must specify the same value for group, i.e., the same members in the same order. If either group does not specify at least one MPI process in the local group of the inter-communicator, or if the calling MPI process is not included in the group, MPI_COMM_NULL is returned.


Rationale.

In the case where either the left or right group is empty, a null communicator is returned instead of an inter-communicator with MPI_GROUP_EMPTY because the side with the empty group must return MPI_COMM_NULL. ( End of rationale.)

Example Inter-communicator creation.
The following example illustrates how the first node in the left side of an inter-communicator could be joined with all members on the right side of an inter-communicator to form a new inter-communicator.

Image file

MPI_COMM_CREATE_GROUP(comm, group, tag, newcomm)
IN commintra-communicator (handle)
IN groupgroup, which is a subset of the group of comm (handle)
IN tagtag (integer)
OUT newcommnew communicator (handle)
C binding
int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_create_group(comm, group, tag, newcomm, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(MPI_Group), INTENT(IN) :: group
INTEGER, INTENT(IN) :: tag
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_CREATE_GROUP(COMM, GROUP, TAG, NEWCOMM, IERROR)

INTEGER COMM, GROUP, TAG, NEWCOMM, IERROR

MPI_COMM_CREATE_GROUP is similar to MPI_COMM_CREATE; however, MPI_COMM_CREATE must be called by all MPI processes in the group of comm, whereas MPI_COMM_CREATE_GROUP must be called by all MPI processes in group, which is a subgroup of the group of comm. In addition, MPI_COMM_CREATE_GROUP requires that comm is an intra-communicator. MPI_COMM_CREATE_GROUP returns a new intra-communicator, newcomm, for which the group argument defines the communication group. No cached information propagates from comm to newcomm and no virtual topology information is added to the created communicator. Each MPI process must provide a group argument that is a subgroup of the group associated with comm; this could be MPI_GROUP_EMPTY. If a nonempty group is specified, then all MPI processes in that group must call the function, and each of these MPI processes must provide the same arguments, including a group that contains the same members with the same ordering. Otherwise the call is erroneous. If the calling MPI process is a member of the group given as the group argument, then newcomm is a communicator with group as its associated group. If the calling MPI process is not a member of group, e.g., group is MPI_GROUP_EMPTY, then the call is a local operation and MPI_COMM_NULL is returned as newcomm.


Rationale.

Functionality similar to MPI_COMM_CREATE_GROUP can be implemented through repeated MPI_INTERCOMM_CREATE and MPI_INTERCOMM_MERGE calls that start with the MPI_COMM_SELF communicators at each MPI process in group and build up an intra-communicator with group group [18]. Such an algorithm requires the creation of many intermediate communicators; MPI_COMM_CREATE_GROUP can provide a more efficient implementation that avoids this overhead. ( End of rationale.)

Advice to users.

An inter-communicator can be created collectively over MPI processes in the union of the local and remote groups by creating the local communicator using MPI_COMM_CREATE_GROUP and using that communicator as the local communicator argument to MPI_INTERCOMM_CREATE. ( End of advice to users.)
The tag argument does not conflict with tags used in point-to-point communication and is not permitted to be a wildcard. If multiple threads at a given MPI process perform concurrent MPI_COMM_CREATE_GROUP operations, the user must distinguish these operations by providing different tag or comm arguments.


Advice to users.

MPI_COMM_CREATE may provide lower overhead than MPI_COMM_CREATE_GROUP because it can take advantage of collective communication on comm when constructing newcomm. ( End of advice to users.)

MPI_COMM_SPLIT(comm, color, key, newcomm)
IN commcommunicator (handle)
IN colorcontrol of subset assignment (integer)
IN keycontrol of rank assignment (integer)
OUT newcommnew communicator (handle)
C binding
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_split(comm, color, key, newcomm, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
INTEGER, INTENT(IN) :: color, key
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR)

INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR

This function partitions the group associated with comm into disjoint subgroups, one for each value of color. Each subgroup contains all MPI processes of the same color. Within each subgroup, the MPI processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in the old group. A new communicator is created for each subgroup and returned in newcomm. An MPI process may supply the color value MPI_UNDEFINED, in which case newcomm returns MPI_COMM_NULL. This is a collective call, but each MPI process is permitted to provide different values for color and key. No cached information propagates from comm to newcomm and no virtual topology information is added to the created communicators.

With an intra-communicator comm, a call to MPI_COMM_CREATE(comm, group, newcomm) is equivalent to a call to MPI_COMM_SPLIT(comm, color, key, newcomm), where MPI processes that are members of their group argument provide a color argument equal to the number of the group (based on a unique numbering of all disjoint groups) and a key argument equal to their rank in group, and all MPI processes that are not members of their group argument provide a color argument equal to MPI_UNDEFINED. The value of color must be nonnegative or MPI_UNDEFINED.


Advice to users.

This is an extremely powerful mechanism for dividing a single communicating group of MPI processes into k subgroups, with k chosen implicitly by the user (by the number of colors asserted over all the MPI processes). Each resulting communicator will be nonoverlapping. Such a division could be useful for defining a hierarchy of computations, such as for multigrid, or linear algebra. For intra-communicators, MPI_COMM_SPLIT provides similar capability as MPI_COMM_CREATE to split a communicating group into disjoint subgroups. MPI_COMM_SPLIT is useful when some MPI processes do not have complete information of the other members in their group, but all MPI processes know (the color of) the group to which they belong. In this case, the MPI implementation discovers the other group members via communication. MPI_COMM_CREATE is useful when all MPI processes have complete information of the members of their group. In this case, MPI can avoid the extra communication required to discover group membership. MPI_COMM_CREATE_GROUP is useful when all MPI processes in a given group have complete information of the members of their group and synchronization with MPI processes outside the group can be avoided.

Multiple calls to MPI_COMM_SPLIT can be used to overcome the requirement that any call have no overlap of the resulting communicators (each MPI process is of only one color per call). In this way, multiple overlapping communication structures can be created. Creative use of the color and key in such splitting operations is encouraged.

Note that, for a fixed color, the keys need not be unique. It is MPI_COMM_SPLIT's responsibility to sort MPI processes in ascending order according to this key, and to break ties in a consistent way. If all the keys are specified in the same way, then all the MPI processes in a given color will have the relative rank order as they did in their parent group.

( End of advice to users.)

Rationale.

color is restricted to be nonnegative, so as not to conflict with the value assigned to MPI_UNDEFINED. ( End of rationale.)
The result of MPI_COMM_SPLIT on an inter-communicator is that those MPI processes on the left with the same color as those MPI processes on the right combine to create a new inter-communicator. The key argument describes the relative rank of MPI processes on each side of the inter-communicator (see Figure 18). For those colors that are specified only on one side of the inter-communicator, MPI_COMM_NULL is returned. MPI_COMM_NULL is also returned to those MPI processes that specify MPI_UNDEFINED as the color.
Advice to users.

For inter-communicators, MPI_COMM_SPLIT is more general than MPI_COMM_CREATE. A single call to MPI_COMM_SPLIT can create a set of disjoint inter-communicators, while a call to MPI_COMM_CREATE creates only one. ( End of advice to users.)

Image file


Figure 18: Inter-communicator construction achieved by splitting an existing inter-communicator with MPI_COMM_SPLIT extended to inter-communicators.


Example Parallel client-server model.
The following client code illustrates how clients on the left side of an inter-communicator could be assigned to a single server from a pool of servers on the right side of an inter-communicator.

Image file

The following is the corresponding server code:

Image file

MPI_COMM_SPLIT_TYPE(comm, split_type, key, info, newcomm)
IN commcommunicator (handle)
IN split_typetype of processes to be grouped together (integer)
IN keycontrol of rank assignment (integer)
INOUT infoinfo argument (handle)
OUT newcommnew communicator (handle)
C binding
int MPI_Comm_split_type(MPI_Comm comm, int split_type, int key, MPI_Info info, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_split_type(comm, split_type, key, info, newcomm, ierror)

TYPE(MPI_Comm), INTENT(IN) :: comm
INTEGER, INTENT(IN) :: split_type, key
TYPE(MPI_Info), INTENT(IN) :: info
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_SPLIT_TYPE(COMM, SPLIT_TYPE, KEY, INFO, NEWCOMM, IERROR)

INTEGER COMM, SPLIT_TYPE, KEY, INFO, NEWCOMM, IERROR

This function partitions the group associated with comm into disjoint subgroups such that each subgroup contains all MPI processes in the same grouping referred to by split_type. Within each subgroup, the MPI processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in the old group. A new communicator is created for each subgroup and returned in newcomm. This is a collective call. All MPI processes in the group associated with comm must provide the same split_type, but each MPI process is permitted to provide different values for key. An exception to this rule is that an MPI process may supply the type value MPI_UNDEFINED, in which case MPI_COMM_NULL is returned in newcomm for such MPI process. No cached information propagates from comm to newcomm and no virtual topology information is added to the created communicators.

For split_type, the following values are defined by MPI:

MPI_COMM_TYPE_SHARED:
all MPI processes in the group of newcomm are part of the same shared memory domain and can create a shared memory segment (e.g., with a successful call to MPI_WIN_ALLOCATE_SHARED). This segment can subsequently be used for load/store accesses by all MPI processes in newcomm.


Advice to users.

Since the location of some of the MPI processes may change during the application execution, the communicators created with the value MPI_COMM_TYPE_SHARED before this change may not reflect an actual ability to share memory between MPI processes after this change. ( End of advice to users.)

MPI_COMM_TYPE_HW_GUIDED:
this value specifies that the communicator comm is split according to a hardware resource type (for example a computing core or an L3 cache) specified by the mpi_hw_resource_type info key. Each output communicator newcomm corresponds to a single instance of the specified hardware resource type. The MPI processes in the group associated with the output communicator newcomm utilize that specific hardware resource type instance, and no other instance of the same hardware resource type.

If an MPI process does not meet the above criteria, then MPI_COMM_NULL is returned in newcomm for such MPI process.

MPI_COMM_NULL is also returned in newcomm in the following cases:


The MPI implementation will return in the group of the output communicator newcomm the largest subset of MPI processes that match the splitting criterion.

The MPI processes in the group associated with newcomm are ranked in the order defined by the value of the argument key with ties broken according to their rank in the group associated with comm.


Advice to users.

The set of hardware resources that an MPI process is able to utilize may change during the application execution (e.g., because of the relocation of an MPI process), in which case the communicators created with the value MPI_COMM_TYPE_HW_GUIDED before this change may not reflect the utilization of hardware resources of such MPI process at any time after the communicator creation. ( End of advice to users.)
The user explicitly constrains with the info argument the splitting of the input communicator comm. To this end, the info key mpi_hw_resource_type is reserved and its associated value is an implementation-defined string designating the type of the requested hardware resource (e.g., ``NUMANode'', ``Package'' or ``L3Cache'').

The value mpi_shared_memory is reserved and its use is equivalent to using MPI_COMM_TYPE_SHARED for the split_type parameter.
Rationale.

The value mpi_shared_memory is defined in order to ensure consistency between the use of MPI_COMM_TYPE_SHARED and the use of MPI_COMM_TYPE_HW_GUIDED. ( End of rationale.)
All MPI processes must provide the same value for the info key mpi_hw_resource_type.


Example Splitting MPI_COMM_WORLD into NUMANode subcommunicators.

Image file

MPI_COMM_TYPE_RESOURCE_GUIDED:
this value specifies that the communicator comm is split according to a hardware resource type (for example a computing core or an L3 cache) specified by the mpi_hw_resource_type info key or to a logical resource type (for example a process set name, see Section Processes Sets) specified by the mpi_pset_name info key.

Each output communicator newcomm corresponds to a single instance of the specified resource type. The MPI processes in the group associated with the output communicator newcomm utilize that specific resource type instance, and no other instance of the same resource type.

If an MPI process does not meet the above criteria, then MPI_COMM_NULL is returned in newcomm for such process.

MPI_COMM_NULL is also returned in newcomm in the following cases:


The MPI implementation will return in the group of the output communicator newcomm the largest subset of MPI processes that match the splitting criterion.


Advice to users.

The set of resources that an MPI process is able to utilize may change during the application execution (e.g., because of the relocation of an MPI process), in which case the communicators created with the value MPI_COMM_TYPE_RESOURCE_GUIDED before this change may not reflect the utilization of resources of such process at any time after the communicator creation. ( End of advice to users.)
The user explicitly constrains with the info argument the splitting of the input communicator comm. To this end, the following info keys are reserved and their associated values are implementation-defined strings designating the type of the requested resource. Only one of these info keys can be used in info at a time in a call to MPI_COMM_SPLIT_TYPE; use of more than one info key is erroneous.

mpi_hw_resource_type is used to specify the type of a requested hardware resource (e.g., ``NUMANode'', ``Package'' or ``L3Cache''). The value mpi_shared_memory is reserved and its use is equivalent to using MPI_COMM_TYPE_SHARED for the split_type parameter.
Rationale.

The value mpi_shared_memory is defined in order to ensure consistency between the use of MPI_COMM_TYPE_SHARED and the use of MPI_COMM_TYPE_RESOURCE_GUIDED. ( End of rationale.)
All MPI processes in the group of the input communicator comm must provide the same info key to perform the splitting action. All MPI processes in the group of the input communicator comm must provide the same value for the info key mpi_hw_resource_type.

mpi_pset_name is used to specify the type of a requested logical resource through the utilization of a process set name (e.g., ``app://ocean'' or ``app://atmos''). This process set name must be valid in the session from which the input communicator comm is derived. If this input communicator is not derived from a session, then MPI_COMM_NULL is returned in newcomm.

All MPI processes that are both in the group of the input communicator comm and in the process set identified by the given process set name must provide the same info key to perform the splitting action. All MPI processes that are both in the group of the input communicator comm and in the process set identified by the given process set name must provide the same value for the info key mpi_pset_name.



Example Splitting MPI_COMM_WORLD into NUMANode subcommunicators.

Image file

MPI_COMM_TYPE_HW_UNGUIDED:
the group of MPI processes associated with newcomm must be a strict subset of the group associated with comm and each newcomm corresponds to a single instance of a hardware resource type (for example a computing core or an L3 cache).

All MPI processes in the group associated with comm that utilize that specific hardware resource type instance---and no other instance of the same hardware resource type---are included in the group of newcomm.

If a given MPI process cannot be a member of a communicator that forms such a strict subset, or does not meet the above criteria, then MPI_COMM_NULL is returned in newcomm for this process.


Advice to implementors.

In a high-quality MPI implementation, the number of different new valid communicators newcomm produced by this splitting operation should be minimal unless the user provides a key/value pair that modifies this behavior. The sets of hardware resource types used for the splitting operation are implementation-dependent, but should reflect the hardware of the actual system on which the application is currently executing. ( End of advice to implementors.)

Rationale.

If the hardware resources are hierarchically organized, calling this routine several times using as its input communicator comm the output communicator newcomm of the previous call creates a sequence of newcomm communicators in each MPI process, which exposes a hierarchical view of the hardware platform, as shown in Example Communicator Constructors. This sequence of returned newcomm communicators may differ from the sets of hardware resource types, as shown in the second splitting operation in Figure 19. ( End of rationale.)

Advice to users.

Each output communicator newcomm can represent a different hardware resource type (see Figure 19 for an example). The set of hardware resources an MPI process utilizes may change during the application execution (e.g., because of MPI process relocation), in which case the communicators created with the value MPI_COMM_TYPE_HW_UNGUIDED before this change may not reflect the utilization of hardware resources for such MPI process at any time after the communicator creation. ( End of advice to users.)

Image file


Figure 19: Recursive splitting of MPI_COMM_WORLD with MPI_COMM_SPLIT_TYPE and MPI_COMM_TYPE_HW_UNGUIDED. Dashed lines represent communicators whilst solid lines represent hardware resources. MPI processes (P0 to P11) utilize exclusively their respective core, except for P6 and P7, which utilize CPU #3 of Rack #0 and can therefore use Cores #6 and #7 indifferently. The second splitting operation yields two subcommunicators corresponding to NUMANodes in Rack #0 and to CPUs in Rack #1 because Rack #1 features only one NUMANode, which corresponds to the whole portion of the Rack that is included in MPI_COMM_WORLD and hwcomm[1]. For the first splitting operation, the hardware resource type returned in the info argument is ``Rack'' on the MPI processes on Rack #0, whereas on Rack #1, it can be either ``Rack'' or ``NUMANode''.

If a valid info handle is provided as an argument, the MPI implementation sets the info key mpi_hw_resource_type for each MPI process in the group associated with a returned newcomm communicator and the info key value is an implementation-defined string that indicates the hardware resource type represented by newcomm. The same hardware resource type must be set in all MPI processes in the group associated with newcomm.


Example Recursive splitting of MPI_COMM_WORLD.

Image file



Advice to implementors.

Implementations can define their own split_type values, or use the info argument, to assist in creating communicators that help expose platform-specific information to the application. The concept of hardware-based communicators was first described by Träff [68] for SMP systems. Guided and unguided modes description as well as an implementation path are introduced by Goglin et al. [28]. ( End of advice to implementors.)

MPI_COMM_CREATE_FROM_GROUP(group, stringtag, info, errhandler, newcomm)
IN groupgroup (handle)
IN stringtagunique identifier for this operation (string)
IN infoinfo object (handle)
IN errhandlererror handler to be attached to new intra-communicator (handle)
OUT newcommnew communicator (handle)
C binding
int MPI_Comm_create_from_group(MPI_Group group, const char *stringtag, MPI_Info info, MPI_Errhandler errhandler, MPI_Comm *newcomm)
Fortran 2008 binding
MPI_Comm_create_from_group(group, stringtag, info, errhandler, newcomm, ierror)

TYPE(MPI_Group), INTENT(IN) :: group
CHARACTER(LEN=*), INTENT(IN) :: stringtag
TYPE(MPI_Info), INTENT(IN) :: info
TYPE(MPI_Errhandler), INTENT(IN) :: errhandler
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
Fortran binding
MPI_COMM_CREATE_FROM_GROUP(GROUP, STRINGTAG, INFO, ERRHANDLER, NEWCOMM, IERROR)

INTEGER GROUP, INFO, ERRHANDLER, NEWCOMM, IERROR
CHARACTER*(*) STRINGTAG

MPI_COMM_CREATE_FROM_GROUP is similar to MPI_COMM_CREATE_GROUP, except that the set of MPI processes involved in the creation of the new intra-communicator is specified by a group argument, rather than the group associated with a pre-existing communicator. If a nonempty group is specified, then all MPI processes in that group must call the function and each of these MPI processes must provide the same arguments, including a group that contains the same members with the same ordering, and identical stringtag value. In the event that MPI_GROUP_EMPTY is supplied as the group argument, then the call is a local operation and MPI_COMM_NULL is returned as newcomm. The stringtag argument is analogous to the tag used for MPI_COMM_CREATE_GROUP. If multiple threads at a given MPI process perform concurrent MPI_COMM_CREATE_FROM_GROUP operations, the user must distinguish these operations by providing different stringtag arguments. The stringtag shall not exceed MPI_MAX_STRINGTAG_LEN characters in length. For C, this includes space for a null terminating character. MPI_MAX_STRINGTAG_LEN shall have a value of at least 63.

The errhandler argument specifies an error handler to be attached to the new intra-communicator. Section Error Handling specifies the error handler to be invoked if an error is encountered during the invocation of MPI_COMM_CREATE_FROM_GROUP.

The info argument provides hints and assertions, possibly MPI implementation dependent, which indicate desired characteristics and guide communicator creation.


Advice to users.

The stringtag argument is used to distinguish concurrent communicator construction operations issued by different entities. As such, it is important to ensure that this argument is unique for each concurrent call to MPI_COMM_CREATE_FROM_GROUP. Reverse domain name notation convention [2] is one approach to constructing unique stringtag arguments. See also example Sessions Model Examples. ( End of advice to users.)


PreviousUpNext
Up: Communicator Management Next: Communicator Destructors Previous: Communicator Accessors


Return to MPI-4.1 Standard Index
Return to MPI Forum Home Page

(Unofficial) MPI-4.1 of November 2, 2023
HTML Generated on November 19, 2023