generalized requests

Steve Huss-Lederman (lederman@super.org)
Wed, 24 May 1995 15:19:51 -0400

This is a rather long note. I hope it will begin a more serious
discussion of the generalized requests. I welcome comments now. I
suspect a modified version of this will be the basis of the discussion
at the next MPI meeting. I think it might be best to take a step back
from the previous document and start the next discussion there.

Here is a summary of the questions raised:

- How can you tell when the system calls the complete_func what
request just finished? You can decode the status returned but we may
want other mechanisms. One problem I see is that to look at the
status of a recv you use MPI_GET_COUNT, etc but to look at a send you
use MPI_TEST_CANCELLED. It seems you need to know which it is before
you test it. (I didn't remember it worked this way but p. 39 after
MPI_WAIT says this.) In Marc's proposal you have an array_of_requests
and array_of_statuses. The application could keep track of which
index in the request/status array corresponds to which type of
request. This might require ugly bookkeeping and may also limit the
ability to reuse array locations (which might require larger arrays).

- When is the request associated with operations started internally to
the request freed? This is normally done by WAIT/TEST but now you
don't do this. When the operation completes, it calls your complete
function. My guess would be that the system would deallocate it
before the call to the complete routine (assuming it is not
persistent).

- The current Wait/Test returns a status. It is not clear what is
returned in status when you wait on a generalized request. With a
non-blocking recv you get count, source, tag. Since the generalized
request may well have done multiple operations this type of information
may be irrelevant. Should we return dummy values? Something else?

- The thread safety of the funcs seem important. In my example, you
need to have both the send and recv finish before you do certain
operations. This is accomplished by updating variables in the
extra_state structure. I believe this can fail if the first call is
in the process of updating when the second call finishes. The
finishing of the second call could cause another system interrupt to
process the second call. Without a lock to protest these variables,
bad things could happen. This is really part of the can of worms
raised in allowing system interrupts that we have started with MPI-2.
How do user functions protect critical sections and set locks?

- Since these operations are non-blocking, there may be a number of
issues that come up if several requests of the same type are going on
simultaneously. It seems that the extra_state must be unique with
each start call. Now we attach the extra_state at the time of the
init call. I don't see how one could do multiple starts after a
single init because there is only one extra state and that could get
jumbled between the different simultaneous operations. Being able to
do this seems desirable because in the example you could reuse the
persistent requests if you were sure that two calls did not
simultaneously use them. On the other hand, you need to be able to
return information from the init call and this is why the extra_state
is passed there. We could avoid some of this if we only allow one
outstanding operation per request but this would limit avoiding
certain startup costs.

There is undoubtedly more, but this is enough for now.

----------------------------------------------------------------------

Below are the details of my thinking through an example. It is
generally based on the discussion of generalized requests handed out
and discussed at the last meeting.

Objective: Perform a non-blocking MPI_ALLGATHER. Do this with a
generalized request by performing p - 1 wrapped shifts of data (p = #
processes in the communicator).

More specifically, assume each process i has an array of size a[p] and
its value is in a[i]. The algorithm works by accepting data from
process (i - 1) % p and sending data to (i + 1) % p. At each step,
process i forwards the data it received in the last step. Once you
shift p - 1 times, each process has a copy of what was stored on all
other processes. Use persistent communication requests to send and
receive the data. To accomplish this, have a separate send and recv
buffer. Initialize the send buffer on process i to be a[i]. After
the send and recv has occurred, copy the recv buffer into array a and
also into the send buffer and repeat the process until done. Blocking
C code is below. (This may not be the normal way to do this but I want
to utilize different features for discussion sake. Most people would
probably do a sendrecv directly into the correct locations in array a.)

MPI_Comm_size(comm, p);
MPI_Comm_rank(comm, i);
/* init your location in array */
a[i] = i;
/* set initial send value */
as = a[i]
/* process to recv from and persistent request */
from = (i - 1) % p;
MPI_RECV_INIT(&ar, 1, MPI_DOUBLE, from, 13, comm, req_recv);
/* process to send to and persistent request */
to = (i + 1) % p;
MPI_SEND_INIT(&as, 1, MPI_DOUBLE, to, 13, comm, req_send);
/* do send and recv p - 1 times */
for (c1 = 0; c1 < p - 1; c1++) {
/* begin persistent send and recv */
MPI_Start(req_recv);
MPI_Start(req_send);
/* when recv complete copy result in final array a */
MPI_Wait(&req_recv, &status_recv);
a[(i - 1 - c1 + p) % p] = ar; /* add p to make sure mod is positive */
/* when send completes, copy recv buf into send buf for next round */
MPI_Wait(&req_send, &status_send);
as = ar;
}
/* free up persistent requests */
MPI_Request_free(req_recv);
MPI_Request_free(req_send);

At the start, a[i] = i and a[j] = 0 (j != i) on process i.
At the end, a[j] = j on all processes.

As an example, with p = 4, process 1 does:

c1 send recv
-- ---- ----
0 a[1] a[0]
1 a[0] a[3]
2 a[3] a[2]

Now to do this with a non-blocking generalized request. Create a
persistent communication request to send to (i + 1) % p and a request
to recv from (i - 1) % p. The algorithm is demand driven. Begin by
starting the persistent recv and send. Each time one of these
finishes, you want MPI to call your complete routine. When the recv
finishes you want to copy the data into the correct location in the
array a. If the send is done, then you copy the recv data into the
send buffer and start a new send and recv if you have not finished.
If not, you return. When the send completes: if the recv is done
then you copy the recv'd data into the send buffer. You then start a
new send and recv if you have not finished. If not, you return. This
seems overly complicated but you cannot copy from the recv buf to the
send buf until both are done and then you can start a new operation.
Since the system calls the complete_func for each request, this makes
it more complicated to relate them.

Below is very tentative pseudo-code. It gives an idea of what I think
would be going on. We have yet to really decide how this will work.
There are likely to be other arguments to the functions.
The extra_state would hold: a, ar, as, req_recv, req_send, p, i,
recv_active, send_active, num_done.

The main routine which calls the non-blocking operation would do:

/* this supplies the init, start, complete, and finalize functions
for the request. MPI returns in type_req the new request */
MPI_Request_type_create(init_func, start_func, complete_func,
finalize_func, type_req)

/* set up the information needed by the calls. Thus, the extra_state
is done. Need to put all the extra_state into some sort of structure
associated with extra_state. This is not shown */

/* Initialize the request. This MPI routine will call the init_func.
I assume that MPI sets the value of request. */
MPI_Request_init(type_req, extra_state, comm, request)

/* start the non-blocking operation. This calls start_func. */
MPI_Start(request);

/* do other stuff that can go on while non-blocking operation is
going. This is not shown. */

/* now wait until non-blocking operation is done */
/* it is not clear what is put into status */
MPI_Wait(request, status);

/* done with request, free up. This calls complete_func. */
MPI_Request_free(request);

/* done with type_req, free up */
MPI_Type_request_free(type_req);


The functions for the generalized request are:

init_func(comm, *extra_state)
{

MPI_Comm_size(comm, p);
MPI_Comm_rank(comm, i);
/* process to recv from and persistent request */
from = (i - 1) % p;
MPI_RECV_INIT(&ar, 1, MPI_DOUBLE, from, 13, comm, req_recv);
/* process to send to and persistent request */
to = (i + 1) % p;
MPI_SEND_INIT(&as, 1, MPI_DOUBLE, to, 13, comm, req_send);
}

start_func(comm, *array_of_reqs, count, *extra_state)
{
/* init your location in array */
a[i] = i;
/* set initial send value */
as = a[i]
/* haven't done any yet */
num_done = 0;

/* begin persistent send and recv */
MPI_Start(req_recv);
recv_active = 1;
MPI_Start(req_send);
send_active = 1;
/* return in array_of_reqs the requests you started and how many */
array_of_reqs[0] = req_recv;
array_of_reqs[1] = req_send;
count = 2
}

Each time a request finishes (req), your complete routine gets called.
It returns true when the non-blocking request is done.:

complete_func(comm, req, *array_of_reqs, count, *extra_state)
{

/* what you want to do here is see if the req that finished was a send
or recv but I don't know how to do this. */
if (req is the persistent recv) {
/* copy the recv'd data into the array a */
a[(i - 1 - num_done + p) % p] = ar; /* add p to make sure mod is positive */
/* see if send has finished */
if (send_active) {
/* send is not done. note that the recv done and return not done */
rec_active = 0;
return 0;
else {
/* send is done. note one more stage complete. See if done */
num_done++;
if (num_done == p - 1) {
/* done, return done code */
return 1;
}
else {
/* not done, copy recv'd data to send buf and start new stage */
as = ar;
/* begin persistent send and recv */
MPI_Start(req_recv);
recv_active = 1;
MPI_Start(req_send);
send_active = 1;
/* return in array_of_reqs the requests you started and how many */
array_of_reqs[0] = req_recv;
array_of_reqs[1] = req_send;
count = 2;
}
}
else {
/* the req that finished was a send */
/* see if recv is done */
if (recv_active) {
/* recv not done, note send done and return not done */
send_active = 0;
return 0;
else {
/* recv is done. note one more stage complete. See if done */
num_done++;
if (num_done == p - 1) {
/* done, return done code */
return 1;
}
else {
/* not done, copy recv'd data to send buf and start new stage */
as = ar;
/* begin persistent send and recv */
MPI_Start(req_recv);
recv_active = 1;
MPI_Start(req_send);
send_active = 1;
/* return in array_of_reqs the requests you started and how many */
array_of_reqs[0] = req_recv;
array_of_reqs[1] = req_send;
count = 2;
}
}
}

finalize_func(comm, req, *extra_state)
{
/* free up the persistent communications requests */
MPI_Request_free(&req_recv);
MPI_Request_free(&req_send);
}

Note some changes from the last draft of generalized handles:

- I do not pass extra_state to the MPI_Request_type_create. My
rationale is that extra_state may hold information that is specific to
each starting of the non-blocking operation. You may create one type
and do many inits with different extra_state info. You could change
the info in the extra_state structure but it is not clear to me what
the MPI_Request_type_create would do with the extra_state anyway.
Does the extra_state binding really have to happen here?