COLL framework needs work to support MPI Bigcount #12336

hppritcha · 2024-02-14T17:18:58Z

In the course of work done in #12226 it was discovered that, unlike the PML API, the COLL API is not ready for big count.
One option would be to extend the existing table of coll methods to have entry points for large count functions. This would have the plus of only having to implement initially support for big count in the basic and maybe tuned components. It would have the downside of roughly doubling the size of mca_coll_base_comm_coll_t struct. Changing the definitions for all the existing methods to be generalized to support big count would have the down side of needing to go into every existing component and making sure their collective methods can handle MPI_Count and MPI_Aint - and if they can't be modified to support big count, disqualify their implementation of that particular collective operation.

Related to issue #9194 and PR #12226.

We do not plan to include this work in PR #12226 as its already complex enough and is really targeted at the infrastructure for generating the _c MPI API c entry points for big count and the way too long in implementation correct TS 29113 entry points for Fortran F08 (along with support for Big count on the fortran side too).

The text was updated successfully, but these errors were encountered:

bosilca · 2024-02-14T18:16:10Z

Not all APIs will need to be doubled. All API handling a single count and disps (per buffer) can simply be extended to take the larger count into account. However, the APIs using arrays of counts and displacements, where the access will be more complicated (allgatherv, alltoallv, alltoallw, gatherv, reduce_scatter, scatterv, plus the non-blocking and persistent versions) will need to double.

This path is a nightmare, it will basically force us to maintain two copies of the same, already complex code, just to cope with the count and displacement type difference. I think I prefer to change the MCA coll type for counts/disps to void* and use macros to compute the right value, and then compile the code twice (once with int/int and once with MPI_Count/MPI_Aint).

hppritcha · 2024-02-14T18:24:45Z

This path is a nightmare, it will basically force us to maintain two copies of the same, already complex code, just to cope with the count and displacement type difference. I think I prefer to change the MCA coll type for counts/disps to void* and use macros to compute the right value, and then compile the code twice (once with int/int and once with MPI_Count/MPI_Aint).

Okay. Would you propose adding an extra arg as well to the methods to indicate whether or not the app was invoking a big count method or small count collective op?

bosilca · 2024-02-14T19:08:33Z

This is also a possible approach, but not the one I was going for. My idea would have doubled the size of the coll structure and a little code in the building infrastructure, but not the size of the collective code.

hppritcha · 2024-02-14T19:16:50Z

oh i see, that's kind of what @jtronge and i were thinking of doing then.

This commit adds only those functions which make use of C integer promotion. So none of the 'v,w' and reduce_scatter related methods are added in this PR. Related to open-mpi#12336 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

This commit adds only those functions which make use of C integer promotion. So none of the 'v,w' and reduce_scatter related methods are added in this PR. Related to open-mpi#12336 Pieces of open-mpi#12478 were taken out to make this PR. Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha · 2024-07-30T15:03:00Z

closed via #12621 and #12539

This commit adds only those functions which make use of C integer promotion. So none of the 'v,w' and reduce_scatter related methods are added in this PR. Related to open-mpi#12336 Pieces of open-mpi#12478 were taken out to make this PR. Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha added MPI-4.0 Target: main labels Feb 14, 2024

hppritcha added this to the 4.0 milestone Feb 14, 2024

hppritcha mentioned this issue Apr 22, 2024

Update collective framework for bigcount #12478

Closed

hppritcha mentioned this issue May 9, 2024

Big Count: first pass for coll framework bc #12539

Merged

hppritcha closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COLL framework needs work to support MPI Bigcount #12336

COLL framework needs work to support MPI Bigcount #12336

hppritcha commented Feb 14, 2024 •

edited

Loading

bosilca commented Feb 14, 2024

hppritcha commented Feb 14, 2024

bosilca commented Feb 14, 2024

hppritcha commented Feb 14, 2024

hppritcha commented Jul 30, 2024

COLL framework needs work to support MPI Bigcount #12336

COLL framework needs work to support MPI Bigcount #12336

Comments

hppritcha commented Feb 14, 2024 • edited Loading

bosilca commented Feb 14, 2024

hppritcha commented Feb 14, 2024

bosilca commented Feb 14, 2024

hppritcha commented Feb 14, 2024

hppritcha commented Jul 30, 2024

hppritcha commented Feb 14, 2024 •

edited

Loading