Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IBM Spectrum MPI gatherv fails with large total data counts #1343

Closed
ye-luo opened this issue Jan 29, 2019 · 1 comment · Fixed by #1479
Closed

IBM Spectrum MPI gatherv fails with large total data counts #1343

ye-luo opened this issue Jan 29, 2019 · 1 comment · Fixed by #1479
Assignees
Labels

Comments

@ye-luo
Copy link
Contributor

ye-luo commented Jan 29, 2019

When gatherv over the spline table of NiO 128 atom problem, the MPI library stops with an error on Summit. The table contains 115 x 69 x 69 x 816 of doubles.
[h23n07:134979] coll:ibm:module: datatype_prepare_recvv overflowed integer range

I suspect that 115 x 69 x 69 x 816 x 8 bytes just goes beyond the 2^31 range of integer.

Currently the gatherv is used in the following way.
The whole table is treated as a matrix with 115 x 69 x 69 rows and 816 columns.
The 816 columns are distributed across MPI ranks. A derived column type is constructed.
Then an in-place gatherv is used to collect the columns.

comm->gatherv_in_place(buffer->coefs, columntype, counts, offset);

There is no problem with Cray, Intel MPI implementations but Spectrum MPI stops.

Workaround:
In the code, there is another code-path for very large tables. The code path was introduced for some locality in the memory. The whole table was treated as 115 matrices with 69 x 69 rows and 816 columns. Then the gatherv is called 115 times.
To access this code path, it is required to have
nx * ny *nz > 1<<20
By changing 20 to 19, the NiO 128 switches to this code path and the code can run.

@ye-luo ye-luo added the bug label Jan 29, 2019
@prckent
Copy link
Contributor

prckent commented Jan 30, 2019

Restating this concisely, the working theory is that gatherv fails when the total byte size exceeds the range of a signed integer 2^31.

@prckent prckent changed the title IBM Spectrum MPI chokes at gatherv IBM Spectrum MPI gatherv fails with large total data counts Feb 6, 2019
@ghost ghost assigned ye-luo Mar 25, 2019
@ghost ghost added the in progress label Mar 25, 2019
@ghost ghost removed the in progress label Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants