You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When gatherv over the spline table of NiO 128 atom problem, the MPI library stops with an error on Summit. The table contains 115 x 69 x 69 x 816 of doubles.
[h23n07:134979] coll:ibm:module: datatype_prepare_recvv overflowed integer range
I suspect that 115 x 69 x 69 x 816 x 8 bytes just goes beyond the 2^31 range of integer.
Currently the gatherv is used in the following way.
The whole table is treated as a matrix with 115 x 69 x 69 rows and 816 columns.
The 816 columns are distributed across MPI ranks. A derived column type is constructed.
Then an in-place gatherv is used to collect the columns.
There is no problem with Cray, Intel MPI implementations but Spectrum MPI stops.
Workaround:
In the code, there is another code-path for very large tables. The code path was introduced for some locality in the memory. The whole table was treated as 115 matrices with 69 x 69 rows and 816 columns. Then the gatherv is called 115 times.
To access this code path, it is required to have nx * ny *nz > 1<<20
By changing 20 to 19, the NiO 128 switches to this code path and the code can run.
The text was updated successfully, but these errors were encountered:
When gatherv over the spline table of NiO 128 atom problem, the MPI library stops with an error on Summit. The table contains 115 x 69 x 69 x 816 of doubles.
[h23n07:134979] coll:ibm:module: datatype_prepare_recvv overflowed integer range
I suspect that 115 x 69 x 69 x 816 x 8 bytes just goes beyond the 2^31 range of integer.
Currently the gatherv is used in the following way.
The whole table is treated as a matrix with 115 x 69 x 69 rows and 816 columns.
The 816 columns are distributed across MPI ranks. A derived column type is constructed.
Then an in-place gatherv is used to collect the columns.
qmcpack/src/spline/einspline_util.hpp
Line 78 in 7767799
There is no problem with Cray, Intel MPI implementations but Spectrum MPI stops.
Workaround:
In the code, there is another code-path for very large tables. The code path was introduced for some locality in the memory. The whole table was treated as 115 matrices with 69 x 69 rows and 816 columns. Then the gatherv is called 115 times.
To access this code path, it is required to have
nx * ny *nz > 1<<20
By changing 20 to 19, the NiO 128 switches to this code path and the code can run.
The text was updated successfully, but these errors were encountered: