Romio fixes from mpich v41x #8371

mpich has two commits about freeing NULLs in GPFS, titled: "romio gpfs: avoid freeing NULL" and "ROMIO: gpfs: avoid freeing NULL in more places" The code isn't a perfect match vs the romio we have here in ompi version 4.1.x, but it's essentially turning a handful of "ADIOI_Free(x)" into "if (x) { ADIOI_Free(x); }". So that's what this commit does too. Signed-off-by: Mark Allen <markalle@us.ibm.com>

Merging a romio commit that was accepted by mpich: > romio GPFS: missing initialization > > Romio GPFS had a runtime problem due to a .mem_ptrs field that > was uninitialized that it was potentially trying to free. Signed-off-by: Mark Allen <markalle@us.ibm.com>

…btype creation Merging a romio commit that was accepted by mpich: > darray (romio) needs a resize in the block()/cyclic() subtype creation > > When Type_create_darray() uses the block()/cyclic() function to > construct a subtype for whatever dimension it's processing, it > needs to resize the resulting type before it goes into the type > construction as the input for processing the next dimension. > > The same problem is in the main path's type creation, and in romio's > mirroring of it. > > Gist for a testcase: > https://gist.github.com/markalle/940de93d64fd779e304ee124855b8e6a > > The darray_bug_romio.c testcase creates a darray using > * 4 ranks in a 2x2 grid > * looking at the type made for rank 0 > * inner dimension: 4 ints distributed block over 2 ranks with 2 items each > * outer dimension: 6 of the above distributed cyclic over 2 ranks with 2 items each > > The type created by processing the first dimension should look like this > [ x x . . ] > > And then distributing those for the second dimension becomes > > [ x x x x ] > [ . . . . ] > [ . . . . ] > [ . . . . ] > [ x x x x ] > [ . . . . ] > > Going to the MPI standard to justify why the first layout is right, > it's where the definiton of the cyclic() function has a ub_marker > of gsize*ex, eg 4 ints for that first type. Signed-off-by: Mark Allen <markalle@us.ibm.com>

Merging a romio commit that was accepted by mpich: > romio flatten has flat->count consistency problems > > ADIOI_Count_contiguous_blocks and ADIOI_Flatten are very similar functions, > but they diverge a lot around the edges and in degenerate cases. In Spectrum > MPI I spent some time making them consistent with each other but > found that to be a losing battle. > > So the approach I used here is to not have Count() be as definitive, > and rather let Flatten() have the last word on what the final flat->count > really is. Eg Flatten's *curr_index is the real count. > > The changes I made are > > 1. Fix a couple places in Flatten where *curr_index was updated out > of sync with what was actually being added to flat->indices[] and > flat->blocklens[]. That's one of the important book-keeping rules > Flatten should follow. There were a couple places (when counts[] arrays > have 0s) where that wasn't the case (see the "nonzeroth" var in this > commit). > > 2. The main change I made was to reset flat->count based on > Flatten's curr_index. This is because the divergence between the > two functions is too great to reliably fix. > > 3. A third change is just a safety valve, using a > flatlist_node_grow() function just in case the > Count function returns a smaller value than Flatten ends up > trying to write into the array, and related to this I > got rid of the various --(flat->count) lines, since that var now > represents the allocated size of the array, until right after > the Flatten function when it's reset to represent the data written > to the array like it used to be. > > I don't think we even need ADIOI_Count_contiguous_blocks() anymore, > but I didn't remove it. Signed-off-by: Mark Allen <markalle@us.ibm.com>

Merging a romio commit that was accepted by mpich: > make ADIOI_GEN_WriteStrided not step on itself > > The ADIOI_GEN_WriteStrided funcion uses data sieving on non-contiguous > types. That is, if it wants to write data at locations > [W...X...Y...Z...] > it reads the whole buffer > [dddddddddddddddd] > changes the locations it wants to write to > [WdddXdddYdddZddd] > then writes the whole thing back. It uses locks to make this safe, but > the problem is this only protects against other parts of the product that > are using locks. And without this PR a peer who is simultaneously making > a simple non-contiguous write wouldn't have locked. > > A testcase to demonstrate the original problem is here: > https://gist.github.com/markalle/d7da240c19e57f095c5d1b13240dae24 > > % mpicc -o x romio_write_timing.c > % mpirun -np 4 ./x > > Note: you need to use a filesystem that uses ADIOI_GEN_WriteStrided to > hit the bug. I was using GPFS. > > This commit is pulled from wkliao after discussing where to put the > new lock. It adds locks to contiguous writes in independent write > functions when data sieving write is not disabled Signed-off-by: Mark Allen <markalle@us.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Romio fixes from mpich v41x #8371

Romio fixes from mpich v41x #8371

Commits on Aug 11, 2021