Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Romio fixes from mpich v41x #8371

Merged
merged 5 commits into from
Oct 11, 2021

Commits on Aug 11, 2021

  1. [from mpich] romio: avoid freeing NULLs in gpfs

    mpich has two commits about freeing NULLs in GPFS, titled:
    
    "romio gpfs: avoid freeing NULL"
    and
    "ROMIO: gpfs: avoid freeing NULL in more places"
    
    The code isn't a perfect match vs the romio we have here
    in ompi version 4.1.x, but it's essentially turning a
    handful of "ADIOI_Free(x)" into "if (x) { ADIOI_Free(x); }".
    So that's what this commit does too.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Aug 11, 2021
    Configuration menu
    Copy the full SHA
    c9c1ec6 View commit details
    Browse the repository at this point in the history
  2. [from mpich] romio GPFS: missing initialization

    Merging a romio commit that was accepted by mpich:
    
    > romio GPFS: missing initialization
    >
    > Romio GPFS had a runtime problem due to a .mem_ptrs field that
    > was uninitialized that it was potentially trying to free.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Aug 11, 2021
    Configuration menu
    Copy the full SHA
    4de4fd3 View commit details
    Browse the repository at this point in the history
  3. [from mpich] darray (romio) needs a resize in the block()/cyclic() su…

    …btype creation
    
    Merging a romio commit that was accepted by mpich:
    
    > darray (romio) needs a resize in the block()/cyclic() subtype creation
    >
    > When Type_create_darray() uses the block()/cyclic() function to
    > construct a subtype for whatever dimension it's processing, it
    > needs to resize the resulting type before it goes into the type
    > construction as the input for processing the next dimension.
    >
    > The same problem is in the main path's type creation, and in romio's
    > mirroring of it.
    >
    > Gist for a testcase:
    > https://gist.github.com/markalle/940de93d64fd779e304ee124855b8e6a
    >
    > The darray_bug_romio.c testcase creates a darray using
    > * 4 ranks in a 2x2 grid
    > * looking at the type made for rank 0
    > * inner dimension: 4 ints distributed block over 2 ranks with 2 items each
    > * outer dimension: 6 of the above distributed cyclic over 2 ranks with 2 items each
    >
    > The type created by processing the first dimension should look like this
    > [ x x . . ]
    >
    > And then distributing those for the second dimension becomes
    >
    > [ x x x x ]
    > [ . . . . ]
    > [ . . . . ]
    > [ . . . . ]
    > [ x x x x ]
    > [ . . . . ]
    >
    > Going to the MPI standard to justify why the first layout is right,
    > it's where the definiton of the cyclic() function has a ub_marker
    > of gsize*ex, eg 4 ints for that first type.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Aug 11, 2021
    Configuration menu
    Copy the full SHA
    3e317d7 View commit details
    Browse the repository at this point in the history
  4. [from mpich] romio flatten has flat->count consistency problems

    Merging a romio commit that was accepted by mpich:
    
    > romio flatten has flat->count consistency problems
    >
    > ADIOI_Count_contiguous_blocks and ADIOI_Flatten are very similar functions,
    > but they diverge a lot around the edges and in degenerate cases.  In Spectrum
    > MPI I spent some time making them consistent with each other but
    > found that to be a losing battle.
    >
    > So the approach I used here is to not have Count() be as definitive,
    > and rather let Flatten() have the last word on what the final flat->count
    > really is.  Eg Flatten's *curr_index is the real count.
    >
    > The changes I made are
    >
    > 1. Fix a couple places in Flatten where *curr_index was updated out
    > of sync with what was actually being added to flat->indices[] and
    > flat->blocklens[].  That's one of the important book-keeping rules
    > Flatten should follow.  There were a couple places (when counts[] arrays
    > have 0s) where that wasn't the case (see the "nonzeroth" var in this
    > commit).
    >
    > 2. The main change I made was to reset flat->count based on
    > Flatten's curr_index.  This is because the divergence between the
    > two functions is too great to reliably fix.
    >
    > 3. A third change is just a safety valve, using a
    > flatlist_node_grow() function just in case the
    > Count function returns a smaller value than Flatten ends up
    > trying to write into the array, and related to this I
    > got rid of the various --(flat->count) lines, since that var now
    > represents the allocated size of the array, until right after
    > the Flatten function when it's reset to represent the data written
    > to the array like it used to be.
    >
    > I don't think we even need ADIOI_Count_contiguous_blocks() anymore,
    > but I didn't remove it.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Aug 11, 2021
    Configuration menu
    Copy the full SHA
    adda190 View commit details
    Browse the repository at this point in the history
  5. [from mpich] make ADIOI_GEN_WriteStrided not step on itself

    Merging a romio commit that was accepted by mpich:
    
    > make ADIOI_GEN_WriteStrided not step on itself
    >
    > The ADIOI_GEN_WriteStrided funcion uses data sieving on non-contiguous
    > types. That is, if it wants to write data at locations
    >     [W...X...Y...Z...]
    > it reads the whole buffer
    >     [dddddddddddddddd]
    > changes the locations it wants to write to
    >     [WdddXdddYdddZddd]
    > then writes the whole thing back. It uses locks to make this safe, but
    > the problem is this only protects against other parts of the product that
    > are using locks. And without this PR a peer who is simultaneously making
    > a simple non-contiguous write wouldn't have locked.
    >
    > A testcase to demonstrate the original problem is here:
    > https://gist.github.com/markalle/d7da240c19e57f095c5d1b13240dae24
    >
    > % mpicc -o x romio_write_timing.c
    > % mpirun -np 4 ./x
    >
    > Note: you need to use a filesystem that uses ADIOI_GEN_WriteStrided to
    > hit the bug. I was using GPFS.
    >
    > This commit is pulled from wkliao after discussing where to put the
    > new lock.  It adds locks to contiguous writes in independent write
    > functions when data sieving write is not disabled
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Aug 11, 2021
    Configuration menu
    Copy the full SHA
    ecd3e05 View commit details
    Browse the repository at this point in the history