Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning ABI compatible builds #12

Open
6 tasks
minrk opened this issue Nov 29, 2024 · 8 comments
Open
6 tasks

planning ABI compatible builds #12

minrk opened this issue Nov 29, 2024 · 8 comments

Comments

@minrk
Copy link
Member

minrk commented Nov 29, 2024

@j34ni is leading some work on building multiple ABI versions of libfabric so that host builds can be dropped-in (like we have for external builds of mpich/openmpi, but at the libfabric level). Since HPC systems often hardware-optimized, but often outdated libfabric.

To do this, I think we only need to build at most one version for each ABI version.

ABI version table:

abi version libfabric version date
1.8 2.0 soon
1.7 1.20-1.22 2023-11
1.6 1.14-1.19 2021-11
1.5 1.13 2021-07
1.4 1.12 2021-03
1.3 1.11 2020-08
1.2 1.7-1.10 2019-01
1.1 1.5-1.6 2017-08
1.0 1.0-1.4 older

Note that libfabric 2 has breaking API changes, but reports to be a minor ABI revision. That means that e.g. mpich built against any past version of libfabric should still be runtime-compatible with libfabric 2.0. This should be testable easily enough by building libfabric 2 and testing it with existing builds of mpich.

Available strategies:

  • pick one 'old' libfabric and only maintain that, alongside standard newest builds, since any old build should be compatible with anything later, up to the next major ABI release (there has yet to be a single major ABI bump). This is the simplest and least work, since it means we only ever need to track two builds.
  • build one for each supported ABI revision

We need to:

  • pick an oldest ABI version to support (is there evidence of older-than 1.14 actively used in the wild? If not, we don't need to support ABI older than 1.6). We can wait until there's a demonstrated need before going further back in time
  • pick a strategy (I think fewer builds is better, so I'd go with maintaining only one older-than-current build unless there's a clear reason to build more)
  • pick how we want to build for a given ABI version - i.e. build oldest (1.14 for 1.6) or newest (1.19 for 1.6). newest will have best results, assuming it works, but it places some trust in the libfabric ABI doing what it claims. Building oldest should be the safest
  • add a libfabric-abi metapackage with the ABI version because semver pinning of package versions doesn't accurately represent compatibility and make sure we get the right compatibility between libfabric versions (it never does for libraries, but it is close enough often enough that it's what we usually use).
  • add external variants to allow empty installs to reference libfabric from the host, like we do for mpich/openmpi already
  • once this is all set, add a matrix of libfabric versions to mpich builds

I suggest we start with building only the latest version and use 1.19 for ABI 1.6 (should support running back to libfabric 1.14)

If we skip the libfabric-abi package, we can use API versioning, but we would need to use stricter pinnings than are technically needed, since we can't assume 2.0 or 3.0 won't break the ABI until they are released.

@minrk
Copy link
Member Author

minrk commented Nov 29, 2024

@j34ni can you test if an mpich built against 1.19 can run against 1.14 (or 1.15.2, if that's the oldest you find)? If that appears to work, then I think we can take the simplest strategy and only need to maintain 1.19 and 'latest'. But we'll need to get run_exports correct, since the usual lower bound will be too restrictive.

@j34ni
Copy link
Contributor

j34ni commented Nov 29, 2024

It does not seem to work with version 1.19.1
I will try with an older one

@j34ni
Copy link
Contributor

j34ni commented Nov 29, 2024

Should we push 1.18.0 to test it?

@minrk
Copy link
Member Author

minrk commented Nov 29, 2024

Sure. What is the error with 1.19?

@j34ni
Copy link
Contributor

j34ni commented Nov 29, 2024

It is as it it was already 1.7

@minrk
Copy link
Member Author

minrk commented Nov 29, 2024

Strange. I ran a local build of mpich linked against the recently published 1.19.1 and installed libfabric 1.14.0 and it still seems to work fine. Can you share more about how you are doing your tests?

@minrk
Copy link
Member Author

minrk commented Nov 30, 2024

Can you test with this build:

mamba install -c minrk/label/libfabric-test -n mpich 'mpich=4.2.3=libfab119*'

it is build against the conda-forge build of libfabric 1.19.1 from #13, and seems to run fine when I replace libfabric 1.19 with libfabric 1.14 from here.

Here's a sample Dockerfile that runs mpich built with libfabric 1.19.1 with libfabric 1.14.0 at runtime: https://gist.github.com/minrk/22707269668cdb7cc8933fc6d57e1d44

@minrk
Copy link
Member Author

minrk commented Dec 17, 2024

almost ready for this. If mpich doesn't benefit from the newer ABI features, the simplest version of this is to keep building mpich with the oldest supported version (e.g. 1.19), which means builds will be compatible with >=1.14.

We can start adding a matrix if mpich will benefit from new features if built against newer libfabric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants