Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux CI jobs hang forever after completing all Python tests successfully #4948

Closed
StrikerRUS opened this issue Jan 14, 2022 · 16 comments
Closed

Comments

@StrikerRUS
Copy link
Collaborator

This problem has started to timeout our CI jobs about 5 days ago. The most frequent CI jobs that run out of allowed 60min limit are Linux_latest regular and Linux_latest sdist at Azure Pipelines. Also, I just saw CUDA Version / cuda 10.0 pip (linux, clang, Python 3.8) encountered the same problem.

From test logs I guess that the root cause is connected to the following warning message from the joblib/threadpoolctl package:

2022-01-14T22:12:20.3133305Z tests/python_package_test/test_sklearn.py::test_sklearn_integration[LGBMRegressor()-check_regressors_train(readonly_memmap=True,X_dtype=float32)]
2022-01-14T22:12:20.3133965Z   /root/miniconda/envs/test-env/lib/python3.8/site-packages/threadpoolctl.py:546: RuntimeWarning: 
2022-01-14T22:12:20.3134409Z   Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
2022-01-14T22:12:20.3134727Z   the same time. Both libraries are known to be incompatible and this
2022-01-14T22:12:20.3135026Z   can cause random crashes or deadlocks on Linux when loaded in the
2022-01-14T22:12:20.3135283Z   same Python program.
2022-01-14T22:12:20.3135545Z   Using threadpoolctl may cause crashes or deadlocks. For more
2022-01-14T22:12:20.3135827Z   information and possible workarounds, please see
2022-01-14T22:12:20.3136165Z       https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
2022-01-14T22:12:20.3136411Z   
2022-01-14T22:12:20.3136614Z     warnings.warn(msg, RuntimeWarning)
@StrikerRUS
Copy link
Collaborator Author

For macOS, we have the following workaround to avoid conflicts between multiple instances of libomp library:

LightGBM/.ci/test.sh

Lines 120 to 123 in 4aaeb22

if [[ $OS_NAME == "macos" ]] && [[ $COMPILER == "clang" ]]; then
# fix "OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized." (OpenMP library conflict due to conda's MKL)
for LIBOMP_ALIAS in libgomp.dylib libiomp5.dylib libomp.dylib; do sudo ln -sf "$(brew --cellar libomp)"/*/lib/libomp.dylib $CONDA_PREFIX/lib/$LIBOMP_ALIAS || exit -1; done
fi

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Jan 14, 2022

Given that I remember only Azure Pipelines Linux_latest and CUDA cuda 10.0 pip (linux, clang, Python 3.8) CI jobs have already faced this problem multiple times, I believe only CI jobs where we use clang to compile LightGBM suffer from this problem.

LightGBM/.vsts-ci.yml

Lines 76 to 79 in 4aaeb22

- job: Linux_latest
###########################################
variables:
COMPILER: clang

- method: pip
compiler: clang
python_version: 3.8
cuda_version: "10.0"

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Jan 14, 2022

Official suggested workarounds:
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md#workarounds-for-intel-openmp-and-llvm-openmp-case

  1. Tell MKL (used by NumPy) to use the GNU OpenMP runtime instead of the Intel OpenMP runtime by setting the following environment variable:

    export MKL_THREADING_LAYER=GNU
    
  2. Install a build of NumPy and SciPy linked against OpenBLAS instead of MKL. This can be done for instance by installing NumPy and SciPy from PyPI:

    pip install numpy scipy
    

    from the conda-forge conda channel:

    conda install -c conda-forge numpy scipy
    

    or from the default conda channel:

    conda install numpy scipy blas[build=openblas]
    
  3. Re-build your OpenMP-enabled extensions from source with GCC (or ICC) instead of Clang if you want to keep on using NumPy/SciPy linked against MKL with the default libiomp-based threading layer.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Jan 15, 2022

I guess we are facing a conflict between default Ubuntu system-wide libomp and libiomp installed from conda as a dependency of numpy/scipy/scikit-learn package. In addition, we have libgomp installed from conda as a dependency of Python itself.

Just for example, latest master logs:

LightGBM/.ci/setup.sh

Lines 52 to 56 in 4aaeb22

if [[ $COMPILER == "clang" ]]; then
sudo apt-get install --no-install-recommends -y \
clang \
libomp-dev
fi

2022-01-09T14:12:47.9352974Z Unpacking clang (1:10.0-50~exp1) ...
2022-01-09T14:12:47.9989218Z Selecting previously unselected package libomp5-10:amd64.
2022-01-09T14:12:48.0013884Z Preparing to unpack .../13-libomp5-10_1%3a10.0.0-4ubuntu1_amd64.deb ...
2022-01-09T14:12:48.0280193Z Unpacking libomp5-10:amd64 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.1021372Z Selecting previously unselected package libomp-10-dev.
2022-01-09T14:12:48.1041014Z Preparing to unpack .../14-libomp-10-dev_1%3a10.0.0-4ubuntu1_amd64.deb ...
2022-01-09T14:12:48.1110633Z Unpacking libomp-10-dev (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.1655290Z Selecting previously unselected package libomp-dev.
2022-01-09T14:12:48.1678522Z Preparing to unpack .../15-libomp-dev_1%3a10.0-50~exp1_amd64.deb ...
2022-01-09T14:12:48.1740800Z Unpacking libomp-dev (1:10.0-50~exp1) ...
2022-01-09T14:12:48.3024348Z Setting up libgc1c2:amd64 (1:7.6.4-0.4ubuntu1) ...
2022-01-09T14:12:48.3309310Z Setting up libedit2:amd64 (3.1-20191231-1) ...
2022-01-09T14:12:48.3575485Z Setting up libobjc4:amd64 (10.3.0-1ubuntu1~20.04) ...
2022-01-09T14:12:48.3746044Z Setting up libllvm10:amd64 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.4107754Z Setting up libclang1-10 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.4314994Z Setting up libobjc-9-dev:amd64 (9.3.0-17ubuntu1~20.04) ...
2022-01-09T14:12:48.4774687Z Setting up libomp5-10:amd64 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.5023165Z Setting up libc6-i386 (2.31-0ubuntu9.2) ...
2022-01-09T14:12:48.5264503Z Setting up libomp-10-dev (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.5746908Z Setting up libclang-cpp10 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.5924828Z Setting up lib32gcc-s1 (10.3.0-1ubuntu1~20.04) ...
2022-01-09T14:12:48.6177911Z Setting up lib32stdc++6 (10.3.0-1ubuntu1~20.04) ...
2022-01-09T14:12:48.6431429Z Setting up libomp-dev (1:10.0-50~exp1) ...
2022-01-09T14:12:48.6841900Z Setting up libclang-common-10-dev (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.7140409Z Setting up clang-10 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.7313904Z Setting up clang (1:10.0-50~exp1) ...
...

  libgomp            pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17

...

  intel-openmp       pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=11954&view=logs&j=275189f9-c769-596a-7ef9-49fb48a9ab70&t=3a9e7a4a-04e6-52a0-67ea-6e8f6cfda74f&l=44

According to the threadpoolctl's link above, deadlocks are observed only with clang and ICC OpenMP implementations loaded at the same time. This explains why we don't see any timeouts for our gcc-configured CI jobs.

The only unrecoverable incompatibility we encountered happens when loading a mix of compiled extensions linked with libomp (LLVM/Clang) and libiomp (ICC), on Linux, manifested by crashes or deadlocks.

@StrikerRUS
Copy link
Collaborator Author

If no one of maintainers knows better workaround than documented in threadpoolctl's help doc, I think we should choose one from those three ones (#4948 (comment)).

#3 is not an option for us I think, because we'd like to test LightGBM against Clang compiler. So, actually we have only two workarounds (#1 (force MKL to use libgomp by setting env variable) and #2 use OpenBLAS instead of MKL).

This CI problem is quite annoying because it makes us re-run CI jobs multiple times after they were timed out and slows down the whole development process in the repository.

@guolinke, @chivee, @shiyu1994, @tongwu-msft, @hzy46, @Laurae2, @jameslamb, @jmoralez

@jameslamb
Copy link
Collaborator

Excellent investigation, thank you!

I vote for option 1, setting MKL_THREADING_LAYER=GNU in the environment for clang Linux jobs. I like that because it's what I'd most like to recommend to a LightGBM user facing this issue...I think "set this environment variable" is less invasive than "install this other conda package".


I'd also like to ask.... @xhochy, if you have time could you advise us on this issue? I'm wondering if you've experienced a similar issuue with the lightgbm conda-forge feedstock or other projects that both depend on numpy and link to OpenMP themselves.

@xhochy
Copy link

xhochy commented Jan 15, 2022

We have seen these issues a long time ago in other packages but they shouldn't be occurring these days in a pure conda-forge setting as we have safeguards in place that only a single OpenMP implementation is installed at a time.

This is especially important for intel-openmp which has been succeeded by llvm-openmp (the latter contains now all patches from the former).

As this seems to directly about failing CI in LightGBM, is there any usage of conda in one of the failing jobs?

@jameslamb
Copy link
Collaborator

Thanks very much @xhochy !

is there any usage of conda in one of the failing jobs?

Yes. For more context, we do not use conda-forge in this project's CI. This project's CI install all dependencies from the Anaconda default channels.

conda install -q -y -n $CONDA_ENV cloudpickle "dask=2021.9.1" "distributed=2021.9.1" joblib matplotlib numpy pandas psutil pytest scikit-learn scipy

And for jobs on Linux which use clang to compile LightGBM (the ones that are sometimes timing out because of these issues), we also separately install libomp-dev (see @StrikerRUS 's summary in #4948 (comment)).

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Jan 15, 2022

I wonder, maybe it's a good time to migrate from default conda channel to conda-forge one? Besides this particular issue with different libomp implementations, default conda channel is extremely slow in terms of updates and lacks some required packages for our CI.

Just some small examples.

dask-core package, which is included into Anaconda distribution (this clarification emphasizes its importance to conda maintainers), right now has 2021.10.0 version, while on conda-forge there is 2022.1.0 version already.

image
https://docs.anaconda.com/anaconda/packages/pkg-docs/

https://anaconda.org/conda-forge/dask-core

LightGBM version at default conda channel is 3.2.1: https://anaconda.org/anaconda/lightgbm. Related issue: #3544 (comment).

Requests for adding new and upgrading existing [R packages] tend to be ignored: ContinuumIO/anaconda-issues#11604, ContinuumIO/anaconda-issues#11571. Due to this reason, we have already migrated to conda-forge for building our docks: #4767.

In addition, conda-forge channel often supports more architectures (Miniforge): #4843 (comment).

Download stats for LightGBM (especially for the recent versions) show that users already prefer conda-forge to default: https://anaconda.org/conda-forge/lightgbm/files vs https://anaconda.org/anaconda/lightgbm/files.

Just a reminder: it's better not to mix different channels in one environment not only due to possible package conflicts, but also due to long time and high memory consumption for resolving environment specification during installation phase (matters for CI): #4054 (review), ContinuumIO/anaconda-issues#11604 (comment).

@jameslamb
Copy link
Collaborator

@StrikerRUS thanks for all the research! I strongly support moving LightGBM's CI to using only conda-forge, and I'd be happy to do that work.

@StrikerRUS
Copy link
Collaborator Author

@jameslamb Thank you very much! Before we start, let's wait for some other opinions...

@guolinke @shiyu1994 @tongwu-msft @hzy46 @Laurae2 @jmoralez

@jmoralez
Copy link
Collaborator

jmoralez commented Jan 15, 2022

I also support using conda-forge and maybe we could consider using mamba in CI. The time to solve the environments and install packages reduces significantly.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Jan 15, 2022

@jmoralez
Let's go deeper! Mambaforge! 😄
https://github.com/conda-forge/miniforge#mambaforge

@jameslamb
Copy link
Collaborator

@StrikerRUS @guolinke could I have "Write" access on https://github.com/guolinke/lightgbm-ci-docker?

I realized today that to make this change to mamba I'll need to update that image, and it would be easier if I could directly push to the dev branch there and have new images pushed, so I could test them in LightGBM's CI.

Otherwise, I'll have to make a PR from my fork of lightgbm-ci-docker into the dev branch of that repo, and then if anything breaks repeat the process of PR-ing from my fork to dev and waiting for approval.

@StrikerRUS
Copy link
Collaborator Author

@jameslamb Good idea but unfortunately I have no rights to grant you "write" access.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot removed the blocking label Aug 16, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants