Linux CI jobs hang forever after completing all Python tests successfully #4948

StrikerRUS · 2022-01-14T23:32:24Z

This problem has started to timeout our CI jobs about 5 days ago. The most frequent CI jobs that run out of allowed 60min limit are Linux_latest regular and Linux_latest sdist at Azure Pipelines. Also, I just saw CUDA Version / cuda 10.0 pip (linux, clang, Python 3.8) encountered the same problem.

From test logs I guess that the root cause is connected to the following warning message from the joblib/threadpoolctl package:

2022-01-14T22:12:20.3133305Z tests/python_package_test/test_sklearn.py::test_sklearn_integration[LGBMRegressor()-check_regressors_train(readonly_memmap=True,X_dtype=float32)]
2022-01-14T22:12:20.3133965Z   /root/miniconda/envs/test-env/lib/python3.8/site-packages/threadpoolctl.py:546: RuntimeWarning: 
2022-01-14T22:12:20.3134409Z   Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
2022-01-14T22:12:20.3134727Z   the same time. Both libraries are known to be incompatible and this
2022-01-14T22:12:20.3135026Z   can cause random crashes or deadlocks on Linux when loaded in the
2022-01-14T22:12:20.3135283Z   same Python program.
2022-01-14T22:12:20.3135545Z   Using threadpoolctl may cause crashes or deadlocks. For more
2022-01-14T22:12:20.3135827Z   information and possible workarounds, please see
2022-01-14T22:12:20.3136165Z       https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
2022-01-14T22:12:20.3136411Z   
2022-01-14T22:12:20.3136614Z     warnings.warn(msg, RuntimeWarning)

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2022-01-14T23:34:16Z

For macOS, we have the following workaround to avoid conflicts between multiple instances of libomp library:

LightGBM/.ci/test.sh

Lines 120 to 123 in 4aaeb22

    
           if [[ $OS_NAME == "macos" ]] && [[ $COMPILER == "clang" ]]; then 
        
               # fix "OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized." (OpenMP library conflict due to conda's MKL) 
        
               for LIBOMP_ALIAS in libgomp.dylib libiomp5.dylib libomp.dylib; do sudo ln -sf "$(brew --cellar libomp)"/*/lib/libomp.dylib $CONDA_PREFIX/lib/$LIBOMP_ALIAS || exit -1; done 
        
           fi

StrikerRUS · 2022-01-14T23:39:12Z

Given that I remember only Azure Pipelines Linux_latest and CUDA cuda 10.0 pip (linux, clang, Python 3.8) CI jobs have already faced this problem multiple times, I believe only CI jobs where we use clang to compile LightGBM suffer from this problem.

LightGBM/.vsts-ci.yml

Lines 76 to 79 in 4aaeb22

    
           - job: Linux_latest 
        
           ########################################### 
        
             variables: 
        
               COMPILER: clang

LightGBM/.github/workflows/cuda.yml

Lines 30 to 33 in 4aaeb22

    
           - method: pip 
        
             compiler: clang 
        
             python_version: 3.8 
        
             cuda_version: "10.0"

StrikerRUS · 2022-01-14T23:43:50Z

Official suggested workarounds:
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md#workarounds-for-intel-openmp-and-llvm-openmp-case

Tell MKL (used by NumPy) to use the GNU OpenMP runtime instead of the Intel OpenMP runtime by setting the following environment variable:
```
export MKL_THREADING_LAYER=GNU
```
Install a build of NumPy and SciPy linked against OpenBLAS instead of MKL. This can be done for instance by installing NumPy and SciPy from PyPI:
```
pip install numpy scipy
```
from the conda-forge conda channel:
```
conda install -c conda-forge numpy scipy
```
or from the default conda channel:
```
conda install numpy scipy blas[build=openblas]
```
Re-build your OpenMP-enabled extensions from source with GCC (or ICC) instead of Clang if you want to keep on using NumPy/SciPy linked against MKL with the default libiomp-based threading layer.

StrikerRUS · 2022-01-15T00:06:48Z

I guess we are facing a conflict between default Ubuntu system-wide libomp and libiomp installed from conda as a dependency of numpy/scipy/scikit-learn package. In addition, we have libgomp installed from conda as a dependency of Python itself.

Just for example, latest master logs:

LightGBM/.ci/setup.sh

Lines 52 to 56 in 4aaeb22

    
           if [[ $COMPILER == "clang" ]]; then 
        
               sudo apt-get install --no-install-recommends -y \ 
        
                   clang \ 
        
                   libomp-dev 
        
           fi

2022-01-09T14:12:47.9352974Z Unpacking clang (1:10.0-50~exp1) ...
2022-01-09T14:12:47.9989218Z Selecting previously unselected package libomp5-10:amd64.
2022-01-09T14:12:48.0013884Z Preparing to unpack .../13-libomp5-10_1%3a10.0.0-4ubuntu1_amd64.deb ...
2022-01-09T14:12:48.0280193Z Unpacking libomp5-10:amd64 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.1021372Z Selecting previously unselected package libomp-10-dev.
2022-01-09T14:12:48.1041014Z Preparing to unpack .../14-libomp-10-dev_1%3a10.0.0-4ubuntu1_amd64.deb ...
2022-01-09T14:12:48.1110633Z Unpacking libomp-10-dev (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.1655290Z Selecting previously unselected package libomp-dev.
2022-01-09T14:12:48.1678522Z Preparing to unpack .../15-libomp-dev_1%3a10.0-50~exp1_amd64.deb ...
2022-01-09T14:12:48.1740800Z Unpacking libomp-dev (1:10.0-50~exp1) ...
2022-01-09T14:12:48.3024348Z Setting up libgc1c2:amd64 (1:7.6.4-0.4ubuntu1) ...
2022-01-09T14:12:48.3309310Z Setting up libedit2:amd64 (3.1-20191231-1) ...
2022-01-09T14:12:48.3575485Z Setting up libobjc4:amd64 (10.3.0-1ubuntu1~20.04) ...
2022-01-09T14:12:48.3746044Z Setting up libllvm10:amd64 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.4107754Z Setting up libclang1-10 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.4314994Z Setting up libobjc-9-dev:amd64 (9.3.0-17ubuntu1~20.04) ...
2022-01-09T14:12:48.4774687Z Setting up libomp5-10:amd64 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.5023165Z Setting up libc6-i386 (2.31-0ubuntu9.2) ...
2022-01-09T14:12:48.5264503Z Setting up libomp-10-dev (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.5746908Z Setting up libclang-cpp10 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.5924828Z Setting up lib32gcc-s1 (10.3.0-1ubuntu1~20.04) ...
2022-01-09T14:12:48.6177911Z Setting up lib32stdc++6 (10.3.0-1ubuntu1~20.04) ...
2022-01-09T14:12:48.6431429Z Setting up libomp-dev (1:10.0-50~exp1) ...
2022-01-09T14:12:48.6841900Z Setting up libclang-common-10-dev (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.7140409Z Setting up clang-10 (1:10.0.0-4ubuntu1) ...
2022-01-09T14:12:48.7313904Z Setting up clang (1:10.0-50~exp1) ...

...

  libgomp            pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17

...

  intel-openmp       pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=11954&view=logs&j=275189f9-c769-596a-7ef9-49fb48a9ab70&t=3a9e7a4a-04e6-52a0-67ea-6e8f6cfda74f&l=44

According to the threadpoolctl's link above, deadlocks are observed only with clang and ICC OpenMP implementations loaded at the same time. This explains why we don't see any timeouts for our gcc-configured CI jobs.

The only unrecoverable incompatibility we encountered happens when loading a mix of compiled extensions linked with libomp (LLVM/Clang) and libiomp (ICC), on Linux, manifested by crashes or deadlocks.

StrikerRUS · 2022-01-15T00:18:17Z

If no one of maintainers knows better workaround than documented in threadpoolctl's help doc, I think we should choose one from those three ones (#4948 (comment)).

#3 is not an option for us I think, because we'd like to test LightGBM against Clang compiler. So, actually we have only two workarounds (#1 (force MKL to use libgomp by setting env variable) and #2 use OpenBLAS instead of MKL).

This CI problem is quite annoying because it makes us re-run CI jobs multiple times after they were timed out and slows down the whole development process in the repository.

@guolinke, @chivee, @shiyu1994, @tongwu-msft, @hzy46, @Laurae2, @jameslamb, @jmoralez

jameslamb · 2022-01-15T02:12:24Z

Excellent investigation, thank you!

I vote for option 1, setting MKL_THREADING_LAYER=GNU in the environment for clang Linux jobs. I like that because it's what I'd most like to recommend to a LightGBM user facing this issue...I think "set this environment variable" is less invasive than "install this other conda package".

I'd also like to ask.... @xhochy, if you have time could you advise us on this issue? I'm wondering if you've experienced a similar issuue with the lightgbm conda-forge feedstock or other projects that both depend on numpy and link to OpenMP themselves.

xhochy · 2022-01-15T09:06:22Z

We have seen these issues a long time ago in other packages but they shouldn't be occurring these days in a pure conda-forge setting as we have safeguards in place that only a single OpenMP implementation is installed at a time.

This is especially important for intel-openmp which has been succeeded by llvm-openmp (the latter contains now all patches from the former).

As this seems to directly about failing CI in LightGBM, is there any usage of conda in one of the failing jobs?

jameslamb · 2022-01-15T18:09:45Z

Thanks very much @xhochy !

is there any usage of conda in one of the failing jobs?

Yes. For more context, we do not use conda-forge in this project's CI. This project's CI install all dependencies from the Anaconda default channels.

LightGBM/.ci/test.sh

Line 117 in 4aaeb22

    
           conda install -q -y -n $CONDA_ENV cloudpickle "dask=2021.9.1" "distributed=2021.9.1" joblib matplotlib numpy pandas psutil pytest scikit-learn scipy

And for jobs on Linux which use clang to compile LightGBM (the ones that are sometimes timing out because of these issues), we also separately install libomp-dev (see @StrikerRUS 's summary in #4948 (comment)).

StrikerRUS · 2022-01-15T22:48:00Z

I wonder, maybe it's a good time to migrate from default conda channel to conda-forge one? Besides this particular issue with different libomp implementations, default conda channel is extremely slow in terms of updates and lacks some required packages for our CI.

Just some small examples.

dask-core package, which is included into Anaconda distribution (this clarification emphasizes its importance to conda maintainers), right now has 2021.10.0 version, while on conda-forge there is 2022.1.0 version already.

https://docs.anaconda.com/anaconda/packages/pkg-docs/

https://anaconda.org/conda-forge/dask-core

LightGBM version at default conda channel is 3.2.1: https://anaconda.org/anaconda/lightgbm. Related issue: #3544 (comment).

Requests for adding new and upgrading existing [R packages] tend to be ignored: ContinuumIO/anaconda-issues#11604, ContinuumIO/anaconda-issues#11571. Due to this reason, we have already migrated to conda-forge for building our docks: #4767.

In addition, conda-forge channel often supports more architectures (Miniforge): #4843 (comment).

Download stats for LightGBM (especially for the recent versions) show that users already prefer conda-forge to default: https://anaconda.org/conda-forge/lightgbm/files vs https://anaconda.org/anaconda/lightgbm/files.

Just a reminder: it's better not to mix different channels in one environment not only due to possible package conflicts, but also due to long time and high memory consumption for resolving environment specification during installation phase (matters for CI): #4054 (review), ContinuumIO/anaconda-issues#11604 (comment).

jameslamb · 2022-01-15T22:52:44Z

@StrikerRUS thanks for all the research! I strongly support moving LightGBM's CI to using only conda-forge, and I'd be happy to do that work.

StrikerRUS · 2022-01-15T22:59:02Z

@jameslamb Thank you very much! Before we start, let's wait for some other opinions...

@guolinke @shiyu1994 @tongwu-msft @hzy46 @Laurae2 @jmoralez

jmoralez · 2022-01-15T23:06:07Z

I also support using conda-forge and maybe we could consider using mamba in CI. The time to solve the environments and install packages reduces significantly.

StrikerRUS · 2022-01-15T23:20:52Z

@jmoralez
Let's go deeper! Mambaforge! 😄
https://github.com/conda-forge/miniforge#mambaforge

jameslamb · 2022-01-16T20:54:54Z

@StrikerRUS @guolinke could I have "Write" access on https://github.com/guolinke/lightgbm-ci-docker?

I realized today that to make this change to mamba I'll need to update that image, and it would be easier if I could directly push to the dev branch there and have new images pushed, so I could test them in LightGBM's CI.

Otherwise, I'll have to make a PR from my fork of lightgbm-ci-docker into the dev branch of that repo, and then if anything breaks repeat the process of PR-ing from my fork to dev and waiting for approval.

StrikerRUS · 2022-01-17T02:26:46Z

@jameslamb Good idea but unfortunately I have no rights to grant you "write" access.

github-actions · 2023-08-16T00:18:20Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

StrikerRUS added blocking maintenance labels Jan 14, 2022

StrikerRUS mentioned this issue Jan 14, 2022

[dask] add support for custom objective functions (fixes #3934) #4920

Merged

jameslamb added a commit that referenced this issue Jan 16, 2022

[ci] use conda-forge in CI jobs (fixes #4948)

771889f

jameslamb mentioned this issue Jan 16, 2022

[ci] use conda-forge in Linux and macOS CI jobs #4953

Merged

jameslamb mentioned this issue Jan 16, 2022

[ci] [docs] readthedocs configuration options are deprecated #4954

Closed

StrikerRUS mentioned this issue Jan 17, 2022

Update versions of LightGBM dependencies #4935

Merged

This was referenced Jan 17, 2022

add mambaforge installer guolinke/lightgbm-ci-docker#22

Merged

[ci] [docs] use miniforge for readthedocs builds (fixes #4954) #4957

Merged

jmoralez mentioned this issue Jan 21, 2022

CUDATreeLearner: free GPU memory in destructor if any allocated #4963

Merged

StrikerRUS mentioned this issue Feb 6, 2022

[docs] document conda-forge channel preference over default one and describe possible workaround for OpenMP conflicts in FAQ #4994

Merged

StrikerRUS closed this as completed in 3500cb6 Feb 11, 2022

github-actions bot removed the blocking label Aug 16, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux CI jobs hang forever after completing all Python tests successfully #4948

Linux CI jobs hang forever after completing all Python tests successfully #4948

StrikerRUS commented Jan 14, 2022

StrikerRUS commented Jan 14, 2022

StrikerRUS commented Jan 14, 2022 •

edited

Loading

StrikerRUS commented Jan 14, 2022 •

edited

Loading

StrikerRUS commented Jan 15, 2022 •

edited

Loading

StrikerRUS commented Jan 15, 2022

jameslamb commented Jan 15, 2022

xhochy commented Jan 15, 2022

jameslamb commented Jan 15, 2022

StrikerRUS commented Jan 15, 2022 •

edited

Loading

jameslamb commented Jan 15, 2022

StrikerRUS commented Jan 15, 2022

jmoralez commented Jan 15, 2022 •

edited

Loading

StrikerRUS commented Jan 15, 2022 •

edited

Loading

jameslamb commented Jan 16, 2022

StrikerRUS commented Jan 17, 2022

github-actions bot commented Aug 16, 2023

Linux CI jobs hang forever after completing all Python tests successfully #4948

Linux CI jobs hang forever after completing all Python tests successfully #4948

Comments

StrikerRUS commented Jan 14, 2022

StrikerRUS commented Jan 14, 2022

StrikerRUS commented Jan 14, 2022 • edited Loading

StrikerRUS commented Jan 14, 2022 • edited Loading

StrikerRUS commented Jan 15, 2022 • edited Loading

StrikerRUS commented Jan 15, 2022

jameslamb commented Jan 15, 2022

xhochy commented Jan 15, 2022

jameslamb commented Jan 15, 2022

StrikerRUS commented Jan 15, 2022 • edited Loading

jameslamb commented Jan 15, 2022

StrikerRUS commented Jan 15, 2022

jmoralez commented Jan 15, 2022 • edited Loading

StrikerRUS commented Jan 15, 2022 • edited Loading

jameslamb commented Jan 16, 2022

StrikerRUS commented Jan 17, 2022

github-actions bot commented Aug 16, 2023

StrikerRUS commented Jan 14, 2022 •

edited

Loading

StrikerRUS commented Jan 14, 2022 •

edited

Loading

StrikerRUS commented Jan 15, 2022 •

edited

Loading

StrikerRUS commented Jan 15, 2022 •

edited

Loading

jmoralez commented Jan 15, 2022 •

edited

Loading

StrikerRUS commented Jan 15, 2022 •

edited

Loading