Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling multithreaded functions sets global number of OMP threads #4705

Closed
david-cortes opened this issue Oct 22, 2021 · 9 comments · Fixed by #6226
Closed

Calling multithreaded functions sets global number of OMP threads #4705

david-cortes opened this issue Oct 22, 2021 · 9 comments · Fixed by #6226
Assignees

Comments

@david-cortes
Copy link
Contributor

david-cortes commented Oct 22, 2021

Calling lightgbm functions which use multithreading will forcibly change the configured number of openmp threads in the whole processes from where lightgbm was called, which is a bad practice (for example, depending on how numpy was built, it might change the number of threads that matrix multiplications in numpy will use afterwards). This can be checked in lines such as this:

omp_set_num_threads(config.num_threads);

The fix is easy but requires some design changes in the overall code structure: don't set the number of openmp threads, pass them instead as a parameter to the openmp pragmas.

Example:

import sklearn
import threadpoolctl
threadpoolctl.threadpool_info()
[{'filepath': '/home/david/anaconda3/envs/py3/lib/libopenblasp-r0.3.13.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.13',
  'num_threads': 16,
  'threading_layer': 'pthreads',
  'architecture': 'Zen'},
 {'filepath': '/home/david/anaconda3/envs/py3/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 16}]
from lightgbm import LGBMRegressor
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)

model = LGBMRegressor(n_jobs=3).fit(X, y)
p = model.predict(X)
threadpoolctl.threadpool_info()
[{'filepath': '/home/david/anaconda3/envs/py3/lib/libopenblasp-r0.3.13.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.13',
  'num_threads': 16,
  'threading_layer': 'pthreads',
  'architecture': 'Zen'},
 {'filepath': '/home/david/anaconda3/envs/py3/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 3}]
@StrikerRUS
Copy link
Collaborator

Ah, just wrote a comment in the related issue! 🙂 Could you please share your expert opinion in #4607 on the chosen design?

@StrikerRUS
Copy link
Collaborator

Related: dmlc/xgboost#7537.

@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2022

@david-cortes if you use the default value (-1), the omp_set_num_threads is not called. And it should use global omp n-therads by default. Does it meet your demand?

@david-cortes
Copy link
Contributor Author

@guolinke : But the issue still persists: if one wants to pass a number of threads that's different from the OMP default, it will alter the current OMP default.

Also scikit-learn's glossary mentions that negative n_jobs is interpreted as following joblib's formula:
https://scikit-learn.org/stable/glossary.html#term-n_jobs
(so e.g. -1 means using all threads instead of following the OMP default)

I think users would typically expect scikit-learn-compatible packages to follow that kind of conventions.

@guolinke
Copy link
Collaborator

guolinke commented Mar 3, 2022

I see. A quick solution is to have a context manager. For example, when start running LightGBM, it records current omp nthreads, then when exiting LightGBM, it resets the omp nthreads by the previous recorded values.

@Innixma
Copy link

Innixma commented Jan 11, 2023

Just wanted to mention that this problem is quite nasty when working on AutoML systems (I develop AutoGluon), as LightGBM can alter the OpenMP thread count at inference time, slowing down torch neural network models running on CPU dramatically if ensembling them together. It would be great if this was fixed. We currently use a context manager to avoid the issue, but it slows down inference speed in small-batch size scenarios.

@jameslamb
Copy link
Collaborator

Thanks for the comment @Innixma ! That's important context.

We've really been struggling from a lack of maintainer attention and availability here, which is why this has been open so long. Are you interested in working on this problem and submitting a proposal in a pull request? That is the best way to ensure that it's fixed in the next release of LightGBM.

@Innixma
Copy link

Innixma commented Jan 28, 2023

Hi @jameslamb, thanks for your attention!

I'm sorry to hear that LightGBM is struggling to obtain prioritization. It is an amazing package and foundational to many AutoML systems. There is a reason the only model I have more than 1 of in our ensemble is LightGBM, and we use it 3 times!

Unfortunately, to say that my C++ skills are rusty would be an understatement, and I'd expect the rest of the AutoGluon developers would say the same.

We currently have a work-around recently implemented into AutoGluon to fix the majority of the slow down (2.5x faster ensemble inference). The work-around mostly boils down to avoiding the use of a context manager entirely and always specifying lightgbm_model.predict(X, num_threads=0), thus avoiding the altering of the OpenMP thread limit and simply using whatever value happens to be set globally.

This solution is a bit shaky, since it depends on the assumption that LightGBM will default to the current OpenMP thread limit and that the current OpenMP thread limit is set to a value that all model types are efficient with (not really possible in reality as some models prefer thread-count and others prefer physical-count, but hopefully it is close enough to not be too slow). Also, optimal OpenMP thread limit differs based on inference batch size, but that is a separate dimension of optimization.

Thankfully, it is pretty easy to benchmark, and for most cases I think the work-around generally solves the problem on our end.

@jameslamb
Copy link
Collaborator

Assigning this to myself, as I'm actively working on it right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants