light gbm hangs when loading a model file in subprocess #6137

assassin5615 · 2023-10-10T03:10:44Z

Description

train two models in the main process and save them into two model files.
then use Multiprocessing.pool to load these two model files in subprocess, the subprocess will hang.
part of the stack trace by using pyrasite-shell is as below

File "simple_lgbm.py", line 77, in predict
x = lgb.Booster(model_file=file_name)
File ".../lightgbm/basic.py", line 2087, in init
_safe_call(_LIB.LGBM_BoosterCreateFromModelfile(

gdb shows more detail, the CreateBoosting function calls something like __kmp_api_GOMP_parallel_40_alias() and it hung at __kmp_suspend_64()

in light gbm FAQ, it mentioned that due to openmp bug, it could hang with multithreading and fork on linux. and suggest to use nthreads=1 to close multithreading. but setting nthreads=1 has no effect for lgb.Booster when loading model file.

is there a workaround or fix for this?

Reproducible example

the code is based on simple_example.py from light gbm repo.

# coding: utf-8
from pathlib import Path
from multiprocessing import get_context

import pandas as pd
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = dict(
    task='train',
    objective='regression',
    num_leaves=50,
    max_depth=6,
    n_jobs=10,
    min_data_in_leaf=100,
    feature_fraction=0.8,
    num_iterations=20,
    learning_rate=0.1,
    deterministic=True,
    metric=['rmse'],
    force_col_wise=True,
    verbose=-1
    )

print('Starting training...')
def train(file_name: str):
    # train
    gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

    print('Saving model...')
    # save model to file
    gbm.save_model(file_name)

train('model1.txt')
train('model2.txt')

print('Starting predicting...')

def predict(file_name: str):
    # it hangs here
    x = lgb.Booster(model_file=file_name)
    y_pred = x.predict(X_test, num_iteration=x.best_iteration)
    rmse_test = mean_squared_error(y_test, y_pred) ** 0.5
    print(f'The RMSE of prediction is: {rmse_test}')

with get_context("fork").Pool(processes=2) as pool:
    for r in pool.imap_unordered(predict, ['model1.txt', 'model2.txt']):
        print(f'got result {r}')

Environment info

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

shiyu1994 · 2023-10-12T16:56:14Z

@assassin5615 Thanks for using LightGBM. Did you try setting the environment variable OMP_NUM_THREADS to 1?

assassin5615 · 2023-10-13T09:56:58Z

@shiyu1994 in my environment, OMP_NUM_THREADS is always 1 as I ran into other issues that requires set OMP_NUM_THREADS 1, so yes.

assassin5615 · 2023-10-16T03:11:40Z

I also tried to print the value of OMP_NUM_THREADS in the script, it's 1 before calling train and prediction.

ChiHangChen · 2025-02-23T14:22:59Z

I encountered very same problem, any solutions so far?

ChiHangChen · 2025-02-23T14:34:11Z

I encountered very same problem, any solutions so far?

Ok just found a HACK after 3 hours struggling

I put the training stages into a subprocess instead of running it under main process, than subprocess load model by lgb.Booster not hang anymore

jameslamb added the question label Oct 10, 2023

jameslamb added the awaiting response label Oct 12, 2023

github-actions bot removed the awaiting response label Oct 13, 2023

jameslamb mentioned this issue Dec 5, 2023

[R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) #6226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

light gbm hangs when loading a model file in subprocess #6137

light gbm hangs when loading a model file in subprocess #6137

assassin5615 commented Oct 10, 2023 •

edited

Loading

shiyu1994 commented Oct 12, 2023

assassin5615 commented Oct 13, 2023 •

edited

Loading

assassin5615 commented Oct 16, 2023

ChiHangChen commented Feb 23, 2025

ChiHangChen commented Feb 23, 2025

light gbm hangs when loading a model file in subprocess #6137

light gbm hangs when loading a model file in subprocess #6137

Comments

assassin5615 commented Oct 10, 2023 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

shiyu1994 commented Oct 12, 2023

assassin5615 commented Oct 13, 2023 • edited Loading

assassin5615 commented Oct 16, 2023

ChiHangChen commented Feb 23, 2025

ChiHangChen commented Feb 23, 2025

assassin5615 commented Oct 10, 2023 •

edited

Loading

assassin5615 commented Oct 13, 2023 •

edited

Loading