Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

light gbm hangs when loading a model file in subprocess #6137

Open
assassin5615 opened this issue Oct 10, 2023 · 5 comments
Open

light gbm hangs when loading a model file in subprocess #6137

assassin5615 opened this issue Oct 10, 2023 · 5 comments
Labels

Comments

@assassin5615
Copy link

assassin5615 commented Oct 10, 2023

Description

train two models in the main process and save them into two model files.
then use Multiprocessing.pool to load these two model files in subprocess, the subprocess will hang.
part of the stack trace by using pyrasite-shell is as below

File "simple_lgbm.py", line 77, in predict
x = lgb.Booster(model_file=file_name)
File ".../lightgbm/basic.py", line 2087, in init
_safe_call(_LIB.LGBM_BoosterCreateFromModelfile(

gdb shows more detail, the CreateBoosting function calls something like __kmp_api_GOMP_parallel_40_alias() and it hung at __kmp_suspend_64()

in light gbm FAQ, it mentioned that due to openmp bug, it could hang with multithreading and fork on linux. and suggest to use nthreads=1 to close multithreading. but setting nthreads=1 has no effect for lgb.Booster when loading model file.

is there a workaround or fix for this?

Reproducible example

the code is based on simple_example.py from light gbm repo.

# coding: utf-8
from pathlib import Path
from multiprocessing import get_context

import pandas as pd
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = dict(
    task='train',
    objective='regression',
    num_leaves=50,
    max_depth=6,
    n_jobs=10,
    min_data_in_leaf=100,
    feature_fraction=0.8,
    num_iterations=20,
    learning_rate=0.1,
    deterministic=True,
    metric=['rmse'],
    force_col_wise=True,
    verbose=-1
    )

print('Starting training...')
def train(file_name: str):
    # train
    gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

    print('Saving model...')
    # save model to file
    gbm.save_model(file_name)

train('model1.txt')
train('model2.txt')

print('Starting predicting...')

def predict(file_name: str):
    # it hangs here
    x = lgb.Booster(model_file=file_name)
    y_pred = x.predict(X_test, num_iteration=x.best_iteration)
    rmse_test = mean_squared_error(y_test, y_pred) ** 0.5
    print(f'The RMSE of prediction is: {rmse_test}')

with get_context("fork").Pool(processes=2) as pool:
    for r in pool.imap_unordered(predict, ['model1.txt', 'model2.txt']):
        print(f'got result {r}')

Environment info

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

@shiyu1994
Copy link
Collaborator

@assassin5615 Thanks for using LightGBM. Did you try setting the environment variable OMP_NUM_THREADS to 1?

@assassin5615
Copy link
Author

assassin5615 commented Oct 13, 2023

@shiyu1994 in my environment, OMP_NUM_THREADS is always 1 as I ran into other issues that requires set OMP_NUM_THREADS 1, so yes.

@assassin5615
Copy link
Author

I also tried to print the value of OMP_NUM_THREADS in the script, it's 1 before calling train and prediction.

@ChiHangChen
Copy link

I encountered very same problem, any solutions so far?

@ChiHangChen
Copy link

I encountered very same problem, any solutions so far?

Ok just found a HACK after 3 hours struggling

I put the training stages into a subprocess instead of running it under main process, than subprocess load model by lgb.Booster not hang anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants