-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove unnecessary omp single that cause deadlock (fixes #6273) #6394
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the EXCELLENT investigation and write-up! This fix makes sense to me.
To ensure we don't introduce a deadlock like this on this codepath again, could you please add the example Python code as a test here?
Right after this test:
def test_dataset_construction_overwrites_user_provided_metadata_fields(): |
Something like this:
def test_dataset_construction_with_high_cardinality_categorical_succeeds():
pd = pytest.importorskip("pandas")
X = pd.DataFrame({"x1": np.random.randint(0, 5_000, 10_000)})
y = np.random.rand(10_000)
dtrain = lgb.Dataset(X, y, categorical_feature=["x1"])
dtrain.construct()
assert ds.num_data() == 10_000
assert ds.num_feature() == 1
@microsoft-github-policy-service agree |
75ba6eb
to
112f757
Compare
112f757
to
0749516
Compare
Thanks for the update! We'll review them soon. In the future when you contribute to LightGBM (and we hope you will!), don't force-push here. It's not necessary, as we squash all commits into 1 when merging to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, excellent fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! I'd still like the opportunity to test this on a large multi-core machine to be sure it's working as expected (similar to the "How I tested this" steps in #6226). Will try to do that soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this tonight using an approach similar to #6226, with both the R and Python packages. Confirmed that the processing time scales as expected with higher values of environment variable OMP_NUM_THREADS
. Also confirmed that the unit test you've added here gets deadlocked on master
but runs quickly and successfully with the changes here 🎉
Thank you SO MUCH for taking the time to investigate and fix this, and for the great explanation in the PR description.
fixes #6273
omp single clause has implicit barrier, which waits for all threads, including threads that did not execute omp single block.
Therefore, if omp single clause is used conditionally inside omp parallel, it will cause a deadlock (explanation below).
I think it is safe to remove omp single clause from
omp_get_max_threads()
(introuduced in #6226).omp_get_max_threads()
is thread-safe https://www.openmp.org/spec-html/5.1/openmpse6.html#x27-260001.6omp_get_max_threads()
gets its value.background
I digged into #6273.
I suspect a deadlock near
OMP_NUM_THREADS()
from the backtrace.In this case,
LightGBM::DatasetLoader::ConstructFromSampleData
has omp parallel and OMP_NUM_THREADS() (FindBin => ArgMax => ArgMaxMT => OMP_NUM_THREADS()) has omp single.deadlock explanation code