LightGBM does not like NUMA (large performance impact on servers) #1441

Laurae2 · 2018-06-10T15:56:16Z

Environment info

Operating System: Windows 8.1 / 10, Windows Server 2012 / 2016, WSL Ubuntu 16.04, Clear Linux (latest)
CPU: Quad Intel Xeon 6154, 768GB RAM (72 physical cores, all 3.7 GHz, all RAM banks populated evenly)
C++/Python/R version: R 3.5 / Python 3.6, LightGBM commit 96e7016

By default, servers are shipping with NUMA nodes enabled. However, with UMA (Node Interleaving), LightGBM performance increases significantly (sometimes by over 80%+). On dual processor systems, the impact can be over 30%.

This issue is not limited to R, includes also Python and CLI.

I also tested the impact on a 8x Xeon Platinum 8180 (768GB RAM, all RAM banks populated evenly): it was 200%+ slower with Node Interleaving off (NUMA on).

xgboost is also affected by this issue, but it's less severe than LightGBM.

Related issues:

Opinion on LightGBM scalability: Opinion on LightGBM scalability #1225

Untested:

Sub NUMA Cluster (NUMA inside a processor) was untested and would probably result in an even worse performance.
numactl was not used because, first it is not available in Windows (Windows pins data to the correct RAM banks only if affinity is specified), second when using all available threads it does not make sense

Reproducible examples

I used a private dataset (takes 250GB+ RAM), but the issue is easily reproducible using HIGGS dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS

See the results on a dual Xeon Gold 6130 and 384GB RAM here: https://public.tableau.com/views/NodeInterleavingv1/Dashboard?:bootstrapWhenNotified=true&:display_count=y&:display_static_image=y&:embed=y&:showVizHome=no&publish=yes

Steps to reproduce

Get on a multiprocessor server with full access to BIOS (at least dual Xeon ideally with a "lot of threads")
In BIOS, turn off Node Interleaving (if enabled)
Get the training time on a large dataset using all threads
In BIOS, turn on Node Interleaving
Get the training time on the same dataset using all threads
Compare 2. and 4. timings

Code from #542 could be used through CLI.

The text was updated successfully, but these errors were encountered:

guolinke · 2018-06-11T02:48:00Z

@Laurae2
maybe using parallel learning is better in NUMA cluster ?

BTW, the multi-threading is based on openmp. Thus, maybe we need additional commands to improve openmp's performance on numa:

https://stackoverflow.com/questions/11959906/openmp-and-numa-relation

http://prace.it4i.cz/sites/prace.it4i.cz/files/files/advancedopenmptutorial_2.pdf

huanzhang12 · 2018-06-11T04:57:21Z

I also observed a similar phenomenon before.
When there are multiple NUMA nodes, the memory holding the feature data might be allocated to RAMs in a single node. CPUs on other nodes may have very slow access to the data:
https://github.com/Microsoft/LightGBM/blob/master/src/io/dense_bin.hpp#L72

Ideally, data_indices, ordered_gradients, ordered_hessians and out should be allocated in RAM that is local to the CPU core (within the same node). Array data_ needs random access so we cannot split it easily, but we can duplicate it on all NUMA nodes since it is read-only. But it seems not straightforward to implement these optimizations.

BTW, a few years ago I wrote an paper on optimizing SGD on NUMA machines which uses similar techniques: https://ieeexplore.ieee.org/document/7837887/ (HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent)

Because we use static work scheduling, the slowest CPU will determine the running time. This will make our situation even worse.
~~https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L429~~
(edit: should be https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L681)

We can first try to use dynamic OpenMP scheduling (change schedule(static, X) to schedule(dynamic)) and see if the issue can be alleviated, but it might slowdown training on single socket system as dynamic scheduling has more overhead. Making LightGBM fully NUMA-aware will need more efforts.

guolinke · 2018-06-11T05:34:58Z

Thanks @huanzhang12 so much. That is very helpful.

guolinke · 2018-06-12T23:42:03Z

@Laurae2 any chance to try @huanzhang12 's suggestions ?

Laurae2 · 2018-06-16T16:51:52Z

@guolinke Is it only this line to change?

https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L429

to:

#pragma omp parallel for schedule(dynamic,1024) if (num_features_ >= 2048)

Or should I change all static to dynamic?

huanzhang12 · 2018-06-16T19:20:02Z

@Laurae2 I guess there are many others, but this one should be the most important one if the data is large and dense (like Higgs). You can try it first.

huanzhang12 · 2018-06-16T19:34:54Z

@Laurae2 Oh sorry I actually misread the code. The major computation loop is actually here:
https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L681
You can try to change this one to dynamic and it should have some effects.

(there are actually some different cases here, so you probably need to change all #pragma omp in this function)

Laurae2 · 2018-06-17T12:59:41Z

@guolinke @huanzhang12 Here are some results with different schedulers on the function you quoted (I changed all static schedulers in ConstructHistograms). Using Dual Xeon Gold 6130 (3.7 GHz single thread, 2.8 GHz all cores).

Timings, average of 5 runs:

Scheduling	NUMA	UMA
static	100% avg time (baseline)	100% avg time (baseline)
dynamic	113% avg time (13% slower)	111% avg time (11% slower)

Efficiencies vs 1 thread (1 thread is 132% efficient on my CPU because turbo boost), average of 5 runs:

Scheduler	NUMA 1T	NUMA 2T	NUMA 4T	NUMA 8T	NUMA 16T	NUMA 32T	NUMA 64T
static	132%	121%	116%	109%	82%	55%	23%
dynamic	132%	121%	121%	119%	93%	58%	30%

Scheduler	UMA 1T	UMA 2T	UMA 4T	UMA 8T	UMA 16T	UMA 32T	UMA 64T
static	132%	130%	127%	120%	107%	76%	46%
dynamic	132%	125%	121%	112%	100%	77%	44%

This is the CPU behavior observed during training:

Scheduling	NUMA	UMA
static	1 CPU = 1-32 threads (autopinned). 2 CPU = 33-64 threads (autopinned). NUMA optimized on CPU by the kernel scheduler.	2 CPU = 1-64 threads (bouncing but pinned on the same bouncing threads). Always use both CPUs at any number of threads.
dynamic	1 CPU = 1-32 threads (autopinned). 2 CPU = 33-64 threads (autopinned). NUMA optimized on CPU by the kernel scheduler.	2 CPU = 1-64 threads (bouncing but pinned on the same bouncing threads). Always use both CPUs at any number of threads.

huanzhang12 · 2018-06-17T19:57:59Z

@Laurae2 So it seems dynamic scheduling does not help. However, the parallel section in ConstructHistogram does have a quite noticeable performance impact. What is the dataset you used?

Laurae2 · 2018-06-17T20:00:59Z

@huanzhang12 Higgs (11M observations, https://archive.ics.uci.edu/ml/datasets/HIGGS), with 2-way interactions (multiplication) of all features.

It makes a total of 11M observations x 408 features.

huanzhang12 · 2018-06-17T21:41:56Z

@Laurae2 Thanks for the clarification!
In NUMA setting, it seems the dynamic scheduler always has better efficiency than static. Why it has a longer runtime in the first table?
In UMA setting, the efficiency numbers look reasonable, dynamic scheduling is only slightly slower (because of the scheduling overhead)

Laurae2 · 2018-06-18T18:53:58Z

@huanzhang12 It seems my dynamic scheduler results are incorrect because I changed the wrong pragmas (not the right function, I'm using commit 3f54429 (the line numbers are not the same with master branch). I'll come back later today/tomorrow with updated results.

huanzhang12 · 2018-06-18T19:07:07Z

@Laurae2 Thanks for the clarification! Look forward to the updated results 🙂

Laurae2 · 2018-06-19T21:17:21Z

@huanzhang12 New & correct results below. Higgs with two-way interactions using multiplication (11M x 408 features).

Speed, average of 5 runs:

Scheduler	NUMA 1T	NUMA 2T	NUMA 4T	NUMA 8T	NUMA 16T	NUMA 32T	NUMA 64T
static	636.8s	339.9s	178.0s	90.0s	60.0s	42.8s	50.6s
dynamic	665.3s	378.9s	205.0s	110.0s	76.0s	57.0s	63.9s

Scheduler	UMA 1T	UMA 2T	UMA 4T	UMA 8T	UMA 16T	UMA 32T	UMA 64T
static	716.3s	359.7s	185.0s	96.4s	55.4s	37.2s	30.6s
dynamic	738.3s	431.6s	233.3s	134.4s	83.4s	55.9s	55.7s

Efficiencies vs 1 thread (1 thread is 132% efficient on my CPU because turbo boost), average of 5 runs:

Scheduler	NUMA 1T	NUMA 2T	NUMA 4T	NUMA 8T	NUMA 16T	NUMA 32T	NUMA 64T
static	132%	124%	118%	117%	88%	61%	26%
dynamic	132%	116%	107%	100%	72%	48%	21%

Scheduler	UMA 1T	UMA 2T	UMA 4T	UMA 8T	UMA 16T	UMA 32T	UMA 64T
static	132%	132%	128%	123%	107%	79%	48%
dynamic	132%	113%	105%	91%	73%	54%	27%

Laurae2 · 2018-06-24T16:19:19Z

It seems using dynamic scheduler always made LightGBM slower. On smaller datasets I reached 8x slower when using all 64 threads (it's probably the worst scenario because the optimal number of threads was 2).

@huanzhang12 Do you have any other tentative solutions in mind?

huanzhang12 · 2018-06-24T17:56:46Z

@Laurae2 Thanks for the detailed benchmarking! It seems the dynamic scheduler has significant overhead and cannot help in our case, so this problem cannot be easily fixed. You can try to do a more detailed profiling using Intel Vtune Amplifier to see where the slow down comes from, which will be very helpful.

Ideally, we need to redesign the ConstructHistogram function and manually make it NUMA aware. @guolinke Maybe we can put it in our roadmap?

Tagar · 2019-05-14T06:17:32Z

Does changing sysctl kernel.numa_balancing to 1/0 on Linux makes performance any different?

guolinke · 2019-05-24T07:38:54Z

@huanzhang12 sorry for the late response. Yeah, the ideal solution is to make it NUMA aware.
for the temporary solution, we can add some documentations about how to avoid it.

guolinke · 2019-05-24T07:41:55Z

@Laurae2 could you help for the documentation about how to pin cores when calling lightgbm?

StrikerRUS · 2019-08-01T16:29:53Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS · 2020-09-15T03:23:04Z

New results: szilard/GBM-perf#29 (comment).

Laurae2 assigned guolinke Jun 10, 2018

guolinke added help wanted feature request labels Aug 1, 2019

guolinke closed this as completed Aug 1, 2019

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke added good first issue hacktoberfest labels Sep 6, 2020

guolinke reopened this Sep 6, 2020

jameslamb removed the hacktoberfest label Nov 5, 2020

StrikerRUS closed this as completed Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM does not like NUMA (large performance impact on servers) #1441

LightGBM does not like NUMA (large performance impact on servers) #1441

Laurae2 commented Jun 10, 2018

guolinke commented Jun 11, 2018 •

edited

Loading

huanzhang12 commented Jun 11, 2018 •

edited

Loading

guolinke commented Jun 11, 2018

guolinke commented Jun 12, 2018

Laurae2 commented Jun 16, 2018

huanzhang12 commented Jun 16, 2018

huanzhang12 commented Jun 16, 2018 •

edited

Loading

Laurae2 commented Jun 17, 2018

huanzhang12 commented Jun 17, 2018

Laurae2 commented Jun 17, 2018

huanzhang12 commented Jun 17, 2018

Laurae2 commented Jun 18, 2018

huanzhang12 commented Jun 18, 2018

Laurae2 commented Jun 19, 2018

Laurae2 commented Jun 24, 2018

huanzhang12 commented Jun 24, 2018

Tagar commented May 14, 2019

guolinke commented May 24, 2019

guolinke commented May 24, 2019

StrikerRUS commented Aug 1, 2019

StrikerRUS commented Sep 15, 2020

LightGBM does not like NUMA (large performance impact on servers) #1441

LightGBM does not like NUMA (large performance impact on servers) #1441

Comments

Laurae2 commented Jun 10, 2018

Environment info

Reproducible examples

Steps to reproduce

guolinke commented Jun 11, 2018 • edited Loading

huanzhang12 commented Jun 11, 2018 • edited Loading

guolinke commented Jun 11, 2018

guolinke commented Jun 12, 2018

Laurae2 commented Jun 16, 2018

huanzhang12 commented Jun 16, 2018

huanzhang12 commented Jun 16, 2018 • edited Loading

Laurae2 commented Jun 17, 2018

huanzhang12 commented Jun 17, 2018

Laurae2 commented Jun 17, 2018

huanzhang12 commented Jun 17, 2018

Laurae2 commented Jun 18, 2018

huanzhang12 commented Jun 18, 2018

Laurae2 commented Jun 19, 2018

Laurae2 commented Jun 24, 2018

huanzhang12 commented Jun 24, 2018

Tagar commented May 14, 2019

guolinke commented May 24, 2019

guolinke commented May 24, 2019

StrikerRUS commented Aug 1, 2019

StrikerRUS commented Sep 15, 2020

guolinke commented Jun 11, 2018 •

edited

Loading

huanzhang12 commented Jun 11, 2018 •

edited

Loading

huanzhang12 commented Jun 16, 2018 •

edited

Loading