Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM does not like NUMA (large performance impact on servers) #1441

Closed
Laurae2 opened this issue Jun 10, 2018 · 21 comments
Closed

LightGBM does not like NUMA (large performance impact on servers) #1441

Laurae2 opened this issue Jun 10, 2018 · 21 comments

Comments

@Laurae2
Copy link
Contributor

Laurae2 commented Jun 10, 2018

Environment info

  • Operating System: Windows 8.1 / 10, Windows Server 2012 / 2016, WSL Ubuntu 16.04, Clear Linux (latest)
  • CPU: Quad Intel Xeon 6154, 768GB RAM (72 physical cores, all 3.7 GHz, all RAM banks populated evenly)
  • C++/Python/R version: R 3.5 / Python 3.6, LightGBM commit 96e7016

By default, servers are shipping with NUMA nodes enabled. However, with UMA (Node Interleaving), LightGBM performance increases significantly (sometimes by over 80%+). On dual processor systems, the impact can be over 30%.

This issue is not limited to R, includes also Python and CLI.

I also tested the impact on a 8x Xeon Platinum 8180 (768GB RAM, all RAM banks populated evenly): it was 200%+ slower with Node Interleaving off (NUMA on).

xgboost is also affected by this issue, but it's less severe than LightGBM.

Related issues:

Untested:

  • Sub NUMA Cluster (NUMA inside a processor) was untested and would probably result in an even worse performance.
  • numactl was not used because, first it is not available in Windows (Windows pins data to the correct RAM banks only if affinity is specified), second when using all available threads it does not make sense

Reproducible examples

I used a private dataset (takes 250GB+ RAM), but the issue is easily reproducible using HIGGS dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS

See the results on a dual Xeon Gold 6130 and 384GB RAM here: https://public.tableau.com/views/NodeInterleavingv1/Dashboard?:bootstrapWhenNotified=true&:display_count=y&:display_static_image=y&:embed=y&:showVizHome=no&publish=yes

Steps to reproduce

  1. Get on a multiprocessor server with full access to BIOS (at least dual Xeon ideally with a "lot of threads")
  2. In BIOS, turn off Node Interleaving (if enabled)
  3. Get the training time on a large dataset using all threads
  4. In BIOS, turn on Node Interleaving
  5. Get the training time on the same dataset using all threads
  6. Compare 2. and 4. timings

Code from #542 could be used through CLI.

@guolinke
Copy link
Collaborator

guolinke commented Jun 11, 2018

@Laurae2
maybe using parallel learning is better in NUMA cluster ?

BTW, the multi-threading is based on openmp. Thus, maybe we need additional commands to improve openmp's performance on numa:

https://stackoverflow.com/questions/11959906/openmp-and-numa-relation

http://prace.it4i.cz/sites/prace.it4i.cz/files/files/advancedopenmptutorial_2.pdf

@huanzhang12
Copy link
Contributor

huanzhang12 commented Jun 11, 2018

I also observed a similar phenomenon before.
When there are multiple NUMA nodes, the memory holding the feature data might be allocated to RAMs in a single node. CPUs on other nodes may have very slow access to the data:
https://github.com/Microsoft/LightGBM/blob/master/src/io/dense_bin.hpp#L72

Ideally, data_indices, ordered_gradients, ordered_hessians and out should be allocated in RAM that is local to the CPU core (within the same node). Array data_ needs random access so we cannot split it easily, but we can duplicate it on all NUMA nodes since it is read-only. But it seems not straightforward to implement these optimizations.

BTW, a few years ago I wrote an paper on optimizing SGD on NUMA machines which uses similar techniques: https://ieeexplore.ieee.org/document/7837887/ (HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent)

Because we use static work scheduling, the slowest CPU will determine the running time. This will make our situation even worse.
https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L429
(edit: should be https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L681)

We can first try to use dynamic OpenMP scheduling (change schedule(static, X) to schedule(dynamic)) and see if the issue can be alleviated, but it might slowdown training on single socket system as dynamic scheduling has more overhead. Making LightGBM fully NUMA-aware will need more efforts.

@guolinke
Copy link
Collaborator

Thanks @huanzhang12 so much. That is very helpful.

@guolinke
Copy link
Collaborator

@Laurae2 any chance to try @huanzhang12 's suggestions ?

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jun 16, 2018

@guolinke Is it only this line to change?

https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L429

to:

#pragma omp parallel for schedule(dynamic,1024) if (num_features_ >= 2048)

Or should I change all static to dynamic?

@huanzhang12
Copy link
Contributor

@Laurae2 I guess there are many others, but this one should be the most important one if the data is large and dense (like Higgs). You can try it first.

@huanzhang12
Copy link
Contributor

huanzhang12 commented Jun 16, 2018

@Laurae2 Oh sorry I actually misread the code. The major computation loop is actually here:
https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L681
You can try to change this one to dynamic and it should have some effects.

(there are actually some different cases here, so you probably need to change all #pragma omp in this function)

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jun 17, 2018

@guolinke @huanzhang12 Here are some results with different schedulers on the function you quoted (I changed all static schedulers in ConstructHistograms). Using Dual Xeon Gold 6130 (3.7 GHz single thread, 2.8 GHz all cores).

Timings, average of 5 runs:

Scheduling NUMA UMA
static 100% avg time (baseline) 100% avg time (baseline)
dynamic 113% avg time (13% slower) 111% avg time (11% slower)

Efficiencies vs 1 thread (1 thread is 132% efficient on my CPU because turbo boost), average of 5 runs:

Scheduler NUMA 1T NUMA 2T NUMA 4T NUMA 8T NUMA 16T NUMA 32T NUMA 64T
static 132% 121% 116% 109% 82% 55% 23%
dynamic 132% 121% 121% 119% 93% 58% 30%
Scheduler UMA 1T UMA 2T UMA 4T UMA 8T UMA 16T UMA 32T UMA 64T
static 132% 130% 127% 120% 107% 76% 46%
dynamic 132% 125% 121% 112% 100% 77% 44%

This is the CPU behavior observed during training:

Scheduling NUMA UMA
static 1 CPU = 1-32 threads (autopinned).
2 CPU = 33-64 threads (autopinned).
NUMA optimized on CPU by the kernel scheduler.
2 CPU = 1-64 threads (bouncing but pinned on the same bouncing threads).
Always use both CPUs at any number of threads.
dynamic 1 CPU = 1-32 threads (autopinned).
2 CPU = 33-64 threads (autopinned).
NUMA optimized on CPU by the kernel scheduler.
2 CPU = 1-64 threads (bouncing but pinned on the same bouncing threads).
Always use both CPUs at any number of threads.

@huanzhang12
Copy link
Contributor

@Laurae2 So it seems dynamic scheduling does not help. However, the parallel section in ConstructHistogram does have a quite noticeable performance impact. What is the dataset you used?

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jun 17, 2018

@huanzhang12 Higgs (11M observations, https://archive.ics.uci.edu/ml/datasets/HIGGS), with 2-way interactions (multiplication) of all features.

It makes a total of 11M observations x 408 features.

@huanzhang12
Copy link
Contributor

@Laurae2 Thanks for the clarification!
In NUMA setting, it seems the dynamic scheduler always has better efficiency than static. Why it has a longer runtime in the first table?
In UMA setting, the efficiency numbers look reasonable, dynamic scheduling is only slightly slower (because of the scheduling overhead)

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jun 18, 2018

@huanzhang12 It seems my dynamic scheduler results are incorrect because I changed the wrong pragmas (not the right function, I'm using commit 3f54429 (the line numbers are not the same with master branch). I'll come back later today/tomorrow with updated results.

@huanzhang12
Copy link
Contributor

@Laurae2 Thanks for the clarification! Look forward to the updated results 🙂

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jun 19, 2018

@huanzhang12 New & correct results below. Higgs with two-way interactions using multiplication (11M x 408 features).

Speed, average of 5 runs:

Scheduler NUMA 1T NUMA 2T NUMA 4T NUMA 8T NUMA 16T NUMA 32T NUMA 64T
static 636.8s 339.9s 178.0s 90.0s 60.0s 42.8s 50.6s
dynamic 665.3s 378.9s 205.0s 110.0s 76.0s 57.0s 63.9s
Scheduler UMA 1T UMA 2T UMA 4T UMA 8T UMA 16T UMA 32T UMA 64T
static 716.3s 359.7s 185.0s 96.4s 55.4s 37.2s 30.6s
dynamic 738.3s 431.6s 233.3s 134.4s 83.4s 55.9s 55.7s

Efficiencies vs 1 thread (1 thread is 132% efficient on my CPU because turbo boost), average of 5 runs:

Scheduler NUMA 1T NUMA 2T NUMA 4T NUMA 8T NUMA 16T NUMA 32T NUMA 64T
static 132% 124% 118% 117% 88% 61% 26%
dynamic 132% 116% 107% 100% 72% 48% 21%
Scheduler UMA 1T UMA 2T UMA 4T UMA 8T UMA 16T UMA 32T UMA 64T
static 132% 132% 128% 123% 107% 79% 48%
dynamic 132% 113% 105% 91% 73% 54% 27%

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jun 24, 2018

It seems using dynamic scheduler always made LightGBM slower. On smaller datasets I reached 8x slower when using all 64 threads (it's probably the worst scenario because the optimal number of threads was 2).

@huanzhang12 Do you have any other tentative solutions in mind?

@huanzhang12
Copy link
Contributor

@Laurae2 Thanks for the detailed benchmarking! It seems the dynamic scheduler has significant overhead and cannot help in our case, so this problem cannot be easily fixed. You can try to do a more detailed profiling using Intel Vtune Amplifier to see where the slow down comes from, which will be very helpful.

Ideally, we need to redesign the ConstructHistogram function and manually make it NUMA aware. @guolinke Maybe we can put it in our roadmap?

@Tagar
Copy link

Tagar commented May 14, 2019

Does changing sysctl kernel.numa_balancing to 1/0 on Linux makes performance any different?

@guolinke
Copy link
Collaborator

@huanzhang12 sorry for the late response. Yeah, the ideal solution is to make it NUMA aware.
for the temporary solution, we can add some documentations about how to avoid it.

@guolinke
Copy link
Collaborator

@Laurae2 could you help for the documentation about how to pin cores when calling lightgbm?

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StrikerRUS
Copy link
Collaborator

New results: szilard/GBM-perf#29 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants