Skip to content

Commit

Permalink
[CUDA] New CUDA version Part 1 (#4630)
Browse files Browse the repository at this point in the history
* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers

Co-authored-by: Yu Shi <shiyu1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
  • Loading branch information
3 people committed Mar 23, 2022
1 parent b857ee1 commit 6b56a90
Show file tree
Hide file tree
Showing 75 changed files with 10,276 additions and 59 deletions.
2 changes: 1 addition & 1 deletion .ci/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ else # Linux
mv $AMDAPPSDK_PATH/lib/x86_64/sdk/* $AMDAPPSDK_PATH/lib/x86_64/
echo libamdocl64.so > $OPENCL_VENDOR_PATH/amdocl64.icd
fi
if [[ $TASK == "cuda" ]]; then
if [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
apt-get update
apt-get install --no-install-recommends -y \
Expand Down
32 changes: 26 additions & 6 deletions .ci/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -190,21 +190,41 @@ if [[ $TASK == "gpu" ]]; then
elif [[ $METHOD == "source" ]]; then
cmake -DUSE_GPU=ON -DOpenCL_INCLUDE_DIR=$AMDAPPSDK_PATH/include/ ..
fi
elif [[ $TASK == "cuda" ]]; then
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
elif [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
if [[ $TASK == "cuda" ]]; then
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
else
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda_exp";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda_exp"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
# by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests
sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'gpu_use_dp = true' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
fi
if [[ $METHOD == "pip" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py sdist || exit -1
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
if [[ $TASK == "cuda" ]]; then
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
else
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda-exp || exit -1
fi
pytest $BUILD_DIRECTORY/tests/python_package_test || exit -1
exit 0
elif [[ $METHOD == "wheel" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
if [[ $TASK == "cuda" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
else
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda-exp || exit -1
fi
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER*.whl -v || exit -1
pytest $BUILD_DIRECTORY/tests || exit -1
exit 0
elif [[ $METHOD == "source" ]]; then
cmake -DUSE_CUDA=ON ..
if [[ $TASK == "cuda" ]]; then
cmake -DUSE_CUDA=ON ..
else
cmake -DUSE_CUDA_EXP=ON ..
fi
fi
elif [[ $TASK == "mpi" ]]; then
if [[ $METHOD == "pip" ]]; then
Expand Down
15 changes: 14 additions & 1 deletion .github/workflows/cuda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ env:

jobs:
test:
name: cuda ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
name: ${{ matrix.tree_learner }} ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
runs-on: [self-hosted, linux]
timeout-minutes: 60
strategy:
Expand All @@ -27,14 +27,27 @@ jobs:
compiler: gcc
python_version: "3.8"
cuda_version: "11.5.1"
tree_learner: cuda
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
tree_learner: cuda
- method: wheel
compiler: gcc
python_version: "3.10"
cuda_version: "9.0"
tree_learner: cuda
- method: source
compiler: gcc
python_version: "3.8"
cuda_version: "11.5.1"
tree_learner: cuda_exp
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
tree_learner: cuda_exp
steps:
- name: Setup or update software on host machine
run: |
Expand Down
34 changes: 27 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ option(USE_SWIG "Enable SWIG to generate Java API" OFF)
option(USE_HDFS "Enable HDFS support (EXPERIMENTAL)" OFF)
option(USE_TIMETAG "Set to ON to output time costs" OFF)
option(USE_CUDA "Enable CUDA-accelerated training (EXPERIMENTAL)" OFF)
option(USE_CUDA_EXP "Enable CUDA-accelerated training with more acceleration (EXPERIMENTAL)" OFF)
option(USE_DEBUG "Set to ON for Debug mode" OFF)
option(USE_SANITIZER "Use santizer flags" OFF)
set(
Expand All @@ -28,7 +29,7 @@ if(__INTEGRATE_OPENCL)
cmake_minimum_required(VERSION 3.11)
elseif(USE_GPU OR APPLE)
cmake_minimum_required(VERSION 3.2)
elseif(USE_CUDA)
elseif(USE_CUDA OR USE_CUDA_EXP)
cmake_minimum_required(VERSION 3.16)
else()
cmake_minimum_required(VERSION 3.0)
Expand Down Expand Up @@ -133,7 +134,7 @@ else()
add_definitions(-DUSE_SOCKET)
endif()

if(USE_CUDA)
if(USE_CUDA OR USE_CUDA_EXP)
set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}")
enable_language(CUDA)
set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
Expand Down Expand Up @@ -171,8 +172,12 @@ if(__INTEGRATE_OPENCL)
endif()
endif()

if(USE_CUDA)
find_package(CUDA 9.0 REQUIRED)
if(USE_CUDA OR USE_CUDA_EXP)
if(USE_CUDA)
find_package(CUDA 9.0 REQUIRED)
else()
find_package(CUDA 10.0 REQUIRED)
endif()
include_directories(${CUDA_INCLUDE_DIRS})
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall")

Expand All @@ -199,7 +204,12 @@ if(USE_CUDA)
endif()
message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")

add_definitions(-DUSE_CUDA)
if(USE_CUDA)
add_definitions(-DUSE_CUDA)
elseif(USE_CUDA_EXP)
add_definitions(-DUSE_CUDA_EXP)
endif()

if(NOT DEFINED CMAKE_CUDA_STANDARD)
set(CMAKE_CUDA_STANDARD 11)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
Expand Down Expand Up @@ -369,9 +379,17 @@ file(
src/objective/*.cpp
src/network/*.cpp
src/treelearner/*.cpp
if(USE_CUDA)
if(USE_CUDA OR USE_CUDA_EXP)
src/treelearner/*.cu
endif()
if(USE_CUDA_EXP)
src/treelearner/cuda/*.cpp
src/treelearner/cuda/*.cu
src/io/cuda/*.cu
src/io/cuda/*.cpp
src/cuda/*.cpp
src/cuda/*.cu
endif()
)

add_library(lightgbm_objs OBJECT ${SOURCES})
Expand Down Expand Up @@ -493,14 +511,16 @@ if(__INTEGRATE_OPENCL)
target_link_libraries(lightgbm_objs PUBLIC ${INTEGRATED_OPENCL_LIBRARIES})
endif()

if(USE_CUDA)
if(USE_CUDA OR USE_CUDA_EXP)
# Disable cmake warning about policy CMP0104. Refer to issue #3754 and PR #4268.
# Custom target properties does not propagate, thus we need to specify for
# each target that contains or depends on cuda source.
set_target_properties(lightgbm_objs PROPERTIES CUDA_ARCHITECTURES OFF)
set_target_properties(_lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)
set_target_properties(lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)

set_target_properties(lightgbm_objs PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

# Device linking is not supported for object libraries.
# Thus we have to specify them on final targets.
set_target_properties(lightgbm PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)
Expand Down
4 changes: 4 additions & 0 deletions R-package/src/Makevars.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ OBJECTS = \
io/parser.o \
io/train_share_states.o \
io/tree.o \
io/dense_bin.o \
io/sparse_bin.o \
io/multi_val_dense_bin.o \
io/multi_val_sparse_bin.o \
metric/dcg_calculator.o \
metric/metric.o \
objective/objective_function.o \
Expand Down
4 changes: 4 additions & 0 deletions R-package/src/Makevars.win.in
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ OBJECTS = \
io/parser.o \
io/train_share_states.o \
io/tree.o \
io/dense_bin.o \
io/sparse_bin.o \
io/multi_val_dense_bin.o \
io/multi_val_sparse_bin.o \
metric/dcg_calculator.o \
metric/metric.o \
objective/objective_function.o \
Expand Down
2 changes: 2 additions & 0 deletions docs/Installation-Guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,8 @@ To build LightGBM CUDA version, run the following commands:
cmake -DUSE_CUDA=1 ..
make -j4
Recently, a new CUDA version with better efficiency is implemented as an experimental feature. To build the new CUDA version, replace ``-DUSE_CUDA`` with ``-DUSE_CUDA_EXP`` in the above commands. Please note that new version requires **CUDA** 10.0 or later libraries.

**Note**: glibc >= 2.14 is required.

**Note**: In some rare cases you may need to install OpenMP runtime library separately (use your package manager and search for ``lib[g|i]omp`` for doing this).
Expand Down
6 changes: 5 additions & 1 deletion docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ Core Parameters

- **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors

- ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, aliases: ``device``
- ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, ``cuda_exp``, aliases: ``device``

- device for the tree learning, you can use GPU to achieve the faster learning

Expand All @@ -209,6 +209,10 @@ Core Parameters

- **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support

- **Note**: ``cuda_exp`` is an experimental CUDA version, the installation guide for ``cuda_exp`` is identical with ``cuda``

- **Note**: ``cuda_exp`` is faster than ``cuda`` and will replace ``cuda`` in the future

- ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``

- this seed is used to generate other seeds, e.g. ``data_random_seed``, ``feature_fraction_seed``, etc.
Expand Down
2 changes: 2 additions & 0 deletions helpers/parameter_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ def get_parameter_infos(
member_infos: List[List[Dict[str, List]]] = []
with open(config_hpp) as config_hpp_file:
for line in config_hpp_file:
if line.strip() in {"#ifndef __NVCC__", "#endif // __NVCC__"}:
continue
if "#pragma region Parameters" in line:
is_inparameter = True
elif "#pragma region" in line and "Parameters" in line:
Expand Down
29 changes: 29 additions & 0 deletions include/LightGBM/bin.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,23 @@ class BinMapper {
}
}

/*!
* \brief Maximum categorical value
* \return Maximum categorical value for categorical features, 0 for numerical features
*/
inline int MaxCatValue() const {
if (bin_2_categorical_.size() == 0) {
return 0;
}
int max_cat_value = bin_2_categorical_[0];
for (size_t i = 1; i < bin_2_categorical_.size(); ++i) {
if (bin_2_categorical_[i] > max_cat_value) {
max_cat_value = bin_2_categorical_[i];
}
}
return max_cat_value;
}

/*!
* \brief Get sizes in byte of this object
*/
Expand Down Expand Up @@ -379,6 +396,10 @@ class Bin {
* \brief Deep copy the bin
*/
virtual Bin* Clone() = 0;

virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const = 0;

virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const = 0;
};


Expand Down Expand Up @@ -452,6 +473,14 @@ class MultiValBin {
static constexpr double multi_val_bin_sparse_threshold = 0.25f;

virtual MultiValBin* Clone() = 0;

#ifdef USE_CUDA_EXP
virtual const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const = 0;
#endif // USE_CUDA_EXP
};

inline uint32_t BinMapper::ValueToBin(double value) const {
Expand Down
Loading

0 comments on commit 6b56a90

Please sign in to comment.