-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] LightGBM uses a lot of RAM when #feature and #leaves are both large #562
Comments
@Laurae2 try the parameter |
@Laurae2 I think the memory consuming here is used for the histogram cache. it needs about num_leaves * 20Bytes * num_features * num_bins. You also can try to reduce the num_leaves, and see how much it costs. |
binary dataset: 1,499,944 KB with When With |
@Laurae2 Did you run it by R-package? can you try it with the CLI version(you can use the binary dataset) ? |
@Laurae2 can you give me the reproduce dataset ? |
@guolinke Do you have an email address where I can send you a link to the file? |
@guolinke I think 10GB is caused by the data in memory, but the rest I am not sure what it is. Do you know if it is possible to do this in R?:
xgboost also has the issue of exponential RAM increase when the maximum number of leaves increases, but it's getting bigger slower only. |
@Laurae2 |
@Laurae2
I run it on visual studio and mingw, parameter: As a result, I think it is not reason to cost 54GB memories... Another thing is: The binary dataset you build seems is by V2.0 version, not the master version. |
I'll recompile with master branch and come back in 2 days when server is free (no idea why it keeps using old LightGBM, I think the dll didn't get freed as I encounter this issue a lot - going to wipe the folder instead of trusting the reinstall for it from now).
Approximately 8GB if I remember (+2GB of Windows). |
@Laurae2 |
@guolinke Check here: https://github.com/Laurae2/gbt_benchmarks/tree/master/zzz_new It is the "reput" dataset. You can run the Creator scripts (there are 3 of them bundled together) manually if you want to recreate the dataset, then you can manually run the "code" script to run benchmark. The number inside the sh scripts define number of threads to use. I can try to provide you an easier script when I finish work for training model if needed. Run "debug" script for reput R script only if your objective is to test RAM usage. (Will be back in 11h) |
@guolinke Download data from here: http://www.sysnet.ucsd.edu/projects/url/ (200MB) R script for creating dataset: # SET YOUR WORKING DIRECTORY
# setwd("C:/")
# Libraries
library(sparsity) # devtools::install_github("Laurae2/sparsity")
library(Matrix)
library(tcltk)
library(R.utils)
data <- read.svmlight(paste0("reput/Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix
new_data <- list()
for (i in 1:120) {
indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
new_data[[indexed]] <- read.svmlight(paste0("reput/Day", i, ".svm"))
new_data[[indexed]]$matrix@Dim[2] <- 3231962L
new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
if ((i %% 10) == 0) {
data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
gc(verbose = FALSE)
cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
}
}
# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)
# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)
# Clean up stuff
rm(list = ls())
gc()
data <- readRDS(file = "reput_sparse.rds")
label <- readRDS(file = "reput_label.rds")
# Select 29006 columns most dense
new_data <- data[1:2396130, (data@p[2:3231963] - data@p[1:3231962]) >= 100]
# Select 23636 columns least dense
new_data <- new_data[1:2396130, (new_data@p[2:29007] - new_data@p[1:29006]) <= 1000]
# Denegative
long_bar <- tkProgressBar(title = "Denegative Parsing", label = "Denegative Parsing: 00000 / 23636\nETA: unknown\n0,000,000 / 6,843,679 elements", min = 0, max = 23636, initial = 0, width = 500)
old_many <- 1
sparse_i <- numeric(6843679)
sparse_j <- numeric(6843679)
sparse_x <- numeric(6843679)
current_time <- System$currentTimeMillis()
for (i in 1:23636) {
which_zeroes <- (which(new_data[, i] != 0))
new_many <- old_many + length(which_zeroes)
which_data <- new_data[which_zeroes, i]
sparse_i[old_many:(new_many - 1)] <- which_zeroes
sparse_j[old_many:(new_many - 1)] <- i
set.seed(i)
sparse_x[old_many:(new_many - 1)] <- runif(length(which_zeroes), min = 1.5, max = 2.5)
old_many <- new_many
new_time <- System$currentTimeMillis()
setTkProgressBar(pb = long_bar, value = i, label = paste0("Denegative Parsing: ", sprintf("%05d", i), " / 23636\nETA: ", sprintf("%05.03f", (new_time - current_time) / 1000), "s / ", sprintf("%05.03f", ((new_time - current_time) / i) * 23636 / 1000), "s\n", formatC(old_many - 1, big.mark = ",", digits = 6, flag = 0, format = "d"), " / 6,843,679 elements"))
}
close(long_bar)
gc()
# Save data
saveRDS(new_data, file = "reput_sparse_23636.rds", compress = TRUE)
saveRDS(sparse_i, file = "reput_sparse_i.rds", compress = TRUE)
saveRDS(sparse_j, file = "reput_sparse_j.rds", compress = TRUE)
saveRDS(sparse_x, file = "reput_sparse_x.rds", compress = TRUE)
# Clean up stuff
rm(list = ls())
gc()
data <- readRDS(file = "reput_sparse_23636.rds")
sparse_i <- readRDS(file = "reput_sparse_i.rds")
sparse_j <- readRDS(file = "reput_sparse_j.rds")
sparse_x <- readRDS(file = "reput_sparse_x.rds")
label <- readRDS(file = "reput_label.rds")
# Signal Destruction
long_bar <- tkProgressBar(title = "Signal Destruction", label = "Signal Destruction: 00000 / 23636\nETA: unknown\n000,000,000 / 761,966,607 elements", min = 0, max = 23636, initial = 0, width = 500)
old_many <- 6843680
new_sparse_i <- numeric(6843679 + 31948 * 23636)
new_sparse_j <- numeric(6843679 + 31948 * 23636)
new_sparse_x <- numeric(6843679 + 31948 * 23636)
new_sparse_i[1:6843679] <- sparse_i
new_sparse_j[1:6843679] <- sparse_j
new_sparse_x[1:6843679] <- sparse_x
instanced <- 1:2396130
current_time <- System$currentTimeMillis()
for (i in 1:23636) {
set.seed(i)
which_zeroes <- sample((1:2396130)[-sparse_i[(data@p[i] + 1):(data@p[i + 1])]], 31948, replace = FALSE)
new_many <- old_many + 31948
which_data <- runif(31948, min = 1, max = 3)
new_sparse_i[old_many:(new_many - 1)] <- which_zeroes
new_sparse_j[old_many:(new_many - 1)] <- i
set.seed(i)
new_sparse_x[old_many:(new_many - 1)] <- which_data
old_many <- new_many
new_time <- System$currentTimeMillis()
setTkProgressBar(pb = long_bar, value = i, label = paste0("Signal Destruction: ", sprintf("%05d", i), " / 23636\nETA: ", sprintf("%05.03f", (new_time - current_time) / 1000), "s / ", sprintf("%05.03f", ((new_time - current_time) / i) * 23636 / 1000), "s\n", formatC(old_many - 1, big.mark = ",", digits = 8, flag = 0, format = "d"), " / 761,966,607 elements"))
}
close(long_bar)
gc()
# Clean up stuff
rm(data, sparse_i, sparse_j, sparse_x, label, instanced, long_bar, old_many, new_many, current_time, new_time, which_data, which_zeroes, i)
gc()
# Lower memory usage
new_sparse_i <- as.integer(new_sparse_i)
new_sparse_j <- as.integer(new_sparse_j)
gc()
# Generate real matrix
real_data <- sparseMatrix(i = new_sparse_i, j = new_sparse_j, x = new_sparse_x, dims = c(2396130L, 23636L))
# Save data
saveRDS(real_data, file = "reput_sparse_final.rds", compress = TRUE) Model training: # SET YOUR WORKING DIRECTORY
# setwd("C:/")
library(lightgbm)
library(Matrix)
my_data <- readRDS("reput_sparse_final.rds")
label <- readRDS("reput_label.rds")
my_train <- my_data[1:2250000, ]
my_train_label <- label[1:2250000]
my_test <- my_data[2250001:2396130, ]
my_test_label <- label[2250001:2396130]
rm(my_data)
train <- lgb.Dataset(data = my_train, label = my_train_label)
test <- lgb.Dataset(data = my_test, label = my_test_label)
gc()
set.seed(11111)
model <- lgb.train(params = list(num_threads = 8,
learning_rate = 0.25,
max_depth = -1,
num_leaves = 1023,
max_bin = 255,
min_gain_to_split = 1,
min_sum_hessian_in_leaf = 1,
bagging_fraction = 1,
bagging_freq = 1,
bagging_seed = 1,
feature_fraction = 1,
min_data_in_leaf = 1,
bin_construct_sample_cnt = 2250000L,
early_stopping_round = 25,
metric = "auc",
histogram_pool_size = 16384),
train,
5,
list(test = test),
verbose = 2,
objective = "binary") |
@Laurae2 |
@guolinke You can use this if you want to work with binary datasets (should take 16+ GB less): # SET YOUR WORKING DIRECTORY
# setwd("C:/")
# Libraries
library(data.table)
library(Matrix)
library(lightgbm)
# Do xgboost / LightGBM
train_sparse <- readRDS(file = "reput_sparse_final.rds")
label <- readRDS(file = "reput_label.rds")
# Split
train_1 <- train_sparse[1:2250000, ]
train_2 <- label[1:2250000]
test_1 <- train_sparse[2250001:2396130, ]
test_2 <- label[2250001:2396130]
# For LightGBM
lgb_train <- lgb.Dataset(data = train_1, label = train_2)
lgb_test <- lgb.Dataset(data = test_1, label = test_2)
# Train fake LightGBM model to use 2.25 million observations' columns
lgb.train(params = list(objective = "regression",
metric = "l2",
bin_construct_sample_cnt = 2250000L),
lgb_train,
1,
list(train = lgb_train,
test = lgb_test),
verbose = 1)
lgb.Dataset.save(lgb_train, fname = "reput_train_lgb.data")
lgb.Dataset.save(lgb_test, fname = "reput_test_lgb.data") And this for training with binary dataset: # SET YOUR WORKING DIRECTORY
# setwd("C:/")
library(lightgbm)
library(Matrix)
train <- lgb.Dataset("reput_train_lgb.data")
test <- lgb.Dataset("reput_test_lgb.data")
gc()
set.seed(11111)
model <- lgb.train(params = list(num_threads = 8,
learning_rate = 0.25,
max_depth = -1,
num_leaves = 1023,
max_bin = 255,
min_gain_to_split = 1,
min_sum_hessian_in_leaf = 1,
bagging_fraction = 1,
bagging_freq = 1,
bagging_seed = 1,
feature_fraction = 1,
min_data_in_leaf = 1,
bin_construct_sample_cnt = 2250000L,
early_stopping_round = 25,
metric = "auc",
histogram_pool_size = 16384),
train,
5,
list(test = test),
verbose = 2,
objective = "binary") |
Yes, it increased to 173GB instead of 164GB (when switching from 2.25M to 100K). |
@Laurae2 173GB is the result with 16384 histogram pool size? |
Assumes loading from RDS, so you may have a difference of minus 8 to 20 GB. |
@Laurae2
Can you send me these two files?
update: never mind, it seems is normal after re-run. |
@Laurae2 my result when set histogram_pool_size to 16384. without set histogram_pool_size: |
This is very similar to what I have. Did you try with CLI to check the RAM difference? About file missing, sorry I added manually a cleanup without reloading file >_< |
@Laurae2 It seems R's version has about 15GB overhead. |
With R, with my default script (the one with RDS), you have the sparse dataset 2 times while having the binary dataset. Original dataset is about 8GB, so it matches the 15GB difference you have. |
@Laurae2
But it still cost 38GB even after I free all data:
Not sure how R manage its memories, But it seems have some problems. |
BTW, The auc result is very low in this dataset. Did you check the feature ? |
@guolinke does free_raw_data actually takes the dataset and builds the binary dataset? I think it goes like this:
But not sure about that as I don't know how memory evolves at each step. As you said, it may due to not freeing original data, but might also be due to training. |
@Laurae2 |
AUC is very low intentionally (learning from 100-1000 elements per feature where there is about 1:100 signal (1) to noise (100) ratio), with nearly all best features removed (otherwise 0.9999+ AUC is meaningless to compare). Objective is to check if it learns from 95% sparse noisy signals. Max AUC should be around 0.65. |
@Laurae2 For the memory usage, I think it isn't a issue now. You can reduce it by set However, I think the potential problem in this dataset also may affect the memory cost. |
@Laurae2 |
BTW, I think we can add a FAQ, for the users that meet the large memory consuming. |
For the FAQ, yes. About speed, in my quick tests it was 2x faster than xgboost, which was surprising. But I didn't test yet for AUC yet, and I will test only until depth 7 / 127 leaves due to RAM explosion. Also, xgboost has crashes with binary dataset, while LightGBM doesn't. This dataset is a real toll on xgboost and LightGBM, probably one of the worst scenario for both of them. |
I ran LightGBM on a very big dataset. It is twice faster than xgboost. But for comparing both, I am building from all the dataset and using all the features, in contrary to LightGBM default behavior.
Dataset properties:
(Committed) RAM usage:
bin_construct_sample_cnt
only)Time per iteration (seems a big dataset fixes issue #542, but this one is really big...):
I think with MinGW, the dataset must be really large for LightGBM to scale well. Otherwise, we see the overhead issue with multithreading (and LightGBM ends up with the same speed of xgboost or is worse).
Now the issue is LightGBM using so much RAM... it uses 54GB RAM, swaps, and commits a total of approx 170,000,000 bytes. In the case of xgboost, as I capped to 54GB RAM it swaps but it's unnoticeable at all thanks to RAID 0 2Gbps NVMe drives (adding more RAM just made the histogram faster, the training was not faster).
I can provide you the building scripts for the dataset if you need to reproduce it. It takes about 30 minutes to prepare them.
I am using:
The text was updated successfully, but these errors were encountered: