Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] LightGBM uses a lot of RAM when #feature and #leaves are both large #562

Closed
Laurae2 opened this issue May 28, 2017 · 32 comments · Fixed by #572
Closed

[R] LightGBM uses a lot of RAM when #feature and #leaves are both large #562

Laurae2 opened this issue May 28, 2017 · 32 comments · Fixed by #572

Comments

@Laurae2
Copy link
Contributor

Laurae2 commented May 28, 2017

I ran LightGBM on a very big dataset. It is twice faster than xgboost. But for comparing both, I am building from all the dataset and using all the features, in contrary to LightGBM default behavior.

Dataset properties:

  • 2,250,000 observations
  • 23,636 features

(Committed) RAM usage:

  • LightGBM: 164GB (173GB when building from 100000 observations using bin_construct_sample_cnt only)
  • xgboost fast histogram: 63GB
  • xgboost exact: 25GB (not sure, but it didn't use a lot)

image

Time per iteration (seems a big dataset fixes issue #542, but this one is really big...):

  • LightGBM: 8-12 seconds
  • xgboost fast histogram: 16-20 seconds
  • xgboost exact: 90 seconds (gave up)

I think with MinGW, the dataset must be really large for LightGBM to scale well. Otherwise, we see the overhead issue with multithreading (and LightGBM ends up with the same speed of xgboost or is worse).

Now the issue is LightGBM using so much RAM... it uses 54GB RAM, swaps, and commits a total of approx 170,000,000 bytes. In the case of xgboost, as I capped to 54GB RAM it swaps but it's unnoticeable at all thanks to RAID 0 2Gbps NVMe drives (adding more RAM just made the histogram faster, the training was not faster).

I can provide you the building scripts for the dataset if you need to reproduce it. It takes about 30 minutes to prepare them.

I am using:

  • 255 bins (intentional)
  • depth of 10
  • number of leaves of 1023
  • 8 threads
@Laurae2 Laurae2 changed the title LightGBM uses a lot of RAM LightGBM uses a lot of RAM on big datasets May 28, 2017
@guolinke
Copy link
Collaborator

@Laurae2 try the parameter histogram_pool_size

@guolinke
Copy link
Collaborator

@Laurae2
Can you save the dataset to binary format, and give me its size?

I think the memory consuming here is used for the histogram cache. it needs about num_leaves * 20Bytes * num_features * num_bins.

You also can try to reduce the num_leaves, and see how much it costs.

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 28, 2017

@guolinke

binary dataset: 1,499,944 KB

with histogram_pool_size = 16384 it manages to keep under 54GB RAM.

When num_leaves decreases, the RAM required decreases exponentially. I think I'll drop some of my tests on this big dataset.

With num_leaves = 255, drops to 65GB RAM required.

@guolinke
Copy link
Collaborator

guolinke commented May 28, 2017

@Laurae2
So the dataset only cost 1GB memory.
And the histogram cache is 16GB when set histogram_pool_size=16384.
I am not sure what cost these other 40GB+ memories...

Did you run it by R-package? can you try it with the CLI version(you can use the binary dataset) ?

@guolinke guolinke changed the title LightGBM uses a lot of RAM on big datasets LightGBM uses a lot of RAM when #feature and #leaves are both large May 28, 2017
@guolinke
Copy link
Collaborator

@Laurae2 can you give me the reproduce dataset ?

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 28, 2017

@guolinke Do you have an email address where I can send you a link to the file?

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 28, 2017

@guolinke I think 10GB is caused by the data in memory, but the rest I am not sure what it is.

Do you know if it is possible to do this in R?:

  • Remove the original dataset while being able to train (like training from a binary file, but directly in memory)
  • Predict from lgb.Dataset instead of using the original dataset

xgboost also has the issue of exponential RAM increase when the maximum number of leaves increases, but it's getting bigger slower only.

@guolinke
Copy link
Collaborator

@Laurae2
can you give the memory usage before calling lgb.train ?
BTW, we cannot predict from lgb.Dataset .

@guolinke
Copy link
Collaborator

guolinke commented May 29, 2017

@Laurae2
I debugged on your dataset, some information:

  1. Cost of Dataset: about 6GB.
  2. Cost of other data structure: about 1GB
  3. Cost of histogram cache: set by histogram_pool_size.

I run it on visual studio and mingw, parameter: data=train.bin app=binary histogram_pool_size=1024 .
they both cost about 8GB memory.

As a result, I think it is not reason to cost 54GB memories...

Another thing is: The binary dataset you build seems is by V2.0 version, not the master version.
Can you delete all local code, and re-build for the master branch version?

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 29, 2017

I'll recompile with master branch and come back in 2 days when server is free (no idea why it keeps using old LightGBM, I think the dll didn't get freed as I encounter this issue a lot - going to wipe the folder instead of trusting the reinstall for it from now).

can you give the memory usage before calling lgb.train ?

Approximately 8GB if I remember (+2GB of Windows).

@guolinke
Copy link
Collaborator

@Laurae2
can you share your R script for the reproducing?

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 29, 2017

@guolinke Check here: https://github.com/Laurae2/gbt_benchmarks/tree/master/zzz_new

It is the "reput" dataset. You can run the Creator scripts (there are 3 of them bundled together) manually if you want to recreate the dataset, then you can manually run the "code" script to run benchmark. The number inside the sh scripts define number of threads to use.

I can try to provide you an easier script when I finish work for training model if needed.

Run "debug" script for reput R script only if your objective is to test RAM usage.

(Will be back in 11h)

@guolinke guolinke changed the title LightGBM uses a lot of RAM when #feature and #leaves are both large [R] LightGBM uses a lot of RAM when #feature and #leaves are both large May 29, 2017
@Laurae2
Copy link
Contributor Author

Laurae2 commented May 29, 2017

@guolinke Download data from here: http://www.sysnet.ucsd.edu/projects/url/ (200MB)

R script for creating dataset:

# SET YOUR WORKING DIRECTORY
# setwd("C:/")

# Libraries
library(sparsity) # devtools::install_github("Laurae2/sparsity")
library(Matrix)
library(tcltk)
library(R.utils)

data <- read.svmlight(paste0("reput/Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix

new_data <- list()

for (i in 1:120) {
  indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
  new_data[[indexed]] <- read.svmlight(paste0("reput/Day", i, ".svm"))
  new_data[[indexed]]$matrix@Dim[2] <- 3231962L
  new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
  new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
  label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
  
  if ((i %% 10) == 0) {
    
    data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
    gc(verbose = FALSE)
    
    cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
    
  }
  
}

# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)

# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)

# Clean up stuff
rm(list = ls())
gc()

data <- readRDS(file = "reput_sparse.rds")
label <- readRDS(file = "reput_label.rds")

# Select 29006 columns most dense
new_data <- data[1:2396130, (data@p[2:3231963] - data@p[1:3231962]) >= 100]

# Select 23636 columns least dense
new_data <- new_data[1:2396130, (new_data@p[2:29007] - new_data@p[1:29006]) <= 1000]

# Denegative
long_bar <- tkProgressBar(title = "Denegative Parsing", label = "Denegative Parsing: 00000 / 23636\nETA: unknown\n0,000,000 / 6,843,679 elements", min = 0, max = 23636, initial = 0, width = 500)
old_many <- 1
sparse_i <- numeric(6843679)
sparse_j <- numeric(6843679)
sparse_x <- numeric(6843679)
current_time <- System$currentTimeMillis()
for (i in 1:23636) {
  which_zeroes <- (which(new_data[, i] != 0))
  new_many <- old_many + length(which_zeroes)
  which_data <- new_data[which_zeroes, i]
  sparse_i[old_many:(new_many - 1)] <- which_zeroes
  sparse_j[old_many:(new_many - 1)] <- i
  set.seed(i)
  sparse_x[old_many:(new_many - 1)] <- runif(length(which_zeroes), min = 1.5, max = 2.5)
  old_many <- new_many
  new_time <- System$currentTimeMillis()
  setTkProgressBar(pb = long_bar, value = i, label = paste0("Denegative Parsing: ", sprintf("%05d", i), " / 23636\nETA: ", sprintf("%05.03f", (new_time - current_time) / 1000), "s / ", sprintf("%05.03f", ((new_time - current_time) / i) * 23636 / 1000), "s\n", formatC(old_many - 1, big.mark = ",", digits = 6, flag = 0, format = "d"), " / 6,843,679 elements"))
}
close(long_bar)
gc()

# Save data
saveRDS(new_data, file = "reput_sparse_23636.rds", compress = TRUE)
saveRDS(sparse_i, file = "reput_sparse_i.rds", compress = TRUE)
saveRDS(sparse_j, file = "reput_sparse_j.rds", compress = TRUE)
saveRDS(sparse_x, file = "reput_sparse_x.rds", compress = TRUE)

# Clean up stuff
rm(list = ls())
gc()

data <- readRDS(file = "reput_sparse_23636.rds")
sparse_i <- readRDS(file = "reput_sparse_i.rds")
sparse_j <- readRDS(file = "reput_sparse_j.rds")
sparse_x <- readRDS(file = "reput_sparse_x.rds")
label <- readRDS(file = "reput_label.rds")

# Signal Destruction
long_bar <- tkProgressBar(title = "Signal Destruction", label = "Signal Destruction: 00000 / 23636\nETA: unknown\n000,000,000 / 761,966,607 elements", min = 0, max = 23636, initial = 0, width = 500)
old_many <- 6843680
new_sparse_i <- numeric(6843679 + 31948 * 23636)
new_sparse_j <- numeric(6843679 + 31948 * 23636)
new_sparse_x <- numeric(6843679 + 31948 * 23636)
new_sparse_i[1:6843679] <- sparse_i
new_sparse_j[1:6843679] <- sparse_j
new_sparse_x[1:6843679] <- sparse_x
instanced <- 1:2396130
current_time <- System$currentTimeMillis()
for (i in 1:23636) {
  set.seed(i)
  which_zeroes <- sample((1:2396130)[-sparse_i[(data@p[i] + 1):(data@p[i + 1])]], 31948, replace = FALSE)
  new_many <- old_many + 31948
  which_data <- runif(31948, min = 1, max = 3)
  new_sparse_i[old_many:(new_many - 1)] <- which_zeroes
  new_sparse_j[old_many:(new_many - 1)] <- i
  set.seed(i)
  new_sparse_x[old_many:(new_many - 1)] <- which_data
  old_many <- new_many
  new_time <- System$currentTimeMillis()
  setTkProgressBar(pb = long_bar, value = i, label = paste0("Signal Destruction: ", sprintf("%05d", i), " / 23636\nETA: ", sprintf("%05.03f", (new_time - current_time) / 1000), "s / ", sprintf("%05.03f", ((new_time - current_time) / i) * 23636 / 1000), "s\n", formatC(old_many - 1, big.mark = ",", digits = 8, flag = 0, format = "d"), " / 761,966,607 elements"))
}
close(long_bar)
gc()

# Clean up stuff
rm(data, sparse_i, sparse_j, sparse_x, label, instanced, long_bar, old_many, new_many, current_time, new_time, which_data, which_zeroes, i)
gc()

# Lower memory usage
new_sparse_i <- as.integer(new_sparse_i)
new_sparse_j <- as.integer(new_sparse_j)
gc()

# Generate real matrix
real_data <- sparseMatrix(i = new_sparse_i, j = new_sparse_j, x = new_sparse_x, dims = c(2396130L, 23636L))

# Save data
saveRDS(real_data, file = "reput_sparse_final.rds", compress = TRUE)

Model training:

# SET YOUR WORKING DIRECTORY
# setwd("C:/")

library(lightgbm)
library(Matrix)

my_data <- readRDS("reput_sparse_final.rds")
label <- readRDS("reput_label.rds")

my_train <- my_data[1:2250000, ]
my_train_label <- label[1:2250000]
my_test <- my_data[2250001:2396130, ]
my_test_label <- label[2250001:2396130]
rm(my_data)

train <- lgb.Dataset(data = my_train, label = my_train_label)
test <- lgb.Dataset(data = my_test, label = my_test_label)

gc()
set.seed(11111)
model <- lgb.train(params = list(num_threads = 8,
                                 learning_rate = 0.25,
                                 max_depth = -1,
                                 num_leaves = 1023,
                                 max_bin = 255,
                                 min_gain_to_split = 1,
                                 min_sum_hessian_in_leaf = 1,
                                 bagging_fraction = 1,
                                 bagging_freq = 1,
                                 bagging_seed = 1,
                                 feature_fraction = 1,
                                 min_data_in_leaf = 1,
                                 bin_construct_sample_cnt = 2250000L,
                                 early_stopping_round = 25,
                                 metric = "auc",
                                 histogram_pool_size = 16384),
                   train,
                   5,
                   list(test = test),
                   verbose = 2,
                   objective = "binary")

@guolinke
Copy link
Collaborator

@Laurae2
Thanks, I will test it when my server is free.
BTW, Did you have the number of memory cost when set bin_construct_sample_cnt to a smaller number?

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 29, 2017

@guolinke You can use this if you want to work with binary datasets (should take 16+ GB less):

# SET YOUR WORKING DIRECTORY
# setwd("C:/")

# Libraries
library(data.table)
library(Matrix)
library(lightgbm)

# Do xgboost / LightGBM
train_sparse <- readRDS(file = "reput_sparse_final.rds")
label <- readRDS(file = "reput_label.rds")

# Split
train_1 <- train_sparse[1:2250000, ]
train_2 <- label[1:2250000]
test_1 <- train_sparse[2250001:2396130, ]
test_2 <- label[2250001:2396130]

# For LightGBM
lgb_train <- lgb.Dataset(data = train_1, label = train_2)
lgb_test <- lgb.Dataset(data = test_1, label = test_2)

# Train fake LightGBM model to use 2.25 million observations' columns
lgb.train(params = list(objective = "regression",
                        metric = "l2",
                        bin_construct_sample_cnt = 2250000L),
          lgb_train,
          1,
          list(train = lgb_train,
               test = lgb_test),
          verbose = 1)
lgb.Dataset.save(lgb_train, fname = "reput_train_lgb.data")
lgb.Dataset.save(lgb_test, fname = "reput_test_lgb.data")

And this for training with binary dataset:

# SET YOUR WORKING DIRECTORY
# setwd("C:/")

library(lightgbm)
library(Matrix)

train <- lgb.Dataset("reput_train_lgb.data")
test <- lgb.Dataset("reput_test_lgb.data")

gc()
set.seed(11111)
model <- lgb.train(params = list(num_threads = 8,
                                 learning_rate = 0.25,
                                 max_depth = -1,
                                 num_leaves = 1023,
                                 max_bin = 255,
                                 min_gain_to_split = 1,
                                 min_sum_hessian_in_leaf = 1,
                                 bagging_fraction = 1,
                                 bagging_freq = 1,
                                 bagging_seed = 1,
                                 feature_fraction = 1,
                                 min_data_in_leaf = 1,
                                 bin_construct_sample_cnt = 2250000L,
                                 early_stopping_round = 25,
                                 metric = "auc",
                                 histogram_pool_size = 16384),
                   train,
                   5,
                   list(test = test),
                   verbose = 2,
                   objective = "binary")

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 29, 2017

BTW, Did you have the number of memory cost when set bin_construct_sample_cnt to a smaller number?

Yes, it increased to 173GB instead of 164GB (when switching from 2.25M to 100K).

@guolinke
Copy link
Collaborator

@Laurae2 173GB is the result with 16384 histogram pool size?

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 29, 2017

@guolinke

  • 164GB: 2.25M bin_construct_sample_cnt, no histogram_pool_size
  • 173GB: 100K bin_construct_sample_cnt, no histogram_pool_size
  • approx 50GB: 2.25M bin_construct_sample_cnt, 16384 histogram_pool_size

Assumes loading from RDS, so you may have a difference of minus 8 to 20 GB.

@guolinke
Copy link
Collaborator

guolinke commented May 30, 2017

@Laurae2
I met error when generate the dataset :

> new_data <- data[1:2396130, (data@p[2:3231963] - data@p[1:3231962]) >= 100]
Error: trying to get slot "p" from an object of a basic class ("function") with
no slots

Can you send me these two files?

train_sparse <- readRDS(file = "reput_sparse_final.rds")
label <- readRDS(file = "reput_label.rds")

update: never mind, it seems is normal after re-run.

@guolinke
Copy link
Collaborator

guolinke commented May 30, 2017

@Laurae2 my result when set histogram_pool_size to 16384.

image

without set histogram_pool_size:

image

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 30, 2017

This is very similar to what I have. Did you try with CLI to check the RAM difference?

About file missing, sorry I added manually a cleanup without reloading file >_<

@guolinke
Copy link
Collaborator

@Laurae2
CLI version:
image

It seems R's version has about 15GB overhead.

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 30, 2017

With R, with my default script (the one with RDS), you have the sparse dataset 2 times while having the binary dataset. Original dataset is about 8GB, so it matches the 15GB difference you have.

@guolinke
Copy link
Collaborator

@Laurae2
using following also cost about 24GB:

library(lightgbm)
library(Matrix)

train <- lgb.Dataset("reput_train_lgb.data")
test <- lgb.Dataset("reput_test_lgb.data")

gc()
set.seed(11111)
model <- lgb.train(params = list(num_threads = 8,
                                 learning_rate = 0.25,
                                 max_depth = -1,
                                 num_leaves = 1023,
                                 max_bin = 255,
                                 min_gain_to_split = 1,
                                 min_sum_hessian_in_leaf = 1,
                                 bagging_fraction = 1,
                                 bagging_freq = 1,
                                 bagging_seed = 1,
                                 feature_fraction = 1,
                                 min_data_in_leaf = 1,
                                 bin_construct_sample_cnt = 2250000L,
                                 early_stopping_round = 25,
                                 metric = "auc",
                                 histogram_pool_size = 16384),
                   train,
                   5,
                   list(test = test),
                   verbose = 2,
                   objective = "binary")

But it still cost 38GB even after I free all data:

library(lightgbm)
library(Matrix)

my_data <- readRDS("reput_sparse_final.rds")
label <- readRDS("reput_label.rds")

my_train <- my_data[1:2250000, ]
my_train_label <- label[1:2250000]
my_test <- my_data[2250001:2396130, ]
my_test_label <- label[2250001:2396130]
rm(my_data)
rm(label)

train <- lgb.Dataset(data = my_train, label = my_train_label, free_raw_data=TRUE)
test <- lgb.Dataset(data = my_test, label = my_test_label, reference=train, free_raw_data=TRUE)
train$construct()
test$construct()
rm(my_train)
rm(my_train_label)
rm(my_test)
rm(my_test_label)

gc()
set.seed(11111)
model <- lgb.train(params = list(num_threads = 16,
                                 learning_rate = 0.25,
                                 max_depth = -1,
                                 num_leaves = 1023,
                                 max_bin = 255,
                                 min_gain_to_split = 1,
                                 min_sum_hessian_in_leaf = 1,
                                 bagging_fraction = 1,
                                 bagging_freq = 1,
                                 bagging_seed = 1,
                                 feature_fraction = 1,
                                 min_data_in_leaf = 1,
                                 early_stopping_round = 25,
                                 metric = "auc",
                                 histogram_pool_size = 16384),
                   train,
                   0,
                   list(test = test),
                   verbose = 2,
                   objective = "binary")

Not sure how R manage its memories, But it seems have some problems.

@guolinke
Copy link
Collaborator

guolinke commented May 30, 2017

BTW, The auc result is very low in this dataset. Did you check the feature ?
Though this data is very sparse, I find some rows are quite dense ...

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 30, 2017

@guolinke does free_raw_data actually takes the dataset and builds the binary dataset?

I think it goes like this:

  1. Data original created
  2. lgb.Dataset created
  3. lgb.Dataset constructed, but original data is not removed (RAM usage different from here)
  4. rm original data
  5. lgb.train builds something from the lgb.Dataset (probably binary dataset) which makes it sit twice in memory

But not sure about that as I don't know how memory evolves at each step. As you said, it may due to not freeing original data, but might also be due to training.

@guolinke
Copy link
Collaborator

@Laurae2
After the dataset is built, the original data will be set to NULL : https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L255-L257 .
And there are not other reference to the this raw data.
And the original data is not used any more after consturct() .

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 30, 2017

AUC is very low intentionally (learning from 100-1000 elements per feature where there is about 1:100 signal (1) to noise (100) ratio), with nearly all best features removed (otherwise 0.9999+ AUC is meaningless to compare).

Objective is to check if it learns from 95% sparse noisy signals. Max AUC should be around 0.65.

@guolinke
Copy link
Collaborator

guolinke commented May 30, 2017

@Laurae2
I think there are some other problems in this dataset. I am trying to fix it .

For the memory usage, I think it isn't a issue now. You can reduce it by set histogram_pool_size, num_leaves and max_bin .

However, I think the potential problem in this dataset also may affect the memory cost.
You can wait until I fix it.

@guolinke
Copy link
Collaborator

@Laurae2
After some investigates, I find this dataset strike the weakness (in sparse optimization) of LightGBM.
So, there are not other problems here.
And I think the speed result on this dataset will not be so good.

@guolinke
Copy link
Collaborator

BTW, I think we can add a FAQ, for the users that meet the large memory consuming.

@Laurae2
Copy link
Contributor Author

Laurae2 commented May 31, 2017

For the FAQ, yes.

About speed, in my quick tests it was 2x faster than xgboost, which was surprising. But I didn't test yet for AUC yet, and I will test only until depth 7 / 127 leaves due to RAM explosion.

Also, xgboost has crashes with binary dataset, while LightGBM doesn't. This dataset is a real toll on xgboost and LightGBM, probably one of the worst scenario for both of them.

guolinke pushed a commit that referenced this issue May 31, 2017
guolinke pushed a commit that referenced this issue Oct 9, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
meggersntexasv8 added a commit to meggersntexasv8/LightGBM that referenced this issue Aug 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants