When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

david-cortes · 2022-03-23T20:32:17Z

This will crash the R process from which lightgbm is called:

library(lightgbm)
data(mtcars)
X <- as.matrix(mtcars[, -1L])
y <- as.numeric(mtcars[, 1L])
dtrain <- lgb.Dataset(X, label = y, params = list(max_bins = 5L))
bst <- lgb.train(
  data = dtrain
  , obj = "regression"
  , nrounds = 5L
  , verbose = -1L
)
fname <- tempfile(fileext=".csv")
write.csv(X, fname, row.names=FALSE)
pred <- predict(bst, fname)

[LightGBM] [Info] Data file /tmp/Rtmp7S7OuR/file6f834262e32d.csv doesn't contain a label column.
[LightGBM] [Fatal] Unknown token "cyl" in data file
[LightGBM] [Warning] Unknown token "cyl" in data file
terminate called without an active exception
Aborted

That C++ exception should be caught and thrown as an R error instead.

ref #4977

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-03-23T20:49:51Z

Great write-up, thanks!

jameslamb · 2022-03-26T03:51:31Z

Just adding that when I ran this tonight, using R 4.1.2 and latest master of {lightgbm}, the provided example actually killed my R session.

That also happened even when providing params = list(header = TRUE) to lgb.train() or to predict().

jameslamb · 2022-04-13T04:16:42Z

I found tonight that this issue is not limited to the R package.

As of latest master (0a4851f), the following Python code also generates a segfault.

import os
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000, random_state=708)

dtrain = lgb.Dataset(data=X, label=y)
bst = lgb.train(
    train_set=dtrain,
    params={
        "objective": "regression"   
    },
    num_boost_round=5
)

# write to CSV
X_df = pd.DataFrame(
    X,
    columns=[f"col_{i}" for i in range(X.shape[1])]
)
csv_file = os.path.join(os.getcwd(), "test.csv")
X_df.to_csv(csv_file, header=True, index=False)

# predict
bst.predict(data=csv_file)

I ran this example in Jupyter Lab, and saw the following in its logs.

[LightGBM] [Fatal] Unknown token col_0 in data file
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unknown token col_0 in data file

If I change the predict() call as follows:

bst.predict(data=csv_file, data_has_header=True)

predicting succeeds.

With the R package, I also found that predicting on a CSV with column names in the headers succeeds if I pass header = TRUE.

library(lightgbm)
data(mtcars)
X <- as.matrix(mtcars[, -1L])
y <- as.numeric(mtcars[, 1L])
dtrain <- lgb.Dataset(
    X
    , label = y
    , params = list(min_data_in_bin = 1L, min_data_in_leaf = 1L)
)
bst <- lgb.train(
    data = dtrain
    , obj = "regression"
    , nrounds = 5L
)
fname <- tempfile(fileext=".csv")
write.csv(X, fname, row.names=FALSE)

# using header = TRUE, predicting succeeds
pred <- predict(bst, fname, header = TRUE)

So I think there are two separate issues:

(not specific to the R package) predicting on a CSV with column names produces a process-crashing error if LGBM_BoosterPredictForFile() is called with data_has_header=0 (not specific to the R package)
(specific to the R package) predict.lgb.Booster() doesn't respect header and its aliases passed through keyword arguments (similar issue to [R-package] predict ignores predict_raw_score #4670)

jameslamb · 2022-04-13T04:19:19Z

For anyone looking to contribute a fix, the error being thrown

[LightGBM] [Fatal] Unknown token col_0 in data file
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unknown token col_0 in data file

is because of this call to Log::Fatal() inside a while loop in the text-file-parsing code

LightGBM/include/LightGBM/utils/common.h

Line 346 in b0137de

Log::Fatal("Unknown token %s in data file", tmp_str.c_str());

which throws an error here:

LightGBM/include/LightGBM/utils/log.h

Line 130 in 0a4851f

throw std::runtime_error(std::string(str_buf));

jameslamb added the bug label Mar 23, 2022

jameslamb added the r-package label Mar 23, 2022

jameslamb changed the title ~~[R-package] Predicting on CSV with column names crashes process~~ When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error Apr 13, 2022

jameslamb removed the r-package label Apr 13, 2022

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

david-cortes commented Mar 23, 2022

jameslamb commented Mar 23, 2022

jameslamb commented Mar 26, 2022

jameslamb commented Apr 13, 2022

jameslamb commented Apr 13, 2022 •

edited

Loading

When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

Comments

david-cortes commented Mar 23, 2022

jameslamb commented Mar 23, 2022

jameslamb commented Mar 26, 2022

jameslamb commented Apr 13, 2022

jameslamb commented Apr 13, 2022 • edited Loading

jameslamb commented Apr 13, 2022 •

edited

Loading