Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

Open
Tracked by #5153
david-cortes opened this issue Mar 23, 2022 · 4 comments
Labels

Comments

@david-cortes
Copy link
Contributor

This will crash the R process from which lightgbm is called:

library(lightgbm)
data(mtcars)
X <- as.matrix(mtcars[, -1L])
y <- as.numeric(mtcars[, 1L])
dtrain <- lgb.Dataset(X, label = y, params = list(max_bins = 5L))
bst <- lgb.train(
  data = dtrain
  , obj = "regression"
  , nrounds = 5L
  , verbose = -1L
)
fname <- tempfile(fileext=".csv")
write.csv(X, fname, row.names=FALSE)
pred <- predict(bst, fname)
[LightGBM] [Info] Data file /tmp/Rtmp7S7OuR/file6f834262e32d.csv doesn't contain a label column.
[LightGBM] [Fatal] Unknown token "cyl" in data file
[LightGBM] [Warning] Unknown token "cyl" in data file
terminate called without an active exception
Aborted

That C++ exception should be caught and thrown as an R error instead.

ref #4977

@jameslamb jameslamb added the bug label Mar 23, 2022
@jameslamb
Copy link
Collaborator

Great write-up, thanks!

@jameslamb
Copy link
Collaborator

Just adding that when I ran this tonight, using R 4.1.2 and latest master of {lightgbm}, the provided example actually killed my R session.

image

That also happened even when providing params = list(header = TRUE) to lgb.train() or to predict().

@jameslamb
Copy link
Collaborator

I found tonight that this issue is not limited to the R package.

As of latest master (0a4851f), the following Python code also generates a segfault.

import os
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000, random_state=708)

dtrain = lgb.Dataset(data=X, label=y)
bst = lgb.train(
    train_set=dtrain,
    params={
        "objective": "regression"   
    },
    num_boost_round=5
)

# write to CSV
X_df = pd.DataFrame(
    X,
    columns=[f"col_{i}" for i in range(X.shape[1])]
)
csv_file = os.path.join(os.getcwd(), "test.csv")
X_df.to_csv(csv_file, header=True, index=False)

# predict
bst.predict(data=csv_file)

I ran this example in Jupyter Lab, and saw the following in its logs.

[LightGBM] [Fatal] Unknown token col_0 in data file
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unknown token col_0 in data file

If I change the predict() call as follows:

bst.predict(data=csv_file, data_has_header=True)

predicting succeeds.


With the R package, I also found that predicting on a CSV with column names in the headers succeeds if I pass header = TRUE.

library(lightgbm)
data(mtcars)
X <- as.matrix(mtcars[, -1L])
y <- as.numeric(mtcars[, 1L])
dtrain <- lgb.Dataset(
    X
    , label = y
    , params = list(min_data_in_bin = 1L, min_data_in_leaf = 1L)
)
bst <- lgb.train(
    data = dtrain
    , obj = "regression"
    , nrounds = 5L
)
fname <- tempfile(fileext=".csv")
write.csv(X, fname, row.names=FALSE)

# using header = TRUE, predicting succeeds
pred <- predict(bst, fname, header = TRUE)

So I think there are two separate issues:

  • (not specific to the R package) predicting on a CSV with column names produces a process-crashing error if LGBM_BoosterPredictForFile() is called with data_has_header=0 (not specific to the R package)
  • (specific to the R package) predict.lgb.Booster() doesn't respect header and its aliases passed through keyword arguments (similar issue to [R-package] predict ignores predict_raw_score #4670)

@jameslamb
Copy link
Collaborator

jameslamb commented Apr 13, 2022

For anyone looking to contribute a fix, the error being thrown

[LightGBM] [Fatal] Unknown token col_0 in data file
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unknown token col_0 in data file

is because of this call to Log::Fatal() inside a while loop in the text-file-parsing code

Log::Fatal("Unknown token %s in data file", tmp_str.c_str());

which throws an error here:

throw std::runtime_error(std::string(str_buf));

@jameslamb jameslamb changed the title [R-package] Predicting on CSV with column names crashes process When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error Apr 13, 2022
@jameslamb jameslamb mentioned this issue Apr 14, 2022
60 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants