Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153

Closed
staftermath opened this issue Apr 1, 2021 · 6 comments

Comments

@staftermath
Copy link

It looks like lightgbm will attempt to convert sparse array into numpy array internally. When the (converted) dense data frame is huge. This may cause memory issue.

A smaller sample pd DataFrame containing sparse array is:

df = pd.DataFrame({
        "col1": [1, 2, 3],
        "sparse1": pd.SparseArray([1, 0, 1], fill_value=0)
    }

In my real case, the pandas dataframe consists of about 15k sparse array columns and 1 million rows. The total memory for this dataframe is < 1 GB. However, when feed to lightgbm training, it raises memory error.

booster = Booster(params=params, train_set=train_set)
  File "Path/python3.6/site-packages/lightgbm/basic.py", line 2053, in __init__
    train_set.construct()
  File "Path/python3.6/site-packages/lightgbm/basic.py", line 1325, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "Pathpython3.6/site-packages/lightgbm/basic.py", line 1123, in _lazy_init
    self.__init_from_np2d(data, params_str, ref_dataset)
  File "Path/python3.6/site-packages/lightgbm/basic.py", line 1160, in __init_from_np2d
    data = np.array(mat.reshape(mat.size), dtype=mat.dtype, copy=False)
MemoryError: Unable to allocate 135. GiB for an array with shape (1182094, 15279) and data type float64

My question is, is it in theory, not possible to use pandas sparse array in training without internally convert to dense arrays?
Noob thought, if lightgbm is using bins, say, max_bin=16, can it efficiently use sparse array?

@jameslamb
Copy link
Collaborator

@staftermath good to see you here! Thanks for your question.

LightGBM does work with sparse arrays in its core library, so I can at least tell you for sure that lightgbm's python package isn't doing something like called .todense() before training.

To be honest, this is the first I've heard of a pandas SparseArray though. I personally would have to research more about https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html. Maybe @StrikerRUS knows more.

If you convert your data to a scipy sparse matrix, do you still experience memory issues?

@staftermath
Copy link
Author

Thanks James! Let me do some testing and report back. I suspect when pandas is fed to training, it only tries to get the np.arrays from underlying data. Perhaps csc_matrix would resolves the issue. Stay tuned :)

@StrikerRUS
Copy link
Collaborator

Here is the corresponding line of the source code:

return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well

So, LightGBM supports input data to be pandas.SparseArray but it is converted to numpy.array.
But I guess it shouldn't increase the memory usage by re-using underlying data structures: #2383 (comment).

However, everything above is true only for float data. I see you're using int type for your SparseArray. So there will be int -> float copy anyway.

I believe this issue should be treated as a sub-issue of the following feature request (#2302):

image

@staftermath Please comment if you think this is another issue.

@StrikerRUS
Copy link
Collaborator

Oh, wait! You are using pandas.SparseArray as a column of 2-d pandas.DataFrame! Unfortunately, we don't support this scenario for now.

@StrikerRUS StrikerRUS changed the title Question on ingestion of pandas sparse array Efficient native support of pandas.DataFrame with mixed dense and sparse columns Apr 2, 2021
@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@staftermath
Copy link
Author

Thanks for the comments @StrikerRUS . The int type is used as example. In my real case it is float64. But yes, the sparsearray is used a feature in pandas.

Also update for @jameslamb , you are right, directly feed a csc_matrix to Dataset and then to lgb.train doesn't incur huge memory spikes. This effectively solves my problem. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants