Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153

staftermath · 2021-04-01T19:44:17Z

It looks like lightgbm will attempt to convert sparse array into numpy array internally. When the (converted) dense data frame is huge. This may cause memory issue.

A smaller sample pd DataFrame containing sparse array is:

df = pd.DataFrame({
        "col1": [1, 2, 3],
        "sparse1": pd.SparseArray([1, 0, 1], fill_value=0)
    }

In my real case, the pandas dataframe consists of about 15k sparse array columns and 1 million rows. The total memory for this dataframe is < 1 GB. However, when feed to lightgbm training, it raises memory error.

booster = Booster(params=params, train_set=train_set)
  File "Path/python3.6/site-packages/lightgbm/basic.py", line 2053, in __init__
    train_set.construct()
  File "Path/python3.6/site-packages/lightgbm/basic.py", line 1325, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "Pathpython3.6/site-packages/lightgbm/basic.py", line 1123, in _lazy_init
    self.__init_from_np2d(data, params_str, ref_dataset)
  File "Path/python3.6/site-packages/lightgbm/basic.py", line 1160, in __init_from_np2d
    data = np.array(mat.reshape(mat.size), dtype=mat.dtype, copy=False)
MemoryError: Unable to allocate 135. GiB for an array with shape (1182094, 15279) and data type float64

My question is, is it in theory, not possible to use pandas sparse array in training without internally convert to dense arrays?
Noob thought, if lightgbm is using bins, say, max_bin=16, can it efficiently use sparse array?

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-04-01T20:49:00Z

@staftermath good to see you here! Thanks for your question.

LightGBM does work with sparse arrays in its core library, so I can at least tell you for sure that lightgbm's python package isn't doing something like called .todense() before training.

LightGBM/src/c_api.cpp

Line 1124 in d517ba1

int LGBM_DatasetCreateFromCSR(const void* indptr,
LightGBM/src/c_api.cpp

Line 1273 in d517ba1

int LGBM_DatasetCreateFromCSC(const void* col_ptr,

To be honest, this is the first I've heard of a pandas SparseArray though. I personally would have to research more about https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html. Maybe @StrikerRUS knows more.

If you convert your data to a scipy sparse matrix, do you still experience memory issues?

staftermath · 2021-04-01T21:49:22Z

Thanks James! Let me do some testing and report back. I suspect when pandas is fed to training, it only tries to get the np.arrays from underlying data. Perhaps csc_matrix would resolves the issue. Stay tuned :)

StrikerRUS · 2021-04-02T13:33:21Z

Here is the corresponding line of the source code:

LightGBM/python-package/lightgbm/basic.py

Line 162 in dc1bc23

    
           return np.array(data, dtype=dtype, copy=False)  # SparseArray should be supported as well

So, LightGBM supports input data to be pandas.SparseArray but it is converted to numpy.array.
But I guess it shouldn't increase the memory usage by re-using underlying data structures: #2383 (comment).

However, everything above is true only for float data. I see you're using int type for your SparseArray. So there will be int -> float copy anyway.

I believe this issue should be treated as a sub-issue of the following feature request (#2302):

@staftermath Please comment if you think this is another issue.

StrikerRUS · 2021-04-02T13:35:08Z

Oh, wait! You are using pandas.SparseArray as a column of 2-d pandas.DataFrame! Unfortunately, we don't support this scenario for now.

StrikerRUS · 2021-04-02T13:41:34Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

staftermath · 2021-04-02T17:44:39Z

Thanks for the comments @StrikerRUS . The int type is used as example. In my real case it is float64. But yes, the sparsearray is used a feature in pandas.

Also update for @jameslamb , you are right, directly feed a csc_matrix to Dataset and then to lgb.train doesn't incur huge memory spikes. This effectively solves my problem. Thanks a lot!

jameslamb added the question label Apr 1, 2021

StrikerRUS added feature request help wanted and removed question labels Apr 2, 2021

StrikerRUS changed the title ~~Question on ingestion of pandas sparse array~~ Efficient native support of pandas.DataFrame with mixed dense and sparse columns Apr 2, 2021

StrikerRUS mentioned this issue Apr 2, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153

Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153

staftermath commented Apr 1, 2021

jameslamb commented Apr 1, 2021

staftermath commented Apr 1, 2021

StrikerRUS commented Apr 2, 2021

StrikerRUS commented Apr 2, 2021

StrikerRUS commented Apr 2, 2021

staftermath commented Apr 2, 2021

Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153

Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153

Comments

staftermath commented Apr 1, 2021

jameslamb commented Apr 1, 2021

staftermath commented Apr 1, 2021

StrikerRUS commented Apr 2, 2021

StrikerRUS commented Apr 2, 2021

StrikerRUS commented Apr 2, 2021

staftermath commented Apr 2, 2021