Missing values #8

AlenkaF · 2021-08-12T10:05:17Z

No description provided.

AlenkaF · 2021-08-13T10:09:32Z

The protocol on 13/08/2021 works for missing values that are int and float dtype. It doesn't work if there are missing values in boolean dtype columns (same in pandas implementation) and it works partly for categorical dtypes.

Here the transition from Pandas dataframe to Vaex dataframe works. The problem is when constructing Vaex dataframe with categorical column that has missing values. I tested two options (categorize() doesn't work if there are missing values):

with map() function changing value of categorical column to np.nan
with func.where() function changing value of categorical column to None

both cases change the dtype:

dtype becomes float
dtpye becomes object

and so it errors with pandas or vaex implementation in both cases. The reason:

when calling _VaexBuffer it errors due to 'pyarrow.lib.DoubleArray' object has no attribute '__array_interface__'
object dtype not yet supported

Need to research the topic more.

AlenkaF · 2021-08-17T12:21:57Z

In the PR comment for Vaex library Maarten pointed me toward the correct way explaining "In vaex, we don't treat nan as missing, it's 'just another float', all columns are nullable. Arrow columns via bitmask, NumPy columns via bool/byte arrays."

I played around little bit more and saved the work in Notebook. The protocol seem to work even on nullable NumPy columns. Next step I should probably take is to implement mask handling.

AlenkaF · 2021-08-24T08:31:32Z

Protocol now passes for all numpy and arrow dtypes with missing data. The methods that needed change were:

convert_column_to_ndarray
convert_categorical_column
describe_null
null_count
get_data_buffer
get_mask

The Notebook test is available here. The code in the draft PR ofr Vaex will be updated to handle missing values this week.

AlenkaF closed this as completed Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values #8

Missing values #8

AlenkaF commented Aug 12, 2021

AlenkaF commented Aug 13, 2021

AlenkaF commented Aug 17, 2021 •

edited

Loading

AlenkaF commented Aug 24, 2021

Missing values #8

Missing values #8

Comments

AlenkaF commented Aug 12, 2021

AlenkaF commented Aug 13, 2021

AlenkaF commented Aug 17, 2021 • edited Loading

AlenkaF commented Aug 24, 2021

AlenkaF commented Aug 17, 2021 •

edited

Loading