-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing values #8
Comments
The protocol on 13/08/2021 works for missing values that are int and float dtype. It doesn't work if there are missing values in boolean dtype columns (same in pandas implementation) and it works partly for categorical dtypes. Here the transition from Pandas dataframe to Vaex dataframe works. The problem is when constructing Vaex dataframe with categorical column that has missing values. I tested two options (
both cases change the dtype:
and so it errors with pandas or vaex implementation in both cases. The reason:
Need to research the topic more. |
In the PR comment for Vaex library Maarten pointed me toward the correct way explaining "In vaex, we don't treat nan as missing, it's 'just another float', all columns are nullable. Arrow columns via bitmask, NumPy columns via bool/byte arrays." I played around little bit more and saved the work in Notebook. The protocol seem to work even on nullable NumPy columns. Next step I should probably take is to implement mask handling. |
Protocol now passes for all numpy and arrow dtypes with missing data. The methods that needed change were:
The Notebook test is available here. The code in the draft PR ofr Vaex will be updated to handle missing values this week. |
No description provided.
The text was updated successfully, but these errors were encountered: