Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values #8

Closed
AlenkaF opened this issue Aug 12, 2021 · 3 comments
Closed

Missing values #8

AlenkaF opened this issue Aug 12, 2021 · 3 comments

Comments

@AlenkaF
Copy link
Owner

AlenkaF commented Aug 12, 2021

No description provided.

@AlenkaF
Copy link
Owner Author

AlenkaF commented Aug 13, 2021

The protocol on 13/08/2021 works for missing values that are int and float dtype. It doesn't work if there are missing values in boolean dtype columns (same in pandas implementation) and it works partly for categorical dtypes.

Here the transition from Pandas dataframe to Vaex dataframe works. The problem is when constructing Vaex dataframe with categorical column that has missing values. I tested two options (categorize() doesn't work if there are missing values):

  • with map() function changing value of categorical column to np.nan
  • with func.where() function changing value of categorical column to None

both cases change the dtype:

  • dtype becomes float
  • dtpye becomes object

and so it errors with pandas or vaex implementation in both cases. The reason:

  • when calling _VaexBuffer it errors due to 'pyarrow.lib.DoubleArray' object has no attribute '__array_interface__'
  • object dtype not yet supported

Need to research the topic more.

@AlenkaF
Copy link
Owner Author

AlenkaF commented Aug 17, 2021

In the PR comment for Vaex library Maarten pointed me toward the correct way explaining "In vaex, we don't treat nan as missing, it's 'just another float', all columns are nullable. Arrow columns via bitmask, NumPy columns via bool/byte arrays."

I played around little bit more and saved the work in Notebook. The protocol seem to work even on nullable NumPy columns. Next step I should probably take is to implement mask handling.

@AlenkaF
Copy link
Owner Author

AlenkaF commented Aug 24, 2021

Protocol now passes for all numpy and arrow dtypes with missing data. The methods that needed change were:

  • convert_column_to_ndarray
  • convert_categorical_column
  • describe_null
  • null_count
  • get_data_buffer
  • get_mask

The Notebook test is available here. The code in the draft PR ofr Vaex will be updated to handle missing values this week.

@AlenkaF AlenkaF closed this as completed Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant