You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since you mentioned you're considering enhancing load_data(), I might also try to expose to the user different methods for imputation of missing numeric data. Currently in data_utils.load_num_feats() this defaults to median imputation, but this can be a poor choice if the reason the data is missing is due to real differences in the data generating process (ie, NULL data actually followed a different process than non-NULL data, and is meaningfully distinct from non-NULL data). In that case, one might instead want to encode the missing data with a distinct value from the non-NULL distribution prior to modeling.
The text was updated successfully, but these errors were encountered:
Hi @kopant, happy to note that you can now do this by setting numerical_handle_na to True and modifying numerical_how_handle_na to either "mean", "median" or "value". If you want to use a specific value, you can set numerical_na_value.
Since you mentioned you're considering enhancing load_data(), I might also try to expose to the user different methods for imputation of missing numeric data. Currently in data_utils.load_num_feats() this defaults to median imputation, but this can be a poor choice if the reason the data is missing is due to real differences in the data generating process (ie, NULL data actually followed a different process than non-NULL data, and is meaningfully distinct from non-NULL data). In that case, one might instead want to encode the missing data with a distinct value from the non-NULL distribution prior to modeling.
The text was updated successfully, but these errors were encountered: