Imputation of numerical data #69

kopant · 2024-03-05T21:46:13Z

Since you mentioned you're considering enhancing load_data(), I might also try to expose to the user different methods for imputation of missing numeric data. Currently in data_utils.load_num_feats() this defaults to median imputation, but this can be a poor choice if the reason the data is missing is due to real differences in the data generating process (ie, NULL data actually followed a different process than non-NULL data, and is meaningfully distinct from non-NULL data). In that case, one might instead want to encode the missing data with a distinct value from the non-NULL distribution prior to modeling.

akashsaravanan-georgian · 2024-03-06T17:46:00Z

That's a good idea, thanks! We'll incorporate that when doing the enhancement.

akashsaravanan-georgian · 2024-09-17T15:18:56Z

Hi @kopant, happy to note that you can now do this by setting numerical_handle_na to True and modifying numerical_how_handle_na to either "mean", "median" or "value". If you want to use a specific value, you can set numerical_na_value.

kopant · 2024-09-18T02:40:00Z

Thanks for making the change, @akashsaravanan-georgian!

akashsaravanan-georgian added the enhancement New feature or request label Mar 6, 2024

akashsaravanan-georgian closed this as completed in #79 Sep 17, 2024

akashsaravanan-georgian mentioned this issue Sep 16, 2024

Feat: Better Preprocessing #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputation of numerical data #69

Imputation of numerical data #69

kopant commented Mar 5, 2024

akashsaravanan-georgian commented Mar 6, 2024

akashsaravanan-georgian commented Sep 17, 2024

kopant commented Sep 18, 2024

Imputation of numerical data #69

Imputation of numerical data #69

Comments

kopant commented Mar 5, 2024

akashsaravanan-georgian commented Mar 6, 2024

akashsaravanan-georgian commented Sep 17, 2024

kopant commented Sep 18, 2024