Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66

kopant · 2024-02-28T21:36:58Z

In data_utils.CategoricalFeatures._one_hot(), could you let the handle_unknown argument to sklearn's OneHotEncoder be passed through to the user, so that they have the option to specify handle_unknown='ignore'? As-is, the code and example notebook becomes problematic in the common case where train versus validation or test sets experience different distinct levels for a categorical variable. In this case, we cannot score the model trained on the train set on the test or validation set because the different number of levels in the categorical variable will cause an error. This happens when you have rare categorical levels.

Another option is to use load_data() on the entire modeling dataset and only then split it into train-test-val later, but this seems not straightforward, at least if you have pre-existing indices for the train-test-val sets (given you are trying to split a pytorch Dataset).

akashsaravanan-georgian · 2024-03-01T22:25:54Z

Hi @kopant, thanks for raising this issue! We're looking into modifying how our load_data() function works to make it easier to work with so you can look forward to that in the future.

I'd like to understand the error you're facing now. Currently the code uses the entire dataset (train+val+test) to build the categorical encoders. So you shouldn't be running into any errors in this scenario, in theory at least. Could you go into a bit more detail or alternatively share the dataset you're trying this with?

kopant · 2024-03-05T20:12:07Z

Hi @akashsaravanan-georgian, I believe in the example jupyter notebook, the data is first split into test-train-val before the model is then trained on the train set. And, as is usually the case, the model is then evaluated on the test set. It's true that if we ran the categorical encoders on the entire (train+val+test) dataset, there wouldn't be an issue with the encoders. However, in this case how would we split the dataset after encoding into train/test/val? This may be something I'm just not aware of, but it seems difficult to split torch Datasets such that they align to pre-defined indices for test/train/val. One can fairly easily create new test/train/val splits using built-in torch functions, but I often find myself wanting to use a predefined set of splits (say, indexed by row index or the like). So, because the output of the categorical encoding is a Torch dataset, this then becomes difficult.

akashsaravanan-georgian · 2024-03-06T17:54:57Z

Hi @kopant,
Thanks for taking the time to explain! The library itself does not do any data splitting - it just expects data that has already been split. So the solution in this case is to split the data before doing the encoding. If you use load_data_from_folder, it will load your pre-split data, process them, and return them to you. Internally, we are combining the datasets, processing them as a whole, and splitting them back into their original segments before converting them into torch datasets.

I hope that helps! Happy to answer any other questions/clarifications you may have.

akashsaravanan-georgian · 2024-09-17T15:17:51Z

Hi @kopant, happy to note that you can now do this by passing in "ohe_handle_unknown" as part of your training arguments. The supported values are "error" (default), "ignore" and "infrequent_if_exist".

akashsaravanan-georgian added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Mar 1, 2024

akashsaravanan-georgian mentioned this issue Sep 16, 2024

Feat: Better Preprocessing #79

Merged

akashsaravanan-georgian closed this as completed in #79 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66

Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66

kopant commented Feb 28, 2024

akashsaravanan-georgian commented Mar 1, 2024

kopant commented Mar 5, 2024

akashsaravanan-georgian commented Mar 6, 2024

akashsaravanan-georgian commented Sep 17, 2024

Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66

Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66

Comments

kopant commented Feb 28, 2024

akashsaravanan-georgian commented Mar 1, 2024

kopant commented Mar 5, 2024

akashsaravanan-georgian commented Mar 6, 2024

akashsaravanan-georgian commented Sep 17, 2024