-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66
Comments
Hi @kopant, thanks for raising this issue! We're looking into modifying how our load_data() function works to make it easier to work with so you can look forward to that in the future. I'd like to understand the error you're facing now. Currently the code uses the entire dataset (train+val+test) to build the categorical encoders. So you shouldn't be running into any errors in this scenario, in theory at least. Could you go into a bit more detail or alternatively share the dataset you're trying this with? |
Hi @akashsaravanan-georgian, I believe in the example jupyter notebook, the data is first split into test-train-val before the model is then trained on the train set. And, as is usually the case, the model is then evaluated on the test set. It's true that if we ran the categorical encoders on the entire (train+val+test) dataset, there wouldn't be an issue with the encoders. However, in this case how would we split the dataset after encoding into train/test/val? This may be something I'm just not aware of, but it seems difficult to split torch Datasets such that they align to pre-defined indices for test/train/val. One can fairly easily create new test/train/val splits using built-in torch functions, but I often find myself wanting to use a predefined set of splits (say, indexed by row index or the like). So, because the output of the categorical encoding is a Torch dataset, this then becomes difficult. |
Hi @kopant, I hope that helps! Happy to answer any other questions/clarifications you may have. |
Hi @kopant, happy to note that you can now do this by passing in "ohe_handle_unknown" as part of your training arguments. The supported values are "error" (default), "ignore" and "infrequent_if_exist". |
In data_utils.CategoricalFeatures._one_hot(), could you let the handle_unknown argument to sklearn's OneHotEncoder be passed through to the user, so that they have the option to specify handle_unknown='ignore'? As-is, the code and example notebook becomes problematic in the common case where train versus validation or test sets experience different distinct levels for a categorical variable. In this case, we cannot score the model trained on the train set on the test or validation set because the different number of levels in the categorical variable will cause an error. This happens when you have rare categorical levels.
Another option is to use load_data() on the entire modeling dataset and only then split it into train-test-val later, but this seems not straightforward, at least if you have pre-existing indices for the train-test-val sets (given you are trying to split a pytorch Dataset).
The text was updated successfully, but these errors were encountered: