-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track which DataFrame Column corresponds to which Array Column(s) after Transform #13
Comments
Last time I looked at sklearn (a few minor versions ago) there was no common way for transformations to indicate which columns corresponded to which names. See #7 for some discussion. Separate implementation for each sklearn transformer would work, but I'm not keen on having a bunch of special cases for each sklearn transformer. That said, I'd accept a patch for that as long as it didn't break anything else. |
I think it is possible to track at least which columns of the final matrix correspond to each variable. Since the results of the transformation of each set of columns is hstacked in the end, we could keep track of the columns that resulted from the transformation of each variable in a "feature_indices_" variable in the mapper after transformation. The meaning could be exactly the same one as OneHotEncoder:
@sveitser could you implement that? |
I know there is work done on this issue so just wanted to vote for it. Here is an example of a boilerplate that could be used for many classification jobs driven from a database table. It would be very handy to be able to track back what features (columns) made it to the final set.
Via clf_pipeline.named_steps['feature_selection'].get_support() we can see what was selected via KBest. But as I understand there is no way to track it back to the original X using DataFrameMapper data. |
|
Good news! This functionality was addressed in 2fc6286 |
I think it would be useful for feature selection if it was possible to keep track of which DataFrame columns were mapped to which array columns during the transformation so that one could use for instance the
feature_importances_
of ensemble methods in sklearn.Is there a straight-forward way to do this right now? I looked into it a bit but didn't find a common way to get the necessary information during fitting of the sklearn transforms. Therefore the best way I can currently think of is to do the inspection separately for each sklearn transform, i. e. use
self.feature_names_
forDictVectorizer
,self.classes_
forLabelBinarizer
, etc.I'm thinking there must be a better way to do this.
The text was updated successfully, but these errors were encountered: