-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GridSearchCV extremely slow with DataFrameMapper? #11
Comments
Could you please try to provide a code snippet that generates random that exhibits the same behavior? It would also be interesting to report the output of a profiler, for instance using the |
|
from %prun
Is this helpful? It seems that almost all time is spent in _get_col_subset |
I'm seeing very similar behavior with |
I've been investigating this and the culprits seem to be these lines:
Apparently the I'm not sure what is the best way to deal with this. Replacing the previous two lines with:
and leaving the cols slicing to the later code in the same function seems to provide a good speedup (around 3x) but I still have to write tests to ensure it doesn't break anything. Perhaps we can get better speedups without the lists trick, but I don't know how to do that and at the same time avoid sklearn turning the dataframe into a numpy array. Ideas welcome! :) |
Hm, I was testing it with See #26 (comment) and https://github.com/scikit-learn/scikit-learn/blob/0.16.0/sklearn/cross_validation.py#L1350. |
Perhaps we should just write in the documentation that the custom cv-wrappers are only needed for |
I think documenting that is a good idea, but also maybe pass-through from distutils.version import StrictVersion
if StrictVersion (sklearn.__version__) > StrictVersion('0.16'):
sklearn_pandas.GridSearchCV = sklearn.grid_search.GridSearchCV |
I don't think it's worth uglying up the code that way. We can say that these wrappers are deprecated and will be eventually dropped in |
@zacstewart can you review #48 please? It's a really minor addition but I always like the four-eyes approach to changes. :-) |
Along those lines: Unfortunately the function CalibrateClassifierCV introduced in sklearn 0.16 does not seem work with DataFrameMappers in a pipeline (this is still the case in sklearn 0.17) |
@Balandat Could you provide an example with a traceback (or wrong result)? Thanks. |
Deprecate custom CV shims in documentation and code. Refs #11.
I have a dataframe, not particularly large (~3000 rows, 250 cols) on which I do the following:
From a quick glance, it seems to spend all its time indexing dataframe objects. The following 2 pieces of code are very fast:
So it must be something to do with using GridSearchCV with the DataFrameMapper. Any ideas?
More generally, is there a better way to handle categorical variables?
The text was updated successfully, but these errors were encountered: