GridSearch - how to access internal parameters #61

hshteingart · 2016-07-31T20:51:28Z

How can I access the Mapper's internal parameters e.g. number of bits for a FeatureHasher?

dukebody · 2016-08-03T13:25:50Z

First of all, sorry for not getting back to you sooner.

What you ask for is not possible right now, but would be a very interesting addition. I will review your related PR to see if we can introduce this implementation change without breaking existing pipelines.

dukebody · 2016-08-04T16:31:29Z

While changing the whole DataFrameMapper to use FeatureUnion allows to set transformers internal parameters and has the benefit of being able to run in parallel, I wonder if we cannot do the same:

Implementing set_params.
Using joblib's Parallel in our own loops.

This is because I feel we might be losing a great deal of control if we delegate all the work to FeatureUnion; for example, implementing #13 would probably be much harder. But OTOH I don't know how important is this with respect to code simplicity and speed.

Thoughts? cc @calpaterson

dukebody · 2016-08-04T16:31:42Z

See https://github.com/paulgb/sklearn-pandas/tree/feature_union_pipe

chanansh · 2016-08-05T08:51:20Z

Hi,

This is my first contribution to a GitHub public project, so I am excited :-)

I find that FeatureUnion is superior because everything is in terms of Scikit-Learn native object which reduces friction. Adding the "output feature naming" is possible with an additional PostNameTransformer which will just name the columns according to the original column name. If you feel this breaks the DataeFrameMapper entirely, maybe my solution fits better the Scikit-Learn package.

What do you think?

DataTerminatorX · 2016-08-24T03:59:51Z

I agree with what @chanansh said. I think the function of output feature naming will be very useful for feature selection and feature union.

As far as I know, pandas.get_dummies() can automatically generate feature names by setting parameter prefix, also sklearn.feature_extraction.DictVectorizer() has the method of get_feature_names(). Maybe we can borrow idea from them?

However, feature naming may be very complex as there are too many ways to do feature engineering. It's hard to find a general rule for feature naming.

dukebody · 2016-08-25T14:50:35Z

@chanansh you're right that it is better to reuse existing proven and maintained code instead of replicating the same feature whenever possible. It's just that I'm afraid of breaking something that is working for current users. However if we make a 2.0 release this should not be an issue; people can always continue using the 1.x branch if needed.

Probably the output feature tracking can be implemented by subclassing FeatureUnion and storing the length of each output feature in an attribute. @DataTerminatorX I believe the issue here is to track which input features generated which output ones, since some transformers like OneHotEncoder or LabelBinarizer can expand features. The output names can simply be <input_feature_name>_0, <input_feature_name>_1, etc.

dukebody · 2016-08-27T15:39:48Z

Could you please review my PR and tell me what you think?

hshteingart · 2016-09-02T12:44:44Z

Sure, I will be happy to help doing the 2.0 version if you want to collaborate together.

From: Israel Saeta Pérez [mailto:notifications@github.com]
Sent: יום ה 25 אוגוסט 2016 17:51
To: paulgb/sklearn-pandas sklearn-pandas@noreply.github.com
Cc: Hanan Shteingart hshteingart@sageaxcess.com; Author author@noreply.github.com
Subject: Re: [paulgb/sklearn-pandas] GridSearch - how to access internal parameters (#61)

@chananshhttps://github.com/chanansh you're right that it is better to reuse existing proven and maintained code instead of replicating the same feature whenever possible. It's just that I'm afraid of breaking something that is working for current users. However if we make a 2.0 release this should not be an issue; people can always continue using the 1.x branch if needed.

Probably the output feature tracking can be implemented by subclassing FeatureUnion and storing the length of each output feature in an attribute. @DataTerminatorXhttps://github.com/DataTerminatorX I believe the issue here is to track which input features generated which output ones, since some transformers like OneHotEncoder or LabelBinarizer can expand features. The output names can simply be <input_feature_name>_0, <input_feature_name>_1, etc.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/61#issuecomment-242416305, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQvKxbx_oSnWQDFaouTYb-OFg3SL75guks5qjau-gaJpZM4JZHkS.

dukebody · 2016-09-03T06:41:01Z

Awesome! So can you review #64 please, @chanansh ?

dukebody · 2016-09-03T06:41:32Z

cc @hshteingart

chanansh · 2016-09-03T06:50:27Z

I feel that if we move to feature union we should actually PR scikit learn
so much larger community can enjoy it. What do you say?

On Sep 3, 2016 9:41 AM, "Israel Saeta Pérez" notifications@github.com
wrote:

Awesome! So can you review #64
#64 please, @chanansh
https://github.com/chanansh ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#61 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFtFzGwU085tbO-lK7dmrGkGHzhBYDdeks5qmRaAgaJpZM4JZHkS
.

dukebody · 2016-09-03T07:06:28Z

AFAIK sklearn itself has no pandas dependencies and I believe they want to keep it that way for now. But feel free to ask in their issue tracker.

chanansh · 2016-09-03T09:03:57Z

Asked here: scikit-learn/scikit-learn#7334

dukebody mentioned this issue Aug 27, 2016

DataFrameMapper using FeatureUnion #64

Open

chanansh mentioned this issue Sep 3, 2016

Adding Pandas Data Frame Feature Union scikit-learn/scikit-learn#7334

Closed

cstjean mentioned this issue Feb 8, 2017

GridSearch with pipelines of dataframes cstjean/ScikitLearn.jl#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearch - how to access internal parameters #61

GridSearch - how to access internal parameters #61

hshteingart commented Jul 31, 2016

dukebody commented Aug 3, 2016

dukebody commented Aug 4, 2016

dukebody commented Aug 4, 2016

chanansh commented Aug 5, 2016

DataTerminatorX commented Aug 24, 2016

dukebody commented Aug 25, 2016

dukebody commented Aug 27, 2016

hshteingart commented Sep 2, 2016

dukebody commented Sep 3, 2016

dukebody commented Sep 3, 2016

chanansh commented Sep 3, 2016

dukebody commented Sep 3, 2016

chanansh commented Sep 3, 2016

GridSearch - how to access internal parameters #61

GridSearch - how to access internal parameters #61

Comments

hshteingart commented Jul 31, 2016

dukebody commented Aug 3, 2016

dukebody commented Aug 4, 2016

dukebody commented Aug 4, 2016

chanansh commented Aug 5, 2016

DataTerminatorX commented Aug 24, 2016

dukebody commented Aug 25, 2016

dukebody commented Aug 27, 2016

hshteingart commented Sep 2, 2016

dukebody commented Sep 3, 2016

dukebody commented Sep 3, 2016

chanansh commented Sep 3, 2016

dukebody commented Sep 3, 2016

chanansh commented Sep 3, 2016