-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impute missing values #149
Conversation
…the mean of the column. Fixes bug where they were all set to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments only in the code review, I haven't been able to test it out on any data yet but should have a chance to before we merge it.
Regarding the test data, we can hard-code some of the column name exceptions to floating point equality as I did for symmetry -- can take a look at this later in the week
modnet/hyper_opt/fit_genetic.py
Outdated
self.act = "elu" | ||
self.loss = "mae" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -33,25 +34,35 @@ def __init__( | |||
weights (Dict[str, float]): Optional (for joint learning only). The relative loss weights to apply for each target. | |||
""" | |||
|
|||
self.act = "elu" | |||
self.loss = "mae" | |||
self.n_neurons_first_layer = 32 * random.randint(1, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should use a fixed/customisable seed here so that fit genetic can be made somewhat deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this make it deterministic though? Since the models will in any case lead to somewhat slightly different results each time they are fitted, they might differ from iteration to iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
semble bon pour moi
@@ -646,7 +660,7 @@ def run( | |||
|
|||
else: | |||
ensemble = [] | |||
for m in models[ranking[:10]]: | |||
for m in models[ranking[:refit]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ? 10 is arbitrary sure... But now, refit=0 is throwing an error unfortunately...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, oops. Shouldn't it be just the first model though?
Some features can be NaN after featurization by matminer or by the user (through df_featurized). This should be handled by MODNet (by matminer as well, but that's another story). It already is handled at different places but, in the case of featurization through MODNet, the infinite and NaN features are replaced by zeros. This means that the various np.nan_to_num(x) calls later in the code are not useful in most cases, and the strategy that has been adopted that consists in setting NaNs to -1 after scaling x between -0.5 and 0.5 is never used.
In this PR, I try to address these issues. After the featurization, the NaNs are not replaced by 0 anymore. This breaks the tests as some features are slightly different. The infinite values are replaced by NaNs. Then, the NaNs are handled when fitting the model using a SimpleImputer which can be chosen. It is then stored as an attribute to the model, and can be re-used when predicting new values. The scaler can also be chosen (StandardScaler or MinMaxScaler), and the user can also choose to first impute then scale, or first scale then impute. Both can be argued (do we want to keep the same distribution as the initial feature, or to change it by moving the NaNs outside the distribution).
One thing to test is whether saving and loading models is working with the imputer and scaler (I just discussed this with PP, it needs to be checked).