Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impute missing values #149

Merged
merged 17 commits into from
May 31, 2023
Merged

Impute missing values #149

merged 17 commits into from
May 31, 2023

Conversation

gbrunin
Copy link
Collaborator

@gbrunin gbrunin commented May 9, 2023

Some features can be NaN after featurization by matminer or by the user (through df_featurized). This should be handled by MODNet (by matminer as well, but that's another story). It already is handled at different places but, in the case of featurization through MODNet, the infinite and NaN features are replaced by zeros. This means that the various np.nan_to_num(x) calls later in the code are not useful in most cases, and the strategy that has been adopted that consists in setting NaNs to -1 after scaling x between -0.5 and 0.5 is never used.

In this PR, I try to address these issues. After the featurization, the NaNs are not replaced by 0 anymore. This breaks the tests as some features are slightly different. The infinite values are replaced by NaNs. Then, the NaNs are handled when fitting the model using a SimpleImputer which can be chosen. It is then stored as an attribute to the model, and can be re-used when predicting new values. The scaler can also be chosen (StandardScaler or MinMaxScaler), and the user can also choose to first impute then scale, or first scale then impute. Both can be argued (do we want to keep the same distribution as the initial feature, or to change it by moving the NaNs outside the distribution).

One thing to test is whether saving and loading models is working with the imputer and scaler (I just discussed this with PP, it needs to be checked).

@ppdebreuck ppdebreuck requested review from ppdebreuck and ml-evs May 9, 2023 15:47
Copy link
Collaborator

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments only in the code review, I haven't been able to test it out on any data yet but should have a chance to before we merge it.

Regarding the test data, we can hard-code some of the column name exceptions to floating point equality as I did for symmetry -- can take a look at this later in the week

Comment on lines 37 to 38
self.act = "elu"
self.loss = "mae"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this now conflicting with #148 (and maybe your own #151), though I see further down we are still doing self.__dict__.update(model_params) so these can be overwritten

@@ -33,25 +34,35 @@ def __init__(
weights (Dict[str, float]): Optional (for joint learning only). The relative loss weights to apply for each target.
"""

self.act = "elu"
self.loss = "mae"
self.n_neurons_first_layer = 32 * random.randint(1, 10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use a fixed/customisable seed here so that fit genetic can be made somewhat deterministic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this make it deterministic though? Since the models will in any case lead to somewhat slightly different results each time they are fitted, they might differ from iteration to iteration?

modnet/hyper_opt/fit_genetic.py Outdated Show resolved Hide resolved
modnet/models/vanilla.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

semble bon pour moi

@ml-evs ml-evs merged commit 0afb9ef into ppdebreuck:master May 31, 2023
@@ -646,7 +660,7 @@ def run(

else:
ensemble = []
for m in models[ranking[:10]]:
for m in models[ranking[:refit]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ? 10 is arbitrary sure... But now, refit=0 is throwing an error unfortunately...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, oops. Shouldn't it be just the first model though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants