Impute missing values #149

gbrunin · 2023-05-09T14:59:39Z

Some features can be NaN after featurization by matminer or by the user (through df_featurized). This should be handled by MODNet (by matminer as well, but that's another story). It already is handled at different places but, in the case of featurization through MODNet, the infinite and NaN features are replaced by zeros. This means that the various np.nan_to_num(x) calls later in the code are not useful in most cases, and the strategy that has been adopted that consists in setting NaNs to -1 after scaling x between -0.5 and 0.5 is never used.

In this PR, I try to address these issues. After the featurization, the NaNs are not replaced by 0 anymore. This breaks the tests as some features are slightly different. The infinite values are replaced by NaNs. Then, the NaNs are handled when fitting the model using a SimpleImputer which can be chosen. It is then stored as an attribute to the model, and can be re-used when predicting new values. The scaler can also be chosen (StandardScaler or MinMaxScaler), and the user can also choose to first impute then scale, or first scale then impute. Both can be argued (do we want to keep the same distribution as the initial feature, or to change it by moving the NaNs outside the distribution).

One thing to test is whether saving and loading models is working with the imputer and scaler (I just discussed this with PP, it needs to be checked).

…the mean of the column. Fixes bug where they were all set to 0.

…pute_missing

ml-evs

Minor comments only in the code review, I haven't been able to test it out on any data yet but should have a chance to before we merge it.

Regarding the test data, we can hard-code some of the column name exceptions to floating point equality as I did for symmetry -- can take a look at this later in the week

ml-evs · 2023-05-16T17:57:56Z

modnet/hyper_opt/fit_genetic.py

+        self.act = "elu"
+        self.loss = "mae"


I guess this now conflicting with #148 (and maybe your own #151), though I see further down we are still doing self.__dict__.update(model_params) so these can be overwritten

ml-evs · 2023-05-16T18:00:09Z

modnet/hyper_opt/fit_genetic.py

@@ -33,25 +34,35 @@ def __init__(
            weights (Dict[str, float]): Optional (for joint learning only). The relative loss weights to apply for each target.
        """

+        self.act = "elu"
+        self.loss = "mae"
+        self.n_neurons_first_layer = 32 * random.randint(1, 10)


I wonder if we should use a fixed/customisable seed here so that fit genetic can be made somewhat deterministic?

Would this make it deterministic though? Since the models will in any case lead to somewhat slightly different results each time they are fitted, they might differ from iteration to iteration?

modnet/hyper_opt/fit_genetic.py

modnet/models/vanilla.py

ml-evs

LGTM!

semble bon pour moi

ppdebreuck · 2023-06-15T15:07:06Z

modnet/hyper_opt/fit_genetic.py

@@ -646,7 +660,7 @@ def run(

        else:
            ensemble = []
-            for m in models[ranking[:10]]:
+            for m in models[ranking[:refit]]:


Why ? 10 is arbitrary sure... But now, refit=0 is throwing an error unfortunately...

Ah, oops. Shouldn't it be just the first model though?

gbrunin added 11 commits June 27, 2022 12:04

Upgraded pymatgen and matminer requirements

922b2b7

Merge branch 'master' of https://github.com/ppdebreuck/modnet

2b89bbc

Merge branch 'master' of https://github.com/ppdebreuck/modnet

1822e53

Merge branch 'master' of https://github.com/ppdebreuck/modnet

c7c6dae

Better handling of NaNs in features by adding the possibility to use …

f6c8b73

…the mean of the column. Fixes bug where they were all set to 0.

Small bug when adding keys to genes.

f247ac8

Small typo and bug fix.

1ea13e4

Update with a choice of order between scaling and imputing.

eb42b1e

Small bug fix.

23f8ee4

Merge branch 'master' of https://github.com/ppdebreuck/modnet into im…

e6da550

…pute_missing

Small name change.

3b3e4b3

ppdebreuck requested review from ppdebreuck and ml-evs May 9, 2023 15:47

ml-evs reviewed May 16, 2023

View reviewed changes

gbrunin and others added 6 commits May 17, 2023 08:13

Typos.

3b27d5c

Merge branch 'master' into impute_missing

45b6ed4

Rename according to PP's PR

cf1edf9

Add some additional 'bad columns' in testing

dd4e5e1

Tidy up merge

4542d07

Fix linting

9eeca82

ml-evs approved these changes May 31, 2023

View reviewed changes

ml-evs merged commit 0afb9ef into ppdebreuck:master May 31, 2023

ppdebreuck reviewed Jun 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impute missing values #149

Impute missing values #149

gbrunin commented May 9, 2023

ml-evs left a comment •

edited

Loading

ml-evs May 16, 2023

ml-evs May 16, 2023

gbrunin May 17, 2023

ml-evs left a comment •

edited

Loading

ppdebreuck Jun 15, 2023

gbrunin Jun 16, 2023

Impute missing values #149

Impute missing values #149

Conversation

gbrunin commented May 9, 2023

ml-evs left a comment • edited Loading

Choose a reason for hiding this comment

ml-evs May 16, 2023

Choose a reason for hiding this comment

ml-evs May 16, 2023

Choose a reason for hiding this comment

gbrunin May 17, 2023

Choose a reason for hiding this comment

ml-evs left a comment • edited Loading

Choose a reason for hiding this comment

ppdebreuck Jun 15, 2023

Choose a reason for hiding this comment

gbrunin Jun 16, 2023

Choose a reason for hiding this comment

ml-evs left a comment •

edited

Loading

ml-evs left a comment •

edited

Loading