Tools to train and test a voting classifier that predicts oxidation states (of MOFs), for example to replicate our work [1]. If you're just interested in using a pre-trained model, the oximachinerunner package.
⚠️ Warning: You need to exportCOMET_API_KEY
, as the code will look for it if you want to track your experiments (when you retrain the model). If you do not want to do this, remove those lines in the code. You might also want to consider other tracking options such as weights and biases.
To install the software with all dependencies, you can use
pip install git+https://github.com/kjappelbaum/learn_mof_ox_state.git
The full process should take some seconds.
Note that the models have been fitted using scikit-learn==0.21.3
and therefore one should ideally used this version. For better compatibility with the other dependencies (matminer
, apricot
) that depend on newer versions of scikit-learn
we patched the model by adding the _strategy
attribute to the initialization DummyClassifier
of the GradientBoostingClassifier
and adding the n_samples_fit_
attribute to the KNeighborsClassifier
. If you plan to do some further developments, it might be advisable to bump all dependencies before training a new model.
- The functions in this package requires inputs (features and labels) that can be generated with our oximachine_featurizer Python package. The full datasets which can be used to train a model, as well as a pre-trained model are deposited on the MaterialsCloud Archive (doi: 10.24435/materialscloud:2019.0085/v1 ). The analysis command line interfaces can be used to reproduce our findings, based on the data deposited in the MaterialsCloud Archive. The training CLI can for example be used as
python machine_learn_oxstates/learnmofox/train_ensemble_classifier.py {featurespath} {labelspath} {modelpath} {metricsoutpath} standard soft isotonic 40000 20 none --train_one_fold
-
Some experiments we ran, together with code and datahash, can also be found at comet.ml
-
For testing a pre-trained model we recommend using our webapp, for which the code can be found, along with the Docker images, in another GitHub repository. There is also a small Python package, oximachinerunner, that allows to run inference on crystal structures.
The training can, depending on the training set size, take hours.
train_calibrate_voting_classifier_no_track.py
: to run the training without comet.mltrain_calibrate_voting_classifier.py
: train a voting classifier (with optimized hyperparameters and track the experiments with comet.ml)train_ensemble_classifier.py
: run the hyperparameter optimization for the ensemble of modelsutils.py
: contains the custom voting classifier class and some utils
The runtime for the tests depends on whether they require retraining the model (permutation significance), which can take several hours, or whether they only involve evaluating the model for some data points, which will take minutes.
feature_importance_cli.py
: command-line-tools to calculate feature importance with permutation or SHAPfarm_learning_curves.py
: command-line-tool to run learning curvesbias_variance_cli.py
: run a bias-variance decomposition analysis with mlxtendpermutation_significance.py
: tool to run a permutation significance test (permute label and measure metrics to see if the model learned something meaningful)run_combinatorial_study.py
: train models on different feature subsetsmetrics.py
contains helper functions to calculate metricsbootstrapped_metrics.py
: functions to calculate a bootstrapped learning curve pointtest_model.py
: command-line-tool to run some basic tests
The use of the main functions of this package is shown in the Jupyter Notebook in the example directory. It contains some example structures and the output, which should be produces in seconds.
[1] Jablonka, Kevin Maik; Ongari, Daniele; Moosavi, Seyed Mohamad; Smit, Berend (2020): Using Collective Knowledge to Assign Oxidation States. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.11604129.v1