From 63cbe7d28986efb0b0a000e29bd280301b234737 Mon Sep 17 00:00:00 2001 From: AntObi Date: Fri, 12 Jul 2024 14:38:56 +0100 Subject: [PATCH] Update README.md --- README.md | 90 +++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 71 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 13683ae..afabfe6 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,18 @@ SkipSpecies is a modification to SkipAtom which is an approach for creating distributed representations of atoms, for use in Machine Learning contexts. It is based on the Skip-gram model used widely in Natural Language Processing. This repository contains the code for training and using SkipAtom/SkipSpecies embeddings, as well as the code for training and evaluating neural network models for materials properties prediction. +This repository supports the findings in our publication "Ionic species representations for materials informatics". + +Users can: +* Train their own SkipSpecies vectors using structures from the Materials Project +* Evaluate their atom and/or species-based embeddings on a property dataset obtained from the Materials Project. + ## This fork :warning: For the original SkipAtom repository, see https://github.com/lantunes/skipatom :warning: Main of the original functions have been kept, but the main changes are: -- The ability to use species instead of atoms +- The ability to use ionic species in addition to atoms - The ability to use oxidation states in the representations - The ability to use the oxidation states in the training of the embeddings - The ability to use the oxidation states in the induction of the embeddings @@ -17,14 +23,16 @@ Data from the parent repository has been removed on this fork, and the data used ## Installation -SkipSpecies can be installed with: -``` -pip install skipspecies +SkipSpecies can be installed by cloning this repository with: +```bash +git clone https://github.com/AntObi/skipspecies.git +cd skipspecies +pip install . ``` However, this will install a minimal implementation that can be used to work with existing SkipAtom/SkipSpecies embeddings only. To train new embeddings, SkipSpecies should be installed with: ``` -pip install skipspecies[training] +pip install .[training] ``` Pre-trained 30-,86-,100-, 200-, 300-, and 400-dimensional SkipAtom and SkipSpecies vectors for 86 atom types/336 species types are available in the `data` directory. @@ -32,27 +40,71 @@ Pre-trained 30-,86-,100-, 200-, 300-, and 400-dimensional SkipAtom and SkipSpeci To use the pre-trained vectors, follow the example in step 4, below. To create SkipAtom/SkipSpecies vectors, follow steps 1 to 3 below. A dataset of inorganic crystal structures is required. A dataset -of 110,160 oxidation-state decorated structures obtained from the [Materials Project](https://materialsproject.org/) is available in -`data/mp_2020_10_09.pkl.gz`. From this dataset, pairs of co-occurring atoms will be derived, as depicted in the -schematic below: +of structures can be obtained from the [Materials Project](https://materialsproject.org/. From this dataset, pairs of co-occurring atoms will be derived. - - -These pairs will be used in the training of the SkipAtom/SkipSpecies vectors. Pairs that were previously derived from the -Materials Project dataset are available in the files: -* Atoms: `data/skipatom_mp2022/mp_2022_10_28.pairs.csv.gz` -* Species: `data/skipatom_mp2022/mp_2022_10_28.pairs_oxi.csv.gz` +These pairs will be used in the training of the SkipAtom/SkipSpecies vectors. _(NOTE: For the following steps 1 to 3, the programs `create_cooccurrence_pairs`, `create_skipatom_training_data` and -`create_skipatom_embeddings` are installed as console scripts when using `pip install skipspecies`, and will be usable if -SkipSpecies was installed with `pip install skipspecies[training]`.)_ +`create_skipatom_embeddings` are usable if +SkipSpecies was installed with `pip install .[training]`.)_ + +1. Download the structures from the Materials Project: + +```bash +python bin/download_mp_structs.py \ +--apikey APIKEY \ +--datapath data \ +--oxi_mp +``` + +_(NOTE: The API KEY must be obtained from the Materials Project site. As well, the `oxi_mp` flag use used to get an oxidation state decorated set of structures)_ + + +2. Create the co-occurrence_pairs + +```bash +$ create_cooccurrence_pairs \ +--data data/mp_2022_10_28_oxi.pkl.gz \ +--out data/mp_2020_10_28.pairs_oxi.csv.gz \ +--processes 70 --workers 200 -z +``` +_(NOTE: The MP Dataset created in step 1 is automatically versioned. The data path in step 2 should be updated in accordance with the version of the MP dataset that has been download)_ +3. Prepare the data for training: + +```bash +create_skipatom_training_data \ +--data data/mp_2022_10_28.pairs_oxi.csv.gz \ +--out data/mp_2022_10_28.training_oxi.data +``` + +4. Create the SkipSpecies embeddings: + +```bash +create_skipatom_embeddings \ +--data data/mp_2022_10_28.training_oxi.data \ +--out data/skipspecies_2022_10_28/mp_2022_10_28.dim200.species.model \ +--dim 200 --step 0.01 --epochs 10 --batch 1024 +``` + +5. Load and use the model + +```python +from skipatom import SkipSpeciesInducedModel + +model = SkipSpeciesInducedModel.load( + "data/kipspecies_2022_10_28/mp_2022_10_28.dim200.model", + "data/mp_2022_10_28.training.data", + min_count=2e7, top_n=5) + +# species vector for O2- +print(model.vectors[model.dictionary["O2-"]]) +``` ### Neural Network Models The `skipatom` module contains Keras-based implementations of an ElemNet-type neural network (for both regression and classification). To use these, it is -necessary to have `tensorflow` in the environment. (Have a look at either the `requirements.txt` file or the -`environment.yml` file for a full list of dependencies.) The neural networks are implemented in the `ElemNet`, +necessary to have `tensorflow` in the environment. (Have a look at either the `requirements.txt` file or for a full list of dependencies.) The neural networks are implemented in the `ElemNet`, `ElemNetClassifier` class. For more information regarding on the `ElemNet` model, see: @@ -67,7 +119,7 @@ To run the property prediction tasks described in the paper, follow the steps ou A Materials Project property dataset can be made for each of the tasks described in the paper using the files in the `data` directory: `data/oxi-mp_property_dataset_unique_formulas.json.gz`. An example of how to create a dataset for the band gap task using the induced, 200-dimensional, sum-pooled SkipSpecies vectors is shown below: -```python +```bash python bin/create_species_property_dataset.py \ --data data/oxi-mp_property_dataset_unique_formulas.json.gz \ --task band_gap \