Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
AntObi committed Jul 12, 2024
1 parent cfc1f32 commit 63cbe7d
Showing 1 changed file with 71 additions and 19 deletions.
90 changes: 71 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,18 @@
SkipSpecies is a modification to SkipAtom which is an approach for creating distributed representations of atoms, for use in Machine Learning contexts. It is
based on the Skip-gram model used widely in Natural Language Processing. This repository contains the code for training and using SkipAtom/SkipSpecies embeddings, as well as the code for training and evaluating neural network models for materials properties prediction.

This repository supports the findings in our publication "Ionic species representations for materials informatics".

Users can:
* Train their own SkipSpecies vectors using structures from the Materials Project
* Evaluate their atom and/or species-based embeddings on a property dataset obtained from the Materials Project.

## This fork

:warning: For the original SkipAtom repository, see https://github.com/lantunes/skipatom :warning:

Main of the original functions have been kept, but the main changes are:
- The ability to use species instead of atoms
- The ability to use ionic species in addition to atoms
- The ability to use oxidation states in the representations
- The ability to use the oxidation states in the training of the embeddings
- The ability to use the oxidation states in the induction of the embeddings
Expand All @@ -17,42 +23,88 @@ Data from the parent repository has been removed on this fork, and the data used

## Installation

SkipSpecies can be installed with:
```
pip install skipspecies
SkipSpecies can be installed by cloning this repository with:
```bash
git clone https://github.com/AntObi/skipspecies.git
cd skipspecies
pip install .
```
However, this will install a minimal implementation that can be used to work with existing SkipAtom/SkipSpecies embeddings only. To
train new embeddings, SkipSpecies should be installed with:
```
pip install skipspecies[training]
pip install .[training]
```

Pre-trained 30-,86-,100-, 200-, 300-, and 400-dimensional SkipAtom and SkipSpecies vectors for 86 atom types/336 species types are available in the `data` directory.

To use the pre-trained vectors, follow the example in step 4, below.

To create SkipAtom/SkipSpecies vectors, follow steps 1 to 3 below. A dataset of inorganic crystal structures is required. A dataset
of 110,160 oxidation-state decorated structures obtained from the [Materials Project](https://materialsproject.org/) is available in
`data/mp_2020_10_09.pkl.gz`. From this dataset, pairs of co-occurring atoms will be derived, as depicted in the
schematic below:
of structures can be obtained from the [Materials Project](https://materialsproject.org/. From this dataset, pairs of co-occurring atoms will be derived.

<img src="resources/schematic.png" width="85%"/>

These pairs will be used in the training of the SkipAtom/SkipSpecies vectors. Pairs that were previously derived from the
Materials Project dataset are available in the files:
* Atoms: `data/skipatom_mp2022/mp_2022_10_28.pairs.csv.gz`
* Species: `data/skipatom_mp2022/mp_2022_10_28.pairs_oxi.csv.gz`
These pairs will be used in the training of the SkipAtom/SkipSpecies vectors.

_(NOTE: For the following steps 1 to 3, the programs `create_cooccurrence_pairs`, `create_skipatom_training_data` and
`create_skipatom_embeddings` are installed as console scripts when using `pip install skipspecies`, and will be usable if
SkipSpecies was installed with `pip install skipspecies[training]`.)_
`create_skipatom_embeddings` are usable if
SkipSpecies was installed with `pip install .[training]`.)_

1. Download the structures from the Materials Project:

```bash
python bin/download_mp_structs.py \
--apikey APIKEY \
--datapath data \
--oxi_mp
```

_(NOTE: The API KEY must be obtained from the Materials Project site. As well, the `oxi_mp` flag use used to get an oxidation state decorated set of structures)_


2. Create the co-occurrence_pairs

```bash
$ create_cooccurrence_pairs \
--data data/mp_2022_10_28_oxi.pkl.gz \
--out data/mp_2020_10_28.pairs_oxi.csv.gz \
--processes 70 --workers 200 -z
```
_(NOTE: The MP Dataset created in step 1 is automatically versioned. The data path in step 2 should be updated in accordance with the version of the MP dataset that has been download)_

3. Prepare the data for training:

```bash
create_skipatom_training_data \
--data data/mp_2022_10_28.pairs_oxi.csv.gz \
--out data/mp_2022_10_28.training_oxi.data
```

4. Create the SkipSpecies embeddings:

```bash
create_skipatom_embeddings \
--data data/mp_2022_10_28.training_oxi.data \
--out data/skipspecies_2022_10_28/mp_2022_10_28.dim200.species.model \
--dim 200 --step 0.01 --epochs 10 --batch 1024
```

5. Load and use the model

```python
from skipatom import SkipSpeciesInducedModel

model = SkipSpeciesInducedModel.load(
"data/kipspecies_2022_10_28/mp_2022_10_28.dim200.model",
"data/mp_2022_10_28.training.data",
min_count=2e7, top_n=5)

# species vector for O2-
print(model.vectors[model.dictionary["O2-"]])
```
### Neural Network Models

The `skipatom` module contains Keras-based implementations of an ElemNet-type neural network (for both
regression and classification). To use these, it is
necessary to have `tensorflow` in the environment. (Have a look at either the `requirements.txt` file or the
`environment.yml` file for a full list of dependencies.) The neural networks are implemented in the `ElemNet`,
necessary to have `tensorflow` in the environment. (Have a look at either the `requirements.txt` file or for a full list of dependencies.) The neural networks are implemented in the `ElemNet`,
`ElemNetClassifier` class.

For more information regarding on the `ElemNet` model, see:
Expand All @@ -67,7 +119,7 @@ To run the property prediction tasks described in the paper, follow the steps ou

A Materials Project property dataset can be made for each of the tasks described in the paper using the files in the `data` directory: `data/oxi-mp_property_dataset_unique_formulas.json.gz`. An example of how to create a dataset for the band gap task using the induced, 200-dimensional, sum-pooled SkipSpecies vectors is shown below:

```python
```bash
python bin/create_species_property_dataset.py \
--data data/oxi-mp_property_dataset_unique_formulas.json.gz \
--task band_gap \
Expand Down

0 comments on commit 63cbe7d

Please sign in to comment.