initial commit for paper

janosh · Jun 20, 2023 · cd4b90f · cd4b90f
1 parent 7a642d7
commit cd4b90f
Show file tree

Hide file tree

Showing 6 changed files with 857 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -24,7 +24,7 @@ job-logs/
 *slurm-*.log
 
 # temporary ignore rules
-paper
+notes
 models/voronoi/*.zip
 
 # generated docs

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -56,6 +56,7 @@ repos:
       - id: codespell
         stages: [commit, commit-msg]
         exclude_types: [csv, json, svg]
+        exclude: ^(.+references.yaml)$
 
   - repo: https://github.com/PyCQA/autoflake
     rev: v2.0.0
@@ -71,7 +72,7 @@ repos:
           - prettier
           - prettier-plugin-svelte
           - svelte
-        exclude: ^(site/static/.+\.svelte|data/wbm/20.+\..+)$
+        exclude: ^(site/static/.+\.svelte|data/wbm/20.+\..+|.+references.yaml)$
 
   - repo: https://github.com/pre-commit/mirrors-eslint
     rev: v8.31.0

diff --git a/site/src/routes/paper/+page.svx b/site/src/routes/paper/+page.svx
@@ -0,0 +1,221 @@
+---
+title: Matbench Discovery - Can ML energy models help find stable crystals?
+tags:
+  - Python
+  - machine learning
+  - materials science
+  - materials discovery
+  - benchmark
+  - ensembles
+  - uncertainty estimation
+authors:
+  - name: Janosh Riebesell
+    orcid: 0000-0001-5233-3462
+    corresponding: true
+    affiliation: 1, 2
+  - name: Rhys Goodall
+    orcid: 0000-0002-6589-1700
+    affiliation: 1
+  - name: Anubhav Jain
+    orcid: 0000-0001-5893-9967
+    affiliation: 2
+  - name: Kristin Persson
+    orcid: 0000-0003-2495-5509
+    affiliation: 2
+  - name: Emma King-Smith
+    orcid: 0000-0002-2999-0955
+    affiliation: 1
+  - name: Alpha Lee
+    orcid: 0000-0002-9616-3108
+    affiliation: 1
+affiliations:
+  - Cavendish Laboratory, University of Cambridge, UK
+  - Lawrence Berkeley National Laboratory, Berkeley, USA
+date: Jan 31, 2023
+bibliography: [references.yaml]
+# taken from https://zotero.org/styles?fields=physics&format=numeric
+citation_style: https://zotero.org/styles/american-physics-society?source=1
+# citation-style: https://zotero.org/styles/american-physics-society?source=1
+geometry: margin=3cm # https://stackoverflow.com/a/13516042
+# To create a PDF from this markdown file, run:
+# ```sh
+# cd paper
+# pandoc path/to/this/file --output paper.pdf --citeproc
+# ```
+# Requires `brew install pandoc` on macOS. Not identical but similar output as the
+# artifact generated by the JOSS GitHub action
+# https://github.com/marketplace/actions/open-journals-pdf-generator
+---
+
+<script>
+  import MetricsTable from '$figs/2022-11-28-metrics-table.svelte'
+  import { references } from './references.yaml'
+  import './heading-number.css' // uncomment to remove heading numbers
+</script>
+
+# {title}
+
+<address>
+  <span rel="author">
+    {@html authors
+      .map((author) => `${author.name}<sup>${author.affiliation}</sup>`)
+      .join(`, `)}
+  </span>
+  <span data-rel="affiliation">
+    {@html affiliations.map((affil, idx) => `${idx + 1}. ${affil}`).join(`<br/>`)}
+  </span>
+  <span style="font-weight: lighter;">{date}</span>
+</address>
+
+## Abstract
+
+<!-- - we propose a new machine learning benchmark for materials stability predictions
+- primary goal of Matbench Discovery is to answer the question of how useful ML energy models are at helping to accelerate inorganic crystal searching and what the optimal methodology is, specifically whether DFT emulators like M3GNet or one-shot predictors do better. -->
+
+In this work, we present a new machine learning benchmark for materials stability predictions called **Matbench Discovery**. The primary goal of this benchmark is to evaluate the effectiveness of machine learning energy models in accelerating the search for inorganic crystals and to determine the optimal methodology for this task. Specifically, we aim to answer the question of whether density functional theory emulators like M3GNet or one-shot predictors like Wrenformer perform better in this setting. We hope our results provide valuable insights that motivate researchers in the field of materials discovery and builders of high throughput databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.
+
+## Introduction
+
+### Why Materials Discovery?
+
+Many technologies rely on specific material properties such as finely tuned band gaps, hardness, melting points, polarizability, and thermal and electrical conductivity. The materials we master define the technological abilities of our time, so much so that we name eras of human development after them (stone, bronze, iron, coal/oil, silicon age). Besides entirely new technologies like superconductors, spintronics and quantum computing, discovering new materials and new ways of tuning their properties enables civilizational advancements such as waste reduction, water purification and of course energy generation and storage. Global warming puts particular urgency on such advances for solar cell, battery, turbine and other energy materials.
+
+### Why machine learning?
+
+Decades of research have resulted in powerful tools across materials theory, simulation and experiment. The latest addition to the computational toolbox are statistical models that learn to infer material properties from existing knowledge of materials accumulated over decades. These models are powerful tools for high-throughput novel materials discovery due to their ability to handle high dimensionality, large datasets, multiple objectives, uncertainty and noise, and sparse data.
+
+1. **Vast search space**: With ~100 elements and myriad compositional, structural, doping and defect degrees of freedom, the search space for novel materials is very high-dimensional. ML algorithms being orders of magnitude faster than simulation can explore larger chunks of this space and identify candidates that might have taken much longer to find using traditional methods.
+1. **Untapped value in existing data**: ML can handle very large datasets. With recent advances in automated experimentation and simulation, the amount of data available for materials discovery has grown to much more than any one expert can hope to absorb. Yet not too much for ML models which have shown the ability to condense a large chunk of the internet's text data into billions of parameters [cite GPT2/3 and other LLMs] which when probed correctly can reveal new patterns and relationships in the data that humans have missed.
+1. **Multiple objectives**: Materials applications often require finding specific combinations of properties in a material that can be anti-correlated such as high electrical conductivity and low thermal conductivity needed for thermoelectric devices. By carefully constructing their loss function, ML algorithms are good at handling multiple objectives and constraints.
+1. **Uncertainty and noise**: Experimental measurements have some level of noise and uncertainty. They can be highly sensitive to synthesis and processing conditions and fraught with outliers, making some data irreproducible and potentially confusing to a statistical model tasked to learn a map from input to non-unique output. However, ML methods like learned-uncertainty aware loss functions [cite heteroskedastic loss from what uncertainties do we need], Gaussian Processes and Bayesian Neural Networks exist and continue to be developed to incorporate uncertainty directly into the model. This leads to more robust and more risk-aware interpretation of ML predictions.
+1. **Sparse data**: Experimental data is scarce or even non-existent in many regions of material space. Tools like active learning and generative modeling can chart a path into or generate synthetic data from such regions, even from only small amounts of experimental data.
+
+If used right, we believe ML can improve the speed, hit rate and computational efficiency of materials discovery.
+
+Over recent years, machine learning has matured into an essential and proven tool in the field of materials science. Due to being very cheap and scalable but less accurate than more expensive ab-initio methods, ML models are most suitable in a high-throughput search setting as a pre-filter to more expensive, higher fidelity simulations. This makes them an obvious candidate for accelerating the search for novel materials across many unexplored regions of material space. Yet despite this untapped potential and the benefit benchmarks have been shown to provide in other areas of ML [cite ImageNet, QM9, OCP2020] in focusing and measuring progress, to date we know of no standardized task that simulates prospective materials discovery. Hence we set out to build one. Our objectives for this task were as follows:
+
+1. Measure model **robustness** and **reliability**: A model may perform worse on some material classes than others. Another model may be poorly calibrated in its stability predictions, i.e. predict overly optimistic for a constrained-resource search or overly pessimistic for an exhaustive search. A large benchmark spanning many different chemical spaces will reveal any weaknesses in chemistries or biases towards (in-)stability in a model.
+1. **Methodology contest**: By using the same benchmark across many models, it becomes possible to gain some degree of implementation-independent insight into the effectiveness of different approaches for predicting materials stability. One of the primary objectives for this benchmark is to help answer the question of whether DFT emulators like M3GNet or one-shot predictors like Wrenformer are fundamentally better suited for this task.
+1. **Measure progress** . Without a common task that endures over time, we have no quantitative measure of how much improvement we've made over the years.
+1. **Drive progress**: By making a curated easily deployable train and test set publicly available, researchers can spend less time on data retrieval, curation and provisioning and more on feature and architecture design.
+
+Another reason we believe a benchmark like this to be overdue is the fact that the last year has seen the release of several new energy models specifically designed to handle unrelaxed crystal structures. Examples include
+
+- [BOWSR](https://sciencedirect.com/science/article/pii/S1369702121002984) @zuo_accelerating_2021
+- [M3GNet](https://arxiv.org/abs/2202.02450) @chen_universal_2022
+- [Wren](https://arxiv.org/abs/2106.11132) @goodall_rapid_2022.
+
+Such models are well-suited for use in materials discovery workflows, where they pre-filter and/or pre-relax structures before feeding them into high-throughput DFT. These models have the potential to greatly accelerate the search for new materials by increasing the hit rate of stable materials. Reducing the total number of structures that need to be relaxed to discover a desired number of new materials ultimately reduces the cost of time and compute to realize a desired target property.
+
+We believe this to partly be the reason for a surge in interest in materials discovery from industry, including big-name companies without prior background in materials science. Many organizations appear to realize that recent advances in machine learning for materials science have matured enough to deliver real utility. And that this value may be more effectively extracted by organizations with prior expertise in big data and access to vast computing resources, than by academia itself.
+
+Yet despite (or perhaps because of) the many recent advances in machine learning for materials discovery, it is still unclear which ML methodology is most effective for predicting material stability. Options include one-shot predictors like Wren, force predictors that can emulate density functional theory relaxation like M3GNet, Bayesian optimizers like BOWSR, and future as yet undiscovered algorithms. Additionally, even if a clear winning methodology could be identified, practitioners seeking to use machine learning for triaging in a discovery campaign need to know which actual implementation is most effective, ready for production use and works with their data. Ideally, the question of which materials stability prediction algorithms offers the most bang for the buck should be answered before industrial players start committing large amounts of resources to the endeavor of expanding materials databases. In this work, we aim to address these questions in a future-proof benchmark that closely simulates applying an ML algorithm in a real-world materials discovery effort.
+
+### Limitations of current benchmarking in ML
+
+1. **Lack of realism**: Benchmark tasks can be idealized and simulate overly simplified conditions that do not reflect the real-world challenges a model is expected to overcome when used in an actual discovery campaign. This can lead to pretty leaderboards listing seemingly SOTA models that underwhelm when used in production. Examples of how this comes about are choosing the wrong target or picking an unrepresentative train/test split.
+1. **Limited diversity**: Benchmark datasets may be too small and contain only a limited number of materials, unrepresentative of the huge diversity of materials space. This can make models look good even if they fail to generalize.
+1. **Opportunity cost**: Bad benchmarks give insufficient consideration to the cost of a failed experiment. Looking purely at global metrics like $\text{MAE}$, $\text{RMSE}$ and $R^2$ can give practitioners a false sense of security. Precision, recall F1 It has been shown that even accurate models are susceptible to unexpectedly high false-positive rates that can cause experimentalists to waste their time and resources. Many benchmark tasks do not consider the cost or practicality of synthesizing the materials, which is an important aspect in the discovery of new materials.
+1. **Scalability**: Many benchmark tasks have too small data to adequately simulate the high-throughput and large-date regimes that future discovery efforts are likely to encounter. Confining model testing to the small data regime can obfuscate poor scaling relations like Gaussian Processes whose training costs grow cubically with training sample count or random forests that achieve outstanding performance on few data points but fail to extract the full information content out of larger datasets, leading to flatter learning curves compared to neural networks and worse performance when large amounts of training data are available.
+
+<!-- also cover how drug discovery suffers from similar issues maybe? -->
+
+### Why stability over formation energies?
+
+## Related Work
+
+Our work is inspired and builds upon earlier research.
+
+### Bartel's critical examination of ML stability predictions
+
+In 2020, [Chris Bartel et al.](https://nature.com/articles/s41524-020-00362-y) @bartel_critical_2020 benchmarked 7 ML models, showing all of them able to predict DFT formation energies with useful accuracy. However, when asked to predict stability (distance to the convex hull), the performance of all models but especially composition-based ones, deteriorated significantly, making them less useful than DFT for discovering new solids. This partly stems from a loss of systematic error cancellation compared to DFT (a first-principles theory of physics will make similar errors for similar systems which cancel when looking at the relative energy differences that determine stability) and partly from stability being a property of not only the material itself but also the chemical space of competing phases around it. A physical theory can generate knowledge of the whole space as it can (in principle) simulate every material in it. An ML algorithm by contrast can only go by the feature vector for the single material it is asked to predict. The authors stress the importance of evaluating model performance on actual stability predictions rather than formation energies, which guided the design of this benchmark.
+
+### Matbench
+
+As the name suggests, this work also expands on the initial release of Matbench @dunn_benchmarking_2020. Matbench aims to serve as a similar catalyst for machine learning in materials science as ImageNet @deng_imagenet_2009 was for computer vision. Matbench released a test suite of 13 supervised tasks for different material properties like thermal (e.g. formation energy, phonon frequency peak), electronic (band gap), optical (refractive index), tensile and elastic (bulk and shear moduli). They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only, 9 provide the relaxed crystal structure as input. Importantly, all tasks we're exclusively concerned with the properties of known materials. We believe a task that looks at materials stability and tries to simulate a materials discovery process to be a missing piece here.
+
+### WBM test set
+
+The choice of data for the train and test sets of this benchmark fell on the latest Materials Project @jain_commentary_2013 database release (2021.05.13 at time of writing) and the **[WBM dataset](https://nature.com/articles/s41524-020-00481-6)** @wang_predicting_2021 published in npj comp. mat. in 2021. Named after the authors last name initials, WBM consists of ~250k structures generated via chemical similarity-based elemental substitution of Materials Project source structures followed by DFT relaxation and convex hull distance calculation. ~20k or 10% were found to lie on the Materials Project convex hull. They did 5 iterations of this substitution process. This is a unique and compelling feature of the dataset as it allows out-of-distribution testing by checking how much model performance degrades with substitution count. A higher number of elemental substitutions will on average carry the structure further away from the region of chemical space covered by the Materials Project training set.
+
+<!--
+rough paper outline
+1. explain why a benchmark for materials stability prediction is important
+2. conclude from the benchmark that among existing models, some are good and some are bad
+  2.1 among the best models is Wren yet it fails on 20% of the data due to being constrained to small crystal structures
+3. to overcome this shortcoming of Wren, we re-implement it from scratch as a transformer-encoder using the same physically meaningful descriptor set consisting of the Wyckoff position, a coarse-grained relaxation-invariant feature set, and show that its memory requirements now only scale quadratically with the number Wyckoff positions, making it universally applicable across materials space
+-->
+
+<!-- Questions to address from Emma:
+
+What is the main hypothesis of the paper?
+What problem will Wrenformer specifically address?
+Why is Wrenformer the best solution to that problem?
+What journal will you be submitting to?
+Are there any tangents that this project went down that should be excluded from the manuscript for clarity’s sake?
+When can we expect a draft from you?
+Who are the reviewers that we want to specifically include/exclude?
+Were any of these questions discussed/addressed?
+-->
+
+## Results
+
+Our benchmark is designed to make [adding future models easy](/how-to-contribute). The initial release compares the following models:
+
+1. [CGCNN](https://arxiv.org/abs/1710.10324) @xie_crystal_2018
+1. [BOWSR](https://sciencedirect.com/science/article/pii/S1369702121002984) @zuo_accelerating_2021
+1. [Wren](https://arxiv.org/abs/2106.11132) @goodall_rapid_2022
+1. [M3GNet](https://arxiv.org/abs/2202.02450) @chen_universal_2022
+
+<MetricsTable />
+
+## Analysis
+
+## Conclusion
+
+## Acknowledgements
+
+JR acknowledges support from the German Academic Scholarship Foundation (Studienstiftung) and gracious hosting as a visiting affiliate in the group of KP.
+
+## References
+
+<ol>
+  {#each references as { title, id, author, DOI, URL, issued }, idx}
+    <li>
+      <strong {id}>{title}</strong>
+      <p>
+        {@html author.map((a) => `${a.given} ${a.family}`).join(`, &thinsp; `)}
+      </p>
+      <p>
+        {#if DOI}
+          DOI: <a href="https://doi.org/{DOI}">{DOI}</a>
+        {:else if URL}
+          preprint: <a href={URL}>{URL}</a>
+        {/if}
+        {#if issued}
+          - {issued[0].year}
+        {/if}
+      </p>
+    </li>
+  {/each}
+</ol>
+
+<style>
+  #abstract,
+  #abstract + p {
+    font-weight: 300;
+  }
+  address {
+    text-align: center;
+  }
+  address span {
+    margin: 1em;
+    display: block;
+  }
+  ol > li {
+    margin: 1ex 0;
+  }
+  ol > li > p {
+    margin: 0;
+  }
+</style>