Pretraining Foundation Models:
Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

X-ray fluorescence (XRF) core scanning is renowned for its highresolution, non-destructive, and user-friendly operation. Despite the extensive applications of XRF data, the universal quantification of this data into specific geological proxies remains challenging due to the inherent non-linearity and project-scale limitation.

Our study aims to address the challenges by harnessing two interdisciplinary advancements:

Vast amount of XRF spectra acquired from series of scientific drilling programs
More powerful training scheme and complex model inspired by the success of large language models (LLMs).

We proposed a pretraining-finetuning framework that leverages the vast amount of XRF spectra to pretrain a foundation model. Masked Spectrum Modeling (MSM) is modifed from BERT, ViT, and MAE to our pretraining process. It is designed to let our foundation model learn the underlying patterns and relationships in the XRF spectra, which can be transferred to downstream tasks. The pretraining process is followed by fine-tuning the model on specific geological proxies to adapt the model to the target tasks. Hence, the downstream fine-tuning does not necessary require large amount of labeled data, which is contrast to the conventional method training a model from scratch in each project.

The work has been published on EGU2024 as a poster:

Lee, A.-S., Lin, H.-T., and Liou, S. Y. H.: Pretraining Foundation Models: Unleashing the Power of Forgotten Spectra for Advanced Geological Applications, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-4956, https://doi.org/10.5194/egusphere-egu24-4956, 2024.

Environment setup

Docker container

We adopt the container template, cuda118, from https://github.com/dispink/docker-example.

Versions of the main packages

Python 3.11
CUDA 11.8
cudnn 8.6.0

Model weights

The published model weights are available on the HuggingFace repo.

Folder structure

.devcontainer: Contain the configuration files for the Docker container, which is compatible to VScode Dev Container.
data: It is hidden here. Please check it on the HuggingFace repo. The script is src/datas/build_data.py.
notebooks: Collect Jupyter notebooks for experimentation, analysis, and model development.
configs: Store configuration files or parameters used in the project, such as hyperparameters, model configurations, or experiment settings.
files: Store selected output files, reports, or visualizations.
src: Contain all the scripts used in the project. It is further divided into subfolders:
- datas: Scripts for data preprocessing, and data loading.
- models: Scripts for model architectures, loss functions, and evaluation metrics.
- train: Scripts for training and related functions.
- eval: Scripts for evalutaion and related functions.
- utils: Utility scripts for logging and other helper functions.
archives: Store old or deprecated scripts.
pilot: Store pilot experiments before integrating into the main project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pretraining Foundation Models:
Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

The work has been published on EGU2024 as a poster:

Environment setup

Docker container

Versions of the main packages

Model weights

Folder structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pretraining Foundation Models: Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

The work has been published on EGU2024 as a poster:

Environment setup

Docker container

Versions of the main packages

Model weights

Folder structure

Pretraining Foundation Models:
Unleashing the Power of Forgotten Spectra for Advanced Geological Applications