Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing parameter docstrings #609

Merged
merged 5 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 52 additions & 9 deletions docs/source/advanced_usage/trainingmodel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -194,22 +194,64 @@ keyword, you can fine-tune the number of new snapshots being created.
By default, the same number of snapshots as had been provided will be created
(if possible).

Using tensorboard
******************
Logging metrics during training
*******************************

Training progress in MALA can be visualized via tensorboard or wandb, as also shown
in the file ``advanced/ex03_tensor_board``. Simply select a logger prior to training as

.. code-block:: python

parameters.running.logger = "tensorboard"
parameters.running.logging_dir = "mala_vis"

Training routines in MALA can be visualized via tensorboard, as also shown
in the file ``advanced/ex03_tensor_board``. Simply enable tensorboard
visualization prior to training via
or

.. code-block:: python

# 0: No visualizatuon, 1: loss and learning rate, 2: like 1,
# but additionally weights and biases are saved
parameters.running.logging = 1
import wandb
wandb.init(
project="mala_training",
entity="your_wandb_entity"
)
parameters.running.logger = "wandb"
parameters.running.logging_dir = "mala_vis"

where ``logging_dir`` specifies some directory in which to save the
MALA logging data. Afterwards, you can run the training without any
MALA logging data. You can also select which metrics to record via

.. code-block:: python

parameters.validation_metrics = ["ldos", "dos", "density", "total_energy"]

Full list of available metrics:
- "ldos": MSE of the LDOS.
- "band_energy": Band energy.
- "band_energy_actual_fe": Band energy computed with ground truth Fermi energy.
- "total_energy": Total energy.
- "total_energy_actual_fe": Total energy computed with ground truth Fermi energy.
- "fermi_energy": Fermi energy.
- "density": Electron density.
- "density_relative": Rlectron density (Mean Absolute Percentage Error).
- "dos": Density of states.
- "dos_relative": Density of states (Mean Absolute Percentage Error).

To save time and resources you can specify the logging interval via

.. code-block:: python

parameters.running.validate_every_n_epochs = 10

If you want to monitor the degree to which the model overfits to the training data,
you can use the option

.. code-block:: python

parameters.running.validate_on_training_data = True

MALA will evaluate the validation metrics on the training set as well as the validation set.

Afterwards, you can run the training without any
other modifications. Once training is finished (or during training, in case
you want to use tensorboard to monitor progress), you can launch tensorboard
via
Expand All @@ -221,6 +263,7 @@ via
The full path for ``path_to_log_directory`` can be accessed via
``trainer.full_logging_path``.

If you're using wandb, you can monitor the training progress on the wandb website.

Training in parallel
********************
Expand Down
53 changes: 38 additions & 15 deletions mala/common/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,6 @@ class ParametersNetwork(ParametersBase):
----------
nn_type : string
Type of the neural network that will be used. Currently supported are

- "feed_forward" (default)
- "transformer"
- "lstm"
Expand Down Expand Up @@ -279,12 +278,12 @@ def __init__(self):
self.layer_activations = ["Sigmoid"]
self.loss_function_type = "mse"

# for LSTM/Gru + Transformer
self.num_hidden_layers = 1

# for LSTM/Gru
self.no_hidden_state = False
self.bidirection = False

# for LSTM/Gru + Transformer
self.num_hidden_layers = 1

# for transformer net
self.dropout = 0.1
Expand Down Expand Up @@ -556,11 +555,6 @@ class ParametersData(ParametersBase):

Attributes
----------
descriptors_contain_xyz : bool
Legacy option. If True, it is assumed that the first three entries of
the descriptor vector are the xyz coordinates and they are cut from the
descriptor vector. If False, no such cutting is peformed.

snapshot_directories_list : list
A list of all added snapshots.

Expand Down Expand Up @@ -693,12 +687,15 @@ class ParametersRunning(ParametersBase):
a "by snapshot" basis.

checkpoints_each_epoch : int
If not 0, checkpoint files will be saved after eac
If not 0, checkpoint files will be saved after each
checkpoints_each_epoch epoch.

checkpoint_name : string
Name used for the checkpoints. Using this, multiple runs
can be performed in the same directory.

run_name : string
Name of the run used for logging.

logging_dir : string
Name of the folder that logging files will be saved to.
Expand All @@ -707,6 +704,34 @@ class ParametersRunning(ParametersBase):
If True, then upon creating logging files, these will be saved
in a subfolder of logging_dir labelled with the starting date
of the logging, to avoid having to change input scripts often.

logger : string
Name of the logger to be used.
Currently supported are:

- "tensorboard": Tensorboard logger.
- "wandb": Weights and Biases logger.

validation_metrics : list
List of metrics to be used for validation. Default is ["ldos"].
Possible options are:

- "ldos": MSE of the LDOS.
- "band_energy": Band energy.
- "band_energy_actual_fe": Band energy computed with ground truth Fermi energy.
- "total_energy": Total energy.
- "total_energy_actual_fe": Total energy computed with ground truth Fermi energy.
- "fermi_energy": Fermi energy.
- "density": Electron density.
- "density_relative": Rlectron density (MAPE).
- "dos": Density of states.
- "dos_relative": Density of states (MAPE).

validate_on_training_data : bool
Whether to validate on the training data as well. Default is False.

validate_every_n_epochs : int
Determines how often validation is performed. Default is 1.

inference_data_grid : list
List holding the grid to be used for inference in the form of
Expand All @@ -721,19 +746,18 @@ class ParametersRunning(ParametersBase):

profiler_range : list
List with two entries determining with which batch/iteration number
the CUDA profiler will start and stop profiling. Please note that
this option only holds significance if the nsys profiler is used.
the CUDA profiler will start and stop profiling. Please note that
this option only holds significance if the nsys profiler is used.
"""

def __init__(self):
super(ParametersRunning, self).__init__()
self.optimizer = "Adam"
self.learning_rate = 10 ** (-5)
self.learning_rate = 0.5
self.learning_rate_embedding = 10 ** (-4)
self.max_number_epochs = 100
self.verbosity = True
self.mini_batch_size = 10
self.snapshots_per_epoch = -1

self.l1_regularization = 0.0
self.l2_regularization = 0.0
Expand All @@ -752,7 +776,6 @@ def __init__(self):
self.num_workers = 0
self.use_shuffling_for_samplers = True
self.checkpoints_each_epoch = 0
self.checkpoint_best_so_far = False
self.checkpoint_name = "checkpoint_mala"
self.run_name = ""
self.logging_dir = "./mala_logging"
Expand Down