Skip to content

Commit

Permalink
[AIR][Doc] New Example: LightningTrainer with experiment tracking too…
Browse files Browse the repository at this point in the history
…ls (#34812)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
  • Loading branch information
woshiyyya authored May 19, 2023
1 parent 35305e4 commit 43a20c1
Show file tree
Hide file tree
Showing 9 changed files with 316 additions and 4 deletions.
2 changes: 2 additions & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ parts:
title: "PyTorch Lightning Basic Example"
- file: train/examples/lightning/lightning_cola_advanced
title: "PyTorch Lightning Advanced Example"
- file: train/examples/lightning/lightning_exp_tracking
title: "PyTorch Lightning with Experiment Tracking Tools"
- file: train/examples/transformers/transformers_example
title: "HF Transformers Example"
- file: train/examples/tf/tensorflow_mnist_example
Expand Down
3 changes: 2 additions & 1 deletion doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,8 @@ def filter_out_undoc_class_members(member_name, class_name, module_name):
"trainTuneTensorflow": "TensorFlow,Training,Tuning",
"trainTunePyTorch": "PyTorch,Training,Tuning",
"trainBenchmark": "PyTorch,Training",
"trainLightning": "PyTorch,Lightning,Training"
"trainLightning": "PyTorch,Lightning,Training",
"trackLightning": "PyTorch,Lightning,Training,MLFlow"
# TODO add and integrate tags for other libraries.
# Tune has a proper example library
# Serve, RLlib and AIR could use one.
Expand Down
9 changes: 9 additions & 0 deletions doc/source/train/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ and use cases. You can filter these examples by the following categories:
<div type="button" class="tag btn btn-outline-primary">PyTorch</div>
<div type="button" class="tag btn btn-outline-primary">TensorFlow</div>
<div type="button" class="tag btn btn-outline-primary">HuggingFace</div>
<div type="button" class="tag btn btn-outline-primary">Lightning</div>
<div type="button" class="tag btn btn-outline-primary">Horovod</div>
<div type="button" class="tag btn btn-outline-primary">MLflow</div>

Expand Down Expand Up @@ -108,6 +109,14 @@ Ray Train Examples Using Loggers & Callbacks

Logging Training Runs with MLflow

.. grid-item-card::
:img-top: /images/pytorch_lightning_small.png
:class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img

.. button-ref:: lightning_experiment_tracking

Using Experiment Tracking Tools in LightningTrainer


Ray Train & Tune Integration Examples
-------------------------------------
Expand Down
12 changes: 11 additions & 1 deletion doc/source/train/examples/lightning/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ filegroup(
visibility=["//doc:__subpackages__"],
)

# GPU tests
py_test_run_all_notebooks(
size="large",
include=["*.ipynb"],
exclude=[],
exclude=["lightning_exp_tracking.ipynb"],
data=["//doc/source/train/examples/lightning:lightning_examples"],
tags=["exclusive", "team:ml", "gpu", "ray_air"],
)

# CPU tests
py_test_run_all_notebooks(
size="large",
include=["lightning_exp_tracking.ipynb"],
exclude=[],
data=["//doc/source/train/examples/lightning:lightning_examples"],
tags=["exclusive", "team:ml", "ray_air"],
)
Original file line number Diff line number Diff line change
Expand Up @@ -1492,7 +1492,8 @@
"## What's next?\n",
"\n",
"- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
"- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
"- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`\n",
"- {ref}`Experiment Tracking with Wandb, CometML, MLFlow, and Tensorboard in LightningTrainer <lightning_experiment_tracking>`"
]
}
],
Expand Down
285 changes: 285 additions & 0 deletions doc/source/train/examples/lightning/lightning_exp_tracking.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(lightning_experiment_tracking)=\n",
"\n",
"# Using Experiment Tracking Tools in LightningTrainer\n",
"\n",
"W&B, CometML, MLFlow, and Tensorboard are all popular tools in the field of machine learning for managing, visualizing, and tracking experiments. The {class}`~ray.train.lightning.LightningTrainer` integration in Ray AIR allows you to continue using these built-in experiment tracking integrations.\n",
"\n",
"\n",
":::{note}\n",
"This guide shows how to use the native [Logger](https://lightning.ai/docs/pytorch/stable/extensions/logging.html) integrations in PyTorch Lightning. Ray AIR also provides {ref}`experiment tracking integrations <tune-exp-tracking-ref>` for all the tools mentioned in this example. We recommend sticking with the PyTorch Lightning loggers.\n",
":::\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define your model and dataloader\n",
"\n",
"In this example, we simply create a dummy model with dummy datasets for demonstration. There is no need for any code change here. We report 3 metrics(\"train_loss\", \"metric_1\", \"metric_2\") in the training loop. Lightning's `Logger`s will capture and report them to the corresponding experiment tracking tools."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import torch.nn.functional as F\n",
"import pytorch_lightning as pl\n",
"from torch.utils.data import TensorDataset, DataLoader\n",
"\n",
"# create dummy data\n",
"X = torch.randn(128, 3) # 128 samples, 3 features\n",
"y = torch.randint(0, 2, (128,)) # 128 binary labels\n",
"\n",
"# create a TensorDataset to wrap the data\n",
"dataset = TensorDataset(X, y)\n",
"\n",
"# create a DataLoader to iterate over the dataset\n",
"batch_size = 8\n",
"dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Define a dummy model\n",
"class DummyModel(pl.LightningModule):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.layer = torch.nn.Linear(3, 1)\n",
"\n",
" def forward(self, x):\n",
" return self.layer(x)\n",
"\n",
" def training_step(self, batch, batch_idx):\n",
" x, y = batch\n",
" y_hat = self(x)\n",
" loss = F.binary_cross_entropy_with_logits(y_hat.flatten(), y.float())\n",
"\n",
" # The metrics below will be reported to Loggers\n",
" self.log(\"train_loss\", loss)\n",
" self.log_dict({\"metric_1\": 1 / (batch_idx + 1), \"metric_2\": batch_idx * 100})\n",
" return loss\n",
"\n",
" def configure_optimizers(self):\n",
" return torch.optim.Adam(self.parameters(), lr=1e-3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define your loggers\n",
"\n",
"For offline loggers, no changes are required in the Logger initialization.\n",
"\n",
"For online loggers (W&B and CometML), you need to do two things:\n",
"- Set up your API keys as environment variables.\n",
"- Set `rank_zero_only.rank = None` to avoid Lightning creating a new experiment run on the driver node. "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"CometLogger will be initialized in online mode\n"
]
}
],
"source": [
"from pytorch_lightning.loggers.wandb import WandbLogger\n",
"from pytorch_lightning.loggers.comet import CometLogger\n",
"from pytorch_lightning.loggers.mlflow import MLFlowLogger\n",
"from pytorch_lightning.loggers.tensorboard import TensorBoardLogger\n",
"from pytorch_lightning.utilities.rank_zero import rank_zero_only\n",
"import wandb\n",
"\n",
"\n",
"# A callback to login wandb in each worker\n",
"class WandbLoginCallback(pl.Callback):\n",
" def __init__(self, key):\n",
" self.key = key\n",
"\n",
" def setup(self, trainer, pl_module, stage) -> None:\n",
" wandb.login(key=self.key)\n",
"\n",
"\n",
"def create_loggers(name, project_name, save_dir=\"./logs\", offline=False):\n",
" # Avoid creating a new experiment run on the driver node.\n",
" rank_zero_only.rank = None\n",
"\n",
" # Wandb\n",
" wandb_api_key = os.environ.get(\"WANDB_API_KEY\", None)\n",
" wandb_logger = WandbLogger(\n",
" name=name, \n",
" project=project_name, \n",
" save_dir=f\"{save_dir}/wandb\", \n",
" offline=offline\n",
" )\n",
" callbacks = [] if offline else [WandbLoginCallback(key=wandb_api_key)]\n",
"\n",
" # CometML\n",
" comet_api_key = os.environ.get(\"COMET_API_KEY\", None)\n",
" comet_logger = CometLogger(\n",
" api_key=comet_api_key,\n",
" experiment_name=name,\n",
" project_name=project_name,\n",
" save_dir=f\"{save_dir}/comet\",\n",
" offline=offline,\n",
" )\n",
"\n",
" # MLFlow\n",
" mlflow_logger = MLFlowLogger(\n",
" run_name=name,\n",
" experiment_name=project_name,\n",
" tracking_uri=f\"file:{save_dir}/mlflow\",\n",
" )\n",
"\n",
" # Tensorboard\n",
" tensorboard_logger = TensorBoardLogger(\n",
" name=name, save_dir=f\"{save_dir}/tensorboard\"\n",
" )\n",
"\n",
" return [wandb_logger, comet_logger, mlflow_logger, tensorboard_logger], callbacks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"YOUR_SAVE_DIR = \"./logs\"\n",
"loggers, callbacks = create_loggers(\n",
" name=\"demo-run\", project_name=\"demo-project\", save_dir=YOUR_SAVE_DIR, offline=False\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# FOR SMOKE TESTS\n",
"loggers, callbacks = create_loggers(\n",
" name=\"demo-run\", project_name=\"demo-project\", offline=True\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train the model and view logged results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ray.air.config import RunConfig, ScalingConfig\n",
"from ray.train.lightning import LightningConfigBuilder, LightningTrainer\n",
"\n",
"builder = LightningConfigBuilder()\n",
"builder.module(cls=DummyModel)\n",
"builder.trainer(\n",
" max_epochs=5,\n",
" accelerator=\"cpu\",\n",
" logger=loggers,\n",
" callbacks=callbacks,\n",
" log_every_n_steps=1,\n",
")\n",
"builder.fit_params(train_dataloaders=dataloader)\n",
"\n",
"lightning_config = builder.build()\n",
"\n",
"scaling_config = ScalingConfig(num_workers=4, use_gpu=False)\n",
"\n",
"run_config = RunConfig(\n",
" name=\"ptl-exp-tracking\",\n",
" storage_path=\"/tmp/ray_results\",\n",
")\n",
"\n",
"trainer = LightningTrainer(\n",
" lightning_config=lightning_config,\n",
" scaling_config=scaling_config,\n",
" run_config=run_config,\n",
")\n",
"\n",
"trainer.fit()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's take a look at our experiment results!\n",
"\n",
"**Wandb**\n",
"![alt](https://user-images.githubusercontent.com/26745457/235216924-ed27f820-3f2e-4812-bc62-982c3a1748c7.png)\n",
"\n",
"\n",
"**CometML**\n",
"![alt](https://user-images.githubusercontent.com/26745457/235216949-72d80d7d-4460-480a-b20d-f154594507fc.png)\n",
"\n",
"\n",
"**Tensorboard**\n",
"![](https://user-images.githubusercontent.com/26745457/235227957-7c2ee93b-91ab-494c-a241-7b106cf9a5e6.png)\n",
"\n",
"**MLFlow**\n",
"![](https://user-images.githubusercontent.com/26745457/235241099-6850bcae-8843-4bbb-8268-c04b04a09e68.png)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Original file line number Diff line number Diff line change
Expand Up @@ -742,7 +742,8 @@
"\n",
"- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
"- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
"- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
"- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`\n",
"- {ref}`Experiment Tracking with Wandb, CometML, MLFlow, and Tensorboard in LightningTrainer <lightning_experiment_tracking>`"
]
}
],
Expand Down
2 changes: 2 additions & 0 deletions doc/source/tune/examples/experiment-tracking.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _tune-exp-tracking-ref:

Tune Experiment Tracking Examples
---------------------------------

Expand Down
1 change: 1 addition & 0 deletions doc/source/tune/examples/tune-pytorch-lightning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -582,6 +582,7 @@
"\n",
"- {ref}`Use LightningTrainer for Image Classification <lightning_mnist_example>`.\n",
"- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
"- {ref}`Experiment Tracking with Wandb, CometML, MLFlow, and Tensorboard in LightningTrainer <lightning_experiment_tracking>`\n",
"- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
"- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)\n",
" and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune.\n",
Expand Down

0 comments on commit 43a20c1

Please sign in to comment.