[AIR][Doc] New Example: LightningTrainer with experiment tracking too…

…ls (#34812) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
ray-project · May 19, 2023 · 43a20c1 · 43a20c1
1 parent 35305e4
commit 43a20c1
Show file tree

Hide file tree

Showing 9 changed files with 316 additions and 4 deletions.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -134,6 +134,8 @@ parts:
                 title: "PyTorch Lightning Basic Example"
               - file: train/examples/lightning/lightning_cola_advanced
                 title: "PyTorch Lightning Advanced Example"
+              - file: train/examples/lightning/lightning_exp_tracking
+                title: "PyTorch Lightning with Experiment Tracking Tools"
               - file: train/examples/transformers/transformers_example
                 title: "HF Transformers Example"
               - file: train/examples/tf/tensorflow_mnist_example

diff --git a/doc/source/conf.py b/doc/source/conf.py
@@ -364,7 +364,8 @@ def filter_out_undoc_class_members(member_name, class_name, module_name):
     "trainTuneTensorflow": "TensorFlow,Training,Tuning",
     "trainTunePyTorch": "PyTorch,Training,Tuning",
     "trainBenchmark": "PyTorch,Training",
-    "trainLightning": "PyTorch,Lightning,Training"
+    "trainLightning": "PyTorch,Lightning,Training",
+    "trackLightning": "PyTorch,Lightning,Training,MLFlow"
     # TODO add and integrate tags for other libraries.
     # Tune has a proper example library
     # Serve, RLlib and AIR could use one.

diff --git a/doc/source/train/examples.rst b/doc/source/train/examples.rst
@@ -19,6 +19,7 @@ and use cases. You can filter these examples by the following categories:
         <div type="button" class="tag btn btn-outline-primary">PyTorch</div>
         <div type="button" class="tag btn btn-outline-primary">TensorFlow</div>
         <div type="button" class="tag btn btn-outline-primary">HuggingFace</div>
+        <div type="button" class="tag btn btn-outline-primary">Lightning</div>
         <div type="button" class="tag btn btn-outline-primary">Horovod</div>
         <div type="button" class="tag btn btn-outline-primary">MLflow</div>
 
@@ -108,6 +109,14 @@ Ray Train Examples Using Loggers & Callbacks
 
             Logging Training Runs with MLflow
 
+    .. grid-item-card::
+        :img-top: /images/pytorch_lightning_small.png
+        :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img
+
+        .. button-ref:: lightning_experiment_tracking
+
+            Using Experiment Tracking Tools in LightningTrainer
+
 
 Ray Train & Tune Integration Examples
 -------------------------------------

diff --git a/doc/source/train/examples/lightning/BUILD b/doc/source/train/examples/lightning/BUILD
@@ -6,10 +6,20 @@ filegroup(
     visibility=["//doc:__subpackages__"],
 )
 
+# GPU tests
 py_test_run_all_notebooks(
     size="large",
     include=["*.ipynb"],
-    exclude=[],
+    exclude=["lightning_exp_tracking.ipynb"],
     data=["//doc/source/train/examples/lightning:lightning_examples"],
     tags=["exclusive", "team:ml", "gpu", "ray_air"],
 )
+
+# CPU tests
+py_test_run_all_notebooks(
+    size="large",
+    include=["lightning_exp_tracking.ipynb"],
+    exclude=[],
+    data=["//doc/source/train/examples/lightning:lightning_examples"],
+    tags=["exclusive", "team:ml", "ray_air"],
+)
diff --git a/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb b/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb
@@ -1492,7 +1492,8 @@
     "## What's next?\n",
     "\n",
     "- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
-    "- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
+    "- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`\n",
+    "- {ref}`Experiment Tracking with Wandb, CometML, MLFlow, and Tensorboard in LightningTrainer <lightning_experiment_tracking>`"
    ]
   }
  ],

diff --git a/doc/source/train/examples/lightning/lightning_exp_tracking.ipynb b/doc/source/train/examples/lightning/lightning_exp_tracking.ipynb
@@ -0,0 +1,285 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(lightning_experiment_tracking)=\n",
+    "\n",
+    "# Using Experiment Tracking Tools in LightningTrainer\n",
+    "\n",
+    "W&B, CometML, MLFlow, and Tensorboard are all popular tools in the field of machine learning for managing, visualizing, and tracking experiments. The {class}`~ray.train.lightning.LightningTrainer` integration in Ray AIR allows you to continue using these built-in experiment tracking integrations.\n",
+    "\n",
+    "\n",
+    ":::{note}\n",
+    "This guide shows how to use the native [Logger](https://lightning.ai/docs/pytorch/stable/extensions/logging.html) integrations in PyTorch Lightning. Ray AIR also provides {ref}`experiment tracking integrations <tune-exp-tracking-ref>` for all the tools mentioned in this example. We recommend sticking with the PyTorch Lightning loggers.\n",
+    ":::\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define your model and dataloader\n",
+    "\n",
+    "In this example, we simply create a dummy model with dummy datasets for demonstration. There is no need for any code change here. We report 3 metrics(\"train_loss\", \"metric_1\", \"metric_2\") in the training loop. Lightning's `Logger`s will capture and report them to the corresponding experiment tracking tools."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "import pytorch_lightning as pl\n",
+    "from torch.utils.data import TensorDataset, DataLoader\n",
+    "\n",
+    "# create dummy data\n",
+    "X = torch.randn(128, 3)  # 128 samples, 3 features\n",
+    "y = torch.randint(0, 2, (128,))  # 128 binary labels\n",
+    "\n",
+    "# create a TensorDataset to wrap the data\n",
+    "dataset = TensorDataset(X, y)\n",
+    "\n",
+    "# create a DataLoader to iterate over the dataset\n",
+    "batch_size = 8\n",
+    "dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define a dummy model\n",
+    "class DummyModel(pl.LightningModule):\n",
+    "    def __init__(self):\n",
+    "        super().__init__()\n",
+    "        self.layer = torch.nn.Linear(3, 1)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.layer(x)\n",
+    "\n",
+    "    def training_step(self, batch, batch_idx):\n",
+    "        x, y = batch\n",
+    "        y_hat = self(x)\n",
+    "        loss = F.binary_cross_entropy_with_logits(y_hat.flatten(), y.float())\n",
+    "\n",
+    "        # The metrics below will be reported to Loggers\n",
+    "        self.log(\"train_loss\", loss)\n",
+    "        self.log_dict({\"metric_1\": 1 / (batch_idx + 1), \"metric_2\": batch_idx * 100})\n",
+    "        return loss\n",
+    "\n",
+    "    def configure_optimizers(self):\n",
+    "        return torch.optim.Adam(self.parameters(), lr=1e-3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define your loggers\n",
+    "\n",
+    "For offline loggers, no changes are required in the Logger initialization.\n",
+    "\n",
+    "For online loggers (W&B and CometML), you need to do two things:\n",
+    "- Set up your API keys as environment variables.\n",
+    "- Set `rank_zero_only.rank = None` to avoid Lightning creating a new experiment run on the driver node. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "CometLogger will be initialized in online mode\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pytorch_lightning.loggers.wandb import WandbLogger\n",
+    "from pytorch_lightning.loggers.comet import CometLogger\n",
+    "from pytorch_lightning.loggers.mlflow import MLFlowLogger\n",
+    "from pytorch_lightning.loggers.tensorboard import TensorBoardLogger\n",
+    "from pytorch_lightning.utilities.rank_zero import rank_zero_only\n",
+    "import wandb\n",
+    "\n",
+    "\n",
+    "# A callback to login wandb in each worker\n",
+    "class WandbLoginCallback(pl.Callback):\n",
+    "    def __init__(self, key):\n",
+    "        self.key = key\n",
+    "\n",
+    "    def setup(self, trainer, pl_module, stage) -> None:\n",
+    "        wandb.login(key=self.key)\n",
+    "\n",
+    "\n",
+    "def create_loggers(name, project_name, save_dir=\"./logs\", offline=False):\n",
+    "    # Avoid creating a new experiment run on the driver node.\n",
+    "    rank_zero_only.rank = None\n",
+    "\n",
+    "    # Wandb\n",
+    "    wandb_api_key = os.environ.get(\"WANDB_API_KEY\", None)\n",
+    "    wandb_logger = WandbLogger(\n",
+    "        name=name, \n",
+    "        project=project_name, \n",
+    "        save_dir=f\"{save_dir}/wandb\", \n",
+    "        offline=offline\n",
+    "    )\n",
+    "    callbacks = [] if offline else [WandbLoginCallback(key=wandb_api_key)]\n",
+    "\n",
+    "    # CometML\n",
+    "    comet_api_key = os.environ.get(\"COMET_API_KEY\", None)\n",
+    "    comet_logger = CometLogger(\n",
+    "        api_key=comet_api_key,\n",
+    "        experiment_name=name,\n",
+    "        project_name=project_name,\n",
+    "        save_dir=f\"{save_dir}/comet\",\n",
+    "        offline=offline,\n",
+    "    )\n",
+    "\n",
+    "    # MLFlow\n",
+    "    mlflow_logger = MLFlowLogger(\n",
+    "        run_name=name,\n",
+    "        experiment_name=project_name,\n",
+    "        tracking_uri=f\"file:{save_dir}/mlflow\",\n",
+    "    )\n",
+    "\n",
+    "    # Tensorboard\n",
+    "    tensorboard_logger = TensorBoardLogger(\n",
+    "        name=name, save_dir=f\"{save_dir}/tensorboard\"\n",
+    "    )\n",
+    "\n",
+    "    return [wandb_logger, comet_logger, mlflow_logger, tensorboard_logger], callbacks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "YOUR_SAVE_DIR = \"./logs\"\n",
+    "loggers, callbacks = create_loggers(\n",
+    "    name=\"demo-run\", project_name=\"demo-project\", save_dir=YOUR_SAVE_DIR, offline=False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "tags": [
+     "remove-cell"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "# FOR SMOKE TESTS\n",
+    "loggers, callbacks = create_loggers(\n",
+    "    name=\"demo-run\", project_name=\"demo-project\", offline=True\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train the model and view logged results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ray.air.config import RunConfig, ScalingConfig\n",
+    "from ray.train.lightning import LightningConfigBuilder, LightningTrainer\n",
+    "\n",
+    "builder = LightningConfigBuilder()\n",
+    "builder.module(cls=DummyModel)\n",
+    "builder.trainer(\n",
+    "    max_epochs=5,\n",
+    "    accelerator=\"cpu\",\n",
+    "    logger=loggers,\n",
+    "    callbacks=callbacks,\n",
+    "    log_every_n_steps=1,\n",
+    ")\n",
+    "builder.fit_params(train_dataloaders=dataloader)\n",
+    "\n",
+    "lightning_config = builder.build()\n",
+    "\n",
+    "scaling_config = ScalingConfig(num_workers=4, use_gpu=False)\n",
+    "\n",
+    "run_config = RunConfig(\n",
+    "    name=\"ptl-exp-tracking\",\n",
+    "    storage_path=\"/tmp/ray_results\",\n",
+    ")\n",
+    "\n",
+    "trainer = LightningTrainer(\n",
+    "    lightning_config=lightning_config,\n",
+    "    scaling_config=scaling_config,\n",
+    "    run_config=run_config,\n",
+    ")\n",
+    "\n",
+    "trainer.fit()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's take a look at our experiment results!\n",
+    "\n",
+    "**Wandb**\n",
+    "![alt](https://user-images.githubusercontent.com/26745457/235216924-ed27f820-3f2e-4812-bc62-982c3a1748c7.png)\n",
+    "\n",
+    "\n",
+    "**CometML**\n",
+    "![alt](https://user-images.githubusercontent.com/26745457/235216949-72d80d7d-4460-480a-b20d-f154594507fc.png)\n",
+    "\n",
+    "\n",
+    "**Tensorboard**\n",
+    "![](https://user-images.githubusercontent.com/26745457/235227957-7c2ee93b-91ab-494c-a241-7b106cf9a5e6.png)\n",
+    "\n",
+    "**MLFlow**\n",
+    "![](https://user-images.githubusercontent.com/26745457/235241099-6850bcae-8843-4bbb-8268-c04b04a09e68.png)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/doc/source/train/examples/lightning/lightning_mnist_example.ipynb b/doc/source/train/examples/lightning/lightning_mnist_example.ipynb
@@ -742,7 +742,8 @@
     "\n",
     "- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
     "- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
-    "- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
+    "- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`\n",
+    "- {ref}`Experiment Tracking with Wandb, CometML, MLFlow, and Tensorboard in LightningTrainer <lightning_experiment_tracking>`"
    ]
   }
  ],

diff --git a/doc/source/tune/examples/experiment-tracking.rst b/doc/source/tune/examples/experiment-tracking.rst
@@ -1,3 +1,5 @@
+.. _tune-exp-tracking-ref:
+
 Tune Experiment Tracking Examples
 ---------------------------------
 

diff --git a/doc/source/tune/examples/tune-pytorch-lightning.ipynb b/doc/source/tune/examples/tune-pytorch-lightning.ipynb
@@ -582,6 +582,7 @@
     "\n",
     "- {ref}`Use LightningTrainer for Image Classification <lightning_mnist_example>`.\n",
     "- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
+    "- {ref}`Experiment Tracking with Wandb, CometML, MLFlow, and Tensorboard in LightningTrainer <lightning_experiment_tracking>`\n",
     "- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
     "- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)\n",
     "  and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune.\n",