Docs: Update MPI tutorial with DelftBlue / SLURM instructions

Added a detailed guide on how to run the model on the DelftBlue supercomputer, which utilizes the SLURM scheduler: - Steps for creating a SLURM script tailored for DelftBlue. - Overview of the DelftBlue supercomputer's capabilities. - Instructions on environment setup, logging in, file transfer, job scheduling, and result retrieval for DelftBlue.
quaquel · Nov 15, 2023 · 1831bd8 · 1831bd8
1 parent b206fe6
commit 1831bd8
Showing 1 changed file with 121 additions and 4 deletions.
diff --git a/docs/source/indepth_tutorial/mpi-evaluator.ipynb b/docs/source/indepth_tutorial/mpi-evaluator.ipynb
@@ -207,9 +207,11 @@
     ")\n",
     "import pickle\n",
     "\n",
+    "\n",
     "def some_model(x1=None, x2=None, x3=None):\n",
     "    return {\"y\": x1 * x2 + x3}\n",
     "\n",
+    "\n",
     "if __name__ == \"__main__\":\n",
     "    ema_logging.log_to_stderr(level=ema_logging.INFO, pass_root_logger_level=True)\n",
     "\n",
@@ -252,18 +254,133 @@
   },
   {
    "cell_type": "markdown",
-   "source": [],
+   "source": [
+    "## Example: Running on the DelftBlue supercomputer (with SLURM)\n",
+    "\n",
+    "As an example, we'll show how to run the model on the [DelftBlue supercomputer](https://doc.dhpc.tudelft.nl/delftblue/), which uses the SLURM scheduler. The DelftBlue supercomputer is a cluster of 218 nodes, each with 2 Intel Xeon Gold E5-6248R CPUs (48 cores total), 192 GB of RAM, and 480 GB of local SSD storage. The nodes are connected with a 100 Gbit/s Infiniband network.\n",
+    "\n",
+    "_These steps roughly follow the [DelftBlue Crash-course for absolute beginners](https://doc.dhpc.tudelft.nl/delftblue/crash-course/). If you get stuck, you can refer to that guide for more information._\n",
+    "\n",
+    "### 1. Creating a SLURM script\n",
+    "\n",
+    "First, you need to create a SLURM script. This is a bash script that will be executed on the cluster, and it will contain all the necessary commands to run your model. You can create a new file, for example `slurm_script.sh`, and add the following lines:\n",
+    "\n",
+    "   ```bash\n",
+    "    #!/bin/bash\n",
+    "    \n",
+    "    #SBATCH --job-name=\"Python_test\"\n",
+    "    #SBATCH --time=00:02:00\n",
+    "    #SBATCH --ntasks=25\n",
+    "    #SBATCH --cpus-per-task=1\n",
+    "    #SBATCH --partition=compute\n",
+    "    #SBATCH --mem-per-cpu=1GB\n",
+    "    #SBATCH --account=research-tpm-mas\n",
+    "    \n",
+    "    module load 2023r1\n",
+    "    module load openmpi\n",
+    "    module load python\n",
+    "    module load py-numpy\n",
+    "    module load py-mpi4py\n",
+    "    module load py-pip\n",
+    "\n",
+    "    pip install -U --user ema_workbench\n",
+    "\n",
+    "    mpiexec python3 -m mpi4py.futures ema_example_model.py\n",
+    "   ```\n",
+    "Modify the script to suit your needs:\n",
+    "- Set the `--job-name` to something descriptive.\n",
+    "- Update the maximum `--time` to the expected runtime of your model. The job will be terminated if it exceeds this time limit.\n",
+    "- Set the `--ntasks` to the number of cores you want to use. Each node has 48 cores, so for example `--ntasks=96` might use two nodes, but can also be distributed over more nodes.\n",
+    "- Update the memory `--mem-per-cpu` to the amount of memory you need per core. Each node has 192 GB of memory, so you can use up to 4 GB per core.\n",
+    "- Add `--exclusive` if you want to claim a full node for your job. In that case, specify `--nodes` instead of `--ntasks`. This will reduce overhead, but it will also delay you scheduling time, because you need to wait for a full node to become available.\n",
+    "- Set the `--account` to your project account. You can find this on the [Accounting and Shares](https://doc.dhpc.tudelft.nl/delftblue/Accounting-and-shares/) page of the DelftBlue docs.\n",
+    "\n",
+    "See [Submit Jobs](https://doc.dhpc.tudelft.nl/delftblue/Slurm-scheduler/) at the DelftBlue docs for more information on the SLURM script configuration.\n",
+    "\n",
+    "Then, you need to load the necessary modules. You can find the available modules on the [DHPC modules](https://doc.dhpc.tudelft.nl/delftblue/DHPC-modules/) page of the DelftBlue docs. In this example, we're loading the 2023r1 toolchain, which includes Python 3.9, and then we're loading the necessary Python packages.\n",
+    "\n",
+    "You might want to install additional Python packages. You can do this with `pip install -U --user <package>`. Note that you need to use the `--user` flag, because you don't have root access on the cluster.\n",
+    "To install the EMA Workbench, you can use `pip install -U --user ema_workbench`. If you want to install a development branch, you can use `pip install -e -U --user git+https://github.com/quaquel/EMAworkbench@<BRANCH>#egg=ema-workbench`, where `<BRANCH>` is the name of the branch you want to install.\n",
+    "\n",
+    "Finally, the script uses `mpiexec` to run Python script in a way that allows the MPIEvaluator to distribute the experiments over the cluster.\n",
+    "\n",
+    "Note that the bash scripts (sh), including the `slurm_script.sh` we just created, need LF line endings. If you are using Windows, line endings are CRLF by default, and you need to convert them to LF. You can do this with most text editors, like Notepad++ or Atom for example."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "a0116cace0bd0a87"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### 1. Setting up the environment\n",
+    "\n",
+    "First, you need to log in on DelftBlue. As an employee, you can login from the command line with:\n",
+    "   ```bash\n",
+    "    ssh <netid>@login.delftblue.tudelft.nl\n",
+    "   ```\n",
+    "where `<netid>` is your TU Delft netid. You can also use an SSH client such as [PuTTY](https://www.putty.org/).\n",
+    "\n",
+    "As a student, you need to jump though an extra hoop:\n",
+    "\n",
+    "   ```bash\n",
+    "    ssh -J <netid>@student-linux.tudelft.nl <netid>@login.delftblue.tudelft.nl\n",
+    "   ```\n",
+    "\n",
+    "Note: Below are the commands for students. If you are an employee, you need to remove the `-J <netid>@student-linux.tudelft.nl` from all commands below.\n",
+    "\n",
+    "Once you're logged in, you want to jump to your scratch directory (note it's not but is not backed up).\n",
+    "   ```bash\n",
+    "    cd ../../scratch/<netid>\n",
+    "   ```\n",
+    "Create a new directory for this tutorial, for example `mkdir ema_mpi_test` and then `cd ema_mpi_test`\n",
+    "\n",
+    "Then, you want to send your Python file and SLURM script to DelftBlue. Open a **new** command line terminal, and then you can do this with `scp`:\n",
+    "   ```bash\n",
+    "    scp -J <netid>@student-linux.tudelft.nl ema_example_model.py slurm_script.sh <netid>@login.delftblue.tudelft.nl:/scratch/<netid>/ema_mpi_test\n",
+    "   ```\n",
+    "Before scheduling the SLURM script, we first have to make it executable:\n",
+    "   ```bash\n",
+    "    chmod +x slurm_script.sh\n",
+    "   ```\n",
+    "Then we can schedule it:\n",
+    "   ```bash\n",
+    "    sbatch slurm_script.sh\n",
+    "   ```\n",
+    "Now it's scheduled!\n",
+    "\n",
+    "You can check the status of your job with `squeue`:\n",
+    "   ```bash\n",
+    "    squeue -u <netid>\n",
+    "   ```\n",
+    "You might want to inspect the log file, which is created by the SLURM script. You can do this with `cat`:\n",
+    "   ```bash\n",
+    "    cat slurm-<jobid>.out\n",
+    "   ```\n",
+    "where `<jobid>` is the job ID of your job, which you can find with `squeue`.\n",
+    "\n",
+    "When the job is finished, we can download the pickle file created. Open the command line again (can be the same one as before), and you can copy the results back to your local machine with `scp`:\n",
+    "   ```bash\n",
+    "    scp -J <netid>@student-linux.tudelft.nl <netid>@login.delftblue.tudelft.nl:/scratch/<netid>/ema_mpi_test/ema_mpi_test.pickle .\n",
+    "   ```\n",
+    "Finally, we can clean up the files on DelftBlue, to avoid cluttering the scratch directory:\n",
+    "   ```bash\n",
+    "    cd ..\n",
+    "    rm -rf \"ema_mpi_test\"\n",
+    "   ```"
+   ],
    "metadata": {
     "collapsed": false
    },
-   "id": "2cfb1215beb5fe5a"
+   "id": "49b3ae210d69c2cb"
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "name": "python3",
    "language": "python",
-   "name": "python3"
+   "display_name": "Python 3 (ipykernel)"
   },
   "language_info": {
    "codemirror_mode": {