Skip to content

Latest commit

 

History

History
345 lines (278 loc) · 30.9 KB

README.md

File metadata and controls

345 lines (278 loc) · 30.9 KB

CHPC 2024 Student Cluster Competition

Welcome the Center for High Performance Computing (CHPC)'s Student Cluster Competition (SCC) - Team Selection Round. This round requires each team to build a prototype multi-node compute cluster within the National Integrated Cyber Infrastructure Systems (NICIS) virtual compute cloud (described below).

The goal of this document is to introduce you to the competition platform and familiarise you with some Linux and systems administration concepts. This competition provides you with a fixed set of virtual resources, that you will use to initialize a set a set of virtual machines instances based on your choice or flavor of Linux.

Table of Contents

  1. Structure of the Competition
    1. Getting Help
    2. Timetable
    3. Scoring
    4. Instructions for Mentors
      1. Hands-Off Rule (You may not touch the keyboard)
    5. Cheat Sheet
  2. Deliverables
    1. Cluster Design Assignment
    2. Technical Knowledge Assessment
    3. Tutorials
  3. Lecture Recordings
  4. Contributing to the Project
    1. Steps to follow when editing existing content
    2. Syntax and Style

Structure of the Competition

The CHPC invites applications from suitably qualified candidates to enter the CHPC Student Cluster Competition. The CHPC Student Cluster Competition gives undergraduate students at South African universities exposure to the High Performance Computing (HPC) Industry. The winning team will be entered into the ISC Student Cluster Competition hosted at the 2025 International Supercomputing Conference held in Hamburg, Germany.

You will be accessing all of the course work and material through this GitHub repository, which you and your team must check regularly to receive updates.

Getting Help

You are strongly encouraged to get help and even assist others by Opening and Participating in Discussions.

Tip

Active participation in the student discussions is an easy way to separate yourselves from the rest of the competition and make it easy for the instructors to notice you!

Timetable

Everyday will comprise of four lectures in the mornings and tutorials taking place in the afternoons. A PDF Version of the Timetable is available for you to download.

Timetable.

Scoring

Teams will be evaluate according to the following breakdown, with your progress in the tutorials and your final presentations carrying the most weight.

Component Weight
Technical Knowledge Assessment 0.1
Tutorials 0.4
Cluster Design Assignment (Part 1) 0.1
Cluster Design Presentation 0.4

Instructions for Mentors

The role of mentors, instructors and volunteers is to provide leadership and guidance for the student competitors participating in this year's Center for High Performance Computing 2024 Student Cluster Competition.

In preparing your teams for the competition, your main goal is to ensure that you teach and impart knowledge to the student participants in such a way that they are empowered and enable to tackle the problems and benchmarking tasks themselves.

Hands-Off Rule (You may not touch the keyboard)

Under no circumstances whatsoever may mentors touch any competition hardware belonging to either their team, or the competition hardware of another team. Mentors are encouraged to provide guidance and leadership to their (as well as other) teams.

Any mentors found to be directly in contravention of this rule, may result in their team incurring a penalty. Repeated infringements may result in possible disqualification of their team.

Cheat Sheet

Below is a table with a number of Linux system commands and utilities that you may find useful in assisting you to debug problems that you may encounter with your clusters. Note that some of these utilities do not ship with the base deployment of a number of Linux flavors, and you may be required to install the associated packages, prior to making use of them.

Command Description
ssh Used from logging into the remote machine and for executing commands on the remote machine.
scp SCP copies files between hosts on a network. It uses ssh for data transfer, and uses the same authentication and provides the same security as ssh.
wget / curl Utility for non-interactive download of files from the Web.It supports HTTP, HTTPS, and FTP protocols.
top / htop / btop Provides a dynamic real-time view of a running system. It can display system summary information as well as a list of processes or threads.
screen / tmux Full-screen window manager that multiplexes a physical terminal between several processes (typically interactive shells).
ip a Display IP Addresses and property information
dmesg Prints the message buffer of the kernel. The output of this command typically contains the messages produced by the device drivers
watch Execute a program periodically, showing output fullscreen.
df -h Report file system disk space usage.
ping PING command is used to verify that a device can communicate with another on a network.
lynx Command-line based web browser (more useful than you think)
ctrl+alt+[F1...F6] Open another shell session (multiple ‘desktops’)
ctrl+z Move command to background (useful with ‘bg’)
du -h Summarize disk usage of each FILE, recursively for directories.
lscpu Command line utility that provides system CPU related information.
lstotp View the topology of a Linux system.
inxi Lists information related to your systems' sensors, partitions, drives, networking, audio, graphics, CPU, system, etc...
hwinfo Hardware probing utility that provides detailed info about various components.
lshw Hardware probing utility that provides detailed info about various components.
proc Information and control center of the kernel, providing a communications channel between kernel space and user space. Many of the preceding commands query information provided by proc, i.e. cat /proc/cpuinfo.
uname Useful for determining information about your current flavor and distribution of your operating system and its version.
lsblk Provides information about block devices (disks, hard drives, flash drives, etc) connected to your system and their partitioning schemes.

Deliverables

You will need to submit the following for scoring and evaluation by the judges:

  • Cluster Design Assignment (Part 1) [10 %]
  • Cluster Design Assignment (Part 2) [40 %]
    • One PDF Presentation Slide with Team Profiles This slide must clearly indicate your Team Name and Institution. Below each team member's photograph, indicate their
      • Name and surname,
      • Degree and Year of study,
    • Presentation Slides
    • Short Technical Brief with Cluster Design Specifications
  • Technical Knowledge Assessment [10 %]
  • Tutorials [40 %]

Cluster Design Assignment

You are tasked with designing a small cluster, with at least three nodes, to the value of R 400  000.00 (ZAR) and present your design to the judging panel. In your design you must specify hardware and software for an operational cluster and describe how it functions. The design must be based on servers and interconnects from either HPE or Dell, and accessories from either NVIDIA, or AMD or Intel. You must use the prices you find in the Parts List Spreadsheet.

The primary purpose of your HPC cluster is to run one of the following codes as efficiently as possible:

You are not given a choice regarding the application selection. Your team will be told which application to optimize for on Wednesday. For now, you should investigate the codes above to understand their unique hardware and software requirements. You are required to submit a brief (half page) report on your findings to the competition organizers by 23:00 on Tuesday.

In addition, your choice of design must take into consideration:

  • Base Platform (Server),
  • Target Processing Unit (CPU / GPU),
  • Memory, Networking and Storage Requirements,
  • System and Application Dependency Software Requirements,
  • Ease of Use (Build, Assembly, Deployment),
  • Efficiency, Performance, Power Consumption and Reliability and
  • Team Management, Coordination and Planning.

Important

You may submit an additional design, that extends upon your small R 400 000.00 cluster, up to the value of R 1 000 000.00. You may use any of the above links for this exercise, using a Dollar to Rand conversion rate or 1:20. You may use GPU's from either AMD or NVIDIA. You may utilize CPUs from either AMD or Intel. You may use either Dell or HPE as a vendor.

The 10 minute slide presentation by the whole team must include your design decisions and the features of your cluster, including: cost, hardware, software, configuration and operation. Each member of the team is required to present even though you will be assessed as a team.

After the presentation the judging panel will have an opportunity to ask questions to each member of your team. All members of your team can be questioned about any part of the cluster, so make sure you are fully familiar with the design.

Technical Knowledge Assessment

Each Team must work together to answer and complete the Technical Knowledge Assessment to the best of their ability. Team Captains must email your findings to the organizers no later than 23:00 13th July. You are required to demonstrate your understanding of the concepts in YOUR OWN WORDS. Keep your answers succinct and to the point. Your answers to each of the questions, should not exceed more than 2-3 lines.

Tutorials

You will be evaluated on your overall progress in the tutorials. Below you will find an overview, glossary and high level breakdown of the tutorials. You must progress through four tutorials, which will be released daily. Your overall progress through the tutorials forms a large component of you score. By the end of the week you would have covered a considerable amount of content, use the links provided should you need to refer to a specific section and are having trouble remembering where is it.

Tutorial 1 deals with introducing concepts to users and getting them started with using the virtual lab, standing up the first virtual machine instance and connecting to it remotely. The content is as follows:

  1. Checklist
  2. Network Primer
    1. Basic Networking Example (WhatIsMyIp.com)
    2. Terminal, Windows MobaXTerm and PowerShell Commands
  3. Launching your First Open Stack Virtual Machine Instance
    1. Accessing the NICIS Cloud
    2. Verify your Teams' Project Workspace and Available Resources
    3. Generating SSH Keys
    4. Launch a New Instance
    5. Linux Flavors and Distributions
      1. Summary of Linux Distributions
    6. OpenStack Instance Flavors
    7. Networks, Ports, Services and Security Groups
    8. Key Pair
    9. Verify that your Instance was Successfully Deployed and Launched
    10. Associating an Externally Accessible IP Address
    11. Success State, Resource Management and Troubleshooting
  4. Introduction to Basic Linux Administration
    1. Accessing your VM Using SSH vs the OpenStack Web Console (VNC)
    2. Running Basic Linux Commands and Services
  5. Linux Binaries, Libraries and Package Management
    1. User Environment and the PATH Variable
  6. Install, Compile and Run High Performance LinPACK (HPL) Benchmark

Tutorial 2 will demonstrate how to configure and stand-up a compute node, and access it using a transparently created, port forwarding SSH tunnel between your workstation and your head node. You will then install a number of critical services across your cluster.

  1. Checklist
  2. Spinning Up a Compute Node on Sebowa(OpenStack)
    1. Compute Node Considerations
  3. Accessing Your Compute Node Using ProxyJump Directive
    1. Setting a Temporary Password on your Compute Node
  4. Understanding the Roles of the Head Node and Compute Node
    1. Terminal Multiplexers and Basic System Monitoring
  5. Manipulating Files and Directories
  6. Verifying Networking Setup
  7. Configuring a Simple Stateful Firewall Using nftables
  8. Network Time Protocol
  9. Network File System
  10. Generating an SSH Key for your NFS /home
  11. User Account Management
    1. Out-Of-Sync Users and Groups
  12. Ansible User Declaration
    1. Create User Accounts
  13. WirGuard VPN Cluster Access
  14. ZeroTier

Tutorial 3 will demonstrate how to configure, build, compile and install a number of various system software and applications. You will also be building these applications with different tools. Finally, you will learn how to run applications across your cluster.

  1. Checklist
  2. Managing Your Environment
    1. NFS Mounted Shared home folder and the PATH Variable
  3. Install Lmod
    1. Lmod Usage
  4. Running the High Performance LINPACK (HPL) Benchmark on Your Compute Node
    1. System Libraries
    2. Configure and Run HPL on Compute Node
  5. Building and Compiling OpenBLAS and OpenMPI Libraries from Source
  6. Intel oneAPI Toolkits and Compiler Suite
    1. Configure and Install Intel oneAPI Base and HPC Toolkits
    2. Configuring and Running HPL with Intel oneAPI Toolkit and MKL
  7. LinPACK Theoretical Peak Performance
    1. Top500 List
  8. Spinning Up a Second Compute Node Using a Snapshot
    1. Running HPL Across Multiple Nodes
  9. HPC Challenge
  10. Application Benchmarks and System Evaluation
    1. GROMACS (ADH Cubic)
    2. LAMMPS (Lennard-Jones)
    3. [Qiskit (Quantum Volume)](tutorial3/README.md#qiskit-quantum-volume**

Tutorial 4 demonstrates how to configure docker containers to deploy a monitoring stack, comprising of a metrics database service, an exporting / scraping service and a metric visualization services. You will then learn the very basics of how to visualize and interpret data. You will then learn how to automate the deployment of your Sebowa OpenStack infrastructure. Lastly, you'll deploy a scheduler and submit a job to it.

  1. Checklist
  2. Cluster Monitoring
    1. Install Docker Engine, Containerd and Docker Compose
    2. Installing your Monitoring Stack
    3. Startup and Test the Monitoring Services
    4. SSH Port Local Forwarding Tunnel
    5. Create a Dashboard in Grafana
    6. Success State, Next Steps and Troubleshooting
  3. Configuring and Connecting to your Remote JupyterLab Server
    1. Visualize Your HPL Benchmark Results
    2. Visualize Your Qiskit Results
  4. Automating the Deployment of your OpenStack Instances Using Terraform
    1. Install and Initialize Terraform
    2. Generate clouds.yml and main.tf Files
    3. Generate, Deploy and Apply Terraform Plan
  5. Continuous Integration Using CircleCI
    1. Prepare GitHub Repository
    2. Reuse providers.tf and main.tf Terraform Configurations
    3. Create .circleci/config.yml File and push Project to GitHub
    4. Create CircleCI Account and Add Project
  6. Slurm Scheduler and Workload Manager
    1. Prerequisites
    2. Head Node Configuration (Server)
    3. Compute Node Configuration (Clients)
  7. GROMACS Application Benchmark
    1. Protein Visualization
    2. Benchmark 2 (1.5M Water)

Lecture Recordings

In this section you will finds links to all of the livestreams of the lectures (Teams Meetings) and subsequent recordings for you to refer back to.

  1. Welcome, Introduction and Getting Started

  2. HPC Hardware, HPC Networking and Systems Administration

  3. Benchmarking, Compilation and Parallel Computing

  4. Administration and Application Visualization

    • [Cluster Admin, Ansible & Containers]
    • Monitoring
    • [Schedulers]
    • [Data Visualization & Jupyter Lab]
  5. Career Guidance

Contributing to the Project

Important

While we value your feedback, the following sections are primarily targeted as Contributors to the Project. As a student participating in the competition, do NOT spend your time working through any of the material below. However, we would love to have your contributions to the project, after the competition.

You are strongly encouraged to contribute and improve the project by Opening and Participating in Discussions, Raising, Addressing and Resolving Issues. The following guide describes How to clone, push, and pull with git (beginners GitHub tutorial).

Steps to follow when editing existing content

In order to effectively manage the various workflows and stages of development, testing and deployment, the project is comprised of three primary branches:

  • main: Stable and production-ready deployment branch of the project.
  • stag: Staging branch which mirrors production and is used for integration testing of new features.
  • dev: Development branch for incorporating new features and bug fixes.

Editing the content directly, will require the use of Git. Using a terminal application or Git for Windows PowerShell or Git for MobaXTerm.

  1. Generate an SSH Key (or use an existing one).

  2. Add your SSH key to your Git profile.

    • Navigate to your 'Profile' and go to 'Settings'.
    • Under 'Access', navigate to 'SSH and GPG Keys'

      Adding SSH Keys to GitHub.

  3. git clone a local copy of the repository, to your personal work space.

    Adding SSH Keys to GitHub.

    You can copy the command from GitHub itself.

    git clone git@github.com:chpc-tech-eval/chpc24-scc-nmu.git
  4. When starting work on a new feature or bug fix, create a feature branch off of the development branch and regularly get updates from dev to ensure that you remain consistent with any changes to dev:

    git checkout dev
    git pull origin dev
  5. Create a new branch to work on. i.e. git branch tutX/bugfix-or-new-feature followed by git checkout tutX/bugfix-or-new-feature, or simply use a single command git checkout -b tutX/bugfix-or-new-feature.

    • Give the branch a sensible name.
    • You are encouraged to push the branch back to the remote so that collaborators can see what you are working on as you make the changes.
  6. Make the appropriate changes and commit them locally:

    git add <relative_path_to_changed_file(s)>
    git commit -m "some_message_pertaining_to_changes_made"
  7. When you have completed editing your feature, merge any remote changes from dev and then push your local changes, back upstream to the remote repository:

    git pull origin dev # (optional) it is generally a good practice to incorporate any changes in dev into your code early and often
    git pull origin feature/bugfix-or-new-feature # (optional) if you are collaborating on a specific feature with someone, it is important to incorporate their changes early and often
    git push origin feature/bugfix-or-new-feature
  8. Once you are satisfied with the changes you've have been editing, eliminate all merge conflicts by pulling all remote changes and deviations into your local working copy. git pull.

    • If you are confident that your feature does not or has not deviated from the remote dev branch, use git pull to automatically fetch and merge remote changes from dev into your feature branch.
    • Alternatively, if your branch is old, or depends on / requires changes from remote use git fetch, to fetch remote changes and be able to preview them before merging.
    • Eliminate your local conflicts and merge all remote changes git merge.
    • Once all the conflicts have been resolved, and you've successfully merged all remote changes, push your branch upstream.
  9. Create a pull request to the remote dev branch on GitHub, to incorporate your feature.

    • Or another branch, if your feature branch was adding functionality to an existing feature branch.

Syntax and Style

Use the following guide on Github Markdown Syntax Editing.