Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

740 add generic support for different gpu hardware #3371

Conversation

jakki-amd
Copy link
Contributor

@jakki-amd jakki-amd commented Dec 2, 2024

Description

This PR decouples the hardware layer from the front- and backend of TorchServe.
Relates to #740

'Add AMD backend support'
Rony Leppänen rleppane@amd.com

'Add AMD frontend support'
Anders Smedegaard Pedersen asmedega@amd.com

'Add Dockerfile.rocm'
Samu Tamminen stammine@amd.com
Jarkko Lehtiranta jlehtira@amd.com

'Add AMD documentation'
Anders Smedegaard Pedersen asmedega@amd.com
Rony Leppänen rleppane@amd.com
Jarkko Lehtiranta jlehtira@amd.com

Other contributions:

Bipradip Chowdhury bichowdh@amd.com
Jarkko Vainio javainio@amd.com
Tero Kemppi tekemppi@amd.com

Requirement Files

Added requirements/torch_rocm62.txt, requirements/torch_rocm61.txt and requirements/torch_rocm60.txt for easy install of dependencies needed for AMD support.

Backend

The Python backend supports currently NVIDIA GPUs using hardware specific libraries. There were also a number of functions that could be refactored using more generalized interfaces.

Changes Made to Backend

  • Use torch.cuda for detecting GPU availability and torch.version for differentiating between GPU vendors (NVIDIA, AMD)
  • Use torch.cuda for collecting GPU metrics
    • Exclude nvgpu library usage which is a quick and dirty solution calling nvidia-smi and parsing its output
    • Currently temporary solution for AMD GPUs which relies on using amdsmi library directly
    • When the bug is changed in torch.cuda, same functions can be used for collecting metrics from different GPUs (NVIDIA, AMD)
  • Extend print_env_info for AMD GPUs and reimplement a number of functions
    • Detect versions of HIP runtime, ROCm and MIOpen
    • Collect model names of available GPUs with torch.cuda (NVIDIA, AMD)
    • Use pynvml for detecting nvidia driver and cuda versions
    • Use torch for detecting compiled cuda and cudnn versions
  • Refactor nvidia-specific code in several places

Frontend

The Java frontend that acts as the workload manager had calls to SMIs hard-coded in a few places. This made it difficult for TorchServe to support multiple hardware vendors in a graceful manner.

Changes Made to Frontend

We've introduced a new package org.pytorch.serve.device with the classes SystemInfo and Accelerator. SystemInfo holds an array list of Accelerator objects that holds static information about the specific accelerators on a machine, and the relevant metrics.

Instead of calling the SMIs directly in multiple places in the frontend code we have abstracted the hardware away by adding an instance of SystemInfo to the pre-existing ConfigManager. Now the frontend can get data from the hardware via the methods on SystemInfo without knowing about the specifics of the hardware and SMIs.

To implement the specifics for each of the vendors that was already partially supported we have created a number of utility classes that communicates with the hardware via the relevant SMI.

The following steps are taken in the SystemInfo constructor.

  1. Detect the relevant vendor by calling which {relevant smi} for each of the supported vendors.
    This is how vendor detection was done previously. There might be more robust ways. where is used on Windows systems.
  2. When the accelerator vendor is detected it creates an instance of the relevant utility class , for example ROCmUtility for AMD.
  3. Accelerators are detected, respecting the relevant environment variable for selecting devices. HIP_VISIBLE_DEVICES for AMD, CUDA_VISIBLE_DEVICES for nvidia and XPU_VISIBLE_DEVICES for Intel. All devices are detected if the relevant environment variable is not set.
  4. Finally the metrics for the detected devices are updated

The following is a class diagram showing how the new classes relate to the existing code

classDiagram
    class Accelerator {
        +Integer id
        +AcceleratorVendor vendor
        +String model
        +IAcceleratorUtility acceleratorUtility
        +Float usagePercentage
        +Float memoryUtilizationPercentage
        +Integer memoryAvailableMegabytes
        +Integer memoryUtilizationMegabytes
        +getVendor()
        +getAcceleratorModel()
        +getAcceleratorId()
        +getMemoryAvailableMegaBytes()
        +getUsagePercentage()
        +getMemoryUtilizationPercentage()
        +getMemoryUtilizationMegabytes()
        +setMemoryAvailableMegaBytes()
        +setUsagePercentage()
        +setMemoryUtilizationPercentage()
        +setMemoryUtilizationMegabytes()
        +utilizationToString()
        +updateDynamicAttributes()
    }

    class SystemInfo {
        -AcceleratorVendor acceleratorVendor
        -ArrayList<Accelerator> accelerators
        -IAcceleratorUtility acceleratorUtil
        +hasAccelerators()
        +getNumberOfAccelerators()
        +getAccelerators()
        +updateAcceleratorMetrics()
    }

    class AcceleratorVendor {
        <<enumeration>>
        AMD
        NVIDIA
        INTEL
        APPLE
        UNKNOWN
    }

    class IAcceleratorUtility {
        <<interface>>
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +getUpdatedAcceleratorsUtilization()
    }

    class ICsvSmiParser {
        <<interface>>
        +csvSmiOutputToAccelerators()
    }

    class IJsonSmiParser {
        <<interface>>
        +jsonOutputToAccelerators()
        +extractAcceleratorId()
        +jsonObjectToAccelerator()
        +extractAccelerators()
    }

    class CudaUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +parseAccelerator()
        +parseUpdatedAccelerator()
    }

    class ROCmUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +extractAccelerators()
        +extractAcceleratorId()
        +jsonObjectToAccelerator()
    }

    class XpuUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +parseDiscoveryOutput()
        +parseUtilizationOutput()
    }

    class AppleUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +jsonObjectToAccelerator()
        +extractAcceleratorId()
        +extractAccelerators()
    }

        class ConfigManager {
        -SystemInfo systemInfo
        +init(Arguments args)
    }

    class WorkerThread {
        #ConfigManager configManager
        #WorkerLifeCycle lifeCycle
    }

    class AsyncWorkerThread {
        #boolean loadingFinished
        #CountDownLatch latch
        +run()
        #connect()
    }

    class SystemInfo {
        -Logger logger
        -AcceleratorVendor acceleratorVendor
        -ArrayList<Accelerator> accelerators
        -IAcceleratorUtility acceleratorUtil
        +SystemInfo()
        -createAcceleratorUtility() IAcceleratorUtility
        -populateAccelerators()
        +hasAccelerators() boolean
        +getNumberOfAccelerators() Integer
        +static detectVendorType() AcceleratorVendor
        -static isCommandAvailable(String) boolean
        +getAccelerators() ArrayList<Accelerator>
        -updateAccelerators(List<Accelerator>)
        +updateAcceleratorMetrics()
        +getAcceleratorVendor() AcceleratorVendor
        +getVisibleDevicesEnvName() String
    }

    class Accelerator {
        +Integer id
        +AcceleratorVendor vendor
        +String model
        +Float usagePercentage
        +Float memoryUtilizationPercentage
        +Integer memoryAvailableMegabytes
        +Integer memoryUtilizationMegabytes
        +getVendor() AcceleratorVendor
        +getAcceleratorModel() String
        +getAcceleratorId() Integer
        +getUsagePercentage() Float
        +setUsagePercentage(Float)
        +setMemoryUtilizationPercentage(Float)
        +setMemoryUtilizationMegabytes(Integer)
    }

    class WorkerLifeCycle {
        -ConfigManager configManager
        -ModelManager modelManager
        -Model model
    }

    class WorkerThread {
        #ConfigManager configManager
        #int port
        #Model model
        #WorkerState state
        #WorkerLifeCycle lifeCycle

    }

    WorkerLifeCycle --> "1" ConfigManager
    WorkerLifeCycle --> "1" Model
    WorkerLifeCycle --> "1" Connector
    WorkerThread --> "1" WorkerLifeCycle

    ConfigManager "1" --> "1" SystemInfo
    ConfigManager "1" --> "*" Accelerator
    WorkerThread --> "1" ConfigManager

    WorkerThread --> "1" WorkerLifeCycle
    AsyncWorkerThread --|> WorkerThread

    SystemInfo --> "0..*" Accelerator
    SystemInfo --> "1" IAcceleratorUtility
    SystemInfo --> "1" AcceleratorVendor
    Accelerator --> "1" AcceleratorVendor
    CudaUtil ..|> IAcceleratorUtility
    CudaUtil ..|> ICsvSmiParser
    ROCmUtil ..|> IAcceleratorUtility
    ROCmUtil ..|> IJsonSmiParser
    XpuUtil ..|> IAcceleratorUtility
    XpuUtil ..|> ICsvSmiParser
    AppleUtil ..|> IAcceleratorUtility
    AppleUtil ..|> IJsonSmiParser
Loading

Documentation

  • Added the section "Hardware Support" in the table of contents
  • Moved the pages about hardware support to serve/docs/hardware_support/ and added them under "Hardware Support" in the TOC
  • Added the page "AMD Support"

Screenshot 2024-11-27 120848

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

We build new docker container for ROCm using Dockerfile.rocm and build argument USE_ROCM_VERSION. For other platforms we used build_image.sh script.

# AMD instance
docker build -f docker/Dockerfile.rocm -t torch-serve-dev-image-rocm --build-arg USE_ROCM_VERSION=rocm62 --build-arg BUILD_FROM_SRC=true .

Run containers

# AMD instance
docker run --rm -it -w /serve --device=/dev/kfd --device=/dev/dri torch-serve-dev-image-rocm bash

Tests

  • Frontend tests, CPU

Logs:

> ./frontend/gradlew -p frontend clean build
...
BUILD SUCCESSFUL in 6m 35s
  • Frontend tests, CUDA

Logs:

> ./frontend/gradlew -p frontend clean build
...
BUILD SUCCESSFUL in 6m 5s
  • Frontend tests, ROCm

Logs:

> ./frontend/gradlew -p frontend clean build
...
BUILD SUCCESSFUL in 6m 43s
  • Backend tests, CPU

Logs:

> python3 -m pytest ts/tests/unit_tests ts/torch_handler/unit_tests
============================================================================ 113 passed, 30 warnings in 38.09s ============================================================================
> cd workflow-archiver && python3 -m pytest workflow_archiver/tests/unit_tests workflow_archiver/tests/integ_tests
=================================================================================== 20 passed in 0.36s ====================================================================================
> cd model-archiver && python3 -m pytest model_archiver/tests/unit_tests model_archiver/tests/integ_tests
=================================================================================== 33 passed in 0.20s ====================================================================================
  • Backend tests, CUDA

Logs:

> python3 -m pytest ts/tests/unit_tests ts/torch_handler/unit_tests
======================================================================= 113 passed, 21 warnings in 83.76s (0:01:23) =======================================================================
> cd workflow-archiver && python3 -m pytest workflow_archiver/tests/unit_tests workflow_archiver/tests/integ_tests
=================================================================================== 20 passed in 0.31s ====================================================================================
> cd model-archiver && python3 -m pytest model_archiver/tests/unit_tests model_archiver/tests/integ_tests
=================================================================================== 33 passed in 0.20s ====================================================================================
  • Backend tests, ROCm

Logs:

> python3 -m pytest ts/tests/unit_tests ts/torch_handler/unit_tests
============================ 113 passed, 21 warnings in 48.06s ============================
> cd workflow-archiver && python3 -m pytest workflow_archiver/tests/unit_tests workflow_archiver/tests/integ_tests
=================================== 20 passed in 0.32s ====================================
> cd model-archiver && python3 -m pytest model_archiver/tests/unit_tests model_archiver/tests/integ_tests
=================================== 33 passed in 0.16s ====================================
  • Regression tests, CPU

Logs:

> git submodule update --init --recursive
> python3 test/regression_tests.py
================================================================ 163 passed, 40 skipped, 15 warnings in 2014.67s (0:33:34) ================================================================
  • Regression tests, CUDA

Logs:

> git submodule update --init --recursive
> python3 test/regression_tests.py
====================================================== 156 passed, 47 skipped, 10 warnings in 8067.30s (2:14:27) =======================================================
  • Regression tests, ROCm

Logs:

> git submodule update --init --recursive
> python3 test/regression_tests.py
FAILED test_handler.py::test_huggingface_bert_model_parallel_inference - assert 'Bloomberg has decided to publish a new report on the global economy' in '{\n  ...
=========== 1 failed, 162 passed, 40 skipped, 11 warnings in 2085.45s (0:34:45) ===========

OBS! The test test_handler.py::test_huggingface_bert_model_parallel_inference fails due to:

ValueError: Input length of input_ids is 150, but max_length is set to 50. This can lead to unexpected behavior. You should consider increasing max_length or, better yet, setting max_new_tokens.

This indicates that preprocessing uses a different max_length than inference, which can be verified when looking at the handler when the test was originally implemented: model.generate() has max_length=50 by default, while tokenizer uses max_length from setup_config (max_length=150). It seems that the BERT-based Textgeneration.mar needs an update.

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@jakki-amd jakki-amd force-pushed the 740-add-generic-support-for-different-GPU-hardware branch from 70c500c to 54ef2bd Compare December 2, 2024 09:38
@jakki-amd jakki-amd marked this pull request as ready for review December 2, 2024 10:49
@jakki-amd jakki-amd force-pushed the 740-add-generic-support-for-different-GPU-hardware branch from 54ef2bd to bc96fa7 Compare December 2, 2024 10:59
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. We are still reviewing it

# For reference:
# https://docs.docker.com/develop/develop-images/build_enhancements/

ARG BASE_IMAGE=ubuntu:24.04
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of TorchServe images are still on ubuntu 20.04 as we had issues with github runners with versions greater. Haven't tried this in a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @agunapal

--index-url https://download.pytorch.org/whl/rocm6.2
torch==2.5.1+rocm6.2; sys_platform == 'linux'
torchvision==0.20.1+rocm6.2; sys_platform == 'linux'
torchaudio==2.5.1+rocm6.2; sys_platform == 'linux'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't yet updated the Pytorch version for the rest of the project. But this should be ok. I will update it for other platforms too

@agunapal
Copy link
Collaborator

agunapal commented Dec 9, 2024

@smedegaard Looks like the PR is breaking TorchServe on linux-aarch64. Can you please check.

@agunapal
Copy link
Collaborator

@smedegaard Looks like the PR is breaking TorchServe on linux-aarch64. Can you please check.

It seems like this test is failing in CI https://github.com/nod-ai/serve/blob/31824434aa2acd3ff8261bd18cf6f1d925b8e22a/frontend/server/src/test/java/org/pytorch/serve/util/ConfigManagerTest.java#L110

@agunapal
Copy link
Collaborator

@smedegaard Looks like the PR is breaking TorchServe on linux-aarch64. Can you please check.

It seems like this test is failing in CI https://github.com/nod-ai/serve/blob/31824434aa2acd3ff8261bd18cf6f1d925b8e22a/frontend/server/src/test/java/org/pytorch/serve/util/ConfigManagerTest.java#L110
Screenshot 2024-12-10 at 6 26 09 PM

Copy link
Contributor

@smedegaard smedegaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@jakki-amd jakki-amd force-pushed the 740-add-generic-support-for-different-GPU-hardware branch from 844806c to cc0809d Compare December 18, 2024 09:56
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested with the changes manually on Graviton 3 and there are no issues.
The failures with the runners can be debugged at a later point

@agunapal agunapal added this pull request to the merge queue Dec 19, 2024
Merged via the queue into pytorch:master with commit 9bcbd22 Dec 19, 2024
9 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants