From cb1c0b2018d0c6b3ca8f2bab5fd8923c25f5f5e6 Mon Sep 17 00:00:00 2001 From: tanayvarshney Date: Wed, 1 Jun 2022 15:14:16 -0700 Subject: [PATCH 1/6] fixed typos --- notebooks/dynamic-shapes.ipynb | 36 +++++++++++++++++----------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/notebooks/dynamic-shapes.ipynb b/notebooks/dynamic-shapes.ipynb index a0ceaab576..b939e12db1 100644 --- a/notebooks/dynamic-shapes.ipynb +++ b/notebooks/dynamic-shapes.ipynb @@ -7,7 +7,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Copyright 2020 NVIDIA Corporation. All Rights Reserved.\n", + "# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", @@ -36,14 +36,14 @@ "id": "73703695", "metadata": {}, "source": [ - "Torch-TensorRT is a compiler for PyTorch/TorchScript, targeting NVIDIA GPUs via NVIDIA's TensorRT Deep Learning Optimizer and Runtime. Unlike PyTorch's Just-In-Time (JIT) compiler, Torch-TensorRT is an Ahead-of-Time (AOT) compiler, meaning that before you deploy your TorchScript code, you go through an explicit compile step to convert a standard TorchScript program into an module targeting a TensorRT engine. Torch-TensorRT operates as a PyTorch extention and compiles modules that integrate into the JIT runtime seamlessly. After compilation using the optimized graph should feel no different than running a TorchScript module. You also have access to TensorRT's suite of configurations at compile time, so you are able to specify operating precision (FP32/FP16/INT8) and other settings for your module.\n", + "Torch-TensorRT is a compiler for PyTorch/TorchScript, targeting NVIDIA GPUs via NVIDIA's TensorRT Deep Learning Optimizer and Runtime. Unlike PyTorch's Just-In-Time (JIT) compiler, Torch-TensorRT is an Ahead-of-Time (AOT) compiler, meaning that before you deploy your TorchScript code, you go through an explicit compile step to convert a standard TorchScript program into a module targeting a TensorRT engine. Torch-TensorRT operates as a PyTorch extension and compiles modules that integrate into the JIT runtime seamlessly. After compilation, using the optimized graph should feel no different than running a TorchScript module. You also have access to TensorRT's suite of configurations at compile-time, so you are able to specify operating precision (FP32/FP16/INT8) and other settings for your module.\n", "\n", - "We highly encorage users to use our NVIDIA's [PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) to run this notebook. It comes packaged with a host of NVIDIA libraries and optimizations to widely used third party libraries. This container is tested and updated on a monthly cadence!\n", + "We highly encourage users to run this notebook using our NVIDIA's [PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). It comes packaged with a host of NVIDIA libraries and optimizations to widely used third-party libraries. In addition, this container is tested and updated on a monthly cadence!\n", "\n", "This notebook has the following sections:\n", - "1. [TL;DR Explanation](#1)\n", - "1. [Setting up the model](#2)\n", - "1. [Working with Dynamic shapes in Torch TRT](#3)" + "1. TL;DR Explanation\n", + "1. Setting up the model\n", + "1. Working with Dynamic shapes in Torch TRT]" ] }, { @@ -633,7 +633,7 @@ "id": "21402d53", "metadata": {}, "source": [ - "Let's test our util functions on the model we have set up, starting with simple predictions" + "Let's test our util functions on the model we have set up, starting with simple predictions." ] }, { @@ -820,19 +820,19 @@ "source": [ "---\n", "## Working with Dynamic shapes in Torch TRT\n", - "\n", - "Enabling \"Dynamic Shaped\" tensors to be used is essentially enabling the ability to defer defining the shape of tensors until runetime. Torch TensorRT simply leverages TensorRT's Dynamic shape support. You can read more about TensorRT's implementation in the [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes).\n", - "\n", + " \n", + "Enabling \"Dynamic Shaped\" tensors to be used is essentially enabling the ability to defer defining the shape of tensors until run-time. Torch TensorRT simply leverages TensorRT's Dynamic shape support. You can read more about TensorRT's implementation in the [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes).\n", + " \n", "#### How can you use this feature?\n", - "\n", + " \n", "To make use of dynamic shapes, you need to provide three shapes:\n", "* `min_shape`: The minimum size of the tensor considered for optimizations.\n", - "* `opt_shape`: The optimizations will be done with an effort to maximize performance for this shape.\n", - "* `min_shape`: The maximum size of the tensor considered for optimizations.\n", - "\n", - "Generally, users can expect best performance within the specified ranges. Performance for other shapes may be be lower for other shapes (depending on the model ops and GPU used)\n", - "\n", - "In the following example, we will showcase varing batch size, which is the zeroth dimension of our input tensors. As Convolution operations require that the channel dimension be a build-time constant, we won't be changing sizes of other channels in this example, but for models which contain ops conducive to changes in other channels, this functionality can be freely used." + "* `opt_shape`: The optimizations will be done in an effort to maximize performance for this shape.\n", + "* `max_shape`: The maximum size of the tensor considered for optimizations.\n", + " \n", + "Generally, users can expect the best performance within the specified ranges. Performance for other shapes maybe be lower for other shapes (depending on the model ops and GPU used)\n", + " \n", + "In the following example, we will showcase varying batch sizes, which is the zeroth dimension of our input tensors. As Convolution operations require that the channel dimension be a build-time constant, we won't be changing the sizes of other channels in this example, but for models which contain ops conducive to changes in other channels, this functionality can be freely used." ] }, { @@ -1015,7 +1015,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.13" + "version": "3.9.6" } }, "nbformat": 4, From c2cb9699bbe56e560e8d93d8c2af3bc7c3da4eeb Mon Sep 17 00:00:00 2001 From: tanayvarshney Date: Tue, 14 Jun 2022 10:14:09 -0700 Subject: [PATCH 2/6] added triton deplolyment --- docsrc/index.rst | 2 + .../deploy_torch_tensorrt_to_triton.rst | 212 ++++++++++++++++++ 2 files changed, 214 insertions(+) create mode 100644 docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst diff --git a/docsrc/index.rst b/docsrc/index.rst index 30b0beddc6..1ae50f7695 100644 --- a/docsrc/index.rst +++ b/docsrc/index.rst @@ -28,6 +28,7 @@ Getting Started * :ref:`use_from_pytorch` * :ref:`runtime` * :ref:`using_dla` +* :ref:`deploy_torch_tensorrt_to_triton` .. toctree:: :caption: Getting Started @@ -43,6 +44,7 @@ Getting Started tutorials/use_from_pytorch tutorials/runtime tutorials/using_dla + tutorials/deploy_torch_tensorrt_to_triton .. toctree:: :caption: Notebooks diff --git a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst new file mode 100644 index 0000000000..454d37a33b --- /dev/null +++ b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst @@ -0,0 +1,212 @@ +Deploying a Torch-TensorRT model (to Triton) +============================================ + +Optimization and deployment go hand in hand in a discussion about Machine +Learning infrastructure. For a Torch-TensorRT user, network level optimzation +to get the maximum performance would already be an area of expertize. + +However, serving this optimized model comes with it's own set of considerations +and challenges like: building an infrastructure to support concorrent model +executions, supporting clients over HTTP or gRPC and more. + +The `Triton Inference Server `__ +solves the aforementioned and more. Let's discuss step-by-step, the process of +optimizing a model with Torch-TensorRT, deploying it on Triton Inference +Server, and building a client to query the model. + +Step 1: Optimize your model with Torch-TensorRT +----------------------------------------------- + +Most Torch-TensorRT users will be familiar with this step. For the purpose of +this demoonstration, we will be using a ResNet50 model from Torchhub. + +Let’s first pull the NGC PyTorch Docker container. You may need to create +an account and get the API key from `here `__. +Sign up and login with your key (follow the instructions +`here `__ after signing up). + +:: + + # is the yy:mm for the publishing tag for NVIDIA's Pytorch + # container; eg. 22.04 + + docker run -it --gpus all -v /path/to/folder:/resnet50_eg nvcr.io/nvidia/pytorch:-py3 + +Once inside the container, we can proceed to download a ResNet model from +Torchhub and optimize it with Torch-TensorRT. + +:: + + import torch + import torch_tensorrt + torch.hub._validate_not_a_forked_repo=lambda a,b,c: True + + # load model + model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda") + + # Compile with Torch TensorRT; + trt_model = torch_tensorrt.compile(model, + inputs= [torch_tensorrt.Input((1, 3, 224, 224))], + enabled_precisions= { torch.half} # Run with FP32 + ) + + # Save the model + torch.jit.save(trt_model, "model.pt") + +The next step in the process is to set up a Triton Inference Server. + +Step 2: Set Up Triton Inference Server +-------------------------------------- + +If you are new to the Triton Inference Server and want to learn more, we +highly recommend to checking our `Github +Repository `__. + +To use Triton, we need to make a model repository. A model repository, as the +name suggested, is a repository of the models the Inference server hosts. While +Triton can serve models from multiple repositories, in this example, we will +discuss the simplest possible form of the model repository. + +The structure of this repository should look something like this: + +:: + + model_repository + | + +-- resnet50 + | + +-- config.pbtxt + +-- 1 + | + +-- model.pt + +There are two files that Triton requires to serve the model: the model itself +and a model configuration file which is typically provided in ``config.pbtxt``. +For the model we prepared in step 1, the following configuration can be used: + +:: + + name: "resnet50" + platform: "pytorch_libtorch" + max_batch_size : 0 + input [ + { + name: "input__0" + data_type: TYPE_FP32 + dims: [ 3, 224, 224 ] + reshape { shape: [ 1, 3, 224, 224 ] } + } + ] + output [ + { + name: "output__0" + data_type: TYPE_FP32 + dims: [ 1, 1000 ,1, 1] + reshape { shape: [ 1, 1000 ] } + } + ] + +The ``config.pbtxt`` file is used to describe the exact model configuration +with details like the names and shapes of the input and output layer(s), +datatypes, scheduling and batching details and more. If you are new to Triton, +we highly encourage you to check out this `section of our +documentation `__ +for more details. + +With the model repository setup, we can proceed to launch the Triton server +with the docker command below. + +:: + + # Make sure that the TensorRT version in the Triton container + # and TensorRT version in the environment used to optimize the model + # are the same. + + docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models + +This should spin up a Triton Inference server. Next step, building a simple +http client to query the server. + +Step 3: Building a Triton Client to Query the Server +---------------------------------------------------- + +Before proceeding, make sure to have a sample image on hand. If you don't +have one, download an example image to test inference. In this section, we +will be going over a very basic client. For a variety of more fleshed out +examples, refer to the `Triton Client Repository `__ + +:: + + wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg" + +We then need to install dependencies for building a python client. These will +change from client to client. For a full list of all languages supported by Triton, +please refer to `Triton's client repository `__. + +:: + + pip install torchvision + pip install attrdict + pip install nvidia-pyindex + pip install tritonclient[all] + +Let's jump into the client. Firstly, we write a small preprocessing function to +resize and normalize the query image. + +:: + + import numpy as np + from torchvision import transforms + from PIL import Image + import tritonclient.http as httpclient + from tritonclient.utils import triton_to_np_dtype + + # preprocessing function + def rn50_preprocess(img_path="img1.jpg"): + img = Image.open(img_path) + preprocess = transforms.Compose([ + transforms.Resize(256), + transforms.CenterCrop(224), + transforms.ToTensor(), + transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), + ]) + return preprocess(img).numpy() + + transformed_img = rn50_preprocess() + +Building a client requires three basic points. Firstly, we setup a connection +with the Triton Inference Server. + +:: + + # Setting up client + triton_client = httpclient.InferenceServerClient(url="localhost:8000") + +Secondly, we specify the names of the input and output layer(s) of our model. + +:: + + test_input = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32") + test_input.set_data_from_numpy(transformed_img, binary_data=True) + + test_output = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000) + +Lastly, we send an inference request to the Triton Inference Server. + +:: + + # Querying the server + results = triton_client.infer(model_name="resnet50", inputs=[test_input], outputs=[test_output]) + test_output_fin = results.as_numpy('output__0') + print(test_output_fin[:5]) + +The output of the same should look like below: + +:: + + [b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136' + b'8.234375:11'] + +The output format here is ``:``. +To learn how to map these to the label names and more, refer to our +`documentation `__. From 326eaeb075aa575d2324eae5bb56b4bd4514fcf5 Mon Sep 17 00:00:00 2001 From: tanayvarshney Date: Wed, 15 Jun 2022 15:34:53 -0700 Subject: [PATCH 3/6] ammending deployment on triton docs --- .../deploy_torch_tensorrt_to_triton.rst | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst index 454d37a33b..3dee0f0184 100644 --- a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst +++ b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst @@ -2,8 +2,8 @@ Deploying a Torch-TensorRT model (to Triton) ============================================ Optimization and deployment go hand in hand in a discussion about Machine -Learning infrastructure. For a Torch-TensorRT user, network level optimzation -to get the maximum performance would already be an area of expertize. +Learning infrastructure. Once network level optimzation are done +to get the maximum performance, the next step would be to deploy it. However, serving this optimized model comes with it's own set of considerations and challenges like: building an infrastructure to support concorrent model @@ -18,7 +18,7 @@ Step 1: Optimize your model with Torch-TensorRT ----------------------------------------------- Most Torch-TensorRT users will be familiar with this step. For the purpose of -this demoonstration, we will be using a ResNet50 model from Torchhub. +this demonstration, we will be using a ResNet50 model from Torchhub. Let’s first pull the NGC PyTorch Docker container. You may need to create an account and get the API key from `here `__. @@ -30,7 +30,7 @@ Sign up and login with your key (follow the instructions # is the yy:mm for the publishing tag for NVIDIA's Pytorch # container; eg. 22.04 - docker run -it --gpus all -v /path/to/folder:/resnet50_eg nvcr.io/nvidia/pytorch:-py3 + docker run -it --gpus all -v /path/to/local/folder/to/copy/model:/resnet50_eg nvcr.io/nvidia/pytorch:-py3 Once inside the container, we can proceed to download a ResNet model from Torchhub and optimize it with Torch-TensorRT. @@ -180,25 +180,25 @@ with the Triton Inference Server. :: # Setting up client - triton_client = httpclient.InferenceServerClient(url="localhost:8000") + client = httpclient.InferenceServerClient(url="localhost:8000") Secondly, we specify the names of the input and output layer(s) of our model. :: - test_input = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32") - test_input.set_data_from_numpy(transformed_img, binary_data=True) + inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32") + inputs.set_data_from_numpy(transformed_img, binary_data=True) - test_output = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000) + outputs = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000) Lastly, we send an inference request to the Triton Inference Server. :: # Querying the server - results = triton_client.infer(model_name="resnet50", inputs=[test_input], outputs=[test_output]) - test_output_fin = results.as_numpy('output__0') - print(test_output_fin[:5]) + results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs]) + inference_output = results.as_numpy('output__0') + print(inference_output[:5]) The output of the same should look like below: From d13127fcfb274e557bac31c8f0553a10f294d738 Mon Sep 17 00:00:00 2001 From: tanayvarshney Date: Thu, 16 Jun 2022 15:29:25 -0700 Subject: [PATCH 4/6] ammending triton deployment documentation --- docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst index 3dee0f0184..144f390b60 100644 --- a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst +++ b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst @@ -30,7 +30,7 @@ Sign up and login with your key (follow the instructions # is the yy:mm for the publishing tag for NVIDIA's Pytorch # container; eg. 22.04 - docker run -it --gpus all -v /path/to/local/folder/to/copy/model:/resnet50_eg nvcr.io/nvidia/pytorch:-py3 + docker run -it --gpus all -v $(pwd):/workspace nvcr.io/nvidia/pytorch:-py3 Once inside the container, we can proceed to download a ResNet model from Torchhub and optimize it with Torch-TensorRT. From a256e6ae9859ff01ca051b1c565bc7bf40553764 Mon Sep 17 00:00:00 2001 From: tanayvarshney Date: Thu, 16 Jun 2022 15:38:23 -0700 Subject: [PATCH 5/6] ammending triton deployment documentation --- docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst index 144f390b60..e134746818 100644 --- a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst +++ b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst @@ -30,7 +30,7 @@ Sign up and login with your key (follow the instructions # is the yy:mm for the publishing tag for NVIDIA's Pytorch # container; eg. 22.04 - docker run -it --gpus all -v $(pwd):/workspace nvcr.io/nvidia/pytorch:-py3 + docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:-py3 Once inside the container, we can proceed to download a ResNet model from Torchhub and optimize it with Torch-TensorRT. From 2c01adc1af253fc3df543c4ada3fdafdce0321c2 Mon Sep 17 00:00:00 2001 From: tanayvarshney Date: Thu, 16 Jun 2022 16:44:26 -0700 Subject: [PATCH 6/6] ammending triton deployment documentation --- docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst index e134746818..540353d235 100644 --- a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst +++ b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst @@ -20,7 +20,7 @@ Step 1: Optimize your model with Torch-TensorRT Most Torch-TensorRT users will be familiar with this step. For the purpose of this demonstration, we will be using a ResNet50 model from Torchhub. -Let’s first pull the NGC PyTorch Docker container. You may need to create +Let’s first pull the `NGC PyTorch Docker container `__. You may need to create an account and get the API key from `here `__. Sign up and login with your key (follow the instructions `here `__ after signing up). @@ -30,7 +30,8 @@ Sign up and login with your key (follow the instructions # is the yy:mm for the publishing tag for NVIDIA's Pytorch # container; eg. 22.04 - docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:-py3 + docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:-py3 + cd /scratch_space Once inside the container, we can proceed to download a ResNet model from Torchhub and optimize it with Torch-TensorRT. @@ -53,7 +54,8 @@ Torchhub and optimize it with Torch-TensorRT. # Save the model torch.jit.save(trt_model, "model.pt") -The next step in the process is to set up a Triton Inference Server. +After copying the model, exit the container. The next step in the process +is to set up a Triton Inference Server. Step 2: Set Up Triton Inference Server -------------------------------------- @@ -114,7 +116,7 @@ documentation `__ for the pull tag for the container. :: @@ -122,7 +124,7 @@ with the docker command below. # and TensorRT version in the environment used to optimize the model # are the same. - docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models + docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/the_model_repository/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models This should spin up a Triton Inference server. Next step, building a simple http client to query the server.