How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

ThuyHoang9001 · 2021-10-20T02:03:01Z

Description

Cuda Mem Host is allocated FAIL .

Environment

TensorRT Version: 8.2.0.6
GPU Type:
Nvidia Driver Version: TU102 [GeForce RTX 2080 Ti]
CUDA Version: 11.4.2
CUDNN Version:
Operating System + Version: Linux 20.0.4
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.9.1+cu102
Baremetal or Container (if container which image + tag):

Steps To Reproduce

I tried to allocate host memory for dynamic model with batch_size > 1:

    context.set_binding_shape(0, (mBatchSize, 3, 112, 112))   
    for binding in engine:
        print('bingding:', binding, engine.get_binding_shape(binding))
        size = trt.volume(engine.get_binding_shape(binding))* mBatchSize
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # dims = context.get_binding_shape(binding)
        # if dims[0] < 0:
        #       size *= -1
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        cuda_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(cuda_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            self.input_w = engine.get_binding_shape(binding)[-1]
            self.input_h = engine.get_binding_shape(binding)[-2]
            host_inputs.append(host_mem)
            cuda_inputs.append(cuda_mem)
        else:
            host_outputs.append(host_mem)
            cuda_outputs.append(cuda_mem)

But it is fail as below:
bingding: input (-1, 3, 112, 112)
Traceback (most recent call last):
File "infer_insight_face.py", line 434, in
trt_wrapper = TRTClass(engine_file_path)
File "infer_insight_face.py", line 103, in init
host_mem = cuda.pagelocked_empty(size, dtype)
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
[10/20/2021-01:41:09] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::35] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

ttyio · 2021-12-10T03:38:58Z

@ThuyHoang9001 , could you use context.get_binding_shape for engine with dynamic shape? thanks

ttyio · 2022-01-25T01:45:42Z

close since no activity for more than 3 weeks, please reopen if you still have question, thanks!

mfoglio · 2022-06-28T20:02:46Z

Hi @ttyio , I have the same issue here. I am trying to run a model with dynamic batch size: engine.get_binding_shape(binding) returns (-1, 3, 224, 224), and a a consequence, when computing the size size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size the size is negative and the allocation of memory will return pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory.
As a workaround I am computing the size of the engine using size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size).
However, when later one I run inference: I get the error:

[TensorRT] ERROR: 3: [executionContext.cpp::resolveSlots::1495] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::1495, condition: allInputDimensionsSpecified(routine)
)
[TensorRT] ERROR: 2: [executionContext.cpp::enqueueInternal::360] Error Code 2: Internal Error (Could not resolve slots: )

Here's my code:

import numpy as np
import requests
from PIL import Image
import tensorrt as trt
import torch
from torchvision import transforms
import torchvision.transforms.functional as F

import pycuda.driver as cuda
import pycuda.autoinit


class HostDeviceMem(object):
    """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class MyModel:

    def __init__(self, engine_path):
        self.engine_path = engine_path
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        self.engine = self.load_engine(self.runtime, self.engine_path)
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers(self.engine)
        self.context = self.engine.create_execution_context()

        # PyTorch preprocessing
        IMAGE_SIZE = 224
        NORMALIZE_MEAN = torch.tensor([0.485, 0.456, 0.406])
        NORMALIZE_STD = torch.tensor([0.226, 0.226, 0.266])
        self.preprocessing_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=NORMALIZE_MEAN, std=NORMALIZE_STD),   # todo: is it between -1 and 1?
            SquarePad(),
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        ])
        self.input_dtype = np.float32

    @staticmethod
    def download_image(image_url: str) -> Image.Image:
        return Image.open(requests.get(image_url, stream=True).raw)

    @staticmethod
    def load_engine(trt_runtime, engine_path):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        trt.init_libnvinfer_plugins(None, "")
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    @staticmethod
    def allocate_buffers(engine):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in engine:
            size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size)
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings, stream

    @staticmethod
    def do_inference_v2(context, bindings, inputs, outputs, stream):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        # Transfer input data to the GPU.
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
        # Run inference.
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
        # Synchronize the stream
        stream.synchronize()
        # Return only the host outputs.
        return [out.host for out in outputs]

    def infer(self, image: Image.Image):
        image = self._preprocessing(image)
        batch = np.expand_dims(image, 0)
        output = self._trt_infer(x=batch, batch_size=1)
        return output

    def _preprocessing(self, image: Image.Image):
            image = self.preprocessing_transforms(image)
            image = np.array(image)
            return image

    def _trt_infer(self, x: np.array, batch_size: int) -> np.array:
        x = x.astype(self.input_dtype)
        np.copyto(self.inputs[0].host, x.ravel())
        return self.do_inference_v2(self.context, self.bindings, self.inputs, self.outputs, self.stream)


if __name__ == "__main__":

    model = MyModel(engine_path="model.engine")
    image_urls = [
        "https://www.kbb.com/wp-content/uploads/2020/10/2020-ford-expedition-rear.jpg?w=300&crop=1&strip=all"
    ]
    for image_url in image_urls:
        image = model.download_image(image_url)
        output = model.infer(image)

abdulazizab2 · 2022-10-11T19:20:39Z

Hi @ttyio , I have the same issue here. I am trying to run a model with dynamic batch size: engine.get_binding_shape(binding) returns (-1, 3, 224, 224), and a a consequence, when computing the size size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size the size is negative and the allocation of memory will return pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory. As a workaround I am computing the size of the engine using size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size). However, when later one I run inference: I get the error:

[TensorRT] ERROR: 3: [executionContext.cpp::resolveSlots::1495] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::1495, condition: allInputDimensionsSpecified(routine)
)
[TensorRT] ERROR: 2: [executionContext.cpp::enqueueInternal::360] Error Code 2: Internal Error (Could not resolve slots: )

Here's my code:

import numpy as np
import requests
from PIL import Image
import tensorrt as trt
import torch
from torchvision import transforms
import torchvision.transforms.functional as F

import pycuda.driver as cuda
import pycuda.autoinit


class HostDeviceMem(object):
    """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class MyModel:

    def __init__(self, engine_path):
        self.engine_path = engine_path
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        self.engine = self.load_engine(self.runtime, self.engine_path)
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers(self.engine)
        self.context = self.engine.create_execution_context()

        # PyTorch preprocessing
        IMAGE_SIZE = 224
        NORMALIZE_MEAN = torch.tensor([0.485, 0.456, 0.406])
        NORMALIZE_STD = torch.tensor([0.226, 0.226, 0.266])
        self.preprocessing_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=NORMALIZE_MEAN, std=NORMALIZE_STD),   # todo: is it between -1 and 1?
            SquarePad(),
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        ])
        self.input_dtype = np.float32

    @staticmethod
    def download_image(image_url: str) -> Image.Image:
        return Image.open(requests.get(image_url, stream=True).raw)

    @staticmethod
    def load_engine(trt_runtime, engine_path):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        trt.init_libnvinfer_plugins(None, "")
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    @staticmethod
    def allocate_buffers(engine):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in engine:
            size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size)
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings, stream

    @staticmethod
    def do_inference_v2(context, bindings, inputs, outputs, stream):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        # Transfer input data to the GPU.
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
        # Run inference.
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
        # Synchronize the stream
        stream.synchronize()
        # Return only the host outputs.
        return [out.host for out in outputs]

    def infer(self, image: Image.Image):
        image = self._preprocessing(image)
        batch = np.expand_dims(image, 0)
        output = self._trt_infer(x=batch, batch_size=1)
        return output

    def _preprocessing(self, image: Image.Image):
            image = self.preprocessing_transforms(image)
            image = np.array(image)
            return image

    def _trt_infer(self, x: np.array, batch_size: int) -> np.array:
        x = x.astype(self.input_dtype)
        np.copyto(self.inputs[0].host, x.ravel())
        return self.do_inference_v2(self.context, self.bindings, self.inputs, self.outputs, self.stream)


if __name__ == "__main__":

    model = MyModel(engine_path="model.engine")
    image_urls = [
        "https://www.kbb.com/wp-content/uploads/2020/10/2020-ford-expedition-rear.jpg?w=300&crop=1&strip=all"
    ]
    for image_url in image_urls:
        image = model.download_image(image_url)
        output = model.infer(image)

Hey, I am also trying to find a workaround to this.
I think taking the absolute value of the size is fine as long as you are allocating buffers for the correct number of bytes. However, you are only allocating buffers when initializing the object.
Say you have your first input with size (1, 3, 224, 224) then you are allocating 13224*224 bytes. But when the next input vary in size say (3,3,224,224) you have to allocate buffers as the size has changed.

The thing here is you might not need to convert to a dynamic engine as you can set the the axis which you want to vary as 1 and later multiply it by batch size or whatever axis of interest (as long as it is supported). So a static input shape might work if you change a little bit in the buffering allocation

However, I am still experimenting this and the code works well but not sure about the logic as I still didn't visualize the results.

Did you find a workaround that we may discuss?

asrays · 2022-11-29T08:30:37Z

@abdulazizab2 I have excatly the same issue my shape is (-1, 32, 32, 1) and getting the same error if I'm taking the abs of its volume.
Did you get any solution.

lix19937 · 2023-05-28T13:49:40Z

[TensorRT] ERROR: 3: [executionContext.cpp::resolveSlots::1495] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::1495, condition: allInputDimensionsSpecified(routine)
)
[TensorRT] ERROR: 2: [executionContext.cpp::enqueueInternal::360] Error Code 2: Internal Error (Could not resolve slots: )

it just means that dynamic shape not set right ? @ttyio

ttyio · 2023-05-30T17:28:39Z

@lix19937 yes, we also have api all_binding_shapes_specified to check if all dynamic binding shapes are specified. see https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/ExecutionContext.html

haiderasad · 2024-07-29T18:43:54Z

@ttyio Hey I am facing a problem with dynamic batching, basically on how to set the host mem size according to varying input batch size at runtime

Can you provide a example script in python?
below is my code

import ctypes
import numpy as np
import tensorrt as trt
from cuda import cuda, cudart
import cv2 as cv
try:
    FileNotFoundError
except NameError:
    FileNotFoundError = IOError

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def check_cuda_err(err):
    if isinstance(err, cuda.CUresult):
        if err != cuda.CUresult.CUDA_SUCCESS:
            raise RuntimeError("Cuda Error: {}".format(err))
    if isinstance(err, cudart.cudaError_t):
        if err != cudart.cudaError_t.cudaSuccess:
            raise RuntimeError("Cuda Runtime Error: {}".format(err))
    else:
        raise RuntimeError("Unknown error type: {}".format(err))

def cuda_call(call):
    err, res = call[0], call[1:]
    check_cuda_err(err)
    if len(res) == 1:
        res = res[0]
    return res

def GiB(val):
    return val * 1 << 30

class HostDeviceMem:
    def __init__(self, size: int, dtype: np.dtype, name= None, shape = None, format= None):
        nbytes = size * dtype.itemsize
        host_mem = cuda_call(cudart.cudaMallocHost(nbytes))
        pointer_type = ctypes.POINTER(np.ctypeslib.as_ctypes_type(dtype))

        self._host = np.ctypeslib.as_array(ctypes.cast(host_mem, pointer_type), (size,))
        self._device = cuda_call(cudart.cudaMalloc(nbytes))
        self._nbytes = nbytes
        self._name = name
        self._shape = shape
        self._format = format
        self._dtype = dtype

    @property
    def host(self) -> np.ndarray:
        return self._host

    @host.setter
    def host(self, arr: np.ndarray):
        if arr.size > self.host.size:
            raise ValueError(f"Tried to fit an array of size {arr.size} into host memory of size {self.host.size}")
        np.copyto(self.host[:arr.size], arr.flat, casting='safe')

    @property
    def device(self) -> int:
        return self._device

    @property
    def nbytes(self) -> int:
        return self._nbytes

    @property
    def name(self):
        return self._name

    @property
    def shape(self):
        return self._shape

    @property
    def format(self):
        return self._format

    @property
    def dtype(self) -> np.dtype:
        return self._dtype

    def __str__(self):
        return f"Host:\n{self.host}\nDevice:\n{self.device}\nSize:\n{self.nbytes}\n"

    def __repr__(self):
        return self.__str__()

    def free(self):
        cuda_call(cudart.cudaFree(self.device))
        cuda_call(cudart.cudaFreeHost(self.host.ctypes.data))

def allocate_buffers(engine: trt.ICudaEngine, profile_idx= None):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda_call(cudart.cudaStreamCreate())
    tensor_names = [engine.get_tensor_name(i) for i in range(engine.num_io_tensors)]
    for binding in tensor_names:
        format = engine.get_tensor_format(binding)
        
        
        shape = engine.get_tensor_shape(binding) if profile_idx is None else engine.get_tensor_profile_shape(binding, profile_idx)[0]
        shape_valid = np.all([s >= 0 for s in shape])
        if not shape_valid and profile_idx is None:
            raise ValueError(f"Binding {binding} has dynamic shape, but no profile was specified.")
        size = trt.volume(shape)
        dtype = np.dtype(trt.nptype(engine.get_tensor_dtype(binding)))

        print("engine.get_tensor_shape(binding)",engine.get_tensor_shape(binding))
        print(shape)
        binding_memory = HostDeviceMem(size, dtype, name=binding, shape=shape, format=format)

        bindings.append(int(binding_memory.device))

        if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
            inputs.append(binding_memory)
        else:
            outputs.append(binding_memory)
    return inputs, outputs, bindings, stream


def memcpy_host_to_device(device_ptr: int, host_arr: np.ndarray):
    nbytes = host_arr.size * host_arr.itemsize
    cuda_call(cudart.cudaMemcpy(device_ptr, host_arr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice))

def memcpy_device_to_host(host_arr: np.ndarray, device_ptr: int):
    nbytes = host_arr.size * host_arr.itemsize
    cuda_call(cudart.cudaMemcpy(host_arr, device_ptr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost))

def _do_inference_base(inputs, outputs, stream, execute_async_func):
    kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
    [cuda_call(cudart.cudaMemcpyAsync(inp.device, inp.host, inp.nbytes, kind, stream)) for inp in inputs]
    execute_async_func()
    kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost
    [cuda_call(cudart.cudaMemcpyAsync(out.host, out.device, out.nbytes, kind, stream)) for out in outputs]
    cuda_call(cudart.cudaStreamSynchronize(stream))
    return [out.host for out in outputs]

def do_inference(context, engine, bindings, inputs, outputs, stream):
    def execute_async_func():
        context.execute_async_v3(stream_handle=stream)

    num_io = engine.num_io_tensors
    context.set_input_shape('input', (6, 360, 640))
    for i in range(num_io):
        context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
        # if engine.get_tensor_name(i)=='input':
        #     context.set_input_shape('input', (6, 360, 640))
        
    #print(context.all_binding_shapes_specified)
    return _do_inference_base(inputs, outputs, stream, execute_async_func)

def load_engine(engine_file_path):
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

def preprocess_images(images, width=1280 // 2, height=720 // 2):
    shapes = [img.shape for img in images]
    images = [cv.resize(img, (width, height)) for img in images]
    images = np.stack(images)
    images = images / 128.0 - 1
    return images

engine_file_path = 'det_model.trt'
engine = load_engine(engine_file_path)
inputs, outputs, bindings, stream = allocate_buffers(engine=engine,profile_idx=0)
images = [np.random.rand(360, 640, 6).astype(np.float32) for _ in range(1)]  # Adjust the batch size as needed
preprocessed_images = preprocess_images(images)

#print(inputs[0].shape)
for host_device_buffer in inputs:
    np.copyto(host_device_buffer.host, preprocessed_images.flatten())
    
context = engine.create_execution_context()
masks = do_inference(context=context, engine=engine, inputs=inputs, outputs=outputs, bindings=bindings, stream=stream)
#print(len(masks))
for mask in masks:
    print(mask.shape)

sevenandseven · 2025-01-08T02:51:37Z

(binding)) * engine.max_batch_size)
dtype = trt.nptype(engine.get_binding_dtype(binding))

你好，我用的tensorrt8.6.1遇到了和你一样的问题，请问应该怎么处理？

ttyio added question Further information is requested Topic: Dynamic Shape triaged Issue has been triaged by maintainers labels Dec 10, 2021

ttyio closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

ThuyHoang9001 commented Oct 20, 2021 •

edited

Loading

ttyio commented Dec 10, 2021

ttyio commented Jan 25, 2022

mfoglio commented Jun 28, 2022 •

edited

Loading

abdulazizab2 commented Oct 11, 2022

asrays commented Nov 29, 2022

lix19937 commented May 28, 2023

ttyio commented May 30, 2023

haiderasad commented Jul 29, 2024

sevenandseven commented Jan 8, 2025

​ How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

​ How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

Comments

ThuyHoang9001 commented Oct 20, 2021 • edited Loading

Description

Environment

Steps To Reproduce

ttyio commented Dec 10, 2021

ttyio commented Jan 25, 2022

mfoglio commented Jun 28, 2022 • edited Loading

abdulazizab2 commented Oct 11, 2022

asrays commented Nov 29, 2022

lix19937 commented May 28, 2023

ttyio commented May 30, 2023

haiderasad commented Jul 29, 2024

sevenandseven commented Jan 8, 2025

How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

ThuyHoang9001 commented Oct 20, 2021 •

edited

Loading

mfoglio commented Jun 28, 2022 •

edited

Loading