Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

​ How to Allocate Host Mem for dynamic model with batch_size > 1? #1564

Closed
ThuyHoang9001 opened this issue Oct 20, 2021 · 9 comments
Closed
Labels
question Further information is requested triaged Issue has been triaged by maintainers

Comments

@ThuyHoang9001
Copy link

ThuyHoang9001 commented Oct 20, 2021

Description

Cuda Mem Host is allocated FAIL .

Environment

TensorRT Version: 8.2.0.6
GPU Type:
Nvidia Driver Version: TU102 [GeForce RTX 2080 Ti]
CUDA Version: 11.4.2
CUDNN Version:
Operating System + Version: Linux 20.0.4
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.9.1+cu102
Baremetal or Container (if container which image + tag):

Steps To Reproduce

I tried to allocate host memory for dynamic model with batch_size > 1:

    context.set_binding_shape(0, (mBatchSize, 3, 112, 112))   
    for binding in engine:
        print('bingding:', binding, engine.get_binding_shape(binding))
        size = trt.volume(engine.get_binding_shape(binding))* mBatchSize
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # dims = context.get_binding_shape(binding)
        # if dims[0] < 0:
        #       size *= -1
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        cuda_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(cuda_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            self.input_w = engine.get_binding_shape(binding)[-1]
            self.input_h = engine.get_binding_shape(binding)[-2]
            host_inputs.append(host_mem)
            cuda_inputs.append(cuda_mem)
        else:
            host_outputs.append(host_mem)
            cuda_outputs.append(cuda_mem)

But it is fail as below:
bingding: input (-1, 3, 112, 112)
Traceback (most recent call last):
File "infer_insight_face.py", line 434, in
trt_wrapper = TRTClass(engine_file_path)
File "infer_insight_face.py", line 103, in init
host_mem = cuda.pagelocked_empty(size, dtype)
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
[10/20/2021-01:41:09] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::35] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

@ttyio
Copy link
Collaborator

ttyio commented Dec 10, 2021

@ThuyHoang9001 , could you use context.get_binding_shape for engine with dynamic shape? thanks

@ttyio ttyio added question Further information is requested Topic: Dynamic Shape triaged Issue has been triaged by maintainers labels Dec 10, 2021
@ttyio
Copy link
Collaborator

ttyio commented Jan 25, 2022

close since no activity for more than 3 weeks, please reopen if you still have question, thanks!

@ttyio ttyio closed this as completed Jan 25, 2022
@mfoglio
Copy link

mfoglio commented Jun 28, 2022

Hi @ttyio , I have the same issue here. I am trying to run a model with dynamic batch size: engine.get_binding_shape(binding) returns (-1, 3, 224, 224), and a a consequence, when computing the size size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size the size is negative and the allocation of memory will return pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory.
As a workaround I am computing the size of the engine using size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size).
However, when later one I run inference: I get the error:

[TensorRT] ERROR: 3: [executionContext.cpp::resolveSlots::1495] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::1495, condition: allInputDimensionsSpecified(routine)
)
[TensorRT] ERROR: 2: [executionContext.cpp::enqueueInternal::360] Error Code 2: Internal Error (Could not resolve slots: )

Here's my code:

import numpy as np
import requests
from PIL import Image
import tensorrt as trt
import torch
from torchvision import transforms
import torchvision.transforms.functional as F

import pycuda.driver as cuda
import pycuda.autoinit


class HostDeviceMem(object):
    """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class MyModel:

    def __init__(self, engine_path):
        self.engine_path = engine_path
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        self.engine = self.load_engine(self.runtime, self.engine_path)
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers(self.engine)
        self.context = self.engine.create_execution_context()

        # PyTorch preprocessing
        IMAGE_SIZE = 224
        NORMALIZE_MEAN = torch.tensor([0.485, 0.456, 0.406])
        NORMALIZE_STD = torch.tensor([0.226, 0.226, 0.266])
        self.preprocessing_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=NORMALIZE_MEAN, std=NORMALIZE_STD),   # todo: is it between -1 and 1?
            SquarePad(),
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        ])
        self.input_dtype = np.float32

    @staticmethod
    def download_image(image_url: str) -> Image.Image:
        return Image.open(requests.get(image_url, stream=True).raw)

    @staticmethod
    def load_engine(trt_runtime, engine_path):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        trt.init_libnvinfer_plugins(None, "")
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    @staticmethod
    def allocate_buffers(engine):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in engine:
            size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size)
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings, stream

    @staticmethod
    def do_inference_v2(context, bindings, inputs, outputs, stream):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        # Transfer input data to the GPU.
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
        # Run inference.
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
        # Synchronize the stream
        stream.synchronize()
        # Return only the host outputs.
        return [out.host for out in outputs]

    def infer(self, image: Image.Image):
        image = self._preprocessing(image)
        batch = np.expand_dims(image, 0)
        output = self._trt_infer(x=batch, batch_size=1)
        return output

    def _preprocessing(self, image: Image.Image):
            image = self.preprocessing_transforms(image)
            image = np.array(image)
            return image

    def _trt_infer(self, x: np.array, batch_size: int) -> np.array:
        x = x.astype(self.input_dtype)
        np.copyto(self.inputs[0].host, x.ravel())
        return self.do_inference_v2(self.context, self.bindings, self.inputs, self.outputs, self.stream)


if __name__ == "__main__":

    model = MyModel(engine_path="model.engine")
    image_urls = [
        "https://www.kbb.com/wp-content/uploads/2020/10/2020-ford-expedition-rear.jpg?w=300&crop=1&strip=all"
    ]
    for image_url in image_urls:
        image = model.download_image(image_url)
        output = model.infer(image)

@abdulazizab2
Copy link

Hi @ttyio , I have the same issue here. I am trying to run a model with dynamic batch size: engine.get_binding_shape(binding) returns (-1, 3, 224, 224), and a a consequence, when computing the size size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size the size is negative and the allocation of memory will return pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory. As a workaround I am computing the size of the engine using size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size). However, when later one I run inference: I get the error:

[TensorRT] ERROR: 3: [executionContext.cpp::resolveSlots::1495] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::1495, condition: allInputDimensionsSpecified(routine)
)
[TensorRT] ERROR: 2: [executionContext.cpp::enqueueInternal::360] Error Code 2: Internal Error (Could not resolve slots: )

Here's my code:

import numpy as np
import requests
from PIL import Image
import tensorrt as trt
import torch
from torchvision import transforms
import torchvision.transforms.functional as F

import pycuda.driver as cuda
import pycuda.autoinit


class HostDeviceMem(object):
    """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class MyModel:

    def __init__(self, engine_path):
        self.engine_path = engine_path
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        self.engine = self.load_engine(self.runtime, self.engine_path)
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers(self.engine)
        self.context = self.engine.create_execution_context()

        # PyTorch preprocessing
        IMAGE_SIZE = 224
        NORMALIZE_MEAN = torch.tensor([0.485, 0.456, 0.406])
        NORMALIZE_STD = torch.tensor([0.226, 0.226, 0.266])
        self.preprocessing_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=NORMALIZE_MEAN, std=NORMALIZE_STD),   # todo: is it between -1 and 1?
            SquarePad(),
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        ])
        self.input_dtype = np.float32

    @staticmethod
    def download_image(image_url: str) -> Image.Image:
        return Image.open(requests.get(image_url, stream=True).raw)

    @staticmethod
    def load_engine(trt_runtime, engine_path):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        trt.init_libnvinfer_plugins(None, "")
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    @staticmethod
    def allocate_buffers(engine):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in engine:
            size = abs(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size)
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings, stream

    @staticmethod
    def do_inference_v2(context, bindings, inputs, outputs, stream):
        """ Copied from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/common.py """
        # Transfer input data to the GPU.
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
        # Run inference.
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
        # Synchronize the stream
        stream.synchronize()
        # Return only the host outputs.
        return [out.host for out in outputs]

    def infer(self, image: Image.Image):
        image = self._preprocessing(image)
        batch = np.expand_dims(image, 0)
        output = self._trt_infer(x=batch, batch_size=1)
        return output

    def _preprocessing(self, image: Image.Image):
            image = self.preprocessing_transforms(image)
            image = np.array(image)
            return image

    def _trt_infer(self, x: np.array, batch_size: int) -> np.array:
        x = x.astype(self.input_dtype)
        np.copyto(self.inputs[0].host, x.ravel())
        return self.do_inference_v2(self.context, self.bindings, self.inputs, self.outputs, self.stream)


if __name__ == "__main__":

    model = MyModel(engine_path="model.engine")
    image_urls = [
        "https://www.kbb.com/wp-content/uploads/2020/10/2020-ford-expedition-rear.jpg?w=300&crop=1&strip=all"
    ]
    for image_url in image_urls:
        image = model.download_image(image_url)
        output = model.infer(image)

Hey, I am also trying to find a workaround to this.
I think taking the absolute value of the size is fine as long as you are allocating buffers for the correct number of bytes. However, you are only allocating buffers when initializing the object.
Say you have your first input with size (1, 3, 224, 224) then you are allocating 13224*224 bytes. But when the next input vary in size say (3,3,224,224) you have to allocate buffers as the size has changed.

The thing here is you might not need to convert to a dynamic engine as you can set the the axis which you want to vary as 1 and later multiply it by batch size or whatever axis of interest (as long as it is supported). So a static input shape might work if you change a little bit in the buffering allocation

However, I am still experimenting this and the code works well but not sure about the logic as I still didn't visualize the results.

Did you find a workaround that we may discuss?

@asrays
Copy link

asrays commented Nov 29, 2022

@abdulazizab2 I have excatly the same issue my shape is (-1, 32, 32, 1) and getting the same error if I'm taking the abs of its volume.
Did you get any solution.

@lix19937
Copy link

[TensorRT] ERROR: 3: [executionContext.cpp::resolveSlots::1495] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::1495, condition: allInputDimensionsSpecified(routine)
)
[TensorRT] ERROR: 2: [executionContext.cpp::enqueueInternal::360] Error Code 2: Internal Error (Could not resolve slots: )

it just means that dynamic shape not set right ? @ttyio

@ttyio
Copy link
Collaborator

ttyio commented May 30, 2023

@lix19937 yes, we also have api all_binding_shapes_specified to check if all dynamic binding shapes are specified. see https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/ExecutionContext.html

@haiderasad
Copy link

@ttyio Hey I am facing a problem with dynamic batching, basically on how to set the host mem size according to varying input batch size at runtime

Can you provide a example script in python?
below is my code

import ctypes
import numpy as np
import tensorrt as trt
from cuda import cuda, cudart
import cv2 as cv
try:
    FileNotFoundError
except NameError:
    FileNotFoundError = IOError

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def check_cuda_err(err):
    if isinstance(err, cuda.CUresult):
        if err != cuda.CUresult.CUDA_SUCCESS:
            raise RuntimeError("Cuda Error: {}".format(err))
    if isinstance(err, cudart.cudaError_t):
        if err != cudart.cudaError_t.cudaSuccess:
            raise RuntimeError("Cuda Runtime Error: {}".format(err))
    else:
        raise RuntimeError("Unknown error type: {}".format(err))

def cuda_call(call):
    err, res = call[0], call[1:]
    check_cuda_err(err)
    if len(res) == 1:
        res = res[0]
    return res

def GiB(val):
    return val * 1 << 30

class HostDeviceMem:
    def __init__(self, size: int, dtype: np.dtype, name= None, shape = None, format= None):
        nbytes = size * dtype.itemsize
        host_mem = cuda_call(cudart.cudaMallocHost(nbytes))
        pointer_type = ctypes.POINTER(np.ctypeslib.as_ctypes_type(dtype))

        self._host = np.ctypeslib.as_array(ctypes.cast(host_mem, pointer_type), (size,))
        self._device = cuda_call(cudart.cudaMalloc(nbytes))
        self._nbytes = nbytes
        self._name = name
        self._shape = shape
        self._format = format
        self._dtype = dtype

    @property
    def host(self) -> np.ndarray:
        return self._host

    @host.setter
    def host(self, arr: np.ndarray):
        if arr.size > self.host.size:
            raise ValueError(f"Tried to fit an array of size {arr.size} into host memory of size {self.host.size}")
        np.copyto(self.host[:arr.size], arr.flat, casting='safe')

    @property
    def device(self) -> int:
        return self._device

    @property
    def nbytes(self) -> int:
        return self._nbytes

    @property
    def name(self):
        return self._name

    @property
    def shape(self):
        return self._shape

    @property
    def format(self):
        return self._format

    @property
    def dtype(self) -> np.dtype:
        return self._dtype

    def __str__(self):
        return f"Host:\n{self.host}\nDevice:\n{self.device}\nSize:\n{self.nbytes}\n"

    def __repr__(self):
        return self.__str__()

    def free(self):
        cuda_call(cudart.cudaFree(self.device))
        cuda_call(cudart.cudaFreeHost(self.host.ctypes.data))

def allocate_buffers(engine: trt.ICudaEngine, profile_idx= None):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda_call(cudart.cudaStreamCreate())
    tensor_names = [engine.get_tensor_name(i) for i in range(engine.num_io_tensors)]
    for binding in tensor_names:
        format = engine.get_tensor_format(binding)
        
        
        shape = engine.get_tensor_shape(binding) if profile_idx is None else engine.get_tensor_profile_shape(binding, profile_idx)[0]
        shape_valid = np.all([s >= 0 for s in shape])
        if not shape_valid and profile_idx is None:
            raise ValueError(f"Binding {binding} has dynamic shape, but no profile was specified.")
        size = trt.volume(shape)
        dtype = np.dtype(trt.nptype(engine.get_tensor_dtype(binding)))

        print("engine.get_tensor_shape(binding)",engine.get_tensor_shape(binding))
        print(shape)
        binding_memory = HostDeviceMem(size, dtype, name=binding, shape=shape, format=format)

        bindings.append(int(binding_memory.device))

        if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
            inputs.append(binding_memory)
        else:
            outputs.append(binding_memory)
    return inputs, outputs, bindings, stream


def memcpy_host_to_device(device_ptr: int, host_arr: np.ndarray):
    nbytes = host_arr.size * host_arr.itemsize
    cuda_call(cudart.cudaMemcpy(device_ptr, host_arr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice))

def memcpy_device_to_host(host_arr: np.ndarray, device_ptr: int):
    nbytes = host_arr.size * host_arr.itemsize
    cuda_call(cudart.cudaMemcpy(host_arr, device_ptr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost))

def _do_inference_base(inputs, outputs, stream, execute_async_func):
    kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
    [cuda_call(cudart.cudaMemcpyAsync(inp.device, inp.host, inp.nbytes, kind, stream)) for inp in inputs]
    execute_async_func()
    kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost
    [cuda_call(cudart.cudaMemcpyAsync(out.host, out.device, out.nbytes, kind, stream)) for out in outputs]
    cuda_call(cudart.cudaStreamSynchronize(stream))
    return [out.host for out in outputs]

def do_inference(context, engine, bindings, inputs, outputs, stream):
    def execute_async_func():
        context.execute_async_v3(stream_handle=stream)

    num_io = engine.num_io_tensors
    context.set_input_shape('input', (6, 360, 640))
    for i in range(num_io):
        context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
        # if engine.get_tensor_name(i)=='input':
        #     context.set_input_shape('input', (6, 360, 640))
        
    #print(context.all_binding_shapes_specified)
    return _do_inference_base(inputs, outputs, stream, execute_async_func)

def load_engine(engine_file_path):
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

def preprocess_images(images, width=1280 // 2, height=720 // 2):
    shapes = [img.shape for img in images]
    images = [cv.resize(img, (width, height)) for img in images]
    images = np.stack(images)
    images = images / 128.0 - 1
    return images

engine_file_path = 'det_model.trt'
engine = load_engine(engine_file_path)
inputs, outputs, bindings, stream = allocate_buffers(engine=engine,profile_idx=0)
images = [np.random.rand(360, 640, 6).astype(np.float32) for _ in range(1)]  # Adjust the batch size as needed
preprocessed_images = preprocess_images(images)

#print(inputs[0].shape)
for host_device_buffer in inputs:
    np.copyto(host_device_buffer.host, preprocessed_images.flatten())
    
context = engine.create_execution_context()
masks = do_inference(context=context, engine=engine, inputs=inputs, outputs=outputs, bindings=bindings, stream=stream)
#print(len(masks))
for mask in masks:
    print(mask.shape)

@sevenandseven
Copy link

(binding)) * engine.max_batch_size)
dtype = trt.nptype(engine.get_binding_dtype(binding))

你好,我用的tensorrt8.6.1遇到了和你 一样的问题,请问应该怎么处理?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

8 participants