Thread Safety #645

ruoqianguo · 2021-09-29T08:36:17Z

ruoqianguo
Sep 29, 2021

Goal

As mentioned in #618 , TRTorch runtime is not thread-safe, which maybe case segment-fault or hang in online-serving multi-thread environment. We need to make sure the TRTorch runtime is thread-safe.

RoadMap

From the commit in TF2TRT, we found that nvinfer1::IExecutionContext::enqueue is not thread safe and we need a mutex for it.

Besides, the TRT best practices guide says: to "Create a CUDA stream using cudaStreamCreate for each independent batch and an IExecutionContext for each independent batch." One should interpret this the following way: if we want to enque work on multiple GPU streams then each independent batch of work has to have a separate IExecutionContext object. Alternatively we can use a single stream with a single execution context. If we call enqueueV2() in from the same IExecutionContext object with different CUDA streams concurrently, it will result in undefined behavior.

From this issue, we found that we have a single compute stream for a physical GPU in TF. That ensues that the TRTEngineOp running on that GPU will have correct stream ordering. But in PyTorch we could have different streams for a physical GPU through c10::cuda::CUDAStreamGuard.

In general, we may need to discuss the following questions:

In TRTorch, do we need to use multiple streams or a single stream?
If we use a single stream, then we just add a mutex for nvinfer1::IExecutionContext::enqueue. The users need to make sure that they only use a single stream, otherwise there will be an undefined behavior.
When we use multiple streams, we may need a vector to store some IExecutionContext and a map to store pairs of streams and index
of IExecutionContext. When the TRTEngine is initialized, we can construct some IExecutionContext based on the TRT engine. During the execution of the engine, we will look up the current stream in the map. When we get a new stream, we will assign a new IExecutionContext to the stream. In addition, we also need to set a maximum number of streams.
When we use multiple streams, we may need two mutexes. The first one is used for nvinfer1::IExecutionContext::enqueue. The second one is used for the map's reading and writing.

Multiple streams pseudocode likes that:

// TRTorch/core/runtime/runtime.h
struct TRTEngine : torch::CustomClassHolder {
  // Each engine needs it's own runtime object
  std::shared_ptr<nvinfer1::IRuntime> rt;
  std::shared_ptr<nvinfer1::ICudaEngine> cuda_engine;
  
  //<-----------here----------->
  // vector of nvinfer1::IExecutionContext
  std::vector<nvinfer1::IExecutionContext> exec_ctx_vec;
  // max number of IExecutionContext
  int max_num_ctx;
  // map to keep index of IExecutionContext for given cudaStream_t.
  std::unordered_map<cudaStream_t, int> ctx_map;
  int last_index;
  // mutx for IExecutionContext::enqueue
  mutex mu;
  // mutx for map
  mutx  map_mutex;

  std::pair<uint64_t, uint64_t> num_io;
  std::string name;
  CudaDevice device_info;

  std::unordered_map<uint64_t, uint64_t> in_binding_map;
  std::unordered_map<uint64_t, uint64_t> out_binding_map;

  ~TRTEngine() = default;
  TRTEngine(std::string serialized_engine, CudaDevice cuda_device);
  TRTEngine(std::vector<std::string> serialized_info);
  TRTEngine(std::string mod_name, std::string serialized_engine, CudaDevice cuda_device);
  TRTEngine& operator=(const TRTEngine& other);
  // TODO: Implement a call method
  // c10::List<at::Tensor> Run(c10::List<at::Tensor> inputs);
};

// TRTorch/core/runtime/TRTEngine.cpp
TRTEngine::TRTEngine(std::string mod_name, std::string serialized_engine, CudaDevice cuda_device, int max_num_ctx) {
  device_info = cuda_device;
  set_cuda_device(device_info);

  rt = std::shared_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(util::logging::get_logger()));

  name = slugify(mod_name);

  cuda_engine = std::shared_ptr<nvinfer1::ICudaEngine>(
      rt->deserializeCudaEngine(serialized_engine.c_str(), serialized_engine.size()));
  TRTORCH_CHECK((cuda_engine != nullptr), "Unable to deserialize the TensorRT engine");
  
   //<-----------here----------->
  // initalize max_num_ctx IExecutionContext
  for(int i=0;i<max_num_ctx;i++)
  {
    exec_ctx_vec.push_back(nvinfer1::IExecutionContext>(cuda_engine->createExecutionContext());
  }
  last_index = 0;
  ....
}
...
namespace {
static auto TRTORCH_UNUSED TRTEngineTSRegistrtion =
    torch::class_<TRTEngine>("tensorrt", "Engine")
        .def(torch::init<std::vector<std::string>>())
        // TODO: .def("__call__", &TRTEngine::Run)
        // TODO: .def("run", &TRTEngine::Run)
        .def_pickle(
            [](const c10::intrusive_ptr<TRTEngine>& self) -> std::vector<std::string> {
              // Serialize TensorRT engine
              auto serialized_trt_engine = self->cuda_engine->serialize();

              // Adding device info related meta data to the serialized file
              auto trt_engine = std::string((const char*)serialized_trt_engine->data(), serialized_trt_engine->size());

              std::vector<std::string> serialize_info;
              serialize_info.resize(ENGINE_IDX + 1);

              serialize_info[ABI_TARGET_IDX] = ABI_VERSION;
              serialize_info[NAME_IDX] = self->name;
              serialize_info[DEVICE_IDX] = serialize_device(self->device_info);
              serialize_info[ENGINE_IDX] = trt_engine;
              serialize_info[MAX_NUM_CTX_IDX] = self->max_num_ctx;    // <------------ serialize max_num_ctx-------------->
              return serialize_info;
            },
            [](std::vector<std::string> seralized_info) -> c10::intrusive_ptr<TRTEngine> {
              return c10::make_intrusive<TRTEngine>(std::move(seralized_info));
            });
} // namespace

// TRTorch/core/compiler.h
void AddEngineToGraph(
    torch::jit::script::Module mod,
    std::shared_ptr<torch::jit::Graph>& g,
    const std::string& serialized_engine,
    runtime::CudaDevice& device_info,
    std::string engine_id = "",
    bool fallback = false,
    int max_num_ctx=1) {   // <-----------Set max number of IExecutionContext----------------->
  auto engine_ptr = c10::make_intrusive<runtime::TRTEngine>(
      mod._ivalue()->name() + "_engine_" + engine_id, serialized_engine, device_info, max_num_ctx);
  ...
}

// TRTorch/core/runtime/register_trt_op.cpp
std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intrusive_ptr<TRTEngine> compiled_engine) {
  LOG_DEBUG("Attempting to run engine (ID: " << compiled_engine->name << ")");

  c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream(inputs[0].device().index());
  std::shared_ptr<nvinfer1::IExecutionContext> exec_ctx;
  // <------------------mutex for map--------------------->
  {
    std::unique_lock<std::mutex> lock(compiled_engine->map_mutex);
    if(compiled_engine->ctx_map.find(stream))
    {
      exec_ctx = compiled_engine->ctx_vec[compiled_engine->ctx_map[stream]];
    }
    else
    {
      if(compiled_engine->last_index < compiled_engine->max_num_ctx)
      {
        exec_ctx = compiled_engine->ctx_vec[compiled_engine->last_index];
        compiled_engine->ctx_map[stream] = compiled_engine->last_index++;
      }
      else
      {
        return error_;
      }
    }
  }
  ....
  // <------------------mutex for IExecutionContext::enqueueV2--------------------->
  std::unique_lock<std::mutex> lock(compiled_engine->mu);
  bool success = exec_ctx->enqueueV2(gpu_handles.data(), stream, nullptr);
  TRTORCH_CHECK(success, "execute engine enequeue trt task failed");

 return outputs;
}

inocsin · 2021-09-29T08:45:16Z

inocsin
Sep 29, 2021

cc: @narendasan @peri044

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread Safety #645

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Thread Safety #645

ruoqianguo Sep 29, 2021

Goal

RoadMap

Replies: 1 comment

inocsin Sep 29, 2021

ruoqianguo
Sep 29, 2021

inocsin
Sep 29, 2021