🐛 [Bug] TRTorch runtime isn't threadsafe,which maybe segment-fault or hang in online-serving multi-thread environment #618

gssplayer · 2021-09-03T10:36:44Z

Bug Description

We optimized one model based on TRTorch v0.3.0 successfully, but failed to deploy in online-serving, which maybe core-dump or hang at MemoryD2H. By locking torch.jit inference，we could work-around this issue currently， and we guess trtorch runtime is not thread-safe.

I collected other issue or commits related this problem:

trtorch community: ❓ [Question] Is the module compiled by TRTorch thread safe? #181
TFTRT had fixed this problem by this patch: tensorflow/tensorflow@e51fa30

From these issue or commits, I get nvinfer1::IExecuteContext is not thread-safe.

So how to fix this bug in trtorch?

The text was updated successfully, but these errors were encountered:

gssplayer · 2021-09-03T10:50:09Z

Now I lock whole "execute_engine" op like this, which will protect trt task submit，not whole inference(from start until getting final result). We've tested it's fixed in single-stream & static-shape case(Maybe not fully test)。 I wonder if it‘s also OK in multi-stream or dynamic-shape case.

std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intrusive_ptr<TRTEngine> compiled_engine) {
 LOG_DEBUG("Attempting to run engine (ID: " << compiled_engine->name << ")");
 std::vector<void*> gpu_handles;

 std::vector<at::Tensor> contig_inputs{};
 contig_inputs.reserve(inputs.size());

 // nvinfer1::IExecutionContext is not thread-safe.
 std::unique_lock<std::mutex> lock(compiled_engine->mu);
 for (size_t i = 0; i < inputs.size(); i++) {
   uint64_t pyt_idx = compiled_engine->in_binding_map[i];
   TRTORCH_CHECK(
       inputs[pyt_idx].is_cuda(),
       "Expected input tensors to have device cuda, found device " << inputs[pyt_idx].device());
   auto expected_type = util::toATenDType(compiled_engine->exec_ctx->getEngine().getBindingDataType(i));
   TRTORCH_CHECK(
       inputs[pyt_idx].dtype() == expected_type,
       "Expected input tensors to have type " << expected_type << ", found type " << inputs[pyt_idx].dtype());
   auto dims = core::util::toDimsPad(inputs[pyt_idx].sizes(), 1);
   auto shape = core::util::toVec(dims);
   contig_inputs.push_back(inputs[pyt_idx].view(shape).contiguous());
   LOG_DEBUG("Input shape: " << dims);
   compiled_engine->exec_ctx->setBindingDimensions(i, dims);
   gpu_handles.push_back(contig_inputs.back().data_ptr());
 }

 TRTORCH_CHECK(
     compiled_engine->exec_ctx->allInputDimensionsSpecified(), "Not enough inputs provided (runtime.RunCudaEngine)");

 std::vector<at::Tensor> outputs(compiled_engine->num_io.second);
 for (size_t o = inputs.size(); o < (compiled_engine->num_io.first + compiled_engine->num_io.second); o++) {
   uint64_t pyt_idx = compiled_engine->out_binding_map[o];
   auto out_shape = compiled_engine->exec_ctx->getBindingDimensions(o);
   LOG_DEBUG("Output shape: " << out_shape);
   auto dims = core::util::toVec(out_shape);
   auto type = util::toATenDType(compiled_engine->exec_ctx->getEngine().getBindingDataType(o));
   outputs[pyt_idx] = std::move(at::empty(dims, {at::kCUDA}).to(type).contiguous());
   gpu_handles.push_back(outputs[pyt_idx].data_ptr());
 }

 c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream(inputs[0].device().index());
 bool success = compiled_engine->exec_ctx->enqueueV2(gpu_handles.data(), stream, nullptr);
 TRTORCH_CHECK(success, "execute engine enequeue trt task failed");

 return outputs;
}

narendasan · 2021-09-08T20:00:35Z

Can you open a PR or something so we can see all changes? Like did you add the compiled_engine->mu field? Yeah the IExecutionContext is not thread-safe. You are supposed to have one per thread. We used to create it on the fly but that incurs a runtime cost.

gssplayer added the bug Something isn't working label Sep 3, 2021

ruoqianguo mentioned this issue Oct 11, 2021

Make TRTorch runtime thread safe #658

Merged

6 tasks

narendasan closed this as completed in #658 Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug] TRTorch runtime isn't threadsafe,which maybe segment-fault or hang in online-serving multi-thread environment #618

🐛 [Bug] TRTorch runtime isn't threadsafe,which maybe segment-fault or hang in online-serving multi-thread environment #618

gssplayer commented Sep 3, 2021 •

edited

Loading

gssplayer commented Sep 3, 2021 •

edited

Loading

narendasan commented Sep 8, 2021

🐛 [Bug] TRTorch runtime isn't threadsafe,which maybe segment-fault or hang in online-serving multi-thread environment #618

🐛 [Bug] TRTorch runtime isn't threadsafe,which maybe segment-fault or hang in online-serving multi-thread environment #618

Comments

gssplayer commented Sep 3, 2021 • edited Loading

Bug Description

gssplayer commented Sep 3, 2021 • edited Loading

narendasan commented Sep 8, 2021

gssplayer commented Sep 3, 2021 •

edited

Loading

gssplayer commented Sep 3, 2021 •

edited

Loading