Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onnx model causes core dump in 22.08+, works with 22.06 #5084

Closed
dagardner-nv opened this issue Nov 17, 2022 · 5 comments
Closed

onnx model causes core dump in 22.08+, works with 22.06 #5084

dagardner-nv opened this issue Nov 17, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link

Description
In Morpheus we have an onnx model which was working with tritonserver 22.02 & 22.06 but causes a core dump in versions 22.08, 22.09 & 22.10
nv-morpheus/Morpheus#475

Triton Information
22.08

Container: nvcr.io/nvidia/tritonserver:22.08-py3

To Reproduce

git clone https://github.com/nv-morpheus/Morpheus
cd Morpheus
./scripts/fetch_data.py fetch models
docker run --rm -ti --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 -v $PWD/models:/models nvcr.io/nvidia/tritonserver:22.08-py3 bash
tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --log-info=true

Fails with:

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 22.08 (build 42766143)
Triton Server Version 2.25.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@4af633826189:/opt/tritonserver# tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --log-info=true
I1117 23:36:57.355277 91 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f23c6000000' with size 268435456
I1117 23:36:57.355606 91 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1117 23:36:57.361426 91 model_lifecycle.cc:459] loading: phishing-bert-trt:1
I1117 23:36:57.361455 91 model_lifecycle.cc:459] loading: root-cause-binary-onnx:1
I1117 23:36:57.361486 91 model_lifecycle.cc:459] loading: abp-nvsmi-xgb:1
I1117 23:36:57.361505 91 model_lifecycle.cc:459] loading: phishing-bert-onnx:1
I1117 23:36:57.361521 91 model_lifecycle.cc:459] loading: log-parsing-onnx:1
I1117 23:36:57.361538 91 model_lifecycle.cc:459] loading: sid-minibert-onnx:1
I1117 23:36:57.361579 91 model_lifecycle.cc:459] loading: sid-minibert-trt:1
I1117 23:36:57.377033 91 tensorrt.cc:5441] TRITONBACKEND_Initialize: tensorrt
I1117 23:36:57.377062 91 tensorrt.cc:5451] Triton TRITONBACKEND API version: 1.10
I1117 23:36:57.377067 91 tensorrt.cc:5457] 'tensorrt' TRITONBACKEND API version: 1.10
I1117 23:36:57.377153 91 tensorrt.cc:5500] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1117 23:36:57.377182 91 tensorrt.cc:5552] TRITONBACKEND_ModelInitialize: phishing-bert-trt (version 1)
I1117 23:36:57.585333 91 logging.cc:49] [MemUsageChange] Init CUDA: CPU +320, GPU +0, now: CPU 337, GPU 1161 (MiB)
Segmentation fault (core dumped)

backtrace:

Core was generated by `tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f24003f342b in triton::backend::tensorrt::UseTensorRTv2API(std::shared_ptr<nvinfer1::ICudaEngine> const&) () from /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
[Current thread is 1 (Thread 0x7f2402cec000 (LWP 95))]
(gdb) bt
#0  0x00007f24003f342b in triton::backend::tensorrt::UseTensorRTv2API(std::shared_ptr<nvinfer1::ICudaEngine> const&) () from /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
#1  0x00007f24003d008e in triton::backend::tensorrt::ModelState::AutoCompleteConfigHelper(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
#2  0x00007f24003d2c60 in triton::backend::tensorrt::ModelState::AutoCompleteConfig() () from /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
#3  0x00007f24003d3588 in triton::backend::tensorrt::ModelState::Create(TRITONBACKEND_Model*, triton::backend::tensorrt::ModelState**) () from /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
#4  0x00007f24003d3a3a in TRITONBACKEND_ModelInitialize () from /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
#5  0x00007f2412016fde in triton::core::TritonModel::Create(triton::core::InferenceServer*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, inference::ModelConfig const&, std::unique_ptr<triton::core::TritonModel, std::default_delete<triton::core::TritonModel> >*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#6  0x00007f24120d13a4 in triton::core::ModelLifeCycle::CreateModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, triton::core::ModelLifeCycle::ModelInfo*) ()
   from /opt/tritonserver/bin/../lib/libtritonserver.so
#7  0x00007f24120d7e38 in std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#8  0x00007f241220ab00 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#9  0x00007f2411b64de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007f2412eb5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007f241184f133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

config.pbtxt looks like:

name: "phishing-bert-onnx"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
max_batch_size: 32

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ 128 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ 128 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 2 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 1, 4, 8, 12, 16, 20, 24, 28, 32 ]
  max_queue_delay_microseconds: 50000
}

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
}}

Expected behavior
Not core dumping

@rmccorm4
Copy link
Collaborator

rmccorm4 commented Nov 21, 2022

Hi @dagardner-nv ,

Thanks for filing this issue with detailed repro steps.

One observation is I believe this is segfaulting from one of your TRT models, not the ONNX model. This is noticed from the backtrace reporting calls from the TRT backend.

If I load only the ONNX model you mention, it loads successfully:

root@d77382c60e22:/models# tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --log-info=true --model-control-mode=explicit --load-model=phishing-bert-onnx
...
+--------------------+---------+--------+
| Model              | Version | Status |
+--------------------+---------+--------+
| phishing-bert-onnx | 1       | READY  |
+--------------------+---------+--------+
...
I1121 19:43:08.055022 500 grpc_server.cc:4610] Started GRPCInferenceService at 0.0.0.0:8001
I1121 19:43:08.055260 500 http_server.cc:3316] Started HTTPService at 0.0.0.0:8000
I1121 19:43:08.097237 500 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

If I load the corresponding TRT model, it segfaults:

root@d77382c60e22:/models# tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --log-info=true --model-control-mode=explicit --load-model=phishing-bert-trt
...
I1121 19:43:17.159170 560 tensorrt.cc:5552] TRITONBACKEND_ModelInitialize: phishing-bert-trt (version 1)
I1121 19:43:17.547282 560 logging.cc:49] [MemUsageChange] Init CUDA: CPU +320, GPU +0, now: CPU 337, GPU 1649 (MiB)
Segmentation fault (core dumped)

But there is no model in this dir, per the README description mentioning portability/generation of the TRT engine:

root@d77382c60e22:/models# ls /models/triton-model-repo/phishing-bert-trt/1/
README.md

Next steps:

  1. Can you minimize this repro to only be loading the phishing-bert-trt model and verify that the engine is there before loading?

    • There is a separate bug in Triton autocomplete for TRT backend that we should've detected no model file here and raised an error instead of segfaulting, but I will handle that separately (CC @nv-kmcgill53 DLIS-4354). You may be able to WAR this segfault by adding tritonserver ... --disable-auto-complete-config, if the expectation is truly that there should be no model file present at load time.
  2. If (1) above is reproducible even with an engine file present, can you share the corresponding trtexec/polygraphy command to generate a reproducible engine rather than the morpheus tools ... command which isn't setup in the triton container?


ref: DLIS-4353

@dagardner-nv
Copy link
Author

@rmccorm4 good catch. I can confirm that:

  1. The --disable-auto-complete-config flag works-around the issue
  2. Removing the empty model dirs (models/triton-model-repo/phishing-bert-trt & models/triton-model-repo/sid-minibert-trt) also avoids the issue

I've actually never generated those models before, I did so, I get a core dump, but I also get some errors first. Launching with:

tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --log-info=true --model-control-mode=explicit --load-model=phishing-bert-trt

gets:

I1121 23:56:17.901619 917 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbec6000000' with size 268435456
I1121 23:56:17.901931 917 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1121 23:56:17.906393 917 model_lifecycle.cc:459] loading: phishing-bert-trt:1
I1121 23:56:17.921656 917 tensorrt.cc:5441] TRITONBACKEND_Initialize: tensorrt
I1121 23:56:17.921683 917 tensorrt.cc:5451] Triton TRITONBACKEND API version: 1.10
I1121 23:56:17.921688 917 tensorrt.cc:5457] 'tensorrt' TRITONBACKEND API version: 1.10
I1121 23:56:17.921775 917 tensorrt.cc:5500] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1121 23:56:17.921801 917 tensorrt.cc:5552] TRITONBACKEND_ModelInitialize: phishing-bert-trt (version 1)
I1121 23:56:18.148084 917 logging.cc:49] [MemUsageChange] Init CUDA: CPU +320, GPU +0, now: CPU 337, GPU 1815 (MiB)
I1121 23:56:18.524985 917 logging.cc:49] Loaded engine size: 943 MiB
E1121 23:56:18.535521 917 logging.cc:43] 1: [stdArchiveReader.cpp::StdArchiveReader::40] Error Code 1: Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 213, Serialized Engine Version: 232)
E1121 23:56:18.540275 917 logging.cc:43] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Segmentation fault (core dumped)

@rmccorm4
Copy link
Collaborator

rmccorm4 commented Nov 22, 2022

E1121 23:56:18.535521 917 logging.cc:43] 1: [stdArchiveReader.cpp::StdArchiveReader::40] Error Code 1: Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 213, Serialized Engine Version: 232)
E1121 23:56:18.540275 917 logging.cc:43] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)

This is a standard TensorRT error. You need to make sure the TRT engines are generated in the same environment (OS, TensorRT version, Compute Capability, etc.) that you deploy Triton in.

ex: Generate the TRT engines in the nvcr.io/nvidia/tensorrt:22.10-py3 container on the same machine that you want to deploy nvcr.io/nvidia/tritonserver:22.10-py3 on, to make sure the TensorRT versions are the same.

Segmentation fault (core dumped)

If you enable --log-verbose=1 or get a traceback from gdb again, this may still be from autocorrect trying to use the invalid engine (this time it's a TRT version mismatch rather than a non-existent engine file, but may be the same root cause). Similarly, you could check if this still segfaults with --disable-auto-complete-config.

I'll wait for the steps to reproduce the engine file if you need help in validating this part.

@rmccorm4
Copy link
Collaborator

rmccorm4 commented Dec 7, 2022

Just an FYI, the autocomplete segfault above may already be fixed in the 22.12 release with this commit: triton-inference-server/tensorrt_backend#52.

dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Dec 14, 2022
… from-source users can use the provided scripts to perform the launching

Add --disable-auto-complete-config to launch to work-around triton issue triton-inference-server/server#5084
Add instructions for launching with only an explicit model
@tanmayv25
Copy link
Contributor

tanmayv25 commented Jan 3, 2023

I can confirm that auto-complete segfault issue has been fixed. In the absence of a TRT model, triton correctly fails with the following logs:

root@02320299735a:/opt/tritonserver# tritonserver --model-store=my_models
I0103 19:49:00.655949 142 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6476000000' with size 268435456
I0103 19:49:00.656271 142 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0103 19:49:00.658054 142 model_lifecycle.cc:459] loading: plan_float32_float32_float32:1
I0103 19:49:00.712452 142 tensorrt.cc:64] TRITONBACKEND_Initialize: tensorrt
I0103 19:49:00.712481 142 tensorrt.cc:74] Triton TRITONBACKEND API version: 1.10
I0103 19:49:00.712488 142 tensorrt.cc:80] 'tensorrt' TRITONBACKEND API version: 1.10
I0103 19:49:00.712492 142 tensorrt.cc:108] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0103 19:49:00.712810 142 tensorrt.cc:198] TRITONBACKEND_ModelInitialize: plan_float32_float32_float32 (version 1)
I0103 19:49:01.024951 142 tensorrt.cc:224] TRITONBACKEND_ModelFinalize: delete model state
E0103 19:49:01.024984 142 model_lifecycle.cc:597] failed to load 'plan_float32_float32_float32' version 1: Internal: unable to load plan file to auto complete config: my_models/plan_float32_float32_float32/1/model.plan
I0103 19:49:01.025110 142 server.cc:563] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0103 19:49:01.025201 142 server.cc:590] 
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------+
| Backend  | Path                                                      | Config                                                                                      |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------+
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-dire |
|          |                                                           | ctory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}                          |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------+

I0103 19:49:01.025355 142 server.cc:633] 
+------------------------------+---------+---------------------------------------------------------------------------------------------------------------------------+
| Model                        | Version | Status                                                                                                                    |
+------------------------------+---------+---------------------------------------------------------------------------------------------------------------------------+
| plan_float32_float32_float32 | 1       | UNAVAILABLE: Internal: unable to load plan file to auto complete config: my_models/plan_float32_float32_float32/1/model.p |
|                              |         | lan                                                                                                                       |
+------------------------------+---------+---------------------------------------------------------------------------------------------------------------------------+

I0103 19:49:01.075352 142 metrics.cc:864] Collecting metrics for GPU 0: NVIDIA TITAN RTX
I0103 19:49:01.075619 142 metrics.cc:757] Collecting CPU metrics
I0103 19:49:01.075779 142 tritonserver.cc:2264] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                            |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                           |
| server_version                   | 2.28.0                                                                                                                           |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_m |
|                                  | emory cuda_shared_memory binary_tensor_data statistics trace logging                                                             |
| model_repository_path[0]         | my_models                                                                                                                        |
| model_control_mode               | MODE_NONE                                                                                                                        |
| strict_model_config              | 0                                                                                                                                |
| rate_limit                       | OFF                                                                                                                              |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                        |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                         |
| response_cache_byte_size         | 0                                                                                                                                |
| min_supported_compute_capability | 6.0                                                                                                                              |
| strict_readiness                 | 1                                                                                                                                |
| exit_timeout                     | 30                                                                                                                               |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------+

I0103 19:49:01.075805 142 server.cc:264] Waiting for in-flight requests to complete.
I0103 19:49:01.075811 142 server.cc:280] Timeout 30: Found 0 model versions that have in-flight inferences
I0103 19:49:01.075818 142 server.cc:295] All models are stopped, unloading models
I0103 19:49:01.075823 142 server.cc:302] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants