Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] triton 22.04 crashes under heavy load from Morpheus+Kafka #259

Closed
pdmack opened this issue Jul 4, 2022 · 5 comments
Closed

[BUG] triton 22.04 crashes under heavy load from Morpheus+Kafka #259

pdmack opened this issue Jul 4, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@pdmack
Copy link
Contributor

pdmack commented Jul 4, 2022

Describe the bug
In testing #257, a large volume of jsonlines messages via Kafka can trigger an abort in triton on a 4xT4 (16Gb). Possibly a contention/exhaustion of GPU memory.

triton:

terminate called after throwing an instance of 'nvinfer1::InternalError'
  what():  Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed.
Signal (6) received.
terminate called recursively
Signal (6) received.
 0# 0x000056451B4A87E9 in tritonserver
 1# 0x00007F4431D5F0C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F4432118911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F443212438C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F4432123369 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007F4431F1EBEF in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
 9# _Unwind_RaiseException in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# __cxa_throw in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# nvinfer1::Lobber<nvinfer1::InternalError>::operator()(char const*, char const*, int, int, nvinfer1::ErrorCode, char const*) in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
12# 0x00007F43DEE0A0FC in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
13# 0x00007F43DF654DCF in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
14# 0x00007F43DF60B1ED in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
15# 0x00007F43DF662213 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
16# 0x00007F43DEE09B55 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
17# 0x00007F43DE9A6F90 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
18# 0x00007F43DEE0F634 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
19# 0x00007F43DF4F6B98 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
20# 0x00007F43DF4F734C in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
21# 0x00007F42FC1F928F in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_tensorrt.so
22# 0x00007F42FC1FC13B in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_tensorrt.so
23# 0x00007F438352A158 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
24# 0x00007F43835A04A5 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
25# 0x00007F438358A789 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
26# 0x00007F438358C86C in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
27# 0x00007F4382FE2D92 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
28# 0x00007F4382FE2FB8 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
29# 0x00007F4382F8A42D in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
30# 0x00007F4383B407BD in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
31# 0x00007F4383B56203 in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
32# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
33# 0x00007F443260BD9A in /opt/tritonserver/bin/../lib/libtritonserver.so
34# 0x00007F443260C757 in /opt/tritonserver/bin/../lib/libtritonserver.so
35# 0x00007F44326C7AB1 in /opt/tritonserver/bin/../lib/libtritonserver.so
36# 0x00007F4432605C27 in /opt/tritonserver/bin/../lib/libtritonserver.so
37# 0x00007F4432150DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
38# 0x00007F4433357609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
39# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Signal (11) received.
 0# 0x000056451B4A87E9 in tritonserver
 1# 0x00007F4431D5F0C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F443212653A in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F443212438C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F4432123369 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007F4431F1EBEF in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
 9# _Unwind_RaiseException in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# __cxa_throw in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# nvinfer1::Lobber<nvinfer1::InternalError>::operator()(char const*, char const*, int, int, nvinfer1::ErrorCode, char const*) in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
12# 0x00007F43DEE0A0FC in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
13# 0x00007F43DF654DCF in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
14# 0x00007F43DF60B1ED in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
15# 0x00007F43DF662213 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
16# 0x00007F43DEE09B55 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
17# 0x00007F43DE9A6F90 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
18# 0x00007F43DEE0F634 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
19# 0x00007F43DF4F6B98 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
20# 0x00007F43DF4F734C in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
21# 0x00007F42FC1F928F in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_tensorrt.so
22# 0x00007F42FC1FC13B in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_tensorrt.so
23# 0x00007F438352A158 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
24# 0x00007F43835A04A5 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
25# 0x00007F438358A789 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
26# 0x00007F438358C86C in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
27# 0x00007F4382FE2D92 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
28# 0x00007F4382FE2FB8 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
29# 0x00007F4382F8A42D in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
30# 0x00007F4383B407BD in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
31# 0x00007F4383B56203 in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
32# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
33# 0x00007F443260BD9A in /opt/tritonserver/bin/../lib/libtritonserver.so
34# 0x00007F443260C757 in /opt/tritonserver/bin/../lib/libtritonserver.so
35# 0x00007F44326C7AB1 in /opt/tritonserver/bin/../lib/libtritonserver.so
36# 0x00007F4432605C27 in /opt/tritonserver/bin/../lib/libtritonserver.so
37# 0x00007F4432150DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
38# 0x00007F4433357609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
39# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

 0# 0x000056451B4A87E9 in tritonserver
 1# 0x00007F4431D5F0C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00007F4432118911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 4# 0x00007F443212438C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F4432123369 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F4431F1EBEF in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
 8# _Unwind_RaiseException in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
 9# __cxa_throw in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
10# nvinfer1::Lobber<nvinfer1::InternalError>::operator()(char const*, char const*, int, int, nvinfer1::ErrorCode, char const*) in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
11# 0x00007F43DEE0A0FC in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
12# 0x00007F43DF654DCF in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
13# 0x00007F43DF60B1ED in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
14# 0x00007F43DF662213 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
15# 0x00007F43DEE09B55 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
16# 0x00007F43DE9A6F90 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
17# 0x00007F43DEE0F634 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
18# 0x00007F43DF4F6B98 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
19# 0x00007F43DF4F734C in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
20# 0x00007F42FC1F928F in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_tensorrt.so
21# 0x00007F42FC1FC13B in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_tensorrt.so
22# 0x00007F438352A158 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
23# 0x00007F43835A04A5 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
24# 0x00007F438358A789 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
25# 0x00007F438358C86C in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
26# 0x00007F4382FE2D92 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
27# 0x00007F4382FE2FB8 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
28# 0x00007F4382F8A42D in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
29# 0x00007F4383B407BD in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
30# 0x00007F4383B56203 in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
31# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
32# 0x00007F443260BD9A in /opt/tritonserver/bin/../lib/libtritonserver.so
33# 0x00007F443260C757 in /opt/tritonserver/bin/../lib/libtritonserver.so
34# 0x00007F44326C7AB1 in /opt/tritonserver/bin/../lib/libtritonserver.so
35# 0x00007F4432605C27 in /opt/tritonserver/bin/../lib/libtritonserver.so
36# 0x00007F4432150DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
37# 0x00007F4433357609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
38# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Morpheus CLI:

From Kafka rate: 1809731messages [01:48, 9481.49messages/s]  Failed to update context stat: Timer not set correctly.
E20220704 16:23:50.529767  3029 triton_inference.cpp:54] Triton Error while executing 'client->Infer(&results, m_options, inputs, outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 04463.54messages/s]
../morpheus/_lib/src/stages/triton_inference.cpp(181)
Failed to update context stat: Timer not set correctly.ges/s]
Failed to update context stat: Timer not set correctly.]
Failed to update context stat: Timer not set correctly.
E20220704 16:23:50.529978  3028 triton_inference.cpp:54] Triton Error while executing 'client->Infer(&results, m_options, inputs, outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(181)
E20220704 16:23:50.530028  3031 triton_inference.cpp:54] Triton Error while executing 'client->Infer(&results, m_options, inputs, outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(181)
E20220704 16:23:50.530154  3030 triton_inference.cpp:54] Triton Error while executing 'client->Infer(&results, m_options, inputs, outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(181)
E20220704 16:23:50.531956  3029 context.cpp:125] main/inference-6; rank: 1; size: 4; tid: 140521369851648: set_exception issued; issuing kill to current runnable. Exception msg: Triton Error while executing 'client->Infer(&results, m_options, inputs, outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(181)
                                                                       E20220704 16:23:50.542585  3026 kafka_source.cpp:323] Exception in rebalance_loop. Msg: std::exception
E20220704 16:23:50.543849  3027 kafka_source.cpp:323] Exception in rebalance_loop. Msg: std::exception
%3|1656951830.543|ERROR|rdkafka#consumer-3| [thrd:GroupCoordinator]: 1/1 brokers are down
E20220704 16:23:50.544577  3024 kafka_source.cpp:323] Exception in rebalance_loop. Msg: std::exception
%3|1656951830.545|ERROR|rdkafka#consumer-2| [thrd:GroupCoordinator]: 1/1 brokers are down
%3|1656951830.545|ERROR|rdkafka#consumer-1| [thrd:GroupCoordinator]: 1/1 brokers are down
From Kafka rate: 1809731messages [05:49, 9481.49messages/s]
Deserialization rate: 1103656messages [05:49, 5319.46messages/s]
Preprocessing rate: 690431messages [05:49, 4463.54messages/s]
Inference rate[Complete]: 448626inf [01:48, 4147.72inf/s]
Serialization rate[Complete]: 448626messages [01:48, 3864.98messages/s]
To Kafka rate[Complete]: 442769messages [01:47, 3993.38messages/s]

Steps/Code to reproduce bug
Launch CLI:

morpheus --log_level=DEBUG run --num_threads=4 --pipeline_batch_size=8192 --model_max_batch_size=32 --edge_buffer_size=32 pipeline-nlp --labels_file=/common/data/model_data/labels_nlp.txt --model_seq_length=256 from-kafka --input_topic morpheus-input --bootstrap_servers broker:9092 monitor --description='From Kafka rate' deserialize monitor --description='Deserialization rate' preprocess --vocab_hash_file=/common/data/model_data/bert-base-uncased-hash.txt --truncation=True --do_lower_case=True --add_special_tokens=False monitor --description='Preprocessing rate' inf-triton --force_convert_inputs=True --model_name=sid-minibert-onnx --server_url=ai-engine:8001 --use_shared_memory=True monitor --description='Inference rate' --smoothing=0.001 --unit inf serialize --exclude '^ts_' monitor --description='Serialization rate' to-kafka --output_topic morpheus-output --bootstrap_servers broker:9092 monitor --description='To Kafka rate'

Then load a 25x replica of current pcap_dump.jsonlines into a single Kafka input topic (e.g., morpheus-input) for consumption by Morpheus.

Expected behavior
CLI and Triton are able to sustain Kafka stream load for inference.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker]
  • Method of Morpheus install: [Docker/k8s]

Environment details
https://gist.github.com/pdmack/5ff438cc99105577b41f4c0c41f7131a

Additional context
nvcr.io/nvidia/tritonserver:22.04-py3

  • Need to retest with 22.06-py3 (latest)
  • Need to adjust batch size
  • Need to retest on Ampere (at least 40Gb)
@pdmack pdmack added bug Something isn't working Needs Triage Need team to review and classify labels Jul 4, 2022
@mdemoret-nv
Copy link
Contributor

@pdmack Can you test whether reducing the --num_threads or --pipeline_batch_size arguments alleviates this problem? I agree with you that this is most likely caused by exhausting GPU memory. Limiting these two properties should reduce the number of outstanding messages and therefore the GPU memory needed by Triton.

@pdmack
Copy link
Contributor Author

pdmack commented Jul 5, 2022

Reproduced with triton 22.06-py3 also.

@pdmack
Copy link
Contributor Author

pdmack commented Jul 5, 2022

Reducing number of threads from 4 to 2 seems to stabilize it. Dropping the batch size alone from 8192 to 2048 had no improvement. But inference stage infrequently updates with the bottleneck.

@pdmack
Copy link
Contributor Author

pdmack commented Jul 5, 2022

Update: triton crashed with 2 threads, batch size = 2048, use_cpp=False

@jarmak-nv jarmak-nv removed the Needs Triage Need team to review and classify label Aug 22, 2022
@mdemoret-nv mdemoret-nv removed their assignment Dec 13, 2023
@pdmack
Copy link
Contributor Author

pdmack commented Apr 10, 2024

out of date

@pdmack pdmack closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants