Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault in TF 2.11 #2171

Closed
ndeepesh opened this issue Aug 10, 2023 · 9 comments
Closed

Segmentation Fault in TF 2.11 #2171

ndeepesh opened this issue Aug 10, 2023 · 9 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:bug

Comments

@ndeepesh
Copy link

On TF 2.11 Serving we are facing segmentation faults with no furthur debugging logs. Below is what we see from the output of sudo dmesg
[1644821.183503] tensorflow_mode[22702]: segfault at 2 ip 000000000a1b93f6 sp 00007fe77cf46fe0 error 4 in tensorflow_model_server[400000+11a0c000]

On the box we dont see any core dumps too? How we debug this issue furthur? Is there a way to enable core dumps in TensorFlow serving? This model runs well on TF 2.4. We are running inference on a CPU

@singhniraj08 singhniraj08 self-assigned this Aug 14, 2023
@singhniraj08
Copy link

@ndeepesh,

Similar issue #2085 was previously reported where users have reported setting TF_ENABLE_ONEDNN_OPTS=0 fixes the error or updating to Serving 2.13 release resolves the error. Can you try these workarounds to see if this can resolve the segmentation fault error. Thank you!

@ndeepesh
Copy link
Author

Hi @singhniraj08
Do we know what is in TF_ENABLE_ONEDNN_OPTS that is causing the issue? I could not get this info from other issue you referenced

@ndeepesh
Copy link
Author

Hi @singhniraj08 Below are the core dump analysis

(gdb) bt
#0 0x00007f7a8e051387 in raise () from /lib64/libc.so.6
#1 0x00007f7a8e052a78 in abort () from /lib64/libc.so.6
#2 0x00007f7a8e093f67 in __libc_message () from /lib64/libc.so.6
#3 0x00007f7a8e09c329 in _int_free () from /lib64/libc.so.6
#4 0x000000000e0633b2 in tensorflow::(anonymous namespace)::Buffer::~Buffer() [clone .part.0] ()
#5 0x000000000e0634e0 in tensorflow::(anonymous namespace)::Buffer::~Buffer() ()
#6 0x000000000e0640c6 in tensorflow::Tensor::~Tensor() ()
#7 0x00000000081d98a7 in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::SimplePropagatorState::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) ()
#8 0x00000000081db74c in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::SimplePropagatorState::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
#9 0x000000000e4ed1fd in Eigen::ThreadPoolTempltsl::thread::EigenEnvironment::WorkerLoop(int) ()
#10 0x000000000e4ea729 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#11 0x000000000e2f64f2 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) ()
#12 0x00007f7a8ed14ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f7a8e119b0d in clone () from /lib64/libc.so.6

@singhniraj08
Copy link

@ndeepesh, If possible, Could you please share us the steps to replicate the issue. I will try to replicate this same issue on my end. This will help us understand what is causing this issue and how we can avoid it. Thank you.

@ndeepesh
Copy link
Author

@singhniraj08 Unfortunately cannot share the complete model. But I believe the above thread that you shared had the steps on how to reproduce. From the core dumps they both look similar.

@singhniraj08
Copy link

@ndeepesh, I am unable to find the steps to reproduce the error in above thread. Instead of your model, if you can share a minimum reproducible model code which can help us replicate this issue on our end, that will help us a lot. Thanks

@github-actions
Copy link

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Aug 26, 2023
@github-actions
Copy link

github-actions bot commented Sep 2, 2023

This issue was closed due to lack of activity after being marked stale for past 7 days.

@github-actions github-actions bot closed this as completed Sep 2, 2023
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:bug
Projects
None yet
Development

No branches or pull requests

2 participants