Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Thread deadlock in ObjectOutputStream #41862

Closed
icexelloss opened this issue May 28, 2024 · 24 comments
Closed

[C++] Thread deadlock in ObjectOutputStream #41862

icexelloss opened this issue May 28, 2024 · 24 comments
Assignees
Labels
Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@icexelloss
Copy link
Contributor

icexelloss commented May 28, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I am seeing a deadlock when destructing an ObjectOutputStream. I have attached the stack trace.

I did some debugging and found that the issue seems to be that the mutex in question is already held by this thread (I checked the __owner field in the pthread_mutex_t which points to the hanging thread)

Unfortunately the stack trace doesn’t show exactly which mutex it is trying to lock. I wonder if someone more familiar with the IO code has some ideas what might be the issue and where to dig deeper?

arrow_object_output_stream_stacktrace.txt

Edit:
I am using Arrow 15.0.0

Component(s)

C++

@pitrou
Copy link
Member

pitrou commented May 29, 2024

Copying the temptative diagnosis I initially posted on the dev ML:

It seems the problem lies here:

state->pending_parts_completed.MarkFinished(state->status);

The Future is marked finished with the ObjectOutputStream's mutex taken,
and the Future's callback then triggers a chain of event which leads to
calling the ObjectOutputStream destructor, which in turn tries to take
the lock.

@pitrou
Copy link
Member

pitrou commented May 29, 2024

@icexelloss If I submit a PR trying to fix this, will it be possible for you to validate it with your use case?

@icexelloss
Copy link
Contributor Author

Yep - I can cherry pick the patch internally and test.

@icexelloss
Copy link
Contributor Author

@pitrou I am curious how did you find the HandleUploadOutcome method? I couldn't find it in the stacktrace.

@pitrou
Copy link
Member

pitrou commented May 29, 2024

It's frame 68 in the stacktrace :-)

@pitrou
Copy link
Member

pitrou commented May 29, 2024

Ok, I have not one but two potential fixes, it would be nice if you could try each of them independently: #41875 , #41876

Note that both seem desirable in any case.

@icexelloss
Copy link
Contributor Author

Thanks @pitrou! I will test both out in the next couple of days!

@icexelloss icexelloss changed the title Thread deadlock in ObjectOutputStream [C++] Thread deadlock in ObjectOutputStream May 30, 2024
@icexelloss
Copy link
Contributor Author

icexelloss commented May 30, 2024

FYI I am testing #41875 Today, I will let it run through out the day because the it doesn't always happen quickly.

@icexelloss
Copy link
Contributor Author

#41875 doesn't seem to work and with similar stacktrace :(

I am rerunning it to double check

@icexelloss
Copy link
Contributor Author

I am testing #41876 Today

@pitrou
Copy link
Member

pitrou commented Jun 3, 2024

#41875 doesn't seem to work and with similar stacktrace :(

Could you post the new stacktrace somewhere?

@icexelloss
Copy link
Contributor Author

#41875 doesn't seem to work and with similar stacktrace :(

Could you post the new stacktrace somewhere?

Yep, let me do that.

@icexelloss
Copy link
Contributor Author

gdb-dataset-path-41875.txt

Here is the stacktrace after I applied the #41875 patch

@pitrou
Copy link
Member

pitrou commented Jun 3, 2024

Here is the stacktrace after I applied the #41875 patch

Did you apply it to an old version of Arrow? I cannot find anything in recent versions that corresponds to this traceback line:

#21 ~<lambda> (this=0x5b98d96d4578, __in_chrg=<optimized out>) at /tmp/build/ts/arrow/dataset/c/src/arrow/dataset/dataset_writer.cc:363

It would be nice if you could test the PR itself.

@pitrou
Copy link
Member

pitrou commented Jun 3, 2024

That said, I agree the stack trace is very similar (except for the unique_ptr being replaced by a shared_ptr in the patch you applied).

@icexelloss
Copy link
Contributor Author

icexelloss commented Jun 3, 2024

Here is the stacktrace after I applied the #41875 patch

Did you apply it to an old version of Arrow? I cannot find anything in recent versions that corresponds to this traceback line:

#21 ~<lambda> (this=0x5b98d96d4578, __in_chrg=<optimized out>) at /tmp/build/ts/arrow/dataset/c/src/arrow/dataset/dataset_writer.cc:363

It would be nice if you could test the PR itself.

This is on Arrow 15. The line numbers are slightly off but this is the line that stack 21 is referring to:

https://github.com/apache/arrow/blob/apache-arrow-15.0.0/cpp/src/arrow/dataset/dataset_writer.cc#L346

(or on master)
https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.cc#L346

@pitrou
Copy link
Member

pitrou commented Jun 3, 2024

Ok, thanks. That explains why the "fix" doesn't work then :-)

@icexelloss
Copy link
Contributor Author

Ok, thanks. That explains why the "fix" doesn't work then :-)

Sorry I am slightly confused. The "fix" doesn't work because I am using Arrow 15 and the fix only works on newer Arrow version?

@pitrou
Copy link
Member

pitrou commented Jun 3, 2024

No, I mean the traceback better explains why the fix doesn't work. It probably wouldn't work on latest Arrow either (though you can still try if that's easy for you).

@icexelloss
Copy link
Contributor Author

I see. Trying #41876 then.

@icexelloss
Copy link
Contributor Author

@pitrou I ran tests with #41876 for two days and didn't see a deadlock. So I think #41876 fixes the issue that I was seeing.

@pitrou
Copy link
Member

pitrou commented Jun 6, 2024

Cool, thanks!

pitrou added a commit to pitrou/arrow that referenced this issue Jun 6, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Jun 10, 2024
pitrou added a commit that referenced this issue Jun 10, 2024
…#41876)

### Rationale for this change

When the Future returned by `OutputStream::CloseAsync` finishes, it can invoke a user-supplied callback. That callback may well destroy the stream as a side effect. If the stream is a S3 output stream, this might lead to a deadlock involving the mutex in the output stream's `UploadState` structure, since the callback is called with that mutex locked.

### What changes are included in this PR?

Unlock the `UploadState` mutex before marking the Future finished, to avoid deadlocking.

### Are these changes tested?

No. Unfortunately, I wasn't able to write a test that would trigger the original condition. Additional preconditions seem to be required to reproduce the deadlock. For example, it might require a mutex implementation that hangs if destroyed while locked.

### Are there any user-facing changes?

No.

* GitHub Issue: #41862

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 17.0.0 milestone Jun 10, 2024
@pitrou
Copy link
Member

pitrou commented Jun 10, 2024

Issue resolved by pull request 41876
#41876

@pitrou pitrou closed this as completed Jun 10, 2024
@icexelloss
Copy link
Contributor Author

Thank you! @pitrou

@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

No branches or pull requests

3 participants