Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41862: [C++][S3] Fix potential deadlock when closing output stream #41876

Merged
merged 3 commits into from
Jun 10, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented May 29, 2024

Rationale for this change

When the Future returned by OutputStream::CloseAsync finishes, it can invoke a user-supplied callback. That callback may well destroy the stream as a side effect. If the stream is a S3 output stream, this might lead to a deadlock involving the mutex in the output stream's UploadState structure, since the callback is called with that mutex locked.

What changes are included in this PR?

Unlock the UploadState mutex before marking the Future finished, to avoid deadlocking.

Are these changes tested?

No. Unfortunately, I wasn't able to write a test that would trigger the original condition. Additional preconditions seem to be required to reproduce the deadlock. For example, it might require a mutex implementation that hangs if destroyed while locked.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented May 29, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou pitrou changed the title GH-41862: [C++][S3] Try out possible fix number 2 GH-41862: [C++][S3] Fix potential deadlock when closing output stream Jun 6, 2024
@pitrou pitrou marked this pull request as ready for review June 6, 2024 15:41
@pitrou
Copy link
Member Author

pitrou commented Jun 6, 2024

@github-actions crossbow submit -g cpp

Copy link

github-actions bot commented Jun 6, 2024

Revision: 51d0738

Submitted crossbow builds: ursacomputing/crossbow @ actions-b40dadb2f5

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou pitrou requested a review from felipecrv June 6, 2024 16:06
Comment on lines 1901 to 1902
lock.unlock();
state->pending_parts_completed.MarkFinished(state->status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's possible for this object to go from "Finished" to "not-finished". There might be logic relying on the state machine converging to the finished state and staying there.

I might do a full review of this class at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's possible for this object to go from "Finished" to "not-finished".

Where is it possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto fut = state->pending_parts_completed;
lock.unlock();
fut.MarkFinished(state->status);

Maybe something can be done if we afraid pending_parts_completed changed, but I've checked that it wouldn't happens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state->parts_in_progress reached 0, you unlock, another thread could increment parts_in_progress and now you call MarkFinished. Is parts_in_progress > 0 considered "finished"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, yes, it can. This is possible because the Write and HandleUploadOutcome can run concurrently, we'd better do like https://github.com/apache/arrow/pull/41876/files#r1631434845 to avoid mark finish before a task is really finished

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, you're right. @mapleFU's suggestion looks ok to me.

Note that the pending_parts_completed future can only be waited on in two situations:

  • the user called blocking Close or Flush and the future is waited upon before returning from the API call;
  • the user called non-blocking CloseAsync, which returns a cascaded future obtained by chaining pending_parts_completed.Then with a continuation.

For Close and CloseAsync, it's certainly not ok to call Write from another thread concurrently.
For Flush, it should be ok to call Write concurrently, but the Flush does not have to wait for the completion of the concurrent Write call.

Moreover, more generally, it doesn't seem sound to write to an output stream (rather than random-access file) from several thread concurrently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the user called blocking Close or Flush and the future is waited upon before returning from the API call;

Maybe there could be a sequence:

  1. A last request finished, acquire lock, dec count, and set it to 0
  2. New request sent, pending_parts_completed set to a new one
  3. (1) call pending_parts_completed.MarkFinished, which may call on the new one

So the further blocking would wrong?

cpp/src/arrow/filesystem/s3fs.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 7, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 7, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 10, 2024
@pitrou
Copy link
Member Author

pitrou commented Jun 10, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: c48d59a

Submitted crossbow builds: ursacomputing/crossbow @ actions-4b9dce20f9

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Jun 10, 2024

Ok, since there are two +1s, I will merge if CI is green (or the failures are unrelated).

@pitrou pitrou merged commit 036fca0 into apache:main Jun 10, 2024
37 of 38 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Jun 10, 2024
@pitrou pitrou deleted the gh41862-fix2 branch June 10, 2024 14:50
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Jun 10, 2024
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 036fca0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 20 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 036fca0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 20 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants