Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A version of the sync macro that throws earlier #34198

Merged
merged 4 commits into from
Mar 13, 2020
Merged

A version of the sync macro that throws earlier #34198

merged 4 commits into from
Mar 13, 2020

Conversation

Keno
Copy link
Member

@Keno Keno commented Dec 25, 2019

I've been looking at what causes deadlocks in our test suite in an effort to cut down on the number of failed tests on CI that result in hangs (since those are hard to diagnose and resolve). I found that by playing with various resource limits, it is easy to create hangs in the test suite. The reason we get a hang rather than a more easily diagnosable error is two fold. We either:

  1. Aren't watching for the error (e.g. a socket remote end closing)
  2. We aren't propagating the error to the top level

A very common situation for case 2) is that the test is wrapped in @sync which doesn't return until all tasks have finished or error'ed. However, in many cases one of the tasks produces data for the others, so if that task errors, the remaining tasks will wait forever. This PR aims to address that situation by introducing a new @syncany macro that immediately rethrows any errors thrown by a contained task rather than waiting for all of them to finish. The implementation isn't super performant (it allocates a new task per object being waited on), but should be sufficient for use in the test suite. A better implementation would create a new scheduler object that can be inserted into multiple wait queues.

Example usage of the new macro:

@sync begin
    @async error("Hello")
    @async sleep(1000)
end # Waits 1000s

@syncany begin
    @async error("Hello")
    @async sleep(1000)
end # Throws immediately

The macro doesn't do any sort of cleanup for the tasks that do not finish, and just lets them run. In the future, we may want to automatically cancel those tasks, but that seemed like a bigger design problem than the simple thing that I wanted (something that propagates error messages more readily, so we see them in the logs).

@Keno Keno requested review from vtjnash and JeffBezanson December 25, 2019 19:16
@jonas-schulze
Copy link
Contributor

Couldn’t this be the new default behavior?

base/task.jl Outdated
during error handling.

!!! Note
This interface is experimental and subject to change or removal without notice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for marking this experimental. Otherwise, implementing structured concurrency #33248 becomes much harder.

@vtjnash
Copy link
Member

vtjnash commented Jan 5, 2020

Maybe we should call this Base.Future.@async (with the idea that once optimized, it should be the default)? The any name seems odd (because it is waiting for all, it’s just early error).

@Keno
Copy link
Member Author

Keno commented Jan 6, 2020

You mean Base.Future.@sync? I'd be fine with that.

@tkf
Copy link
Member

tkf commented Jan 6, 2020

Isn't it strange to put an experimental API in Future? I think it makes harder to fix the bigger design problem.

@c42f
Copy link
Member

c42f commented Jan 8, 2020

Letting tasks drop out the bottom seems like a pretty big change to the semantics of @sync so I'm not sure it makes sense to declare it the Future without trying it out for a while. I'd also be happier not to export it for the moment if it's experimental.

As you say, it would ideally interact with some kind of cancellation. That's hard but it feels a lot more consistent with the way @sync works right now.

It seems handy though and could be a big practical improvement for the test suite. For the moment, could we just name it clearly and not export it? How about @sync_failfast or something equally ugly but explicit?

@Keno Keno added the triage This should be discussed on a triage call label Feb 25, 2020
@Keno
Copy link
Member Author

Keno commented Feb 27, 2020

From triage: let's call this Base.Experimental.@sync

@Keno Keno removed the triage This should be discussed on a triage call label Feb 27, 2020
@tkf
Copy link
Member

tkf commented Feb 27, 2020

Base.Experimental.@sync doesn't sound like a stop-gap measure. Shouldn't the macro name (@sync) clarify that:

  • this macro has undesirable property (tasks leak),
  • it is included temporary only because a real solution is impossible to implement with the current infrastructure, and
  • it will be removed or at least will be discouraged to use in the future?

@Keno
Copy link
Member Author

Keno commented Mar 13, 2020

Let's merge this and see if it helps CI. Being in experimental makes it explicitly unsupported, so we can change or remove it if we feel like it at any point.

@Keno Keno merged commit 558eec9 into master Mar 13, 2020
@Keno Keno deleted the kf/syncany branch March 13, 2020 02:19
ravibitsgoa pushed a commit to ravibitsgoa/julia that referenced this pull request Apr 9, 2020
I've been looking at what causes deadlocks in our test suite in an effort to cut down on the number of failed tests on CI that result in hangs (since those are hard to diagnose and resolve). I found that by playing with various resource limits, it is easy to create hangs in the test suite. The reason we get a hang rather than a more easily diagnosable error is two fold. We either:

1. Aren't watching for the error (e.g. a socket remote end closing)
2. We aren't propagating the error to the top level

A very common situation for case 2) is that the test is wrapped in @sync which doesn't return until all tasks have finished or error'ed. However, in many cases one of the tasks produces data for the others, so if that task errors, the remaining tasks will wait forever. This PR aims to address that situation by introducing a new `Experimental.@sync` macro that immediately rethrows any errors thrown by a contained task rather than waiting for all of them to finish. The implementation isn't super performant (it allocates a new task per object being waited on), but should be sufficient for use in the test suite. A better implementation would create a new scheduler object that can be inserted into multiple wait queues.

Example usage of the new macro:

```
@sync begin
    @async error("Hello")
    @async sleep(1000)
end # Waits 1000s

Experimental.@sync begin
    @async error("Hello")
    @async sleep(1000)
end # Throws immediately
```

The macro doesn't do any sort of cleanup for the tasks that do not finish, and just lets them run. In the future, we may want to automatically cancel those tasks, but that seemed like a bigger design problem than the simple thing that I wanted (something that propagates error messages more readily, so we see them in the logs).
KristofferC pushed a commit that referenced this pull request Apr 11, 2020
I've been looking at what causes deadlocks in our test suite in an effort to cut down on the number of failed tests on CI that result in hangs (since those are hard to diagnose and resolve). I found that by playing with various resource limits, it is easy to create hangs in the test suite. The reason we get a hang rather than a more easily diagnosable error is two fold. We either:

1. Aren't watching for the error (e.g. a socket remote end closing)
2. We aren't propagating the error to the top level

A very common situation for case 2) is that the test is wrapped in @sync which doesn't return until all tasks have finished or error'ed. However, in many cases one of the tasks produces data for the others, so if that task errors, the remaining tasks will wait forever. This PR aims to address that situation by introducing a new `Experimental.@sync` macro that immediately rethrows any errors thrown by a contained task rather than waiting for all of them to finish. The implementation isn't super performant (it allocates a new task per object being waited on), but should be sufficient for use in the test suite. A better implementation would create a new scheduler object that can be inserted into multiple wait queues.

Example usage of the new macro:

```
@sync begin
    @async error("Hello")
    @async sleep(1000)
end # Waits 1000s

Experimental.@sync begin
    @async error("Hello")
    @async sleep(1000)
end # Throws immediately
```

The macro doesn't do any sort of cleanup for the tasks that do not finish, and just lets them run. In the future, we may want to automatically cancel those tasks, but that seemed like a bigger design problem than the simple thing that I wanted (something that propagates error messages more readily, so we see them in the logs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants