Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/python/pants_test/base:exception_sink_integration is flaky #8127

Closed
stuhood opened this issue Jul 29, 2019 · 23 comments · Fixed by #9722 or #9769
Closed

tests/python/pants_test/base:exception_sink_integration is flaky #8127

stuhood opened this issue Jul 29, 2019 · 23 comments · Fixed by #9722 or #9769
Assignees

Comments

@stuhood
Copy link
Member

stuhood commented Jul 29, 2019

When run locally, this completes relatively quickly. But in some number of runs, it seems to hang forever, triggering a 360 second test timeout in travis.

tests/python/pants_test/base:exception_sink_integration                         .....Command '['/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/bin/python3.6', '/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest-prep/CPython-3.6.5/929c23cae3b600b495a5d319ae6c47e8b41a2667', '-c', '/dev/null', '-ocache_dir=/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest/.pytest_cache', '--junitxml', '/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest/tests.python.pants_test.base.exception_sink_integration/junitxml/TEST-tests.python.pants_test.base.exception_sink_integration.xml', '--confcutdir', '/Users/travis/build/pantsbuild/pants', '--continue-on-collection-errors', '--color', 'yes', '-q', '-rfa', '--rootdir', '/Users/travis/build/pantsbuild/pants', '-p', '__pants_backend_python_tasks_pytest_prep_pytest_plugin__', '--pants-sources-map-path', '/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest/tmpg79dyv4r/sources_map.json', '/Users/travis/build/pantsbuild/pants/.pants.d/pyprep/sources/49f3f1d9d9dc377d027f9fb364db7fffbb6a5ab9/pants_test/base/test_exception_sink_integration.py']' timed out after 360 seconds
@stuhood
Copy link
Member Author

stuhood commented Jul 29, 2019

Seen in #8123.

@stuhood
Copy link
Member Author

stuhood commented Aug 1, 2019

Seen again in #8099.

@stuhood
Copy link
Member Author

stuhood commented Aug 7, 2019

Seen again on master.

@stuhood
Copy link
Member Author

stuhood commented Aug 7, 2019

Seen in #8143.

@stuhood
Copy link
Member Author

stuhood commented Aug 7, 2019

Seen again in master.

This is probably our highest priority flaky test, as it seems to just hang fairly frequently.

@stuhood
Copy link
Member Author

stuhood commented Aug 9, 2019

Seen again on the OSX shard in #8153. The timeout for this one is now 540, and it takes about 30 seconds to run locally on OSX, so there is something strange happening. Maybe we're being forced to re-boostrap or recompile? Or it is just hanging.

@stuhood
Copy link
Member Author

stuhood commented Aug 14, 2019

Seen again in both #8165 and #8166 on the OSX shard.

@cattibrie
Copy link
Contributor

Seen in #8150.

@pierrechevalier83
Copy link
Contributor

Seen in #8192. It's not the first time I see it, but it is the first time I comment here. Overall, I think there is no doubt this one regularly exceeds its timeout.

@stuhood
Copy link
Member Author

stuhood commented Aug 22, 2019

Seen again in master.

@stuhood
Copy link
Member Author

stuhood commented Aug 24, 2019

Seen again in #8201.

@stuhood
Copy link
Member Author

stuhood commented Aug 29, 2019

Seen again in #8221 on the OSX shard.

@cattibrie
Copy link
Contributor

Seen in #8223 in OSX platform-specific tests shard

@Eric-Arellano
Copy link
Contributor

Seen again in OSX platform-specific tests - time out of 540.

@stuhood we should probably lower the timeout to less than 540 because this appears to be an issue with hanging forever? That way it eagerly fails.

I do not think this is an issue with trying to re-bootstrap ./pants. Now that #8183 has landed, we only ever use ./pants.pex for integration tests so I don't think this would even be possible.

@cattibrie
Copy link
Contributor

Seen in #8233 in OSX platform-specific tests shard

@cattibrie
Copy link
Contributor

Seen in #8276 in OSX platform-specific tests shard

@stuhood
Copy link
Member Author

stuhood commented Sep 16, 2019

Seen again in #8113.

@stuhood
Copy link
Member Author

stuhood commented Sep 26, 2019

Seen again in #8088.

@wisechengyi
Copy link
Contributor

Seen again in #8406

@pierrechevalier83
Copy link
Contributor

Seen again in #8452

@Eric-Arellano Eric-Arellano self-assigned this Nov 6, 2019
@Eric-Arellano
Copy link
Contributor

I'm looking into this today. I agree with Stu that this is likely our highest priority flake.

Locally, I ran a script to repeat until failure. First run, it took 71 attempts. Second run, it took 131 attempts to fail. This translates to 1.3% of runs failing and 0.7% of runs failing, respectively. In CI, it seems the number is closer to 20%. I'm going to try debugging in CI instead.

@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Nov 7, 2019

On a successful OSX shard, the test takes 5 minutes to run. Locally on OSX, it takes 30-35 seconds. Something seems to be going on with Travis.

These were the individual tests that took longer than local execution:

  • prints_traceback_on_sigusr2
  • keyboardinterrupt
  • dumps_traceback_on_sigabrt
  • dumps_logs_on_signal

EDIT: the common denominator for all of these tests is _make_waiter_handle:

@contextmanager
def _make_waiter_handle(self):
with temporary_dir() as tmpdir:
# The path is required to end in '.pants.d'. This is validated in
# GoalRunner#is_valid_workdir().
workdir = os.path.join(tmpdir, '.pants.d')
safe_mkdir(workdir)
arrive_file = os.path.join(tmpdir, 'arrived')
await_file = os.path.join(tmpdir, 'await')
waiter_handle = self.run_pants_with_workdir_without_waiting([
'--no-enable-pantsd',
'run', 'testprojects/src/python/coordinated_runs:phaser',
'--', arrive_file, await_file
], workdir)
# Wait for testprojects/src/python/coordinated_runs:phaser to be running.
while not os.path.exists(arrive_file):
time.sleep(0.1)
def join():
touch(await_file)
return waiter_handle.join()
yield (workdir, waiter_handle.process.pid, join)

@Eric-Arellano Eric-Arellano removed their assignment Nov 22, 2019
Eric-Arellano added a commit that referenced this issue Nov 22, 2019
### Skip some exception sink integration tests on macOS
These shards have chronically flaked by hanging since at least July #8127. They are our most egregious Python flake.

We will still test these four tests on Linux and only skip them on macOS, as this is a macOS specific 
issue.

### Skip `remote::tests::dropped_request_cancels`
This seems to flake roughly 40% of the time #8405. It is our most egregious Rust flake.

### Tweak some other tests
We mark some other tests as flaky and bump their timeouts as relevant to hopefully stabilize CI further.
@jsirois jsirois self-assigned this Mar 24, 2020
@jsirois jsirois removed their assignment Mar 25, 2020
stuhood pushed a commit that referenced this issue May 13, 2020
### Problem

The setup and teardown of each request made to the nailgun server in `pantsd` had become quite complicated over time... and consequently, slower than it needed to be.

### Solution

Port `pantsd`'s nailgun server to rust using the `nails` crate. Additionally, remove the `Exiter` class, which had accumulated excess responsibilities that can instead be handled by returning `ExitCode` values. Finally, fix a few broken windows including: double logging to pantsd, double help output, closed file errors on pantsd shutdown, and redundant setup codepaths.

### Result

There is less code to maintain, and runs of `./pants --enable-pantsd help` take `~1.7s`, of which `~400ms` are spent in the server. Fixes #9448, fixes #8243, fixes #8206, fixes #8127, fixes #7653, fixes #7613, fixes #7597.
@stuhood stuhood reopened this May 14, 2020
@stuhood stuhood self-assigned this May 14, 2020
@stuhood
Copy link
Member Author

stuhood commented May 14, 2020

I'm monitoring this one to decide what to do.

stuhood pushed a commit that referenced this issue May 14, 2020
### Problem

A while back we started capturing core dumps "globally" in travis. But in practice we have never consumed them, and I'm fairly certain that they are causing the OSX shards that test sending `SIGABRT` (which, if core dumps are enabled, will trigger a core dump) to `pantsd` to:
1. be racey, because while the core is dumping, the process is non-responsive and can't be killed, leading to errors like:
```FAILURE: failure while terminating pantsd: failed to kill pid 28775 with signals (<Signals.SIGTERM: 15>, <Signals.SIGKILL: 9>)```
2. run out of disk space: we've seen mysterious "out of disk" errors on the OSX shards... and core dumps are large.

### Solution

Disable core dumps everywhere. If we end up needing them in the future, we can enable them on a case-by-case basis.

### Result

Fixes #8127.

[ci skip-rust-tests]
[ci skip-jvm-tests]
stuhood pushed a commit that referenced this issue May 14, 2020
### Problem

A while back we started capturing core dumps "globally" in travis. But in practice we have never consumed them, and I'm fairly certain that they are causing the OSX shards that test sending `SIGABRT` (which, if core dumps are enabled, will trigger a core dump) to `pantsd` to:
1. be racey, because while the core is dumping, the process is non-responsive and can't be killed, leading to errors like:
```FAILURE: failure while terminating pantsd: failed to kill pid 28775 with signals (<Signals.SIGTERM: 15>, <Signals.SIGKILL: 9>)```
2. run out of disk space: we've seen mysterious "out of disk" errors on the OSX shards... and core dumps are large.

### Solution

Disable core dumps everywhere. If we end up needing them in the future, we can enable them on a case-by-case basis.

### Result

Fixes #8127.

[ci skip-rust-tests]
[ci skip-jvm-tests]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants