tests/python/pants_test/base:exception_sink_integration is flaky #8127

stuhood · 2019-07-29T20:18:26Z

When run locally, this completes relatively quickly. But in some number of runs, it seems to hang forever, triggering a 360 second test timeout in travis.

tests/python/pants_test/base:exception_sink_integration                         .....Command '['/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/bin/python3.6', '/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest-prep/CPython-3.6.5/929c23cae3b600b495a5d319ae6c47e8b41a2667', '-c', '/dev/null', '-ocache_dir=/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest/.pytest_cache', '--junitxml', '/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest/tests.python.pants_test.base.exception_sink_integration/junitxml/TEST-tests.python.pants_test.base.exception_sink_integration.xml', '--confcutdir', '/Users/travis/build/pantsbuild/pants', '--continue-on-collection-errors', '--color', 'yes', '-q', '-rfa', '--rootdir', '/Users/travis/build/pantsbuild/pants', '-p', '__pants_backend_python_tasks_pytest_prep_pytest_plugin__', '--pants-sources-map-path', '/Users/travis/build/pantsbuild/pants/.pants.d/test/pytest/tmpg79dyv4r/sources_map.json', '/Users/travis/build/pantsbuild/pants/.pants.d/pyprep/sources/49f3f1d9d9dc377d027f9fb364db7fffbb6a5ab9/pants_test/base/test_exception_sink_integration.py']' timed out after 360 seconds

The text was updated successfully, but these errors were encountered:

stuhood · 2019-07-29T20:18:36Z

Seen in #8123.

stuhood · 2019-08-01T22:29:06Z

Seen again in #8099.

stuhood · 2019-08-07T18:17:08Z

Seen again on master.

stuhood · 2019-08-07T18:30:34Z

Seen in #8143.

stuhood · 2019-08-07T19:38:43Z

Seen again in master.

This is probably our highest priority flaky test, as it seems to just hang fairly frequently.

stuhood · 2019-08-09T18:39:17Z

Seen again on the OSX shard in #8153. The timeout for this one is now 540, and it takes about 30 seconds to run locally on OSX, so there is something strange happening. Maybe we're being forced to re-boostrap or recompile? Or it is just hanging.

stuhood · 2019-08-14T17:16:57Z

Seen again in both #8165 and #8166 on the OSX shard.

cattibrie · 2019-08-15T09:39:59Z

Seen in #8150.

pierrechevalier83 · 2019-08-21T14:51:36Z

Seen in #8192. It's not the first time I see it, but it is the first time I comment here. Overall, I think there is no doubt this one regularly exceeds its timeout.

stuhood · 2019-08-22T19:48:53Z

Seen again in master.

stuhood · 2019-08-24T03:22:29Z

Seen again in #8201.

stuhood · 2019-08-29T15:35:30Z

Seen again in #8221 on the OSX shard.

cattibrie · 2019-08-30T10:07:48Z

Seen in #8223 in OSX platform-specific tests shard

Eric-Arellano · 2019-09-02T19:42:52Z

Seen again in OSX platform-specific tests - time out of 540.

@stuhood we should probably lower the timeout to less than 540 because this appears to be an issue with hanging forever? That way it eagerly fails.

I do not think this is an issue with trying to re-bootstrap ./pants. Now that #8183 has landed, we only ever use ./pants.pex for integration tests so I don't think this would even be possible.

cattibrie · 2019-09-03T16:59:50Z

Seen in #8233 in OSX platform-specific tests shard

cattibrie · 2019-09-11T15:36:59Z

Seen in #8276 in OSX platform-specific tests shard

stuhood · 2019-09-16T17:42:44Z

Seen again in #8113.

stuhood · 2019-09-26T03:56:22Z

Seen again in #8088.

wisechengyi · 2019-10-07T19:54:17Z

Seen again in #8406

pierrechevalier83 · 2019-10-25T22:37:50Z

Seen again in #8452

Eric-Arellano · 2019-11-07T13:30:42Z

I'm looking into this today. I agree with Stu that this is likely our highest priority flake.

Locally, I ran a script to repeat until failure. First run, it took 71 attempts. Second run, it took 131 attempts to fail. This translates to 1.3% of runs failing and 0.7% of runs failing, respectively. In CI, it seems the number is closer to 20%. I'm going to try debugging in CI instead.

Eric-Arellano · 2019-11-07T17:27:05Z

On a successful OSX shard, the test takes 5 minutes to run. Locally on OSX, it takes 30-35 seconds. Something seems to be going on with Travis.

These were the individual tests that took longer than local execution:

prints_traceback_on_sigusr2
keyboardinterrupt
dumps_traceback_on_sigabrt
dumps_logs_on_signal

EDIT: the common denominator for all of these tests is _make_waiter_handle:

pants/tests/python/pants_test/base/test_exception_sink_integration.py

Lines 118 to 141 in 9ef7954

    
           @contextmanager 
        
           def _make_waiter_handle(self): 
        
             with temporary_dir() as tmpdir: 
        
               # The path is required to end in '.pants.d'. This is validated in 
        
               # GoalRunner#is_valid_workdir(). 
        
               workdir = os.path.join(tmpdir, '.pants.d') 
        
               safe_mkdir(workdir) 
        
               arrive_file = os.path.join(tmpdir, 'arrived') 
        
               await_file = os.path.join(tmpdir, 'await') 
        
               waiter_handle = self.run_pants_with_workdir_without_waiting([ 
        
                 '--no-enable-pantsd', 
        
                 'run', 'testprojects/src/python/coordinated_runs:phaser', 
        
                 '--', arrive_file, await_file 
        
               ], workdir) 
        
               # Wait for testprojects/src/python/coordinated_runs:phaser to be running. 
        
               while not os.path.exists(arrive_file): 
        
                 time.sleep(0.1) 
        
               def join(): 
        
                 touch(await_file) 
        
                 return waiter_handle.join() 
        
               yield (workdir, waiter_handle.process.pid, join)

### Skip some exception sink integration tests on macOS These shards have chronically flaked by hanging since at least July #8127. They are our most egregious Python flake. We will still test these four tests on Linux and only skip them on macOS, as this is a macOS specific issue. ### Skip `remote::tests::dropped_request_cancels` This seems to flake roughly 40% of the time #8405. It is our most egregious Rust flake. ### Tweak some other tests We mark some other tests as flaky and bump their timeouts as relevant to hopefully stabilize CI further.

### Problem The setup and teardown of each request made to the nailgun server in `pantsd` had become quite complicated over time... and consequently, slower than it needed to be. ### Solution Port `pantsd`'s nailgun server to rust using the `nails` crate. Additionally, remove the `Exiter` class, which had accumulated excess responsibilities that can instead be handled by returning `ExitCode` values. Finally, fix a few broken windows including: double logging to pantsd, double help output, closed file errors on pantsd shutdown, and redundant setup codepaths. ### Result There is less code to maintain, and runs of `./pants --enable-pantsd help` take `~1.7s`, of which `~400ms` are spent in the server. Fixes #9448, fixes #8243, fixes #8206, fixes #8127, fixes #7653, fixes #7613, fixes #7597.

stuhood · 2020-05-14T16:51:56Z

I'm monitoring this one to decide what to do.

### Problem A while back we started capturing core dumps "globally" in travis. But in practice we have never consumed them, and I'm fairly certain that they are causing the OSX shards that test sending `SIGABRT` (which, if core dumps are enabled, will trigger a core dump) to `pantsd` to: 1. be racey, because while the core is dumping, the process is non-responsive and can't be killed, leading to errors like: ```FAILURE: failure while terminating pantsd: failed to kill pid 28775 with signals (<Signals.SIGTERM: 15>, <Signals.SIGKILL: 9>)``` 2. run out of disk space: we've seen mysterious "out of disk" errors on the OSX shards... and core dumps are large. ### Solution Disable core dumps everywhere. If we end up needing them in the future, we can enable them on a case-by-case basis. ### Result Fixes #8127. [ci skip-rust-tests] [ci skip-jvm-tests]

stuhood added the flaky-test label Jul 29, 2019

stuhood mentioned this issue Aug 24, 2019

Add utility to extract macro classes from a classpath. #8201

Closed

Eric-Arellano self-assigned this Nov 6, 2019

Eric-Arellano mentioned this issue Nov 12, 2019

OSX shard frequently times out #8027

Closed

Eric-Arellano removed their assignment Nov 22, 2019

Eric-Arellano mentioned this issue Nov 22, 2019

Skip two flaky tests and tweak more test timeouts #8687

Merged

jsirois self-assigned this Mar 24, 2020

jsirois added the in-progress label Mar 24, 2020

jsirois removed their assignment Mar 25, 2020

jsirois added test-skipped and removed in-progress labels Mar 25, 2020

stuhood mentioned this issue May 10, 2020

Port the pantsd nailgun server to rust #9722

Merged

stuhood closed this as completed in #9722 May 13, 2020

stuhood reopened this May 14, 2020

stuhood self-assigned this May 14, 2020

stuhood mentioned this issue May 14, 2020

Disable capturing core dumps in travis #9769

Merged

stuhood closed this as completed in #9769 May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/python/pants_test/base:exception_sink_integration is flaky #8127

tests/python/pants_test/base:exception_sink_integration is flaky #8127

stuhood commented Jul 29, 2019

stuhood commented Jul 29, 2019

stuhood commented Aug 1, 2019 •

edited

Loading

stuhood commented Aug 7, 2019

stuhood commented Aug 7, 2019

stuhood commented Aug 7, 2019

stuhood commented Aug 9, 2019

stuhood commented Aug 14, 2019

cattibrie commented Aug 15, 2019

pierrechevalier83 commented Aug 21, 2019

stuhood commented Aug 22, 2019

stuhood commented Aug 24, 2019

stuhood commented Aug 29, 2019

cattibrie commented Aug 30, 2019

Eric-Arellano commented Sep 2, 2019

cattibrie commented Sep 3, 2019

cattibrie commented Sep 11, 2019

stuhood commented Sep 16, 2019

stuhood commented Sep 26, 2019

wisechengyi commented Oct 7, 2019

pierrechevalier83 commented Oct 25, 2019

Eric-Arellano commented Nov 7, 2019

Eric-Arellano commented Nov 7, 2019 •

edited

Loading

stuhood commented May 14, 2020

tests/python/pants_test/base:exception_sink_integration is flaky #8127

tests/python/pants_test/base:exception_sink_integration is flaky #8127

Comments

stuhood commented Jul 29, 2019

stuhood commented Jul 29, 2019

stuhood commented Aug 1, 2019 • edited Loading

stuhood commented Aug 7, 2019

stuhood commented Aug 7, 2019

stuhood commented Aug 7, 2019

stuhood commented Aug 9, 2019

stuhood commented Aug 14, 2019

cattibrie commented Aug 15, 2019

pierrechevalier83 commented Aug 21, 2019

stuhood commented Aug 22, 2019

stuhood commented Aug 24, 2019

stuhood commented Aug 29, 2019

cattibrie commented Aug 30, 2019

Eric-Arellano commented Sep 2, 2019

cattibrie commented Sep 3, 2019

cattibrie commented Sep 11, 2019

stuhood commented Sep 16, 2019

stuhood commented Sep 26, 2019

wisechengyi commented Oct 7, 2019

pierrechevalier83 commented Oct 25, 2019

Eric-Arellano commented Nov 7, 2019

Eric-Arellano commented Nov 7, 2019 • edited Loading

stuhood commented May 14, 2020

stuhood commented Aug 1, 2019 •

edited

Loading

Eric-Arellano commented Nov 7, 2019 •

edited

Loading