`test_AllProgress_lost_key` flaky #4915

jrbourbeau · 2021-06-14T15:49:47Z

This has been showing up in several CI builds lately (for example https://github.com/dask/distributed/runs/2814505063).

Traceback:

2021-06-13T17:50:12.3994425Z =================================== FAILURES ===================================
2021-06-13T17:50:12.3995912Z __________________________ test_AllProgress_lost_key ___________________________
2021-06-13T17:50:12.3996481Z 
2021-06-13T17:50:12.3997052Z     def test_func():
2021-06-13T17:50:12.3998014Z         result = None
2021-06-13T17:50:12.3998648Z         workers = []
2021-06-13T17:50:12.4000042Z         with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
2021-06-13T17:50:12.4001589Z     
2021-06-13T17:50:12.4002225Z             async def coro():
2021-06-13T17:50:12.4003000Z                 with dask.config.set(config):
2021-06-13T17:50:12.4003897Z                     s = False
2021-06-13T17:50:12.4004505Z                     for _ in range(60):
2021-06-13T17:50:12.4005265Z                         try:
2021-06-13T17:50:12.4005893Z                             s, ws = await start_cluster(
2021-06-13T17:50:12.4006594Z                                 nthreads,
2021-06-13T17:50:12.4008058Z                                 scheduler,
2021-06-13T17:50:12.4008686Z                                 loop,
2021-06-13T17:50:12.4009563Z                                 security=security,
2021-06-13T17:50:12.4010208Z                                 Worker=Worker,
2021-06-13T17:50:12.4010913Z                                 scheduler_kwargs=scheduler_kwargs,
2021-06-13T17:50:12.4011916Z                                 worker_kwargs=worker_kwargs,
2021-06-13T17:50:12.4012713Z                             )
2021-06-13T17:50:12.4013471Z                         except Exception as e:
2021-06-13T17:50:12.4014154Z                             logger.error(
2021-06-13T17:50:12.4014868Z                                 "Failed to start gen_cluster: "
2021-06-13T17:50:12.4015594Z                                 f"{e.__class__.__name__}: {e}; retrying",
2021-06-13T17:50:12.4016275Z                                 exc_info=True,
2021-06-13T17:50:12.4016809Z                             )
2021-06-13T17:50:12.4017416Z                             await asyncio.sleep(1)
2021-06-13T17:50:12.4018030Z                         else:
2021-06-13T17:50:12.4018584Z                             workers[:] = ws
2021-06-13T17:50:12.4019185Z                             args = [s] + workers
2021-06-13T17:50:12.4019739Z                             break
2021-06-13T17:50:12.4020435Z                     if s is False:
2021-06-13T17:50:12.4021139Z                         raise Exception("Could not start cluster")
2021-06-13T17:50:12.4021947Z                     if client:
2021-06-13T17:50:12.4022511Z                         c = await Client(
2021-06-13T17:50:12.4023072Z                             s.address,
2021-06-13T17:50:12.4023821Z                             loop=loop,
2021-06-13T17:50:12.4024447Z                             security=security,
2021-06-13T17:50:12.4025096Z                             asynchronous=True,
2021-06-13T17:50:12.4025737Z                             **client_kwargs,
2021-06-13T17:50:12.4026274Z                         )
2021-06-13T17:50:12.4026801Z                         args = [c] + args
2021-06-13T17:50:12.4027521Z                     try:
2021-06-13T17:50:12.4028102Z                         future = func(*args)
2021-06-13T17:50:12.4028698Z                         if timeout:
2021-06-13T17:50:12.4029407Z                             future = asyncio.wait_for(future, timeout)
2021-06-13T17:50:12.4030155Z                         result = await future
2021-06-13T17:50:12.4030758Z                         if s.validate:
2021-06-13T17:50:12.4031458Z                             s.validate_state()
2021-06-13T17:50:12.4036496Z                     finally:
2021-06-13T17:50:12.4037089Z                         if client and c.status not in ("closing", "closed"):
2021-06-13T17:50:12.4038195Z                             await c._close(fast=s.status == Status.closed)
2021-06-13T17:50:12.4038927Z                         await end_cluster(s, workers)
2021-06-13T17:50:12.4039597Z                         await asyncio.wait_for(cleanup_global_workers(), 1)
2021-06-13T17:50:12.4040148Z     
2021-06-13T17:50:12.4040798Z                     try:
2021-06-13T17:50:12.4041284Z                         c = await default_client()
2021-06-13T17:50:12.4041817Z                     except ValueError:
2021-06-13T17:50:12.4042447Z                         pass
2021-06-13T17:50:12.4042823Z                     else:
2021-06-13T17:50:12.4043280Z                         await c._close(fast=True)
2021-06-13T17:50:12.4043681Z     
2021-06-13T17:50:12.4044075Z                     def get_unclosed():
2021-06-13T17:50:12.4044666Z                         return [c for c in Comm._instances if not c.closed()] + [
2021-06-13T17:50:12.4045196Z                             c
2021-06-13T17:50:12.4045898Z                             for c in _global_clients.values()
2021-06-13T17:50:12.4046473Z                             if c.status != "closed"
2021-06-13T17:50:12.4046909Z                         ]
2021-06-13T17:50:12.4047238Z     
2021-06-13T17:50:12.4047868Z                     try:
2021-06-13T17:50:12.4048300Z                         start = time()
2021-06-13T17:50:12.4048922Z                         while time() < start + 60:
2021-06-13T17:50:12.4050039Z                             gc.collect()
2021-06-13T17:50:12.4050573Z                             if not get_unclosed():
2021-06-13T17:50:12.4051026Z                                 break
2021-06-13T17:50:12.4051538Z                             await asyncio.sleep(0.05)
2021-06-13T17:50:12.4052039Z                         else:
2021-06-13T17:50:12.4052485Z                             if allow_unclosed:
2021-06-13T17:50:12.4053105Z                                 print(f"Unclosed Comms: {get_unclosed()}")
2021-06-13T17:50:12.4053636Z                             else:
2021-06-13T17:50:12.4054261Z                                 raise RuntimeError("Unclosed Comms", get_unclosed())
2021-06-13T17:50:12.4054865Z                     finally:
2021-06-13T17:50:12.4055385Z                         Comm._instances.clear()
2021-06-13T17:50:12.4055967Z                         _global_clients.clear()
2021-06-13T17:50:12.4056427Z     
2021-06-13T17:50:12.4056805Z                     return result
2021-06-13T17:50:12.4057201Z     
2021-06-13T17:50:12.4057612Z >           result = loop.run_sync(
2021-06-13T17:50:12.4058223Z                 coro, timeout=timeout * 2 if timeout else timeout
2021-06-13T17:50:12.4058750Z             )
2021-06-13T17:50:12.4058975Z 
2021-06-13T17:50:12.4059423Z distributed/utils_test.py:966: 
2021-06-13T17:50:12.4059938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-06-13T17:50:12.4061606Z /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:120: in __exit__
2021-06-13T17:50:12.4062376Z     next(self.gen)
2021-06-13T17:50:12.4062882Z distributed/utils_test.py:1557: in clean
2021-06-13T17:50:12.4063500Z     del thread_state.on_event_loop_thread
2021-06-13T17:50:12.4064566Z /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:120: in __exit__
2021-06-13T17:50:12.4065324Z     next(self.gen)
2021-06-13T17:50:12.4066722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-06-13T17:50:12.4067066Z 
2021-06-13T17:50:12.4067470Z     @contextmanager
2021-06-13T17:50:12.4067950Z     def check_instances():
2021-06-13T17:50:12.4068503Z         Client._instances.clear()
2021-06-13T17:50:12.4069092Z         Worker._instances.clear()
2021-06-13T17:50:12.4069864Z         Scheduler._instances.clear()
2021-06-13T17:50:12.4070502Z         SpecCluster._instances.clear()
2021-06-13T17:50:12.4071174Z         Worker._initialized_clients.clear()
2021-06-13T17:50:12.4072034Z         # assert all(n.status == "closed" for n in Nanny._instances), {
2021-06-13T17:50:12.4072726Z         #     n: n.status for n in Nanny._instances
2021-06-13T17:50:12.4073179Z         # }
2021-06-13T17:50:12.4073610Z         Nanny._instances.clear()
2021-06-13T17:50:12.4074164Z         _global_clients.clear()
2021-06-13T17:50:12.4074690Z         Comm._instances.clear()
2021-06-13T17:50:12.4075120Z     
2021-06-13T17:50:12.4075435Z         yield
2021-06-13T17:50:12.4075772Z     
2021-06-13T17:50:12.4076108Z         start = time()
2021-06-13T17:50:12.4076723Z         while set(_global_clients):
2021-06-13T17:50:12.4077139Z             sleep(0.1)
2021-06-13T17:50:12.4077550Z             assert time() < start + 10
2021-06-13T17:50:12.4077911Z     
2021-06-13T17:50:12.4078313Z         _global_clients.clear()
2021-06-13T17:50:12.4078706Z     
2021-06-13T17:50:12.4079109Z         for w in Worker._instances:
2021-06-13T17:50:12.4079708Z             with suppress(RuntimeError):  # closed IOLoop
2021-06-13T17:50:12.4081371Z                 w.loop.add_callback(w.close, report=False, executor_wait=False)
2021-06-13T17:50:12.4090143Z                 if w.status == Status.running:
2021-06-13T17:50:12.4090730Z                     w.loop.add_callback(w.close)
2021-06-13T17:50:12.4091313Z         Worker._instances.clear()
2021-06-13T17:50:12.4093411Z     
2021-06-13T17:50:12.4093778Z         start = time()
2021-06-13T17:50:12.4094383Z         while any(c.status != "closed" for c in Worker._initialized_clients):
2021-06-13T17:50:12.4096722Z             sleep(0.1)
2021-06-13T17:50:12.4097128Z             assert time() < start + 10
2021-06-13T17:50:12.4097688Z         Worker._initialized_clients.clear()
2021-06-13T17:50:12.4099499Z     
2021-06-13T17:50:12.4099838Z         for i in range(5):
2021-06-13T17:50:12.4100344Z             if all(c.closed() for c in Comm._instances):
2021-06-13T17:50:12.4100811Z                 break
2021-06-13T17:50:12.4103249Z             else:
2021-06-13T17:50:12.4103799Z                 sleep(0.1)
2021-06-13T17:50:12.4104180Z         else:
2021-06-13T17:50:12.4104673Z             L = [c for c in Comm._instances if not c.closed()]
2021-06-13T17:50:12.4106740Z             Comm._instances.clear()
2021-06-13T17:50:12.4107252Z             print("Unclosed Comms", L)
2021-06-13T17:50:12.4107984Z             # raise ValueError("Unclosed Comms", L)
2021-06-13T17:50:12.4109982Z     
2021-06-13T17:50:12.4110349Z >       assert all(
2021-06-13T17:50:12.4111055Z             n.status == Status.closed or n.status == Status.init for n in Nanny._instances
2021-06-13T17:50:12.4113249Z         ), {n: n.status for n in Nanny._instances}
2021-06-13T17:50:12.4116294Z E       AssertionError: {<Nanny: None, threads: 1>: <Status.closed: 'closed'>, <Nanny: None, threads: 1>: <Status.running: 'running'>, <Nanny: None, threads: 2>: <Status.closed: 'closed'>}
2021-06-13T17:50:12.4118819Z E       assert False
2021-06-13T17:50:12.4119740Z E        +  where False = all(<generator object check_instances.<locals>.<genexpr> at 0x7f9f62c9a4a0>)
2021-06-13T17:50:12.4120826Z 
2021-06-13T17:50:12.4122764Z distributed/utils_test.py:1525: AssertionError

gjoseph92 · 2021-06-22T21:00:35Z

Got this too: https://github.com/dask/distributed/pull/4937/checks?check_run_id=2888140312#step:10:2213

Traceback + output

=================================== FAILURES ===================================
__________________________ test_AllProgress_lost_key ___________________________

    def test_func():
        result = None
        workers = []
        with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
    
            async def coro():
                with dask.config.set(config):
                    s = False
                    for _ in range(60):
                        try:
                            s, ws = await start_cluster(
                                nthreads,
                                scheduler,
                                loop,
                                security=security,
                                Worker=Worker,
                                scheduler_kwargs=scheduler_kwargs,
                                worker_kwargs=worker_kwargs,
                            )
                        except Exception as e:
                            logger.error(
                                "Failed to start gen_cluster: "
                                f"{e.__class__.__name__}: {e}; retrying",
                                exc_info=True,
                            )
                            await asyncio.sleep(1)
                        else:
                            workers[:] = ws
                            args = [s] + workers
                            break
                    if s is False:
                        raise Exception("Could not start cluster")
                    if client:
                        c = await Client(
                            s.address,
                            loop=loop,
                            security=security,
                            asynchronous=True,
                            **client_kwargs,
                        )
                        args = [c] + args
                    try:
                        future = func(*args)
                        if timeout:
                            future = asyncio.wait_for(future, timeout)
                        result = await future
                        if s.validate:
                            s.validate_state()
                    finally:
                        if client and c.status not in ("closing", "closed"):
                            await c._close(fast=s.status == Status.closed)
                        await end_cluster(s, workers)
                        await asyncio.wait_for(cleanup_global_workers(), 1)
    
                    try:
                        c = await default_client()
                    except ValueError:
                        pass
                    else:
                        await c._close(fast=True)
    
                    def get_unclosed():
                        return [c for c in Comm._instances if not c.closed()] + [
                            c
                            for c in _global_clients.values()
                            if c.status != "closed"
                        ]
    
                    try:
                        start = time()
                        while time() < start + 60:
                            gc.collect()
                            if not get_unclosed():
                                break
                            await asyncio.sleep(0.05)
                        else:
                            if allow_unclosed:
                                print(f"Unclosed Comms: {get_unclosed()}")
                            else:
                                raise RuntimeError("Unclosed Comms", get_unclosed())
                    finally:
                        Comm._instances.clear()
                        _global_clients.clear()
    
                    return result
    
>           result = loop.run_sync(
                coro, timeout=timeout * 2 if timeout else timeout
            )

distributed/utils_test.py:966: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tornado.platform.asyncio.AsyncIOLoop object at 0x7ffbac992580>
func = <function gen_cluster.<locals>._.<locals>.test_func.<locals>.coro at 0x7ffbc0091ee0>
timeout = 60

    def run_sync(self, func: Callable, timeout: Optional[float] = None) -> Any:
        """Starts the `IOLoop`, runs the given function, and stops the loop.
    
        The function must return either an awaitable object or
        ``None``. If the function returns an awaitable object, the
        `IOLoop` will run until the awaitable is resolved (and
        `run_sync()` will return the awaitable's result). If it raises
        an exception, the `IOLoop` will stop and the exception will be
        re-raised to the caller.
    
        The keyword-only argument ``timeout`` may be used to set
        a maximum duration for the function.  If the timeout expires,
        a `tornado.util.TimeoutError` is raised.
    
        This method is useful to allow asynchronous calls in a
        ``main()`` function::
    
            async def main():
                # do stuff...
    
            if __name__ == '__main__':
                IOLoop.current().run_sync(main)
    
        .. versionchanged:: 4.3
           Returning a non-``None``, non-awaitable value is now an error.
    
        .. versionchanged:: 5.0
           If a timeout occurs, the ``func`` coroutine will be cancelled.
    
        """
        future_cell = [None]  # type: List[Optional[Future]]
    
        def run() -> None:
            try:
                result = func()
                if result is not None:
                    from tornado.gen import convert_yielded
    
                    result = convert_yielded(result)
            except Exception:
                fut = Future()  # type: Future[Any]
                future_cell[0] = fut
                future_set_exc_info(fut, sys.exc_info())
            else:
                if is_future(result):
                    future_cell[0] = result
                else:
                    fut = Future()
                    future_cell[0] = fut
                    fut.set_result(result)
            assert future_cell[0] is not None
            self.add_future(future_cell[0], lambda future: self.stop())
    
        self.add_callback(run)
        if timeout is not None:
    
            def timeout_callback() -> None:
                # If we can cancel the future, do so and wait on it. If not,
                # Just stop the loop and return with the task still pending.
                # (If we neither cancel nor wait for the task, a warning
                # will be logged).
                assert future_cell[0] is not None
                if not future_cell[0].cancel():
                    self.stop()
    
            timeout_handle = self.add_timeout(self.time() + timeout, timeout_callback)
        self.start()
        if timeout is not None:
            self.remove_timeout(timeout_handle)
        assert future_cell[0] is not None
        if future_cell[0].cancelled() or not future_cell[0].done():
>           raise TimeoutError("Operation timed out after %s seconds" % timeout)
E           tornado.util.TimeoutError: Operation timed out after 60 seconds

/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/tornado/ioloop.py:529: TimeoutError
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42907
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42907
distributed.worker - INFO -          dashboard at:            127.0.0.1:36249
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-qzucux6g
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:35647
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:35647
distributed.worker - INFO -          dashboard at:            127.0.0.1:45463
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-24jmyssw
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:35647
distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 10 seconds; retrying
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 272, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/utils_test.py", line 890, in coro
    s, ws = await start_cluster(
  File "/home/runner/work/distributed/distributed/distributed/utils_test.py", line 814, in start_cluster
    await asyncio.gather(*workers)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 276, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 10 seconds
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42805
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42805
distributed.worker - INFO -          dashboard at:            127.0.0.1:41279
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-au2uvenv
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37355
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37355
distributed.worker - INFO -          dashboard at:            127.0.0.1:33053
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-ficp2l0q
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42805
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37355
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 37% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 37% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 38% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 39% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 39% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 40% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 41% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 41% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 42% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 42% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 43% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 53% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 83% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 84% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 85% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 86% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 87% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 88% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 89% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 90% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 91% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 92% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 93% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 95% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...

The

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

seemingly leading to a Nanny failure and timeout (and subsequent 99% time spent in GC?!) is interesting.

fjetter · 2021-06-23T08:52:58Z

I've seen the GC message in some other "flaky" tests as well, I believe

fjetter · 2021-06-23T09:12:15Z

What's also interesting is, that we only see one error log of Failed to start gen_cluster: although this section should be retried 60 times. Considering that the nanny times out after about 10s I would've expected to see at least to cluster start failures (gen_cluster timeout is at 30) so the GC might actually block here?

Both links also merely happen on py3.8 but that may be a coincidence.

jrbourbeau added the flaky test Intermittent failures on CI. label Jun 14, 2021

fjetter mentioned this issue Jun 23, 2021

Ensure connectionpool does not leave comms if closed mid connect #4951

Merged

fjetter mentioned this issue Feb 16, 2022

Failing tests overview #5818

Open

51 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_AllProgress_lost_key` flaky #4915

`test_AllProgress_lost_key` flaky #4915

jrbourbeau commented Jun 14, 2021

gjoseph92 commented Jun 22, 2021

fjetter commented Jun 23, 2021

fjetter commented Jun 23, 2021

test_AllProgress_lost_key flaky #4915

test_AllProgress_lost_key flaky #4915

Comments

jrbourbeau commented Jun 14, 2021

gjoseph92 commented Jun 22, 2021

fjetter commented Jun 23, 2021

fjetter commented Jun 23, 2021

`test_AllProgress_lost_key` flaky #4915

`test_AllProgress_lost_key` flaky #4915