Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_AllProgress_lost_key flaky #4915

Open
jrbourbeau opened this issue Jun 14, 2021 · 3 comments
Open

test_AllProgress_lost_key flaky #4915

jrbourbeau opened this issue Jun 14, 2021 · 3 comments
Labels
flaky test Intermittent failures on CI.

Comments

@jrbourbeau
Copy link
Member

This has been showing up in several CI builds lately (for example https://github.com/dask/distributed/runs/2814505063).

Traceback:
2021-06-13T17:50:12.3994425Z =================================== FAILURES ===================================
2021-06-13T17:50:12.3995912Z __________________________ test_AllProgress_lost_key ___________________________
2021-06-13T17:50:12.3996481Z 
2021-06-13T17:50:12.3997052Z     def test_func():
2021-06-13T17:50:12.3998014Z         result = None
2021-06-13T17:50:12.3998648Z         workers = []
2021-06-13T17:50:12.4000042Z         with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
2021-06-13T17:50:12.4001589Z     
2021-06-13T17:50:12.4002225Z             async def coro():
2021-06-13T17:50:12.4003000Z                 with dask.config.set(config):
2021-06-13T17:50:12.4003897Z                     s = False
2021-06-13T17:50:12.4004505Z                     for _ in range(60):
2021-06-13T17:50:12.4005265Z                         try:
2021-06-13T17:50:12.4005893Z                             s, ws = await start_cluster(
2021-06-13T17:50:12.4006594Z                                 nthreads,
2021-06-13T17:50:12.4008058Z                                 scheduler,
2021-06-13T17:50:12.4008686Z                                 loop,
2021-06-13T17:50:12.4009563Z                                 security=security,
2021-06-13T17:50:12.4010208Z                                 Worker=Worker,
2021-06-13T17:50:12.4010913Z                                 scheduler_kwargs=scheduler_kwargs,
2021-06-13T17:50:12.4011916Z                                 worker_kwargs=worker_kwargs,
2021-06-13T17:50:12.4012713Z                             )
2021-06-13T17:50:12.4013471Z                         except Exception as e:
2021-06-13T17:50:12.4014154Z                             logger.error(
2021-06-13T17:50:12.4014868Z                                 "Failed to start gen_cluster: "
2021-06-13T17:50:12.4015594Z                                 f"{e.__class__.__name__}: {e}; retrying",
2021-06-13T17:50:12.4016275Z                                 exc_info=True,
2021-06-13T17:50:12.4016809Z                             )
2021-06-13T17:50:12.4017416Z                             await asyncio.sleep(1)
2021-06-13T17:50:12.4018030Z                         else:
2021-06-13T17:50:12.4018584Z                             workers[:] = ws
2021-06-13T17:50:12.4019185Z                             args = [s] + workers
2021-06-13T17:50:12.4019739Z                             break
2021-06-13T17:50:12.4020435Z                     if s is False:
2021-06-13T17:50:12.4021139Z                         raise Exception("Could not start cluster")
2021-06-13T17:50:12.4021947Z                     if client:
2021-06-13T17:50:12.4022511Z                         c = await Client(
2021-06-13T17:50:12.4023072Z                             s.address,
2021-06-13T17:50:12.4023821Z                             loop=loop,
2021-06-13T17:50:12.4024447Z                             security=security,
2021-06-13T17:50:12.4025096Z                             asynchronous=True,
2021-06-13T17:50:12.4025737Z                             **client_kwargs,
2021-06-13T17:50:12.4026274Z                         )
2021-06-13T17:50:12.4026801Z                         args = [c] + args
2021-06-13T17:50:12.4027521Z                     try:
2021-06-13T17:50:12.4028102Z                         future = func(*args)
2021-06-13T17:50:12.4028698Z                         if timeout:
2021-06-13T17:50:12.4029407Z                             future = asyncio.wait_for(future, timeout)
2021-06-13T17:50:12.4030155Z                         result = await future
2021-06-13T17:50:12.4030758Z                         if s.validate:
2021-06-13T17:50:12.4031458Z                             s.validate_state()
2021-06-13T17:50:12.4036496Z                     finally:
2021-06-13T17:50:12.4037089Z                         if client and c.status not in ("closing", "closed"):
2021-06-13T17:50:12.4038195Z                             await c._close(fast=s.status == Status.closed)
2021-06-13T17:50:12.4038927Z                         await end_cluster(s, workers)
2021-06-13T17:50:12.4039597Z                         await asyncio.wait_for(cleanup_global_workers(), 1)
2021-06-13T17:50:12.4040148Z     
2021-06-13T17:50:12.4040798Z                     try:
2021-06-13T17:50:12.4041284Z                         c = await default_client()
2021-06-13T17:50:12.4041817Z                     except ValueError:
2021-06-13T17:50:12.4042447Z                         pass
2021-06-13T17:50:12.4042823Z                     else:
2021-06-13T17:50:12.4043280Z                         await c._close(fast=True)
2021-06-13T17:50:12.4043681Z     
2021-06-13T17:50:12.4044075Z                     def get_unclosed():
2021-06-13T17:50:12.4044666Z                         return [c for c in Comm._instances if not c.closed()] + [
2021-06-13T17:50:12.4045196Z                             c
2021-06-13T17:50:12.4045898Z                             for c in _global_clients.values()
2021-06-13T17:50:12.4046473Z                             if c.status != "closed"
2021-06-13T17:50:12.4046909Z                         ]
2021-06-13T17:50:12.4047238Z     
2021-06-13T17:50:12.4047868Z                     try:
2021-06-13T17:50:12.4048300Z                         start = time()
2021-06-13T17:50:12.4048922Z                         while time() < start + 60:
2021-06-13T17:50:12.4050039Z                             gc.collect()
2021-06-13T17:50:12.4050573Z                             if not get_unclosed():
2021-06-13T17:50:12.4051026Z                                 break
2021-06-13T17:50:12.4051538Z                             await asyncio.sleep(0.05)
2021-06-13T17:50:12.4052039Z                         else:
2021-06-13T17:50:12.4052485Z                             if allow_unclosed:
2021-06-13T17:50:12.4053105Z                                 print(f"Unclosed Comms: {get_unclosed()}")
2021-06-13T17:50:12.4053636Z                             else:
2021-06-13T17:50:12.4054261Z                                 raise RuntimeError("Unclosed Comms", get_unclosed())
2021-06-13T17:50:12.4054865Z                     finally:
2021-06-13T17:50:12.4055385Z                         Comm._instances.clear()
2021-06-13T17:50:12.4055967Z                         _global_clients.clear()
2021-06-13T17:50:12.4056427Z     
2021-06-13T17:50:12.4056805Z                     return result
2021-06-13T17:50:12.4057201Z     
2021-06-13T17:50:12.4057612Z >           result = loop.run_sync(
2021-06-13T17:50:12.4058223Z                 coro, timeout=timeout * 2 if timeout else timeout
2021-06-13T17:50:12.4058750Z             )
2021-06-13T17:50:12.4058975Z 
2021-06-13T17:50:12.4059423Z distributed/utils_test.py:966: 
2021-06-13T17:50:12.4059938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-06-13T17:50:12.4061606Z /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:120: in __exit__
2021-06-13T17:50:12.4062376Z     next(self.gen)
2021-06-13T17:50:12.4062882Z distributed/utils_test.py:1557: in clean
2021-06-13T17:50:12.4063500Z     del thread_state.on_event_loop_thread
2021-06-13T17:50:12.4064566Z /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:120: in __exit__
2021-06-13T17:50:12.4065324Z     next(self.gen)
2021-06-13T17:50:12.4066722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-06-13T17:50:12.4067066Z 
2021-06-13T17:50:12.4067470Z     @contextmanager
2021-06-13T17:50:12.4067950Z     def check_instances():
2021-06-13T17:50:12.4068503Z         Client._instances.clear()
2021-06-13T17:50:12.4069092Z         Worker._instances.clear()
2021-06-13T17:50:12.4069864Z         Scheduler._instances.clear()
2021-06-13T17:50:12.4070502Z         SpecCluster._instances.clear()
2021-06-13T17:50:12.4071174Z         Worker._initialized_clients.clear()
2021-06-13T17:50:12.4072034Z         # assert all(n.status == "closed" for n in Nanny._instances), {
2021-06-13T17:50:12.4072726Z         #     n: n.status for n in Nanny._instances
2021-06-13T17:50:12.4073179Z         # }
2021-06-13T17:50:12.4073610Z         Nanny._instances.clear()
2021-06-13T17:50:12.4074164Z         _global_clients.clear()
2021-06-13T17:50:12.4074690Z         Comm._instances.clear()
2021-06-13T17:50:12.4075120Z     
2021-06-13T17:50:12.4075435Z         yield
2021-06-13T17:50:12.4075772Z     
2021-06-13T17:50:12.4076108Z         start = time()
2021-06-13T17:50:12.4076723Z         while set(_global_clients):
2021-06-13T17:50:12.4077139Z             sleep(0.1)
2021-06-13T17:50:12.4077550Z             assert time() < start + 10
2021-06-13T17:50:12.4077911Z     
2021-06-13T17:50:12.4078313Z         _global_clients.clear()
2021-06-13T17:50:12.4078706Z     
2021-06-13T17:50:12.4079109Z         for w in Worker._instances:
2021-06-13T17:50:12.4079708Z             with suppress(RuntimeError):  # closed IOLoop
2021-06-13T17:50:12.4081371Z                 w.loop.add_callback(w.close, report=False, executor_wait=False)
2021-06-13T17:50:12.4090143Z                 if w.status == Status.running:
2021-06-13T17:50:12.4090730Z                     w.loop.add_callback(w.close)
2021-06-13T17:50:12.4091313Z         Worker._instances.clear()
2021-06-13T17:50:12.4093411Z     
2021-06-13T17:50:12.4093778Z         start = time()
2021-06-13T17:50:12.4094383Z         while any(c.status != "closed" for c in Worker._initialized_clients):
2021-06-13T17:50:12.4096722Z             sleep(0.1)
2021-06-13T17:50:12.4097128Z             assert time() < start + 10
2021-06-13T17:50:12.4097688Z         Worker._initialized_clients.clear()
2021-06-13T17:50:12.4099499Z     
2021-06-13T17:50:12.4099838Z         for i in range(5):
2021-06-13T17:50:12.4100344Z             if all(c.closed() for c in Comm._instances):
2021-06-13T17:50:12.4100811Z                 break
2021-06-13T17:50:12.4103249Z             else:
2021-06-13T17:50:12.4103799Z                 sleep(0.1)
2021-06-13T17:50:12.4104180Z         else:
2021-06-13T17:50:12.4104673Z             L = [c for c in Comm._instances if not c.closed()]
2021-06-13T17:50:12.4106740Z             Comm._instances.clear()
2021-06-13T17:50:12.4107252Z             print("Unclosed Comms", L)
2021-06-13T17:50:12.4107984Z             # raise ValueError("Unclosed Comms", L)
2021-06-13T17:50:12.4109982Z     
2021-06-13T17:50:12.4110349Z >       assert all(
2021-06-13T17:50:12.4111055Z             n.status == Status.closed or n.status == Status.init for n in Nanny._instances
2021-06-13T17:50:12.4113249Z         ), {n: n.status for n in Nanny._instances}
2021-06-13T17:50:12.4116294Z E       AssertionError: {<Nanny: None, threads: 1>: <Status.closed: 'closed'>, <Nanny: None, threads: 1>: <Status.running: 'running'>, <Nanny: None, threads: 2>: <Status.closed: 'closed'>}
2021-06-13T17:50:12.4118819Z E       assert False
2021-06-13T17:50:12.4119740Z E        +  where False = all(<generator object check_instances.<locals>.<genexpr> at 0x7f9f62c9a4a0>)
2021-06-13T17:50:12.4120826Z 
2021-06-13T17:50:12.4122764Z distributed/utils_test.py:1525: AssertionError
@jrbourbeau jrbourbeau added the flaky test Intermittent failures on CI. label Jun 14, 2021
@gjoseph92
Copy link
Collaborator

Got this too: https://github.com/dask/distributed/pull/4937/checks?check_run_id=2888140312#step:10:2213

Traceback + output
=================================== FAILURES ===================================
__________________________ test_AllProgress_lost_key ___________________________

    def test_func():
        result = None
        workers = []
        with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
    
            async def coro():
                with dask.config.set(config):
                    s = False
                    for _ in range(60):
                        try:
                            s, ws = await start_cluster(
                                nthreads,
                                scheduler,
                                loop,
                                security=security,
                                Worker=Worker,
                                scheduler_kwargs=scheduler_kwargs,
                                worker_kwargs=worker_kwargs,
                            )
                        except Exception as e:
                            logger.error(
                                "Failed to start gen_cluster: "
                                f"{e.__class__.__name__}: {e}; retrying",
                                exc_info=True,
                            )
                            await asyncio.sleep(1)
                        else:
                            workers[:] = ws
                            args = [s] + workers
                            break
                    if s is False:
                        raise Exception("Could not start cluster")
                    if client:
                        c = await Client(
                            s.address,
                            loop=loop,
                            security=security,
                            asynchronous=True,
                            **client_kwargs,
                        )
                        args = [c] + args
                    try:
                        future = func(*args)
                        if timeout:
                            future = asyncio.wait_for(future, timeout)
                        result = await future
                        if s.validate:
                            s.validate_state()
                    finally:
                        if client and c.status not in ("closing", "closed"):
                            await c._close(fast=s.status == Status.closed)
                        await end_cluster(s, workers)
                        await asyncio.wait_for(cleanup_global_workers(), 1)
    
                    try:
                        c = await default_client()
                    except ValueError:
                        pass
                    else:
                        await c._close(fast=True)
    
                    def get_unclosed():
                        return [c for c in Comm._instances if not c.closed()] + [
                            c
                            for c in _global_clients.values()
                            if c.status != "closed"
                        ]
    
                    try:
                        start = time()
                        while time() < start + 60:
                            gc.collect()
                            if not get_unclosed():
                                break
                            await asyncio.sleep(0.05)
                        else:
                            if allow_unclosed:
                                print(f"Unclosed Comms: {get_unclosed()}")
                            else:
                                raise RuntimeError("Unclosed Comms", get_unclosed())
                    finally:
                        Comm._instances.clear()
                        _global_clients.clear()
    
                    return result
    
>           result = loop.run_sync(
                coro, timeout=timeout * 2 if timeout else timeout
            )

distributed/utils_test.py:966: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tornado.platform.asyncio.AsyncIOLoop object at 0x7ffbac992580>
func = <function gen_cluster.<locals>._.<locals>.test_func.<locals>.coro at 0x7ffbc0091ee0>
timeout = 60

    def run_sync(self, func: Callable, timeout: Optional[float] = None) -> Any:
        """Starts the `IOLoop`, runs the given function, and stops the loop.
    
        The function must return either an awaitable object or
        ``None``. If the function returns an awaitable object, the
        `IOLoop` will run until the awaitable is resolved (and
        `run_sync()` will return the awaitable's result). If it raises
        an exception, the `IOLoop` will stop and the exception will be
        re-raised to the caller.
    
        The keyword-only argument ``timeout`` may be used to set
        a maximum duration for the function.  If the timeout expires,
        a `tornado.util.TimeoutError` is raised.
    
        This method is useful to allow asynchronous calls in a
        ``main()`` function::
    
            async def main():
                # do stuff...
    
            if __name__ == '__main__':
                IOLoop.current().run_sync(main)
    
        .. versionchanged:: 4.3
           Returning a non-``None``, non-awaitable value is now an error.
    
        .. versionchanged:: 5.0
           If a timeout occurs, the ``func`` coroutine will be cancelled.
    
        """
        future_cell = [None]  # type: List[Optional[Future]]
    
        def run() -> None:
            try:
                result = func()
                if result is not None:
                    from tornado.gen import convert_yielded
    
                    result = convert_yielded(result)
            except Exception:
                fut = Future()  # type: Future[Any]
                future_cell[0] = fut
                future_set_exc_info(fut, sys.exc_info())
            else:
                if is_future(result):
                    future_cell[0] = result
                else:
                    fut = Future()
                    future_cell[0] = fut
                    fut.set_result(result)
            assert future_cell[0] is not None
            self.add_future(future_cell[0], lambda future: self.stop())
    
        self.add_callback(run)
        if timeout is not None:
    
            def timeout_callback() -> None:
                # If we can cancel the future, do so and wait on it. If not,
                # Just stop the loop and return with the task still pending.
                # (If we neither cancel nor wait for the task, a warning
                # will be logged).
                assert future_cell[0] is not None
                if not future_cell[0].cancel():
                    self.stop()
    
            timeout_handle = self.add_timeout(self.time() + timeout, timeout_callback)
        self.start()
        if timeout is not None:
            self.remove_timeout(timeout_handle)
        assert future_cell[0] is not None
        if future_cell[0].cancelled() or not future_cell[0].done():
>           raise TimeoutError("Operation timed out after %s seconds" % timeout)
E           tornado.util.TimeoutError: Operation timed out after 60 seconds

/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/tornado/ioloop.py:529: TimeoutError
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42907
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42907
distributed.worker - INFO -          dashboard at:            127.0.0.1:36249
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-qzucux6g
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:35647
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:35647
distributed.worker - INFO -          dashboard at:            127.0.0.1:45463
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-24jmyssw
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:33233
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:35647
distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 10 seconds; retrying
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 272, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/utils_test.py", line 890, in coro
    s, ws = await start_cluster(
  File "/home/runner/work/distributed/distributed/distributed/utils_test.py", line 814, in start_cluster
    await asyncio.gather(*workers)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 276, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 10 seconds
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42805
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42805
distributed.worker - INFO -          dashboard at:            127.0.0.1:41279
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-au2uvenv
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37355
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37355
distributed.worker - INFO -          dashboard at:            127.0.0.1:33053
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                   6.79 GiB
distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-ficp2l0q
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:35561
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42805
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37355
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 37% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 37% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 38% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 39% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 39% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 40% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 41% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 41% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 42% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 42% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 43% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 53% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 83% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 84% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 85% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 86% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 87% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 88% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 89% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 90% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 91% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 92% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 93% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 95% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%)
distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...

The

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

seemingly leading to a Nanny failure and timeout (and subsequent 99% time spent in GC?!) is interesting.

@fjetter
Copy link
Member

fjetter commented Jun 23, 2021

I've seen the GC message in some other "flaky" tests as well, I believe

@fjetter
Copy link
Member

fjetter commented Jun 23, 2021

What's also interesting is, that we only see one error log of Failed to start gen_cluster: although this section should be retried 60 times. Considering that the nanny times out after about 10s I would've expected to see at least to cluster start failures (gen_cluster timeout is at 30) so the GC might actually block here?

Both links also merely happen on py3.8 but that may be a coincidence.

@fjetter fjetter mentioned this issue Feb 16, 2022
51 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

3 participants