KeyError: ('ready', 'memory') #4721

mdering · 2021-04-20T15:52:37Z

What happened:

ERROR - 2021-04-20 08:43:37,220 - distributed.worker.execute.l2800 - ('ready', 'memory')
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 2765, in execute
    self.transition(ts, "memory", value=value)
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1584, in transition
    func = self._transitions[start, finish]
KeyError: ('ready', 'memory')

We see this key error occasionally during execution of our application. I checked through the worker code and theres already a function defined for this transition, its just not in the _transitions table. Is this an oversight?

distributed/distributed/worker.py

Line 1840 in d5fc324

def transition_ready_memory(self, ts, value=None):

What you expected to happen:
I would expect this to not happen

Minimal Complete Verifiable Example:

I don't have one, we just see it sometimes. I'm opening this issue just to see if this is an oversight or not.

Anything else we need to know?:

Environment:

Dask version: 2021.04.0
Python version: 3.7
Operating System: windows
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

fjetter · 2021-04-20T16:01:35Z

Thanks for the report. We've seen issues around this lately. There is a bigger issue open #4413

I believe for this particular instance it indeed looks like an oversight. You don't happen to have a minimal example which reproduces this error, do you?

gforsyth · 2021-04-20T16:03:22Z

Looks like I missed out adding it to the _transitions table -- thanks for flagging it @mdering

mdering · 2021-04-20T17:40:45Z

I'm sorry, I don't. We have a largeish application and we just see this in the logs sometimes, the error message doesn't even tell me which future it is so I can't deduce anything about it from that either. I appreciate the quick responses!

This should remove the error message in dask#4721 but is more of a band-aid than a fix. Ready->memory transitions shouldn't happen, but since they do occassionally crop up, we might as well dispatch them appropriately until the worker state machine is ready.

* Add "ready->memory" to transitions in worker This should remove the error message in #4721 but is more of a band-aid than a fix. Ready->memory transitions shouldn't happen, but since they do occassionally crop up, we might as well dispatch them appropriately until the worker state machine is ready. * Default to `no_value` * Update distributed/worker.py Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com> Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

mdering · 2021-05-06T16:01:00Z

Hi everyone, I was able to reproduce this finally. Our experience with dask had lead us to notice that some of the workers would jam up inexplicably, and this would prevent our application from proceeding. These workers would be given tasks, but the tasks would never start nor be completed. This lead me to use a method based on calling reschedule on the scheduler, if a key took too long to progress. This seems to have mostly worked, though the workers eventually end up in strange states. I may raise a different issue about that based on the feedback I get here.

Anyway, on the example

import distributed
import time
client = distributed.Client("127.0.0.1:8786") # i have a scheduler with 4 workers here locally
def square(x):
    return x ** 2
def neg(x):
    return -x
def test_wait(x):
    for i in range(20):
        time.sleep(1)
    return x
def reschedule_keys(dask_scheduler=None, keys=None):
    for k in keys:
        try:
            dask_scheduler.reschedule(key=k)
        except Exception:
            import traceback
            traceback.print_exc()
for i in range(20):
    A = client.map(test_wait, range(10))
    B = client.map(square, A)
    C = client.map(neg, B)
    time.sleep(2)
    client.run_on_scheduler(reschedule_keys, keys=[a.key for a in A])
    processing = client.processing()
    stacks = client.call_stack()
    for w in processing.keys():
        if not set(stacks[w]).issubset(set(processing[w])):
            print(stacks[w].keys(), processing[w])

this gives me two sets of errors in the worker logs, seen below.

Traceback (most recent call last):
  File "/anaconda3/envs/dra/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/anaconda3/envs/dra/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/anaconda3/envs/dra/lib/python3.7/site-packages/distributed/worker.py", line 957, in heartbeat
    for key in self.active_threads.values()
  File "/anaconda3/envs/dra/lib/python3.7/site-packages/distributed/worker.py", line 958, in <dictcomp>
    if key in self.tasks
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
ERROR - 2021-05-06 10:15:48 - distributed.worker.execute.l2800 - ('ready', 'memory')
Traceback (most recent call last):
  File "//anaconda3/envs/dra/lib/python3.7/site-packages/distributed/worker.py", line 2765, in execute
    self.transition(ts, "memory", value=value)
  File "/anaconda3/envs/dra/lib/python3.7/site-packages/distributed/worker.py", line 1584, in transition
    func = self._transitions[start, finish]
KeyError: ('ready', 'memory')

I know this use of reschedule is probably off-label, but I wanted to follow up with a working example so you can take a better look. This script also will show whenever a workers processing queue and call stack get out of sync, which is I believe the root cause of this issue where the workers are accepting new tasks but no longer processing. Please let me know if you think this other issue (where workers keep accepting tasks but no longer run them) is worth opening as well.

mdering · 2021-05-07T12:26:55Z

hi @fjetter i just wanted to make sure you saw this

fjetter · 2021-05-07T12:40:31Z

Thanks for the example. Unfortunately, it does not reproduce the issue on my machine. That's very likely an issue of speed since these issues are often extremely timing sensitive race conditions. That's also the reason why we didn't catch this regression reliably in CI, so far

mdering · 2021-05-07T12:53:25Z

maybe try upping the number of cycles, and size of A, and decreasing the sleep time before the reschedule?

import distributed
import time
client = distributed.Client("127.0.0.1:8786") # i have a scheduler with 4 workers here locally
def square(x):
    return x ** 2
def neg(x):
    return -x
def test_wait(x):
    for i in range(20):
        time.sleep(1)
    return x
def reschedule_keys(dask_scheduler=None, keys=None):
    for k in keys:
        try:
            dask_scheduler.reschedule(key=k)
        except Exception:
            import traceback
            traceback.print_exc()
for i in range(50):
    A = client.map(test_wait, range(100))
    B = client.map(square, A)
    C = client.map(neg, B)
    time.sleep(1)
    client.run_on_scheduler(reschedule_keys, keys=[a.key for a in A])
    processing = client.processing()
    stacks = client.call_stack()
    for w in processing.keys():
        if not set(stacks[w]).issubset(set(processing[w])):
            print(stacks[w].keys(), processing[w])

adonig · 2021-07-24T22:54:49Z

I'm not sure whether this is related but after upgrading to 2021.07.01 I observed this error:

distributed.utils - ERROR - ('fetch', 'memory')
Traceback (most recent call last):
  File "venv/lib/python3.9/site-packages/distributed/utils.py", line 638, in log_errors
    yield
  File "venv/lib/python3.9/site-packages/distributed/worker.py", line 2411, in gather_dep
    self.transition(ts, "memory", value=data[d])
  File "venv/lib/python3.9/site-packages/distributed/worker.py", line 1692, in transition
    func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f26cbb88910>>, <Task finished name='Task-642' coro=<Worker.gather_dep() done, defined at venv/lib/python3.9/site-packages/distributed/worker.py:2267> exception=KeyError(('fetch', 'memory'))>)
Traceback (most recent call last):
  File "venv/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "venv/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "venv/lib/python3.9/site-packages/distributed/worker.py", line 2411, in gather_dep
    self.transition(ts, "memory", value=data[d])
  File "venv/lib/python3.9/site-packages/distributed/worker.py", line 1692, in transition
    func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')

gforsyth mentioned this issue Apr 21, 2021

Add "ready->memory" to transitions in worker #4728

Merged

3 tasks

fjetter mentioned this issue Apr 22, 2021

Close workers if an exception occurs during transition #4735

Closed

jrbourbeau mentioned this issue Apr 27, 2021

Error transitioning worker task from waiting to fetch #4757

Open

fjetter mentioned this issue May 7, 2021

Internal worker state transitions #4413

Closed

bsesar mentioned this issue Aug 2, 2021

KeyError: ('fetch', 'memory') in Dask 2017.7.2 (execution sometimes hangs) #5152

Closed

mdering closed this as completed Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: ('ready', 'memory') #4721

KeyError: ('ready', 'memory') #4721

mdering commented Apr 20, 2021

fjetter commented Apr 20, 2021

gforsyth commented Apr 20, 2021

mdering commented Apr 20, 2021

mdering commented May 6, 2021 •

edited

Loading

mdering commented May 7, 2021

fjetter commented May 7, 2021

mdering commented May 7, 2021

adonig commented Jul 24, 2021

KeyError: ('ready', 'memory') #4721

KeyError: ('ready', 'memory') #4721

Comments

mdering commented Apr 20, 2021

fjetter commented Apr 20, 2021

gforsyth commented Apr 20, 2021

mdering commented Apr 20, 2021

mdering commented May 6, 2021 • edited Loading

mdering commented May 7, 2021

fjetter commented May 7, 2021

mdering commented May 7, 2021

adonig commented Jul 24, 2021

mdering commented May 6, 2021 •

edited

Loading