`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927

gjoseph92 · 2021-06-18T02:55:36Z

I think there is a logic error with bookkeping for TaskGroup.nbytes_in_memory. There's a discrepancy between how we increment it and decrement it when multiple workers hold the same key.

In transition_memory_released, we decrement it by nbytes once for every worker that holds that task:

distributed/distributed/scheduler.py

Lines 2646 to 2649 in 6340b5b

    
           for ws in ts._who_has: 
        
               del ws._has_what[ts] 
        
               ws._nbytes -= ts_nbytes 
        
               ts._group._nbytes_in_memory -= ts_nbytes

Whereas in _propagate_forgotten, we decrement it once by nbytes if there are any workers holding the task, regardless of how many. This doesn't match with transition_memory_released:

distributed/distributed/scheduler.py

Lines 7339 to 7341 in 646b12b

    
           ts_nbytes: Py_ssize_t = ts.get_nbytes() 
        
           if ts._who_has: 
        
               ts._group._nbytes_in_memory -= ts_nbytes

On the creation side, in TaskState.set_nbytes, we only increment it by the diff between the last known value and the current value. If the key is being copied to multiple workers, this difference is usually 0:

distributed/distributed/scheduler.py

Lines 1556 to 1562 in 6340b5b

    
           def set_nbytes(self, nbytes: Py_ssize_t): 
        
               diff: Py_ssize_t = nbytes 
        
               old_nbytes: Py_ssize_t = self._nbytes 
        
               if old_nbytes >= 0: 
        
                   diff -= old_nbytes 
        
               self._group._nbytes_total += diff 
        
               self._group._nbytes_in_memory += diff

In short, I think TaskGroup.nbytes_in_memory is incremented once per key, but decremented once per copy of the key.

If nbytes can be different for different workers, then to do this bookkeeping correctly, I think we'd also need to track TaskState.total_nbytes (size of all copies of the key), then decrement by that once in transition_memory_released and _propagate_forgotten.

Discovered in #4925 (comment). I think #4925 made this more apparent, since it encourages more data replication.

cc @crusaderky since you know more about replicated keys.

The text was updated successfully, but these errors were encountered:

fjetter · 2021-06-18T08:54:01Z

I stumbled over this myself recently, see

distributed/distributed/tests/test_scheduler.py

Lines 1919 to 1920 in fced981

    
           # TODO: Are we supposed to track replicated memory here? See also Scheduler.add_keys 
        
           assert tg.nbytes_in_memory == y.nbytes

where this behaviour is intentionally pinned but I believe this was by mistake. Intuitively, I would also expect this to be different.

From what I can see, the lack of counting is introduced in Scheduler.add_keys where workers let the scheduler know once they have a replica. There, the group nbytes is not increased

distributed/distributed/scheduler.py

Lines 6300 to 6303 in fced981

    
           if ts not in ws._has_what: 
        
               ws._nbytes += ts.get_nbytes() 
        
               ws._has_what[ts] = None 
        
               ts._who_has.add(ws)

I noticed this in my deadlock PR which already grew without bounds so I didn't fix it.

mrocklin · 2021-06-18T13:56:29Z

Maybe this will solve the problem? #4930

jrbourbeau · 2021-06-18T17:41:44Z

Closed via #4930

jrbourbeau closed this as completed Jun 18, 2021

jrbourbeau mentioned this issue Jun 18, 2021

Remove nbytes_in_memory #4930

Merged

gjoseph92 mentioned this issue Jun 21, 2021

Consider candidates that don't hold any dependencies in decide_worker #4925

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927

`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927

gjoseph92 commented Jun 18, 2021

fjetter commented Jun 18, 2021

mrocklin commented Jun 18, 2021

jrbourbeau commented Jun 18, 2021

TaskGroup.nbytes_in_memory miscounted for replicated keys #4927

TaskGroup.nbytes_in_memory miscounted for replicated keys #4927

Comments

gjoseph92 commented Jun 18, 2021

fjetter commented Jun 18, 2021

mrocklin commented Jun 18, 2021

jrbourbeau commented Jun 18, 2021

`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927

`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927