Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure actual spilled bytes, not output of sizeof() #5805

Merged
merged 4 commits into from
Feb 16, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Feb 14, 2022

Closes #5364

#5543 changed the measure of spilled memory from the output of sizeof() to the actual number of bytes written to disk. This introduced a substantial difference in case of

  1. highly compressible data*
  2. data where sizeof() returns an inaccurate output
  3. numpy views, which become real arrays upon serialization.

*Compression upon spill requires lz4 or snappy to be installed. Notably, in our CI they're exclusively installed on Python 3.9. See #5807.

The same PR introduced a regression, where this new difference would add or detract to the managed_in_memory measure, which in turn would detract or add to the unmanaged memory.

Example

"x" * (50 * 2**20) occupies 50 MiB in RAM, but only 200 kiB on disk.

Before #5543, you would see 50 MiB on the dashboard both if the value was in RAM or spilled to disk.

After #5543, when the value was spilled to disk you would read on the dashboard:

  • 200 kiB spilled memory (grey)
  • 49.8 MiB more managed_in_memory (solid color) than you actually have
  • 49.8 MiB less unmanaged memory (semi transparent) than you actually have

This PR removes the above artifact.

Demo

import time
from distributed import Client

c = Client()

futures = []
while True:
    futures += [
        c.submit(lambda: "x" * (50 * 2**20), pure=False)
        for _ in range(64)
    ]
    time.sleep(0.5)

Before #5543:
before

After both PRs:
after

@github-actions
Copy link
Contributor

github-actions bot commented Feb 14, 2022

Unit Test Results

       18 files         18 suites   10h 15m 58s ⏱️
  2 603 tests   2 523 ✔️      78 💤 2
23 293 runs  21 880 ✔️ 1 411 💤 2

For more details on these failures, see this check.

Results for commit adf1b7c.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the measure_pickled_bytes branch 5 times, most recently from 2f1611f to a897741 Compare February 14, 2022 16:11
@ncclementi
Copy link
Member

@crusaderky it looks like we are having a related failure in a test_scheduler.py::test_memory see https://github.com/dask/distributed/runs/5186888061?check_suite_focus=true#step:12:1260

@ncclementi
Copy link
Member

@crusaderky With the changes on this PR, Does this documentation need extra information?
https://github.com/dask/distributed/blob/main/docs/source/worker.rst#using-the-dashboard-to-monitor-memory-usage

@crusaderky
Copy link
Collaborator Author

@crusaderky With the changes on this PR, Does this documentation need extra information? https://github.com/dask/distributed/blob/main/docs/source/worker.rst#using-the-dashboard-to-monitor-memory-usage

updated

@crusaderky
Copy link
Collaborator Author

crusaderky commented Feb 14, 2022

@crusaderky it looks like we are having a related failure in a test_scheduler.py::test_memory see https://github.com/dask/distributed/runs/5186888061?check_suite_focus=true#step:12:1260

This is... a major pain. The problem is that Python 3.7/3.8 CI installs neither lz4 nor snappy, so distributed.comm.compression falls back to no compression. However, I can't force it to zlib either (which is always available), because of a design flaw in distributed.protocol.compression. Which is not trivial to fix. I'll look into it more but this looks like yet another blocker issue.

Edit: worked around for now. Tracked at #5807

@crusaderky
Copy link
Collaborator Author

crusaderky commented Feb 14, 2022

All green except issues already spotted in parent PR. Ready for review (but not merge!)

@ncclementi
Copy link
Member

ncclementi commented Feb 15, 2022

I think we are missing renaming some things in the dashboard to be consistent with the changes done in this PR, unless we think it is clear enough that "managed" is "managed_in_memory".

  • The WorkerTable it uses the word "memory_managed" and there is an assignment of "memory_managed": "managed" if I'm not wrong this should be "memory_managed": "managed in memory",

  • The ClusterMemory and WorkersMemory use also as name "managed",

@crusaderky
Copy link
Collaborator Author

  • "memory_managed": "managed"

I think we are missing renaming some things in the dashboard to be consistent with the changes done in this PR, unless we think it is clear enough that "managed" is "managed_in_memory".

  • The WorkerTable it uses the word "memory_managed" and there is an assignment of "memory_managed": "managed" if I'm not wrong this should be "memory_managed": "managed in memory",
  • The ClusterMemory and WorkersMemory use also as name "managed",

I've renamed the variables for clarity. Note that the measure being shown was already managed_in_memory. The choice to just write "managed" in the UI is a conscious one to avoid confusing users.

@quasiben
Copy link
Member

@shwina given you recent work with visualizing RMM data, this PR may be of interest as it illuminates issues with how spilled data is visualized

Copy link
Member

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the comments/requests were addressed in 905defa

@crusaderky crusaderky merged commit 8d0df89 into dask:main Feb 16, 2022
@crusaderky crusaderky deleted the measure_pickled_bytes branch February 16, 2022 11:11
@crusaderky
Copy link
Collaborator Author

IMPORTANT NOTE

🚨 🚨 🚨 🚨 🚨 🚨
All open PRs in dask/distributed will start failing on CI on Python 3.9 due to (deliberate) forward incompatibility with zict git tip. The solution is to merge from main.
🚨 🚨 🚨 🚨 🚨 🚨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spill to constrained disk space
3 participants