Make cache management more dynamic #3578

dbutenhof · 2023-11-17T22:58:36Z

PBENCH-1301

This could be refined, but I want to get it up for review. I'd love to get it on the staging server for testing with more limited disk space, although I can ask for 95% or 880Gib on my laptop and see it free cache. (I'm out all next week, but I may spend a bit of time trying to clean this up: e.g., trying to add some unit tests.)

First, this adds code to track the unpacked size of tarballs to help with managing cache goals. This is based on a du -s -B1 of the unpacked directory tree, and stored as metadata.

When asked to unpack, it will check whether there's sufficient space on the cache device, and if not it will reclaim cache with a goal sufficient to accommodate the target dataset.

The pbench-tree-manage command can now target either % free or bytes free, and the background timer job will attempt to free 20% of the drive every 4 hours instead of targeting "old" tarballs. (The last_ref timestamp is now used only to sort the list of datasets with live cache on input to reclaim so that we free the oldest first.)

PBENCH-1301 This could be refined, but I want to get it up for review. I'd love to get it on the staging server for testing with more limited disk space, although I can ask for 95% or 880Gib on my laptop and see it free cache. (I'm out all next week, but I may spend a bit of time trying to clean this up: e.g., trying to add some unit tests.) First, this adds code to track the unpacked size of tarballs to help with managing cache goals. This is based on a `du -s -B1` of the unpacked directory tree, and stored as metadata. When asked to unpack, it will check whether there's sufficient space on the cache device, and if not it will reclaim cache with a goal sufficient to accommodate the target dataset. The `pbench-tree-manage` command can now target either % free or bytes free, and the background timer job will attempt to free 20% of the drive every 4 hours instead of targeting "old" tarballs. (The `last_ref` timestamp is now used only to sort the list of datasets with live cache on input to reclaim so that we free the oldest first.)

lib/pbench/cli/server/tree_manage.py

lib/pbench/server/cache_manager.py

lib/pbench/cli/server/tree_manage.py

lib/pbench/server/cache_manager.py

webbnh

👍

dbutenhof · 2023-11-29T16:28:56Z

podman image ls --filter 'reference=pbench-agent' --filter 'reference=pbench-server' --format '{{.Id}}' --filter containers=false

sort -u

xargs podman image rm -f
Error: image name or ID must be specified

true

webbnh · 2023-11-29T16:30:08Z

podman image ls --filter 'reference=pbench-agent' --filter 'reference=pbench-server' --format '{{.Id}}' --filter containers=false

sort -u

xargs podman image rm -f
Error: image name or ID must be specified

true

That's old news, now. I resubmitted your build and it failed unit tests.

dbutenhof · 2023-11-29T16:34:12Z

Actually, it did both. I'm not sure why the cleanup is generating a blank argument to podman image rm -f, but it apparently is. But, yeah: the CI appears to be hitting some weird unit test problem I'm not seeing locally, and I really hate when that happens.

dbutenhof · 2023-11-29T16:47:05Z

Ah. It's more of that "asynchronous pytest run with implicit dependencies". Dang. So it is a real problem. 😦

webbnh · 2023-11-29T16:51:06Z

I'm not sure why the cleanup is generating a blank argument to podman image rm -f, but it apparently is.

Apparently the filtered output of podman image ls is empty, and so there is nothing for the image rm to remove. I'm not sure what might have caused that, but the build is supposed to ignore this result, so I think the fact that we're seeing failure notifications is unrelated to this (i.e., I don't think that this is a problem).

the CI appears to be hitting some weird unit test problem I'm not seeing locally, and I really hate when that happens.

😞

webbnh · 2023-11-29T17:23:23Z

While you're thinking about this stuff...did you see my comment from the other week?

dbutenhof · 2023-11-29T19:18:53Z

While you're thinking about this stuff...did you see my comment from the other week?

Yeah, I saw that. I was tempted to just pull it into this PR although it deserves its own and I'm not particularly inclined to deal with that now. Then again, maybe I should just throw it into the pot ...

webbnh · 2023-11-29T19:27:48Z

I saw that.

OK, good.

I was tempted to just pull it into this PR although it deserves its own and I'm not particularly inclined to deal with that now. Then again, maybe I should just throw it into the pot ...

No, no...you are right to resist that temptation. (I just didn't know how closely related or how big a change it would require.)

dbutenhof · 2023-11-29T19:34:15Z

I saw that.

OK, good.

I was tempted to just pull it into this PR although it deserves its own and I'm not particularly inclined to deal with that now. Then again, maybe I should just throw it into the pot ...

No, no...you are right to resist that temptation. (I just didn't know how closely related or how big a change it would require.)

It's a trivial one-character change, I think, since I believe that the plural form they want us to use takes *args. That is, just changing add_column to add_columns should do it.

`Tarball.__init__` now tries to look up a `Dataset`, and therefore (if the constructor isn't mocked out) we need a DB. In parallel execution, we may hit a test requiring DB without a fixture to initialize it. The failures are therefore somewhat random. I've added fixtures to fix the ones I can reproduce locally with `jenkins/run tox -- server python -n16`. TBD whether the CI finds more!

webbnh

🤞

dbutenhof added Server Code Infrastructure labels Nov 17, 2023

dbutenhof requested a review from webbnh November 17, 2023 22:58

dbutenhof self-assigned this Nov 17, 2023

This comment was marked as resolved.

Sign in to view

dbutenhof added 2 commits November 27, 2023 10:48

Minor cleanup

70c5f94

dbutenhof dismissed webbnh’s stale review via 70c5f94 November 27, 2023 15:49

dbutenhof force-pushed the goal branch from dc52f5a to 70c5f94 Compare November 27, 2023 15:49

webbnh previously approved these changes Nov 28, 2023

View reviewed changes

Fix reclaim collector, tweak messages

ebf281b

dbutenhof dismissed webbnh’s stale review via ebf281b November 28, 2023 20:42

This comment was marked as resolved.

Sign in to view

round up bytes

8e02039

dbutenhof dismissed webbnh’s stale review via 8e02039 November 29, 2023 15:06

webbnh previously approved these changes Nov 29, 2023

View reviewed changes

dbutenhof dismissed webbnh’s stale review via 902b898 November 29, 2023 20:50

webbnh approved these changes Nov 29, 2023

View reviewed changes

dbutenhof merged commit e90872a into distributed-system-analysis:main Nov 29, 2023
3 checks passed

dbutenhof deleted the goal branch November 29, 2023 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cache management more dynamic #3578

Make cache management more dynamic #3578

dbutenhof commented Nov 17, 2023

This comment was marked as resolved.

This comment was marked as resolved.

webbnh left a comment

dbutenhof commented Nov 29, 2023

webbnh commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

webbnh commented Nov 29, 2023

webbnh commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

webbnh commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

webbnh left a comment

Make cache management more dynamic #3578

Make cache management more dynamic #3578

Conversation

dbutenhof commented Nov 17, 2023

This comment was marked as resolved.

This comment was marked as resolved.

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof commented Nov 29, 2023

webbnh commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

webbnh commented Nov 29, 2023

webbnh commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

webbnh commented Nov 29, 2023

dbutenhof commented Nov 29, 2023

webbnh left a comment

Choose a reason for hiding this comment