Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cache management more dynamic #3578

Merged
merged 5 commits into from
Nov 29, 2023

Conversation

dbutenhof
Copy link
Member

PBENCH-1301

This could be refined, but I want to get it up for review. I'd love to get it on the staging server for testing with more limited disk space, although I can ask for 95% or 880Gib on my laptop and see it free cache. (I'm out all next week, but I may spend a bit of time trying to clean this up: e.g., trying to add some unit tests.)

First, this adds code to track the unpacked size of tarballs to help with managing cache goals. This is based on a du -s -B1 of the unpacked directory tree, and stored as metadata.

When asked to unpack, it will check whether there's sufficient space on the cache device, and if not it will reclaim cache with a goal sufficient to accommodate the target dataset.

The pbench-tree-manage command can now target either % free or bytes free, and the background timer job will attempt to free 20% of the drive every 4 hours instead of targeting "old" tarballs. (The last_ref timestamp is now used only to sort the list of datasets with live cache on input to reclaim so that we free the oldest first.)

webbnh

This comment was marked as resolved.

PBENCH-1301

This could be refined, but I want to get it up for review. I'd love to get it
on the staging server for testing with more limited disk space, although I can
ask for 95% or 880Gib on my laptop and see it free cache. (I'm out all next
week, but I may spend a bit of time trying to clean this up: e.g., trying to
add some unit tests.)

First, this adds code to track the unpacked size of tarballs to help with
managing cache goals. This is based on a `du -s -B1` of the unpacked directory
tree, and stored as metadata.

When asked to unpack, it will check whether there's sufficient space on the
cache device, and if not it will reclaim cache with a goal sufficient to
accommodate the target dataset.

The `pbench-tree-manage` command can now target either % free or bytes free,
and the background timer job will attempt to free 20% of the drive every 4
hours instead of targeting "old" tarballs. (The `last_ref` timestamp is now
used only to sort the list of datasets with live cache on input to reclaim
so that we free the oldest first.)
webbnh
webbnh previously approved these changes Nov 28, 2023
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
webbnh

This comment was marked as resolved.

webbnh
webbnh previously approved these changes Nov 29, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dbutenhof
Copy link
Member Author

  • podman image ls --filter 'reference=pbench-agent' --filter 'reference=pbench-server' --format '{{.Id}}' --filter containers=false
  • sort -u
  • xargs podman image rm -f
    Error: image name or ID must be specified
  • true

@webbnh
Copy link
Member

webbnh commented Nov 29, 2023

  • podman image ls --filter 'reference=pbench-agent' --filter 'reference=pbench-server' --format '{{.Id}}' --filter containers=false
  • sort -u
  • xargs podman image rm -f
    Error: image name or ID must be specified
  • true

That's old news, now. I resubmitted your build and it failed unit tests.

@dbutenhof
Copy link
Member Author

Actually, it did both. I'm not sure why the cleanup is generating a blank argument to podman image rm -f, but it apparently is. But, yeah: the CI appears to be hitting some weird unit test problem I'm not seeing locally, and I really hate when that happens.

@dbutenhof
Copy link
Member Author

Ah. It's more of that "asynchronous pytest run with implicit dependencies". Dang. So it is a real problem. 😦

@webbnh
Copy link
Member

webbnh commented Nov 29, 2023

I'm not sure why the cleanup is generating a blank argument to podman image rm -f, but it apparently is.

Apparently the filtered output of podman image ls is empty, and so there is nothing for the image rm to remove. I'm not sure what might have caused that, but the build is supposed to ignore this result, so I think the fact that we're seeing failure notifications is unrelated to this (i.e., I don't think that this is a problem).

the CI appears to be hitting some weird unit test problem I'm not seeing locally, and I really hate when that happens.

😞

@webbnh
Copy link
Member

webbnh commented Nov 29, 2023

While you're thinking about this stuff...did you see my comment from the other week?

@dbutenhof
Copy link
Member Author

While you're thinking about this stuff...did you see my comment from the other week?

Yeah, I saw that. I was tempted to just pull it into this PR although it deserves its own and I'm not particularly inclined to deal with that now. Then again, maybe I should just throw it into the pot ...

@webbnh
Copy link
Member

webbnh commented Nov 29, 2023

I saw that.

OK, good.

I was tempted to just pull it into this PR although it deserves its own and I'm not particularly inclined to deal with that now. Then again, maybe I should just throw it into the pot ...

No, no...you are right to resist that temptation. (I just didn't know how closely related or how big a change it would require.)

@dbutenhof
Copy link
Member Author

I saw that.

OK, good.

I was tempted to just pull it into this PR although it deserves its own and I'm not particularly inclined to deal with that now. Then again, maybe I should just throw it into the pot ...

No, no...you are right to resist that temptation. (I just didn't know how closely related or how big a change it would require.)

It's a trivial one-character change, I think, since I believe that the plural form they want us to use takes *args. That is, just changing add_column to add_columns should do it.

`Tarball.__init__` now tries to look up a `Dataset`, and therefore (if the
constructor isn't mocked out) we need a DB. In parallel execution, we may
hit a test requiring DB without a fixture to initialize it. The failures are
therefore somewhat random. I've added fixtures to fix the ones I can reproduce
locally with `jenkins/run tox -- server python -n16`. TBD whether the CI finds
more!
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

@dbutenhof dbutenhof merged commit e90872a into distributed-system-analysis:main Nov 29, 2023
3 checks passed
@dbutenhof dbutenhof deleted the goal branch November 29, 2023 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants