Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically clean up old install bases #2109

Open
camillol opened this issue Nov 18, 2016 · 34 comments
Open

Automatically clean up old install bases #2109

camillol opened this issue Nov 18, 2016 · 34 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: feature request

Comments

@camillol
Copy link

$ du -sh /private/var/tmp/_bazel_camillol/install
2.3G /private/var/tmp/_bazel_camillol/install

There are 50 directories in there. The oldest dates back to November 24, 2015; the newest is from September 1, 2016. May 12 has five different folders.

Even though I ran "blaze clean" in all of my blaze clients, lots of these folders are left behind. They should be cleaned up somehow.

@kchodorow
Copy link
Contributor

bazel clean removes output files, bazel clean --expunge removes the who output base (which contains the install directory). See https://bazel.build/versions/master/docs/bazel-user-manual.html#clean for more info.

@camillol
Copy link
Author

The output base does not contain the install directory, but only a symlink to it. The actual install directory is not removed by bazel clean --expunge.

@kchodorow
Copy link
Contributor

Oh, that's true. I guess we could add an option, but you can just delete the directories, too.

@kchodorow kchodorow reopened this Nov 18, 2016
@kchodorow kchodorow added category: misc > misc P3 We're not considering working on this, but happy to review a PR. (No assignee) type: feature request labels Nov 18, 2016
@camillol
Copy link
Author

An option doesn't really help. If you know to use the option, you know enough to delete the directories manually. The problem is that we leave behind a separate installation directory for every single build of Blaze that the user has ever used, and they never get cleaned up, until the user starts running low on space and goes looking for things to delete. We should not burden the user with that; we should just clean up old installations periodically.

@kchodorow
Copy link
Contributor

That doesn't seem like Bazel's responsibility: if you wanted you could put the bazel directories on a filesystem that will delete things that haven't been used in a while. "We now slow down your build to delete some files taking up a little disk space" doesn't seem like a tradeoff most developers would want to make.

@pcj
Copy link
Member

pcj commented Nov 19, 2016

Is it safe to delete the entire user tree, i.e. $ rm -rf /private/var/tmp/_bazel_camillol? Mine is 14G and my disk is 99% full. This is kindof an issue with people that are working with bazel with multiple repos and not a monorepo.

@ulfjack
Copy link
Contributor

ulfjack commented Dec 1, 2016

It's perfectly safe to delete the install directories (as long as you aren't running Bazel in parallel). Bazel just re-creates them on the next run if necessary. Deleting the entire bazel tree is also safe, but will cause Bazel to rebuild everything on the next run, and the bazel-* symlinks will all be dead links.

@camillol
Copy link
Author

camillol commented Dec 1, 2016

@kchodorow: in general, ensuring that temporary files get deleted is the responsibility of whoever creates them. If this were a 10 MB cache, you could say "eh, whatever, it's not going to hurt to just leave it there", but dumping 14 GB of old temporary files on @pcj's disk is not a reasonable thing to do.

We don't need to slow down builds at all, either. Bazel runs as a daemon, it can easily do the cleanup as an idle task when it's not building anything.

BTW, if you are not seeing this problem on your own machine it's probably because your company has set up a cron job to clean up old bazel directories automatically. (Which could, in theory, slow down your build if it runs at the same time as bazel... have you ever noticed that issue?) But this is really a responsibility that Bazel should take on, so that it works by default for everyone.

@dgoldstein0
Copy link

+1 to this. ran into the same thing recently myself - bazel had used 18GB of my disk for it's caching, on a vm with 60GB - which caused me to run out of space and go hunting for my gigabytes.

If bazel is going to cache GBs of files, it should be responsible for doing some basic tracking of their usage and deleting them when they haven't been accessed in a while. I don't mind giving a few GB to bazel to use as a cache, but it needs to be respectful of my disk and not cause me to run out of space.

@kchodorow
Copy link
Contributor

@camillol I disagree. Google has a separate tool that basically takes care of this problem for you, which is why it isn't built into Bazel. Bazel is huge and complicated, we'd like to keep it focused on doing one thing (building) well.

@camillol
Copy link
Author

camillol commented Dec 9, 2016 via email

@damienmg damienmg added this to the 0.7 milestone Dec 12, 2016
@damienmg
Copy link
Contributor

I think this should be done. Either having the launcher clean the install dir or have the installer install such a service. Probably better to have this simpe code has part of the launcher.

I agree with @camillol that the defaults of bazel should give an awesome user experience and this is not it.

@damienmg damienmg added P2 We'll consider working on this in future. (Assignee optional) type: feature request and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) type: feature request labels Dec 12, 2016
@jmmv
Copy link
Contributor

jmmv commented Apr 5, 2017

I just stumbled on this issue and would like to vote in favor of what @camillol is saying: Bazel created the garbage, so it's Bazel's responsibility to clean it up automatically. These whole "installation directories" are a very strange concept after all, so when Bazel abandons them, Bazel has to destroy them. And as @damienmg says, the current behavior is far from a great user experience.

@camillol
Copy link
Author

Gentle ping. I keep pruning it, but it grows back quickly. This is mostly within the last two months:

$ du -sh /private/var/tmp/_bazel_camillol/install/
883M /private/var/tmp/_bazel_camillol/install/

@camillol
Copy link
Author

camillol commented Dec 4, 2017

I would suggest reclassifying this from "feature request" to "bug".

@jgavris
Copy link
Contributor

jgavris commented Sep 20, 2018

I'd also like to bump this ... our build agents have output bases of over 100G pretty quickly. Some notion of being more careful about leaving behind garbage would be great.

@jmmv
Copy link
Contributor

jmmv commented Sep 20, 2018

@jgavris The output base is different than the install base. The output base is specific to your project and you are responsible for getting rid of it via bazel clean if you want to (it's your data, so Bazel shouldn't get rid of it automatically). The install base is what's an artifact of how Bazel works today... and I think it's pretty hard to pile up 100GB of such data...

@jgavris
Copy link
Contributor

jgavris commented Sep 21, 2018

@jmmv my bad ... you're right. We actually quickly hit 100GB of artifacts in CI in one day doing about 10 variant builds of a medium / large codebase (debug and release configs for 5 different architectures).

@meisterT
Copy link
Member

What's the actual proposal here? I suppose it should still be possible to use several bazel versions on one machine without having to extract them on each new invocation.

@meisterT
Copy link
Member

one option is to handle this in https://github.com/philwo/bazelisk and look at the time stamp of the install base

@sgowroji
Copy link
Member

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

@sgowroji sgowroji closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2023
@jmmv
Copy link
Contributor

jmmv commented Jan 31, 2023

@sgowroji Please note that we (people outside of the Bazel org) don't have permissions to reopen issues.

This is still a real usability problem that should be addressed. It may not be a visible problem within Google due to how this is handled there (cron job... as someone mentioned above years ago), but we need a real solution for the general public.

@sgowroji sgowroji reopened this Jan 31, 2023
@MilesCranmer
Copy link

MilesCranmer commented Jan 31, 2023

[Edit - moved to caching issues. Thanks @jmmv]

@jmmv
Copy link
Contributor

jmmv commented Jan 31, 2023

@MilesCranmer Let's not conflate issues though. From your claim of "millions of files", I'm pretty sure you are referring to output trees, not the install directory. The install directory, which is covered by this issue, is truly a cache that needs automatic cleanup. But this directory grows much more slowly than everything else and doesn't take "a lot" of space (but that's subjective).

For other types of outputs... see this other issue (and my comment), which tracks this problem more broadly: #1035 (comment)

@MilesCranmer
Copy link

I think maybe we are speaking about the same thing? From what I recall, my cache (I think the install directory?) was only ~20-50 GB in terms of space, but the number of files was truly massive, in the few millions (of tiny files) – which slows down file indexing. This is why I have avoided using bazel on my institutional cluster, because I will hit the hard file limit very quickly. I could be misremembering though, as I haven't used bazel in a while (though I would like to, once this issue gets fixed!).

@jmmv
Copy link
Contributor

jmmv commented Jan 31, 2023

We aren't. The install directory for a release takes ~170MB and contains ~700 files (today). If all you had in those 50GB were Bazel installs, you would fit 300 installs which amount to 210k files. And you wouldn't end up with 300 different installs unless you were developing Bazel itself, because there aren't that many releases out there.

What you are talking about is the space used by output directories, which are stored in ~/.cache/ per XDG guidelines but they are not a cache. See the comment I linked to.

@MilesCranmer
Copy link

Ah, I see, thanks! Indeed I guess I got confused because they are in the .cache directory, your comment was helpful. Will move myself to the other issue.

@meisterT meisterT added team-Performance Issues for Performance teams and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels May 16, 2024
@tjgq tjgq changed the title Bazel leaves behind too many old files in "install" Automatically clean up old install bases May 23, 2024
@tjgq
Copy link
Contributor

tjgq commented May 23, 2024

I'm going to repurpose this issue for the "automatically clean up old install bases" project, which I intend to work on at some point before Bazel 8.

The preliminary plan is to check for install bases that haven't been touched for a long time upon server startup and delete them. (We already update the mtime on the install base directory when the server starts up, so we can use that as the signal.)

@tjgq tjgq removed the help wanted Someone outside the Bazel team could own this label Jun 21, 2024
@tjgq
Copy link
Contributor

tjgq commented Sep 30, 2024

Status update: this will not make it into 8.0, but the work is planned for 8.1.

copybara-service bot pushed a commit that referenced this issue Nov 4, 2024
…cks.

Currently, we rely on CreateFile to effectively obtain an exclusive (write) lock on the entire file, which makes the later call to LockFileEx redundant. This CL makes it so that we open the file in shared mode, and actually use LockFileEx to lock it.

This makes a client-side lock compatible with a server-side one obtained through the JVM (which defaults to opening files in shared mode and uses LockFileEx for locking). Even though this doesn't matter for the output base lock, which is only ever obtained from the client side (the server side doesn't use filesystem-based locks), it will be necessary to implement install base locking (as part of fixing #2109).

Note that this means an older Bazel might immediately exit instead of blocking for the lock, if the latter was previously acquired by a newer Bazel (since the older Bazel will always CreateFile successfully, but treat the subsequent LockFileEx failure as an unrecoverable error). However, this only matters during the very small window during which the client-side lock is held (it's taken over by the server-side lock in very short order), so I believe this is a very small price to pay to avoid adding more complexity.

RELNOTES[INC]: On Windows, a change to the output base locking protocol might cause an older Bazel invoked immediately after a newer Bazel (on the same output base) to error out instead of blocking for the lock, even if --block_for_lock is enabled.

PiperOrigin-RevId: 692973056
Change-Id: Iaf1ccecfb4c138333ec9d7a694b10caf96b2917b
tjgq added a commit to tjgq/bazel that referenced this issue Nov 5, 2024
…h JVM locks.

Currently, we rely on CreateFile to effectively obtain an exclusive (write) lock on the entire file, which makes the later call to LockFileEx redundant. This CL makes it so that we open the file in shared mode, and actually use LockFileEx to lock it.

This makes a client-side lock compatible with a server-side one obtained through the JVM (which defaults to opening files in shared mode and uses LockFileEx for locking). Even though this doesn't matter for the output base lock, which is only ever obtained from the client side (the server side doesn't use filesystem-based locks), it will be necessary to implement install base locking (as part of fixing bazelbuild#2109).

Note that this means an older Bazel might immediately exit instead of blocking for the lock, if the latter was previously acquired by a newer Bazel (since the older Bazel will always CreateFile successfully, but treat the subsequent LockFileEx failure as an unrecoverable error). However, this only matters during the very small window during which the client-side lock is held (it's taken over by the server-side lock in very short order), so I believe this is a very small price to pay to avoid adding more complexity.

RELNOTES[INC]: On Windows, a change to the output base locking protocol might cause an older Bazel invoked immediately after a newer Bazel (on the same output base) to error out instead of blocking for the lock, even if --block_for_lock is enabled.

PiperOrigin-RevId: 692973056
Change-Id: Iaf1ccecfb4c138333ec9d7a694b10caf96b2917b
copybara-service bot pushed a commit that referenced this issue Nov 5, 2024
They are currently only used to acquire a lock on the output base, but a future change will use them to lock the install base as well.

Human-readable output is also amended to refer to the "output base lock" instead of the "client lock", as the latter term becomes ambiguous once multiple locks exist.

Progress on #2109.

PiperOrigin-RevId: 693354279
Change-Id: I2b39e6f5ddb83bbc2be15a31d7de9655358776c5
github-merge-queue bot pushed a commit that referenced this issue Nov 5, 2024
…h JVM locks. (#24210)

Currently, we rely on CreateFile to effectively obtain an exclusive
(write) lock on the entire file, which makes the later call to
LockFileEx redundant. This CL makes it so that we open the file in
shared mode, and actually use LockFileEx to lock it.

This makes a client-side lock compatible with a server-side one obtained
through the JVM (which defaults to opening files in shared mode and uses
LockFileEx for locking). Even though this doesn't matter for the output
base lock, which is only ever obtained from the client side (the server
side doesn't use filesystem-based locks), it will be necessary to
implement install base locking (as part of fixing #2109).

Note that this means an older Bazel might immediately exit instead of
blocking for the lock, if the latter was previously acquired by a newer
Bazel (since the older Bazel will always CreateFile successfully, but
treat the subsequent LockFileEx failure as an unrecoverable error).
However, this only matters during the very small window during which the
client-side lock is held (it's taken over by the server-side lock in
very short order), so I believe this is a very small price to pay to
avoid adding more complexity.

RELNOTES[INC]: On Windows, a change to the output base locking protocol
might cause an older Bazel invoked immediately after a newer Bazel (on
the same output base) to error out instead of blocking for the lock,
even if --block_for_lock is enabled.

PiperOrigin-RevId: 692973056
Change-Id: Iaf1ccecfb4c138333ec9d7a694b10caf96b2917b
tjgq added a commit to tjgq/bazel that referenced this issue Nov 6, 2024
They are currently only used to acquire a lock on the output base, but a future change will use them to lock the install base as well.

Human-readable output is also amended to refer to the "output base lock" instead of the "client lock", as the latter term becomes ambiguous once multiple locks exist.

Progress on bazelbuild#2109.

PiperOrigin-RevId: 693354279
Change-Id: I2b39e6f5ddb83bbc2be15a31d7de9655358776c5
github-merge-queue bot pushed a commit that referenced this issue Nov 6, 2024
…4223)

They are currently only used to acquire a lock on the output base, but a
future change will use them to lock the install base as well.

Human-readable output is also amended to refer to the "output base lock"
instead of the "client lock", as the latter term becomes ambiguous once
multiple locks exist.

Progress on #2109.

PiperOrigin-RevId: 693354279
Change-Id: I2b39e6f5ddb83bbc2be15a31d7de9655358776c5
copybara-service bot pushed a commit that referenced this issue Nov 25, 2024
Progress on #2109.

PiperOrigin-RevId: 700006410
Change-Id: Ifd0cfdca6d4124addfecb99b0dec5f488e3ffedd
copybara-service bot pushed a commit that referenced this issue Dec 3, 2024
…the client side.

These are the client-side changes required to implement garbage collection of stale install bases: since one Bazel server might attempt to collect the install base of another, we must ensure that a running Bazel server *or* client prevents collection of its own install base, which is achieved by acquiring an exclusive lock prior to collection (to be implemented in a followup).

Note that we keep the existing mechanism for handling concurrent attempts to create the same install base don't clash (atomic rename) because it's simpler than using the lock (which would require upgrading it from shared to exclusive and back).

Progress on #2109.

PiperOrigin-RevId: 702433270
Change-Id: I474f3d56ec126de5f975c543f7bf9b64f4f08124
copybara-service bot pushed a commit that referenced this issue Dec 9, 2024
Progress on #2109.

PiperOrigin-RevId: 704253963
Change-Id: Ib5ef94bbcbad769ca19ac89f97e1d7f82eba4dba
copybara-service bot pushed a commit that referenced this issue Dec 13, 2024
The `--experimental_install_base_gc_max_age` flag determines the criteria for considering an install base stale; zero disables garbage collection. The default will be replaced with a suitable nonzero value in a future CL.

Progress on #2109.

PiperOrigin-RevId: 705874241
Change-Id: I3c8526a4a0e20a12d52c08cb282f24e268cbc633
ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this issue Dec 18, 2024
…cks.

Currently, we rely on CreateFile to effectively obtain an exclusive (write) lock on the entire file, which makes the later call to LockFileEx redundant. This CL makes it so that we open the file in shared mode, and actually use LockFileEx to lock it.

This makes a client-side lock compatible with a server-side one obtained through the JVM (which defaults to opening files in shared mode and uses LockFileEx for locking). Even though this doesn't matter for the output base lock, which is only ever obtained from the client side (the server side doesn't use filesystem-based locks), it will be necessary to implement install base locking (as part of fixing bazelbuild#2109).

Note that this means an older Bazel might immediately exit instead of blocking for the lock, if the latter was previously acquired by a newer Bazel (since the older Bazel will always CreateFile successfully, but treat the subsequent LockFileEx failure as an unrecoverable error). However, this only matters during the very small window during which the client-side lock is held (it's taken over by the server-side lock in very short order), so I believe this is a very small price to pay to avoid adding more complexity.

RELNOTES[INC]: On Windows, a change to the output base locking protocol might cause an older Bazel invoked immediately after a newer Bazel (on the same output base) to error out instead of blocking for the lock, even if --block_for_lock is enabled.

PiperOrigin-RevId: 692973056
Change-Id: Iaf1ccecfb4c138333ec9d7a694b10caf96b2917b
ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this issue Dec 18, 2024
They are currently only used to acquire a lock on the output base, but a future change will use them to lock the install base as well.

Human-readable output is also amended to refer to the "output base lock" instead of the "client lock", as the latter term becomes ambiguous once multiple locks exist.

Progress on bazelbuild#2109.

PiperOrigin-RevId: 693354279
Change-Id: I2b39e6f5ddb83bbc2be15a31d7de9655358776c5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: feature request
Projects
None yet
Development

No branches or pull requests