Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to persist rule index #1880

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jbedard
Copy link
Contributor

@jbedard jbedard commented Aug 20, 2024

What type of PR is this?
Feature

What package or component does this PR mostly affect?
cmd/gazelle

What does this PR do? Why is it needed?

Add the ability to persist the RuleIndex across runs to allow updating only the specified directories when indexing is required.

Which issues(s) does this PR fix?

Ref: #1181

walk/walk.go Outdated Show resolved Hide resolved
@jbedard jbedard marked this pull request as ready for review August 20, 2024 21:39
fmeum pushed a commit that referenced this pull request Aug 21, 2024
**What type of PR is this?**
Other

**What package or component does this PR mostly affect?**
all

**What does this PR do? Why is it needed?**

Prefactor for #1880
@jbedard jbedard force-pushed the index-persist branch 3 times, most recently from 10667ba to 936cb51 Compare August 22, 2024 08:34
}
defer indexDbFile.Close()

indexDbContent, err := io.ReadAll(indexDbFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large does this file get? I'm wondering whether json.NewDecoder could be more appropriate here. Doesn't matter for the first iteration though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this could get fairly large and I've been wondering if we'll want something more compact then json. But I figured if this is EXPERIMENTAL and we can break it whenever we want then plain json is a good place to start?

resolve/index.go Outdated Show resolved Hide resolved
@@ -96,6 +99,7 @@ func (ucr *updateConfigurer) RegisterFlags(fs *flag.FlagSet, cmd string, c *conf
fs.StringVar(&ucr.memProfile, "memprofile", "", "write memory profile to `file`")
fs.Var(&gzflag.MultiFlag{Values: &ucr.knownImports}, "known_import", "import path for which external resolution is skipped (can specify multiple times)")
fs.StringVar(&ucr.repoConfigPath, "repo_config", "", "file where Gazelle should load repository configuration. Defaults to WORKSPACE.")
fs.StringVar(&uc.ruleIndexFile, "indexdb", "", "EXPERIMENTAL: file to cache the rule index")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for the naming of the cli arg? I used "indexdb" because this is caching the RuleIndex data 🤷
Maybe we should use a --experimental prefix?

@jbedard
Copy link
Contributor Author

jbedard commented Aug 22, 2024

Just to re-iterate a few points and limitations of this:

This is EXPERIMENTAL with no api guarantees. 99% chance there will be breaking changes. You should never commit this file to source-control.

When this EXPERIMENTAL feature is used:

  • the resolver.Imports/Embeds() calls for the entire workspace (the "rule index") is written to disk
  • when specifying dirs (default: .) on the command line only those directories are updated in the "rule index"
  • only the dirs specified on the command line will be traversed, therefore:
    • only those dirs will have Imports/Embeds() invoked
    • only those dirs will have Fix(), GenerateRules() invoked
    • only those dirs will have BUILDs updated, warnings generated etc.
    • only those dirs will have things like resolve warnings outputted

If resolver.Imports/Embeds() have side-effects then this won't work very well when they are cached and not invoked. Don't do that.

There is no cache invalidation here. It is entirely up to the user to pass the correct list of dirs that require reprocessing.

FUTURE: if those updated dirs changed what rules they generate and the Imports/Embeds() of those rules change it is possible that other rules outside dirs are now referencing something incorrect. I think if Imports/Embeds() of those dirs change we might want to re-run Fix + GenerateRules for all directories.

@jbedard
Copy link
Contributor Author

jbedard commented Dec 18, 2024

@fmeum did you have any opinions on this going forward? Is this something you think should live in gazelle?

For large repos even simply walking the fs can be slow (even after the changes making it concurrent) so any type of incremental BUILD-generation needs to run on a subset of the workspace which is the goal of this PR.

@fmeum
Copy link
Member

fmeum commented Dec 27, 2024

I could very well see this live in Gazelle. Do you think that any of the two recent proposals #1891 and #1907 could substantially improve the situation for your large repos? If so, I would prefer to explore this route first, mostly since serialization, even if experimental, often ends up constraining future efforts to change the persisted data structure. Not doing unnecessary work in the first place could save us the complexity of having to persist this work.

@jbedard
Copy link
Contributor Author

jbedard commented Jan 1, 2025

Avoiding any type of caching/persisting would always be ideal if we can do it 👍

For persisting/serialization... I would prefer just thinking of it as caching, it should never be required. If we just throw a hash at the start of the serialized data then we can ignore any cached data from an older version of gazelle or different list of languages etc (hash = gazelle version + every language compiled into the gazelle binary maybe), then we should never have to deal with older data formats at least.

@jbedard
Copy link
Contributor Author

jbedard commented Jan 1, 2025

I'll try to leave comments on #1891 and #1907. I worry any major improvements will require breaking changes to truly help though, maybe completely replacing traversal+indexing+resolving with something lazy by default?

I think a lot of scenarios might still need to know the fs tree in order to do things based on convention or lazy indexing though, but if we can then do the actual indexing+resolving in a lazy fashion that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants