Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of SyncReleasesWithTags #18351

Closed

Conversation

petergardfjall
Copy link
Contributor

@petergardfjall petergardfjall commented Jan 21, 2022

This PR addresses #18352

A mirror-sync operation not only runs git remote update, but also tries to
sync any repo releases with the available repo tags. For large repositories with
many tags, this can be a costly operation, both in time and computational resources.

This commit aims to improve correctness and optimize performance of the SyncReleasesWithTags operation, in particular for plain git mirrors (which never have any releases). There are two parts to the commit:

  • first, when looking up releases, we should ignore plain tags in
    FindReleaseOptions.

  • second, if we don't find any releases, we don't need to query all tags of the
    repo (as there is nothing to be synced).

For large mirror repos, with hundreds of tags, this can bring down the duration
of the sync operation from minutes to seconds.

A mirror-sync operation not only runs `git remote update`, but also tries to
sync any repo releases with the available repo tags. For large repositories with
many tags, this can be a costly operation.

This commit aims to improve correctness and optimize performance of this
operation, in particular for plain git mirrors (which never have any
releases). There are two parts to the commit:

- first, when looking up releases, we should ignore plain tags in
  `FindReleaseOptions`.

- second, if we don't find any releases, we don't need to query all tags of the
  repo (as there is nothing to be synced).

For large mirror repos, with hundreds of tags, this can bring down the duration
of the sync operation from minutes to seconds.

Signed-off-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jan 21, 2022
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Jan 21, 2022
Copy link
Member

@noerw noerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this breaks the test TestMirrorPull

assert.EqualValues(t, initCount+1, count)

@petergardfjall
Copy link
Contributor Author

Looks like this breaks the test TestMirrorPull

assert.EqualValues(t, initCount+1, count)

Yes, I noticed that too. I'll see if I can fix it.

@@ -247,7 +247,7 @@ func SyncReleasesWithTags(repo *repo_model.Repository, gitRepo *git.Repository)
existingRelTags := make(map[string]struct{})
opts := models.FindReleasesOptions{
IncludeDrafts: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also ignore drafts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IncludeDrafts: true,
IncludeDrafts: false,

@6543 6543 added the performance/speed performance issues with slow downs label Jan 23, 2022
@6543 6543 added this to the 1.17.0 milestone Jan 23, 2022
@6543 6543 added the type/enhancement An improvement of existing functionality label Jan 23, 2022
@petergardfjall
Copy link
Contributor Author

petergardfjall commented Jan 24, 2022

I think that my understanding of SyncReleasesWithTags was flawed. I have now realized that it is used in more places than on mirror-syncs and it is also a bit more involved than I first thought, so I'm considering closing this PR (and the associated issue). Some thoughts below:

Initially I thought the releases table was only used to track "project releases" (as seen under the Releases tab in the UI for instance), but I soon realized that it tracks both "project releases" and regular tags. So I believed that since a pull mirror doesn't have "project releases", SyncReleasesWithTags could largely be skipped for pull mirrors.

I soon noticed that SyncReleasesWithTags is also used during one-off migrations (from GitHub and similar). In those cases run after the "project releases" have been downloaded to sync any remaining migrated tags to the release table.

I then thought that one way of optimizing for syncing pull mirrors (which are always identical to the upstream) could be to just rebuild the release table entries entirely on every sync (delete all release rows for the mirror repo and recreate from the just-synced tags, which on rare occassions may have changed). However, I'm beginning to think that recreating all those models.Releases entries is not possible without running a lot of git commands (unless there is a batch git command that fetches all metadata to build models.Releases of all tags in one go), which is what makes the sync expensive in the first place.

So, my initial thinking that git remote update should pretty much be enough for a pull mirror was wrong, since I overlooked the amount of data needed for every entry in the release table.

WDYT, just close or are there any other ideas on if/how performance can be improved for pull mirrors?

Edit: would a command such as git for-each-ref --format='%(refname:short);;%(objectname);;%(authordate);;%(authorname);;%(authoremail)' "refs/tags/*" get enough data to build a sufficient models.Release struct for every tag in the repo? If so, I think that the pull mirror sync could still be made considerably faster.

@petergardfjall
Copy link
Contributor Author

petergardfjall commented Jan 24, 2022

To add to the reasoning setting IncludeTags: false might not be safe even for pull mirrors since, on (very) rare occassions, it might happen that an upstream tag was moved to another commit, and we would then end up with tag inconsistencies in the release table.

@petergardfjall
Copy link
Contributor Author

Closing this in favor of #19125

@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
@wxiaoguang wxiaoguang removed this from the 1.17.0 milestone Jun 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 1 This PR needs approval from one additional maintainer to be merged. performance/speed performance issues with slow downs type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants