Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git partial clone and worktree to reduce network/file io #5412

Merged
merged 13 commits into from
Dec 13, 2024

Conversation

Warashi
Copy link
Contributor

@Warashi Warashi commented Dec 11, 2024

What this PR does:

  • Use not cp -rf but git worktree add to Copy the git.Repo
  • Use git partial clone to reduce network io

Why we need it:

I want to reduce network/file io

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

  • How are users affected by this change:
  • Is this breaking change:
  • How to migrate (if breaking change):

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 17.27273% with 91 lines in your changes missing coverage. Please review.

Project coverage is 25.79%. Comparing base (8f23da4) to head (c5216c9).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
pkg/git/repo.go 28.07% 35 Missing and 6 partials ⚠️
pkg/git/gittest/git.mock.go 0.00% 11 Missing ⚠️
pkg/app/piped/deploysource/deploysource.go 0.00% 10 Missing ⚠️
pkg/app/pipedv1/deploysource/deploysource.go 0.00% 10 Missing ⚠️
pkg/app/piped/driftdetector/cloudrun/detector.go 0.00% 3 Missing ⚠️
pkg/app/piped/driftdetector/ecs/detector.go 0.00% 3 Missing ⚠️
pkg/app/piped/driftdetector/kubernetes/detector.go 0.00% 3 Missing ⚠️
pkg/app/piped/driftdetector/lambda/detector.go 0.00% 3 Missing ⚠️
pkg/app/piped/driftdetector/terraform/detector.go 0.00% 3 Missing ⚠️
pkg/app/piped/eventwatcher/eventwatcher.go 0.00% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5412      +/-   ##
==========================================
+ Coverage   25.74%   25.79%   +0.05%     
==========================================
  Files         449      450       +1     
  Lines       48452    48606     +154     
==========================================
+ Hits        12474    12539      +65     
- Misses      35005    35085      +80     
- Partials      973      982       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
@Warashi Warashi force-pushed the git-worktree-and-partial-clone branch from 8f6b6c1 to e6a8296 Compare December 12, 2024 01:02
…mplementations

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
@Warashi Warashi changed the title Use git worktree and partial clone to reduce network io Use git worktree to reduce file io Dec 12, 2024
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
…git.Repo

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
@Warashi Warashi force-pushed the git-worktree-and-partial-clone branch from 5d79c69 to 2a0ee56 Compare December 12, 2024 02:23
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
… cloning

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
@Warashi
Copy link
Contributor Author

Warashi commented Dec 12, 2024

/review

Copy link
Contributor

PR Analysis

Main theme

"Implementation of worktree for repository operations and addition of 'Clean' method for resource cleanup"

PR summary

"This PR introduces the use of git worktrees for operations on repositories in various parts of the application to improve isolation and management of changes. It also includes the implementation of a 'Clean' method to remove temporary resources, effectively cleaning up after worktree usage."

Type of PR

"Enhancement"

PR Feedback:

General suggestions

The use of worktrees is a solid approach to handle operations on repositories in an isolated environment. This prevents accidental impact on the main repository and provides a clean way to manage changes for detection, event watching, and other tasks requiring repo modifications. Including the 'Clean' method enhances resource management by ensuring temporary worktrees are correctly removed after use. This avoids potential issues with lingering resources and maintains a clean working environment.

Code feedback

- relevant file: "pkg/app/piped/driftdetector/cloudrun/detector.go"
  suggestion: |-
    Add error handling for the `repo.Clean()` call to ensure any issues during worktree cleanup are captured and handled appropriately. This should prevent unnoticed errors and allow for proper logging or retry mechanisms if needed. 
    ```go
    defer func() {
      if err := repo.Clean(); err != nil {
        // Log error or handle it accordingly
      }
    }()
    ```
    (important)
  relevant line: "+			defer repo.Clean()"

- relevant file: "pkg/git/client.go"
  suggestion: |-
    It is important to carefully manage the transition to using the `--filter=blob:none` option with `git clone`. This can change the behavior for users who may not expect a partial clone. Ensure that there is proper documentation and that it is applicable to all use cases where the client is used.
    Additionally, assess if this change might require a cleanup or migration step for existing clients or stored repositories where full clones were expected. Communicate this change with users to manage expectations.
    (medium)
  relevant line: "+				args := []string{"clone", "--mirror", "--filter=blob:none", remote, repoCachePath}"

Security concerns:

"no"
The code changes do not seem to introduce obvious security concerns. The enhancement focuses on repository management and cleanup, without altering the security posture of the application.

@Warashi Warashi changed the title Use git worktree to reduce file io Use git partial clone and worktree to reduce network/file io Dec 12, 2024
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
// It is a separate checkout of the repository.
// It is used to make changes to the repository without affecting the main repository.
// Worktree always does checkout with the detached HEAD, so it doesn't affect the main repository.
type Worktree interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to restrict operation at worktree.

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
@Warashi Warashi marked this pull request as ready for review December 12, 2024 07:04
Comment on lines 194 to 199
// To avoid the git operation on copied source.
// TODO: do not copy the .git directory to improve the performance.
if err := os.RemoveAll(filepath.Join(dest, ".git")); err != nil {
fmt.Fprintf(lw, "Unable to remove the .git directory from the copied deploy source (%v)\n", err)
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a tar trick to avoid .git copy and remove. I think tar is available everywhere and fast enough compared to cp

tar -cpP --exclude=.git -C "$source_dir" . | tar -xpP -C "$target_dir"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khanhtc1202

We use the alpine as the base image for piped, so I checked the tar command in the piped image.
First, the tar command exists. But, the options differ from the GNU/BSD tar.
For example, short flags -p and -P do not exist.
IMO, it's not good to depend on this tar behavior because users may run the piped on a non-alpine-based OS.
WDYT?

Note: I checked tar --help in the piped container image. I paste the result below.

$ docker run --rm -it --entrypoint sh  ghcr.io/pipe-cd/piped:v0.50.0-23-g8f23da4@sha256:88ec60ebf2b7e3131ff38b1af4a043117ef688bc636ddd95afd6bd368293651b
/ $ tar --help
BusyBox v1.36.1 (2024-06-10 07:11:47 UTC) multi-call binary.

Usage: tar c|x|t [-ZzJjahmvokO] [-f TARFILE] [-C DIR] [-T FILE] [-X FILE] [LONGOPT]... [FILE]...

Create, extract, or list files from a tar file

        c       Create
        x       Extract
        t       List
        -f FILE Name of TARFILE ('-' for stdin/out)
        -C DIR  Change to DIR before operation
        -v      Verbose
        -O      Extract to stdout
        -m      Don't restore mtime
        -o      Don't restore user:group
        -k      Don't replace existing files
        -Z      (De)compress using compress
        -z      (De)compress using gzip
        -J      (De)compress using xz
        -j      (De)compress using bzip2
        --lzma  (De)compress using lzma
        -a      (De)compress based on extension
        -h      Follow symlinks
        -T FILE File with names to include
        -X FILE File with glob patterns to exclude
        --exclude PATTERN       Glob pattern to exclude
        --overwrite             Replace existing files
        --strip-components NUM  NUM of leading components to strip
        --no-recursion          Don't descend in directories
        --numeric-owner         Use numeric user:group
        --no-same-permissions   Don't restore access permissions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the version without -P, the more universal solution

tar --exclude='.git' -cf - source_dir | tar -xf - -C destination_dir

@Warashi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OKay, thank you for providing this. I'll use this script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried some tar options and finally reached this commit.
I tested it with busybox's tar (the same as alpine's) and GNU tar.
c5216c9

@khanhtc1202
Copy link
Member

Overall LGTM. Just left a nits, want to give it a shot 👍

…rformance

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
@Warashi Warashi requested a review from khanhtc1202 December 12, 2024 09:08
Copy link
Member

@khanhtc1202 khanhtc1202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGreatTM 💯
This could be the patch we're looking for 🥇

@@ -263,6 +263,8 @@ func (d *detector) loadHeadServiceManifest(app *model.Application, repo git.Repo
if err != nil {
return provider.ServiceManifest{}, fmt.Errorf("failed to copy the cloned git repository (%w)", err)
}
defer repo.Clean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Warashi
it seems that the dir is removed after executing the loadHeadServiceManifest
So we might not call repo.Clean(). WDYT?

		// We have to copy repository into another directory because
		// decrypting the sealed secrets or attaching files might change the git repository.
		if attachmentUsed || encryptionUsed {
			dir, err := os.MkdirTemp("", "detector-git-processing")
			if err != nil {
				return nil, fmt.Errorf("failed to prepare a temporary directory for git repository (%w)", err)
			}
			defer os.RemoveAll(dir)

			repo, err = repo.Copy(filepath.Join(dir, "repo"))
			if err != nil {
				return nil, fmt.Errorf("failed to copy the cloned git repository (%w)", err)
			}
			defer repo.Clean()

			repoDir = repo.GetPath()
			appDir = filepath.Join(repoDir, app.GitPath.Path)
		}

https://github.com/pipe-cd/pipecd/blob/c5216c91bffc906f57b30133f3d0e557f1b1888f/pkg/app/piped/driftdetector/kubernetes/detector.go#L270C1-L281C22

Copy link
Contributor Author

@Warashi Warashi Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffjlabo

I think we should call repo.Clean() here because the 'repo.Clean()' not only does the directory deletion.
IMO, we should call some methods like Clean() or Close() everywhere because the caller doesn't know what happens with such methods. The caller only knows the cleanup methods exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Warashi

I think we should call repo.Clean() here because the 'repo.Clean()' not only does the directory deletion.

I got it. The git worktree needs additional operation other than deleting the directory to keep the consistency of the worktree as you said.

IMO, we should call some methods like Clean() or Close() everywhere because the caller doesn't know what happens with such methods. The caller only knows the cleanup methods exist.

Also I agree with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little off topic, but the detector reuses the one repo. loadHeadServiceManifest also uses it as an arg. So I feel a little bit confused about cleaning at this point.

But I noticed that this section is used temporarily and different from the actual repo passed as arg. 🙏

Comment on lines +38 to +39
Copy(dest string) (Worktree, error)
CopyToModify(dest string) (Repo, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMO] How about using the Worktree and Copy?

	Worktree(dest string) (Worktree, error)
	Copy(dest string) (Repo, error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds clear, but I want to leave here as-is.
I want to clear when to use the CopyToModify method with its naming.
Almost all of implementation, we don't need modification of git commits/histories, so we should not use CopyToModify, and we can know from its naming.
If we rename them as Worktree and Copy, some implementation may call Copy method mistakenly even if it's enough to call the Worktree method.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all of implementation, we don't need modification of git commits/histories, so we should not use CopyToModify, and we can know from its naming.

I got your point! I agree with it.

Copy link
Member

@ffjlabo ffjlabo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Warashi Warashi merged commit fd58db0 into master Dec 13, 2024
17 of 18 checks passed
@Warashi Warashi deleted the git-worktree-and-partial-clone branch December 13, 2024 03:27
github-actions bot pushed a commit that referenced this pull request Dec 13, 2024
* Use git worktree and partial clone to reduce network io

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Add defer statement to clean up cloned git repositories in detector implementations

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Change the repo.Copy to use worktree and implement CopyToModify

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Add tests for Copy and CopyToModify methods in repo

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* repo.Copy and related methods updated to use git.Worktree instead of git.Repo

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* MockRepo.Copy method updated to return git.Worktree instead of git.Repo

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Update CopyToModify method to clone repository using git clone command

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Test: Update TestCopy to use repo.Copy instead of CopyToModify

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Fix comment in CopyToModify to clarify remote URL setting after local cloning

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Fetch the latest changes from remote after local cloning

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Remove .git directory from copied deploy source to avoid the git ops

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Update TestCopyToModify to use a mock remote directory for testing

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

* Copy deploy source using tar to exclude .git directory and improve performance

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>

---------

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Signed-off-by: pipecd-bot <pipecd.dev@gmail.com>
@github-actions github-actions bot mentioned this pull request Dec 13, 2024
Warashi added a commit that referenced this pull request Dec 13, 2024
…5419)

* Use git worktree and partial clone to reduce network io



* Add defer statement to clean up cloned git repositories in detector implementations



* Change the repo.Copy to use worktree and implement CopyToModify



* Add tests for Copy and CopyToModify methods in repo



* repo.Copy and related methods updated to use git.Worktree instead of git.Repo



* MockRepo.Copy method updated to return git.Worktree instead of git.Repo



* Update CopyToModify method to clone repository using git clone command



* Test: Update TestCopy to use repo.Copy instead of CopyToModify



* Fix comment in CopyToModify to clarify remote URL setting after local cloning



* Fetch the latest changes from remote after local cloning



* Remove .git directory from copied deploy source to avoid the git ops



* Update TestCopyToModify to use a mock remote directory for testing



* Copy deploy source using tar to exclude .git directory and improve performance



---------

Signed-off-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Signed-off-by: pipecd-bot <pipecd.dev@gmail.com>
Co-authored-by: Shinnosuke Sawada-Dazai <shin@warashi.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants