-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage][Spot] Resolving failure to exclude local files for managed spot jobs using git --ignored outputs #2018
[Storage][Spot] Resolving failure to exclude local files for managed spot jobs using git --ignored outputs #2018
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left some comments. Still reading.
sky/data/storage.py
Outdated
# This command outputs a list to be excluded according to .gitignore | ||
# and .git/info/exclude | ||
filter_cmd = f'git -C {expand_src_dir_path} status --ignored -s' | ||
excluded_list: List[str] = ['.git/*', '.gitignore'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not want to exclude .gitignore by default, since that is often a part of the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Why do we not apply the same logic to .git/*
? .git/*
was excluded by default even before this update.
sky/data/storage.py
Outdated
filter_cmd = f'git -C {expand_src_dir_path} status --ignored -s' | ||
excluded_list: List[str] = ['.git/*', '.gitignore'] | ||
|
||
# pylint: disable=W1510 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a strong reason to disable W1510? Instead, should we add check=True
for L907, L911, L925 and L930? Good to explicitly handle any errors that may occur when running them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Origianally, I thought the commands were causing inevitable errors, but found that it actually had some issues. They are currently fixed and removed # pylint: disable=W1510
as well.
sky/data/storage.py
Outdated
# when the SRC_DIR_PATH is not a git repo | ||
if e.returncode == exceptions.GIT_UNINITIALIZED_CODE: | ||
init_cmd = f'git -C {expand_src_dir_path} init' | ||
subprocess.run(init_cmd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may fail due to different reasons (e.g., user has read-only permissions in the dir). In that case, can we raise a warning and gracefully fall back to not excluding any files from .gitignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be printing a logger.wanring()
message and return the default excluded_list
with .git/*
in it.
sky/data/storage.py
Outdated
stdout=subprocess.PIPE, | ||
stderr=subprocess.PIPE, | ||
text=True) | ||
if git_exclude_exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking if git exclude already existed, can we instead check if the .git directory already existed? If it did, we should leave it untouched (i.e., do not remove any files or directories). This is to avoid inadvertently deleting anything we shouldn't be deleting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If .git
alreday existed, this if git_exclude_exists:
statement will not be entered. The outer if e.returncode == exceptions.GIT_UNINITIALIZED_CODE:
would be entered when .git
didn't exist in this directory. I'll leave a more detailed comment to note that this exception is caught only when .git
didn't exist(not a git repo).
except subprocess.CalledProcessError as e:
# when the SRC_DIR_PATH is not a git repo and .git
# does not exist in it
if e.returncode == exceptions.GIT_UNINITIALIZED_CODE:
tests/test_smoke.py
Outdated
up_dir_output = subprocess.check_output(up_dir_cmd, shell=True) | ||
|
||
# GCS ls and aws s3 list-objects outputs in a different way | ||
# Only 'included.*' files should exist in the cloud object storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this comment and why separate handling for GCS is required.. can you include some example of why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated cli_count_name_in_bucket
for GCS by adding -r
option in the command so that GCS doesn't need separate handling. The issue was, original gsutil ls
command was not counting the objects in the subdirectory, include_dir/included.log
, so I was handling them separately, but this was able to be fixed with -r
option.
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
@romilbhardwaj Ready for another look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left two comments, should be good to go after that!
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj Resolved the comments! Ready for another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter - this is ready to go!
This PR closes #1898 and opposed to implementing our own
.gitignore
filter in #1939 and usingrsync
from #1993 , this harnessgit -C /path/to/source/dir status --ignored -s
to get a list of files/directories to be excluded from syncingsky storage
.Added the testing structure and the files supposed to be ignored from syncing divided into
git_info_exclude_test
(corresponding to.git/info/exclude
) andgitignore_test
(.gitignore
).Tested (run the relevant ones):
sky spot launch
and checked theworkdir
and localfile_mounts
src from the spot instancepytest tests/test_smoke.py::TestStorageWithCredentials --cloudflare
pytest tests/test_smoke.py --managed-spot
test_smoke.py/test_excluded_file_cloud_storage_upload_copy