Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage][Spot] Resolving failure to exclude local files for managed spot jobs using git --ignored outputs #2018

Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b87ed7e
update excluding commands
landscapepainter May 7, 2023
fd4ccbb
exclude from .git/info/exclude
landscapepainter May 7, 2023
6492b98
added test in test_smoke.py
landscapepainter May 9, 2023
8879355
test-fix indent
landscapepainter May 9, 2023
40af8ed
format
landscapepainter May 10, 2023
45dc8ad
remove unused
landscapepainter May 10, 2023
2db6df3
fix
landscapepainter May 10, 2023
9eee64f
remove unused
landscapepainter May 10, 2023
8b925ae
robust formatting
landscapepainter May 17, 2023
a0795bc
update for [!01] and ? for gsutil
landscapepainter May 17, 2023
be85cd1
nit
landscapepainter May 17, 2023
436bfd0
add robust test case for gitignore sync
landscapepainter May 22, 2023
e478c1e
updating the test
landscapepainter May 24, 2023
9aca4fa
test update
landscapepainter May 25, 2023
d8283e5
nit
landscapepainter May 25, 2023
c95a2ef
.gitignore exclusion filter using rsync
landscapepainter May 27, 2023
36a845f
Merge branch 'master' into managed_spot_exclude_fix_rsync
landscapepainter May 27, 2023
5315a72
update test_smoke
landscapepainter May 28, 2023
5919521
format
landscapepainter May 28, 2023
3de671c
nit
landscapepainter May 28, 2023
4f18304
nit
landscapepainter May 28, 2023
0f7ab57
nit
landscapepainter May 28, 2023
ddc23b5
gitignore filter using git
landscapepainter Jun 3, 2023
c79f7f2
Update sky/data/storage.py
landscapepainter Jun 9, 2023
96ed4c7
nti updates
landscapepainter Jun 9, 2023
26596ef
update comments
landscapepainter Jun 10, 2023
da64fad
updates
landscapepainter Jun 11, 2023
d73cf40
test_smoke update
landscapepainter Jun 11, 2023
f624f91
format
landscapepainter Jun 11, 2023
de85068
Merge branch 'master' into managed_spot_exclude_fix_git
landscapepainter Jul 12, 2023
4a6e435
Update sky/data/storage_utils.py
landscapepainter Jul 26, 2023
f520aa9
Merge branch 'master' into managed_spot_exlcude_fix_git
landscapepainter Jul 26, 2023
b08cf44
Merge branch 'master' and update try except block
landscapepainter Jul 26, 2023
5a6ca6c
format
landscapepainter Jul 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 90 additions & 14 deletions sky/data/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -873,6 +873,75 @@ def add_if_not_none(key: str, value: Optional[Any]):
return config


def format_gitignore_to_exclude_list(src_dir_path: str) -> List[str]:
"""Returns a list of formatted excluded files read from .gitignore and
.git/info/exclude using git. 'git init' is ran when SRC_DIR_PATH
is not a git repository and removed after obtaining excluded list
"""
expand_src_dir_path = os.path.expanduser(src_dir_path)

git_exclude_path = os.path.join(expand_src_dir_path, '.git/info/exclude')
gitignore_path = os.path.join(expand_src_dir_path, '.gitignore')

git_exclude_exists = os.path.isfile(git_exclude_path)
gitignore_exists = os.path.isfile(gitignore_path)

# This command outputs a list to be excluded according to .gitignore
# and .git/info/exclude
filter_cmd = f'git -C {expand_src_dir_path} status --ignored -s'
excluded_list: List[str] = ['.git/*', '.gitignore']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not want to exclude .gitignore by default, since that is often a part of the repository.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Why do we not apply the same logic to .git/*? .git/* was excluded by default even before this update.


# pylint: disable=W1510
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason to disable W1510? Instead, should we add check=True for L907, L911, L925 and L930? Good to explicitly handle any errors that may occur when running them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Origianally, I thought the commands were causing inevitable errors, but found that it actually had some issues. They are currently fixed and removed # pylint: disable=W1510 as well.

if git_exclude_exists or gitignore_exists:
try:
output = subprocess.run(filter_cmd,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
check=True,
text=True)
except subprocess.CalledProcessError as e:
# when the SRC_DIR_PATH is not a git repo
if e.returncode == exceptions.GIT_UNINITIALIZED_CODE:
init_cmd = f'git -C {expand_src_dir_path} init'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, no action needed:

Running git init in an existing repository is safe. It will not
       overwrite things that are already there. The primary reason for
       rerunning git init is to pick up newly added templates (or to move the
       repository to another place if --separate-git-dir is given).

subprocess.run(init_cmd,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail due to different reasons (e.g., user has read-only permissions in the dir). In that case, can we raise a warning and gracefully fall back to not excluding any files from .gitignore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be printing a logger.wanring() message and return the default excluded_list with .git/* in it.

shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output = subprocess.run(filter_cmd,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
if git_exclude_exists:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking if git exclude already existed, can we instead check if the .git directory already existed? If it did, we should leave it untouched (i.e., do not remove any files or directories). This is to avoid inadvertently deleting anything we shouldn't be deleting.

Copy link
Collaborator Author

@landscapepainter landscapepainter Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If .git alreday existed, this if git_exclude_exists: statement will not be entered. The outer if e.returncode == exceptions.GIT_UNINITIALIZED_CODE: would be entered when .git didn't exist in this directory. I'll leave a more detailed comment to note that this exception is caught only when .git didn't exist(not a git repo).

        except subprocess.CalledProcessError as e:
            # when the SRC_DIR_PATH is not a git repo and .git
            # does not exist in it
            if e.returncode == exceptions.GIT_UNINITIALIZED_CODE:

# removes all the files/dirs created with 'git init'
# under .git/ except .git/info/exclude
remove_files_cmd = (f'find {expand_src_dir_path}/.git ' \
f'-path {git_exclude_path} -prune -o ' \
'-type f -exec rm -f {} +')
remove_dirs_cmd = (f'find {expand_src_dir_path}/.git ' \
f'-path {git_exclude_path} -prune -o' \
' -type d -empty -delete')
subprocess.run(remove_files_cmd,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
subprocess.run(remove_dirs_cmd,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)

output_list = output.stdout.split('\n')
for line in output_list:
if line.startswith('!!'):
to_be_excluded = line[3:]
if line.endswith('/'):
to_be_excluded += '*'
excluded_list.append(to_be_excluded)
# pylint: enable=W1510
return excluded_list


class S3Store(AbstractStore):
"""S3Store inherits from Storage Object and represents the backend
for S3 buckets.
Expand Down Expand Up @@ -1075,10 +1144,12 @@ def get_file_sync_command(base_dir_path, file_names):

def get_dir_sync_command(src_dir_path, dest_dir_name):
# we exclude .git directory from the sync
sync_command = (
'aws s3 sync --no-follow-symlinks --exclude ".git/*" '
f'{src_dir_path} '
f's3://{self.name}/{dest_dir_name}')
excluded_list = format_gitignore_to_exclude_list(src_dir_path)
excludes = ' '.join(
[f'--exclude "{file_name}"' for file_name in excluded_list])
sync_command = (f'aws s3 sync --no-follow-symlinks {excludes} '
f'{src_dir_path} '
f's3://{self.name}/{dest_dir_name}')
return sync_command

# Generate message for upload
Expand Down Expand Up @@ -1498,8 +1569,10 @@ def get_file_sync_command(base_dir_path, file_names):

def get_dir_sync_command(src_dir_path, dest_dir_name):
# we exclude .git directory from the sync
sync_command = (f'gsutil -m rsync -r -x \'.git/*\' {src_dir_path} '
f'gs://{self.name}/{dest_dir_name}')
excluded_list = format_gitignore_to_exclude_list(src_dir_path)
excludes = '\'(' + '|'.join(excluded_list) + ')\''
sync_command = (f'gsutil -m rsync -r -x {excludes} {src_dir_path}'
f' gs://{self.name}/{dest_dir_name}')
return sync_command

# Generate message for upload
Expand Down Expand Up @@ -1821,15 +1894,18 @@ def get_file_sync_command(base_dir_path, file_names):

def get_dir_sync_command(src_dir_path, dest_dir_name):
# we exclude .git directory from the sync
excluded_list = format_gitignore_to_exclude_list(src_dir_path)
excludes = ' '.join(
[f'--exclude "{file_name}"' for file_name in excluded_list])
endpoint_url = cloudflare.create_endpoint()
sync_command = (
'AWS_SHARED_CREDENTIALS_FILE='
f'{cloudflare.R2_CREDENTIALS_PATH} '
'aws s3 sync --no-follow-symlinks --exclude ".git/*" '
f'{src_dir_path} '
f's3://{self.name}/{dest_dir_name} '
f'--endpoint {endpoint_url} '
f'--profile={cloudflare.R2_PROFILE_NAME}')

sync_command = ('AWS_SHARED_CREDENTIALS_FILE='
f'{cloudflare.R2_CREDENTIALS_PATH} '
f'aws s3 sync --no-follow-symlinks {excludes} '
f'{src_dir_path} '
f's3://{self.name}/{dest_dir_name} '
f'--endpoint {endpoint_url} '
f'--profile={cloudflare.R2_PROFILE_NAME}')
return sync_command

# Generate message for upload
Expand Down
2 changes: 2 additions & 0 deletions sky/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
RSYNC_FILE_NOT_FOUND_CODE = 23
# Arbitrarily chosen value. Used in SkyPilot's storage mounting scripts
MOUNT_PATH_NON_EMPTY_CODE = 42
# Return code when git command is ran in a dir that is not git repo
GIT_UNINITIALIZED_CODE = 128


class ResourcesUnavailableError(Exception):
Expand Down
8 changes: 8 additions & 0 deletions tests/git_info_exclude_test
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
question_mark/excluded?.txt
square_bracket/excluded[0-9].txt
square_bracket_single/excluded[01].txt
square_bracket_excla/excluded[!01].txt
square_bracket_alpha/excluded[a-z].txt
excluded_dir/
nested_double_asterisk/**/also_exclude.txt
nested_wildcard_dir/*day/also_exclude.txt
11 changes: 11 additions & 0 deletions tests/gitignore_test
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# This test should upload only include.txt, included.log, and included/included.l$
**/double_asterisk_excluded
**/double_asterisk_excluded_dir
**/parent/*.txt
**/parent/child/*.txt
*.log
exp-*/
!included.log
excluded.*
/front_slash_excluded
no_slash_excluded
199 changes: 199 additions & 0 deletions tests/test_smoke.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import hashlib
import inspect
import pathlib
import shutil
import subprocess
import sys
import tempfile
Expand Down Expand Up @@ -1911,6 +1912,96 @@ class TestStorageWithCredentials:
'abc_', # ends with an underscore
]

GITIGNORE_SYNC_TEST_DIR_STRUCTURE = {
'double_asterisk': {
'double_asterisk_excluded': None,
'double_asterisk_excluded_dir': {
'dir_excluded': None,
},
},
'double_asterisk_parent': {
'parent': {
'also_excluded.txt': None,
'child': {
'double_asterisk_parent_child_excluded.txt': None,
},
'double_asterisk_parent_excluded.txt': None,
},
},
'excluded.log': None,
'excluded_dir': {
'excluded.txt': None,
'nested_excluded': {
'excluded': None,
},
},
'exp-1': {
'be_excluded': None,
},
'exp-2': {
'be_excluded': None,
},
'front_slash_excluded': None,
'included.log': None,
'included.txt': None,
'include_dir': {
'excluded.log': None,
'included.log': None,
},
'nested_double_asterisk': {
'one': {
'also_exclude.txt': None,
},
'two': {
'also_exclude.txt': None,
},
},
'nested_wildcard_dir': {
'monday': {
'also_exclude.txt': None,
},
'tuesday': {
'also_exclude.txt': None,
},
},
'no_slash_excluded': None,
'no_slash_tests': {
'no_slash_excluded': {
'also_excluded.txt': None,
},
},
'question_mark': {
'excluded1.txt': None,
'excluded@.txt': None,
},
'square_bracket': {
'excluded1.txt': None,
},
'square_bracket_alpha': {
'excludedz.txt': None,
},
'square_bracket_excla': {
'excluded2.txt': None,
'excluded@.txt': None,
},
'square_bracket_single': {
'excluded0.txt': None,
},
}

@staticmethod
def create_dir_structure(base_path, structure):
for name, substructure in structure.items():
path = os.path.join(base_path, name)
if substructure is None:
# Create a file
open(path, 'a').close()
else:
# Create a subdirectory
os.mkdir(path)
TestStorageWithCredentials.create_dir_structure(
path, substructure)

@staticmethod
def cli_delete_cmd(store_type, bucket_name):
if store_type == storage_lib.StoreType.S3:
Expand Down Expand Up @@ -1946,6 +2037,35 @@ def cli_ls_cmd(store_type, bucket_name, suffix=''):
url = f's3://{bucket_name}'
return f'AWS_SHARED_CREDENTIALS_FILE={cloudflare.R2_CREDENTIALS_PATH} aws s3 ls {url} --endpoint {endpoint_url} --profile=r2'

@staticmethod
def cli_count_name_in_bucket(store_type, bucket_name, file_name, suffix=''):
if store_type == storage_lib.StoreType.S3:
if suffix:
return f'aws s3api list-objects --bucket "{bucket_name}" --prefix {suffix} --query "length(Contents[?contains(Key,\'{file_name}\')].Key)"'
else:
return f'aws s3api list-objects --bucket "{bucket_name}" --query "length(Contents[?contains(Key,\'{file_name}\')].Key)"'
elif store_type == storage_lib.StoreType.GCS:
if suffix:
return f'gsutil ls gs://{bucket_name}/{suffix} | grep "{file_name}" | wc -l'
else:
return f'gsutil ls gs://{bucket_name} | grep "{file_name}" | wc -l'
elif store_type == storage_lib.StoreType.R2:
endpoint_url = cloudflare.create_endpoint()
if suffix:
return f'AWS_SHARED_CREDENTIALS_FILE={cloudflare.R2_CREDENTIALS_PATH} aws s3api list-objects --bucket "{bucket_name}" --prefix {suffix} --query "length(Contents[?contains(Key,\'{file_name}\')].Key)" --endpoint {endpoint_url} --profile=r2'
else:
return f'AWS_SHARED_CREDENTIALS_FILE={cloudflare.R2_CREDENTIALS_PATH} aws s3api list-objects --bucket "{bucket_name}" --query "length(Contents[?contains(Key,\'{file_name}\')].Key)" --endpoint {endpoint_url} --profile=r2'

@staticmethod
def cli_count_file_in_bucket(store_type, bucket_name):
if store_type == storage_lib.StoreType.S3:
return f'aws s3 ls s3://{bucket_name} --recursive | wc -l'
elif store_type == storage_lib.StoreType.GCS:
return f'gsutil ls -r gs://{bucket_name}/** | wc -l'
elif store_type == storage_lib.StoreType.R2:
endpoint_url = cloudflare.create_endpoint()
return f'AWS_SHARED_CREDENTIALS_FILE={cloudflare.R2_CREDENTIALS_PATH} aws s3 ls s3://{bucket_name} --recursive --endpoint {endpoint_url} --profile=r2 | wc -l'

@pytest.fixture
def tmp_source(self, tmp_path):
# Creates a temporary directory with a file in it
Expand Down Expand Up @@ -2291,6 +2411,85 @@ def test_invalid_names(self, invalid_name_list, store_type):
storage_obj = storage_lib.Storage(name=name)
storage_obj.add_store(store_type)

@pytest.mark.parametrize(
'gitignore_structure, store_type',
[(GITIGNORE_SYNC_TEST_DIR_STRUCTURE, storage_lib.StoreType.S3),
(GITIGNORE_SYNC_TEST_DIR_STRUCTURE, storage_lib.StoreType.GCS),
pytest.param(GITIGNORE_SYNC_TEST_DIR_STRUCTURE,
storage_lib.StoreType.R2,
marks=pytest.mark.cloudflare)])
def test_excluded_file_cloud_storage_upload_copy(self, gitignore_structure,
store_type):
# tests if files included in .gitignore and .git/info/exclude are
# excluded from being transferred to Storage

with tempfile.TemporaryDirectory() as tmpdir:
# Create files to be uploaded in the Storage
self.create_dir_structure(tmpdir, gitignore_structure)

# Create files to be excluded from .gitignore and list them in the file
skypilot_path = os.path.abspath(sys.argv[0])
skypilot_path = os.path.dirname(skypilot_path)
temp_path = f'{tmpdir}/.gitignore'
file_path = os.path.join(skypilot_path, 'tests/gitignore_test')
shutil.copyfile(file_path, temp_path)

# Create files to be excluded from .git/info/exclude and list them in the file
temp_path = f'{tmpdir}/.git/info/'
os.makedirs(temp_path)
temp_exclude_path = os.path.join(temp_path, 'exclude')
file_path = os.path.join(skypilot_path,
'tests/git_info_exclude_test')
shutil.copyfile(file_path, temp_exclude_path)

# Create sky Storage with the files created
timestamp = str(time.time()).replace('.', '')
bucket_name = f'sky-test-{timestamp}'
store_obj = storage_lib.Storage(name=bucket_name,
source=tmpdir,
mode=storage_lib.StorageMode.COPY)
store_obj.add_store(store_type)

upload_file_name = 'included'
# Count the number of files with the given file name
up_cmd = self.cli_count_name_in_bucket(store_type, \
bucket_name, file_name=upload_file_name)
gitignore_cmd = self.cli_count_name_in_bucket(store_type, \
bucket_name, file_name='.gitignore')
git_exclude_cmd = self.cli_count_name_in_bucket(store_type, \
bucket_name, file_name='.git')
cnt_num_file_cmd = self.cli_count_file_in_bucket(
store_type, bucket_name)

up_output = subprocess.check_output(up_cmd, shell=True)
gitignore_output = subprocess.check_output(gitignore_cmd, shell=True)
git_exclude_output = subprocess.check_output(git_exclude_cmd,
shell=True)
cnt_output = subprocess.check_output(cnt_num_file_cmd, shell=True)

if store_type == storage_lib.StoreType.GCS:
up_dir_cmd = up_cmd = self.cli_count_name_in_bucket(store_type, \
bucket_name, file_name=upload_file_name, suffix='include_dir')
up_dir_output = subprocess.check_output(up_dir_cmd, shell=True)

# GCS ls and aws s3 list-objects outputs in a different way
# Only 'included.*' files should exist in the cloud object storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this comment and why separate handling for GCS is required.. can you include some example of why this is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated cli_count_name_in_bucket for GCS by adding -r option in the command so that GCS doesn't need separate handling. The issue was, original gsutil ls command was not counting the objects in the subdirectory, include_dir/included.log, so I was handling them separately, but this was able to be fixed with -r option.

if store_type == storage_lib.StoreType.GCS:
assert '2' in up_output.decode('utf-8'), \
f'Filese to be uploaded are not uploaded.: {bucket_name}'
assert '1' in up_dir_output.decode('utf-8'), \
f'File to be uploaded in include_dir is not uploaded.: {bucket_name}'
else:
assert '3' in up_output.decode('utf-8'), \
f'{upload_file_name} is not uploaded.: {bucket_name}'
assert '0' in gitignore_output.decode('utf-8'), \
'.gitignore file should not be uploaded'
assert '0' in git_exclude_output.decode('utf-8'), \
'.git directory should not be uploaded'
ls_cmd = self.cli_ls_cmd(store_type, bucket_name)
assert '3' in cnt_output.decode('utf-8'), \
f'Some items listed in .gitignore and .git/info/exclude are not ignored - there must be one item in the bucket. Check with {ls_cmd}'


# ---------- Testing YAML Specs ----------
# Our sky storage requires credentials to check the bucket existance when
Expand Down