Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS SageMaker] Add working FSx setup and test #3831

Merged
merged 8 commits into from
May 29, 2020

Conversation

RedbackThomson
Copy link
Contributor

This pull request sets up a new FSx instance (preloaded with data from the specified S3 bucket). It attaches it to the same private subnet as the EKS cluster and creates a security group for ingress on the FSx TCP port. I have also added a SageMaker test that mounts and uses this as the primary train channel.

Because of the added cost and time of adding FSx to the integration tests, I have also tried to flag off this test with an environment variable SKIP_FSX_TESTS. I still need to configure PyTest to ignore the FSx test if this environment variable is present.

@kubeflow-bot
Copy link

This change is Reviewable


function delete_fsx_instance() {
if [ ! -z "${FSX_ID}" ]; then
aws fsx delete-file-system --file-system-id "${FSX_ID}" --region "${REGION}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might leak the resource here if this command fails for some reasons...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very real possibility at this point in the code. This is after our error trap, so if this command fails we would have to see it failing in the logs to catch it. We might need to set up some alarm to ensure we don't have too many FSx instances in the account.

@@ -150,11 +164,26 @@ function cleanup_kfp() {
}

if [[ -z "${EKS_EXISTING_CLUSTER}" ]]; then
launch_eks
# Launch all of these in parallel to reduce start-up time
EKS_CLUSTER_NAME="${DEPLOY_NAME}-eks-cluster"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this move?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to launch EKS at the same time as creating FSx because both operations take a long time. However, when we use the & operator, it starts the shell in a different process and therefore any variables set in that shell are not set in the caller shell. So I bypass this by setting the name of the cluster before calling.

if [[ "${SKIP_FSX_TESTS}" == "false" ]]; then
delete_fsx_instance
# Sleep in order for the security group to detach before attempting to delete it
sleep 15s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better condition based solution for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not had much of a chance to play around with this. I think maybe once the FSx file system changes status to "Deleting" it would have detached from the VPC and security groups. Not entirely sure, this was a transient error.

@@ -163,6 +192,15 @@ install_kfp
install_generated_role

pytest_args=( --region "${REGION}" --role-arn "${SAGEMAKER_EXECUTION_ROLE_ARN}" --s3-data-bucket "${S3_DATA_BUCKET}" --minio-service-port "${MINIO_LOCAL_PORT}" --kfp-namespace "${KFP_NAMESPACE}" )

if [[ "${SKIP_FSX_TESTS}" == "true" ]]; then
pytest_args+=( -m "not fsx_test" )
Copy link
Contributor

@surajkota surajkota May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be fixed in future, the test should automatically be skipped if fsx-id is not defined rather than explicitly needing to set it

@surajkota
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: surajkota

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@surajkota
Copy link
Contributor

Has 2 approvals
/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 37a6363 into kubeflow:master May 29, 2020
RedbackThomson added a commit to RedbackThomson/pipelines that referenced this pull request Jun 17, 2020
* Add working FSx setup and test

* Removed duplicate test function

* Replaced failure return with exit

* Update parallel methods to export

* Update EKS cluster name outside parallel task

* Add SKIP_FSX_TEST in buildspec

* Add revoke security group ingress

* Add default pytest FSx values
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this pull request Dec 9, 2020
* Add working FSx setup and test

* Removed duplicate test function

* Replaced failure return with exit

* Update parallel methods to export

* Update EKS cluster name outside parallel task

* Add SKIP_FSX_TEST in buildspec

* Add revoke security group ingress

* Add default pytest FSx values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants