-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS SageMaker] Add working FSx setup and test #3831
Conversation
components/aws/sagemaker/tests/integration_tests/component_tests/test_train_component.py
Outdated
Show resolved
Hide resolved
components/aws/sagemaker/tests/integration_tests/scripts/run_integration_tests
Outdated
Show resolved
Hide resolved
components/aws/sagemaker/tests/integration_tests/component_tests/test_train_component.py
Outdated
Show resolved
Hide resolved
5a960c9
to
2bffbfe
Compare
1ba716b
to
c7fdedd
Compare
|
||
function delete_fsx_instance() { | ||
if [ ! -z "${FSX_ID}" ]; then | ||
aws fsx delete-file-system --file-system-id "${FSX_ID}" --region "${REGION}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might leak the resource here if this command fails for some reasons...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very real possibility at this point in the code. This is after our error trap, so if this command fails we would have to see it failing in the logs to catch it. We might need to set up some alarm to ensure we don't have too many FSx instances in the account.
@@ -150,11 +164,26 @@ function cleanup_kfp() { | |||
} | |||
|
|||
if [[ -z "${EKS_EXISTING_CLUSTER}" ]]; then | |||
launch_eks | |||
# Launch all of these in parallel to reduce start-up time | |||
EKS_CLUSTER_NAME="${DEPLOY_NAME}-eks-cluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to launch EKS at the same time as creating FSx because both operations take a long time. However, when we use the &
operator, it starts the shell in a different process and therefore any variables set in that shell are not set in the caller shell. So I bypass this by setting the name of the cluster before calling.
if [[ "${SKIP_FSX_TESTS}" == "false" ]]; then | ||
delete_fsx_instance | ||
# Sleep in order for the security group to detach before attempting to delete it | ||
sleep 15s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better condition based solution for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not had much of a chance to play around with this. I think maybe once the FSx file system changes status to "Deleting" it would have detached from the VPC and security groups. Not entirely sure, this was a transient error.
@@ -163,6 +192,15 @@ install_kfp | |||
install_generated_role | |||
|
|||
pytest_args=( --region "${REGION}" --role-arn "${SAGEMAKER_EXECUTION_ROLE_ARN}" --s3-data-bucket "${S3_DATA_BUCKET}" --minio-service-port "${MINIO_LOCAL_PORT}" --kfp-namespace "${KFP_NAMESPACE}" ) | |||
|
|||
if [[ "${SKIP_FSX_TESTS}" == "true" ]]; then | |||
pytest_args+=( -m "not fsx_test" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be fixed in future, the test should automatically be skipped if fsx-id is not defined rather than explicitly needing to set it
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: surajkota The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Has 2 approvals |
* Add working FSx setup and test * Removed duplicate test function * Replaced failure return with exit * Update parallel methods to export * Update EKS cluster name outside parallel task * Add SKIP_FSX_TEST in buildspec * Add revoke security group ingress * Add default pytest FSx values
* Add working FSx setup and test * Removed duplicate test function * Replaced failure return with exit * Update parallel methods to export * Update EKS cluster name outside parallel task * Add SKIP_FSX_TEST in buildspec * Add revoke security group ingress * Add default pytest FSx values
This pull request sets up a new FSx instance (preloaded with data from the specified S3 bucket). It attaches it to the same private subnet as the EKS cluster and creates a security group for ingress on the FSx TCP port. I have also added a SageMaker test that mounts and uses this as the primary train channel.
Because of the added cost and time of adding FSx to the integration tests, I have also tried to flag off this test with an environment variable
SKIP_FSX_TESTS
. I still need to configure PyTest to ignore the FSx test if this environment variable is present.