-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add base terraform config for batch #2
Conversation
aws/nextflow-compute.tf
Outdated
allocation_strategy = "BEST_FIT" | ||
max_vcpus = 20 | ||
min_vcpus = 0 | ||
image_id = "ami-0a8857ac38c35157f" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you noted where this AMI came from and what it is in the PR description, but I think it'd be good to also note that in a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just saw aws/setup-log.md
, maybe just mention that doc in a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
aws/setup-log.md
Outdated
|
||
Briefly: | ||
1. We used the base AMI _Amazon ECS-Optimized Amazon Linux AMI_. (note, not Amazon Linux 2). | ||
This was launched with an 80GB EBS volume for data (compared to teh default 22) in addition to the base 8GB boot volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was launched with an 80GB EBS volume for data (compared to teh default 22) in addition to the base 8GB boot volume. | |
This was launched with an 80GB EBS volume for data (compared to the default 22) in addition to the base 8GB boot volume. |
spot_iam_fleet_role = aws_iam_role.nf_ecs_role.arn | ||
bid_percentage = 20 | ||
max_vcpus = 100 | ||
min_vcpus = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so if I understand this correctly, you've got a default queue that uses spot instances and a priority queue that uses on demand instances. They're both going to scale up automatically, default to a max_vcpus of 100 and the priority to 20. Is that all correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what is supposed to happen, yes. I may well change these in the future, but this seemed a good place to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sweet. A summary like that might also be useful to have in a comment or doc somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was planning to add a readme before making it into a real PR.
"s3:GetBucketCORS", | ||
"s3:GetBucketLocation", | ||
"s3:ReplicateDelete", | ||
"s3:GetObjectVersion" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this everything? I think s3:*
might be equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know? Probably is nearly equivalent, but this was one that I copied from the CloudFormation setting, so I didn't want to mess around too much. I doubt changing it to s3:* would be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word if you got this from there then it's probably best to leave it as is.
|
||
resource "aws_subnet" "nf_subnet" { | ||
vpc_id = aws_vpc.nf_vpc.id | ||
cidr_block = "10.1.1.0/24" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on adding additional subnets? If not, you may just want to have this match the VPC's CIDR block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect so? This is where I was completely lost, and really just was going off examples. I don't know what the function of multiple subnets would be, but I needed to make at least one, it seemed. the CloudFormation-created vpc definitely made more than one, but I don't know why, or how that might be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then this is should be fine. It shouldn't really matter much anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this all looks pretty good to me. You'll definitely want to test it, it's hard to eyeball all the things that could be wrong but there's nothing major as far as I can tell.
That is good to hear. What is your recommended test procedure? I have been making sure syntax, etc works with |
and fix typo
Yep, it's pretty good about that.
There is no other step other than maybe |
Ok I added some untested terraform code. It shouldn't work just yet, you'll need to add the public key of your ssh key into the keypair resource definition. Also https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/batch_compute_environment#ec2_key_pair didn't specify what attribute of the key_pair it wanted, so maybe you'll need to switch that to id (normal instances use the name though). Once this is set up, you'll be able to ssh in with If any of that gives you any trouble lemme know! |
After Kurt successfully got things working by adding in missing gateway settings (who knew? Kurt), I have gone back and tightened things back up. I left in the sections that would allow SSH access to nodes, but that access is currently disabled: there is no public key passed to the nodes, and port 22 is closed. I have also added more comments in the I have also added an example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think these docs are enough to get started, along with the custom AMI section of the Nextflow docs, but we should keep an eye on how to refine them over time!
The spot price threshold is set high (currently 100%!) so it should never not run. | ||
|
||
To use these files, you will need to install Terraform on your local machine, either from https://www.terraform.io/downloads.html or via Conda, Homebrew, or the like. | ||
(You will also probably want the `aws` cli tools, and to have run `aws configure` to set your access key ID and secret access key.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You will also probably want the `aws` cli tools, and to have run `aws configure` to set your access key ID and secret access key.) | |
(You will also probably want the `aws` cli tools, and to have run [`aws configure`](https://docs.aws.amazon.com/cli/latest/reference/configure/index.html) to set your access key ID and secret access key.) |
Alternatively link to: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm reading this, I am wondering about a local development section in the documentation, which almost certainly belongs in a different markdown file. This is a future problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will need to come. But the hope is that most development won't require messing with terraform. Other AWS stuff will need to be added for the rest of the rep though, to get access to required buckets, etc.
(You will also probably want the `aws` cli tools, and to have run `aws configure` to set your access key ID and secret access key.) | ||
|
||
Once you have installed terraform, you can run `terraform init` from this directory to get started. | ||
To implement any changes, you will want a copy of the current state file, which is not included in this repository for security reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear to me how I would ensure that I would have a copy of the current state file, but it may be the case that it has to be this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, you would message me, and I would send it to you. Kurt pointed me to terraform cloud, but I haven't looked into it yet.
To implement any changes, you will want a copy of the current state file, which is not included in this repository for security reasons. | ||
With that file in the directory, you can test any changes with `terraform plan` and then implement them on AWS with `terraform apply`. | ||
|
||
Changes to the compute environments do not always seem to go smoothly, as Terraform does not always properly shut down the job queues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆒
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The readme seems nicely informative!
This draft PR adds terraform files to build a the full stack of resources required for a nextflow queue.
This should include the necessary groups, roles, policies, compute environments and batch queues.
Of note, the policies include read/write access to only two buckets: nextflow-ccdl-data and nextflow-ccdl-results (read access is to all buckets).
The compute environments use an AMI based on Amazon ECS-Optimized Amazon Linux AMI, with changes to allow for greater data storage and installation of aws-cli as required for Nextflow.
Most of the settings I used were derived from instructions found here: https://t-neumann.github.io/pipelines/AWS-pipeline/, as well as by looking at the setup created by the cloudformation stack (https://aws.amazon.com/quickstart/biotech-blueprint/nextflow/) that we had previously done some testing with.
I have not yet attempted to deploy this stack, as I wanted to get some feedback/checks before I did so, especially as it is the first time I have used terraform. 😰
While I think most of the stack is probably mostly correct, I am really flying blind on the vpc settings in nextflow-security.tf. I would love to have those looked at especially closely!
I also have not tagged anything, so if there are particular tagging conventions that I should follow, that would be great to hear about. (If there is a way to apply a set of tags to all resources at once, that would probably make the most sense...)