Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add base terraform config for batch #2

Merged
merged 17 commits into from
Aug 19, 2020
Merged

Conversation

jashapiro
Copy link
Member

This draft PR adds terraform files to build a the full stack of resources required for a nextflow queue.

This should include the necessary groups, roles, policies, compute environments and batch queues.

Of note, the policies include read/write access to only two buckets: nextflow-ccdl-data and nextflow-ccdl-results (read access is to all buckets).

The compute environments use an AMI based on Amazon ECS-Optimized Amazon Linux AMI, with changes to allow for greater data storage and installation of aws-cli as required for Nextflow.

Most of the settings I used were derived from instructions found here: https://t-neumann.github.io/pipelines/AWS-pipeline/, as well as by looking at the setup created by the cloudformation stack (https://aws.amazon.com/quickstart/biotech-blueprint/nextflow/) that we had previously done some testing with.

I have not yet attempted to deploy this stack, as I wanted to get some feedback/checks before I did so, especially as it is the first time I have used terraform. 😰

While I think most of the stack is probably mostly correct, I am really flying blind on the vpc settings in nextflow-security.tf. I would love to have those looked at especially closely!

I also have not tagged anything, so if there are particular tagging conventions that I should follow, that would be great to hear about. (If there is a way to apply a set of tags to all resources at once, that would probably make the most sense...)

@jashapiro jashapiro requested a review from kurtwheeler August 3, 2020 13:09
allocation_strategy = "BEST_FIT"
max_vcpus = 20
min_vcpus = 0
image_id = "ami-0a8857ac38c35157f"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you noted where this AMI came from and what it is in the PR description, but I think it'd be good to also note that in a comment here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just saw aws/setup-log.md, maybe just mention that doc in a comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

aws/nextflow-compute.tf Outdated Show resolved Hide resolved
aws/setup-log.md Outdated

Briefly:
1. We used the base AMI _Amazon ECS-Optimized Amazon Linux AMI_. (note, not Amazon Linux 2).
This was launched with an 80GB EBS volume for data (compared to teh default 22) in addition to the base 8GB boot volume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This was launched with an 80GB EBS volume for data (compared to teh default 22) in addition to the base 8GB boot volume.
This was launched with an 80GB EBS volume for data (compared to the default 22) in addition to the base 8GB boot volume.

spot_iam_fleet_role = aws_iam_role.nf_ecs_role.arn
bid_percentage = 20
max_vcpus = 100
min_vcpus = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so if I understand this correctly, you've got a default queue that uses spot instances and a priority queue that uses on demand instances. They're both going to scale up automatically, default to a max_vcpus of 100 and the priority to 20. Is that all correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what is supposed to happen, yes. I may well change these in the future, but this seemed a good place to start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sweet. A summary like that might also be useful to have in a comment or doc somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was planning to add a readme before making it into a real PR.

aws/nextflow-compute.tf Outdated Show resolved Hide resolved
"s3:GetBucketCORS",
"s3:GetBucketLocation",
"s3:ReplicateDelete",
"s3:GetObjectVersion"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this everything? I think s3:* might be equivalent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know? Probably is nearly equivalent, but this was one that I copied from the CloudFormation setting, so I didn't want to mess around too much. I doubt changing it to s3:* would be a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word if you got this from there then it's probably best to leave it as is.


resource "aws_subnet" "nf_subnet" {
vpc_id = aws_vpc.nf_vpc.id
cidr_block = "10.1.1.0/24"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on adding additional subnets? If not, you may just want to have this match the VPC's CIDR block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect so? This is where I was completely lost, and really just was going off examples. I don't know what the function of multiple subnets would be, but I needed to make at least one, it seemed. the CloudFormation-created vpc definitely made more than one, but I don't know why, or how that might be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then this is should be fine. It shouldn't really matter much anyway.

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this all looks pretty good to me. You'll definitely want to test it, it's hard to eyeball all the things that could be wrong but there's nothing major as far as I can tell.

@jashapiro
Copy link
Member Author

Ok this all looks pretty good to me. You'll definitely want to test it, it's hard to eyeball all the things that could be wrong but there's nothing major as far as I can tell.

That is good to hear.

What is your recommended test procedure? I have been making sure syntax, etc works with terraform plan but I wasn't really sure the best next steps. I assume terraform apply, but if things go wrong, what is the rollback? Will terraform destroy only affect the resources defined in these files? If I make changes, do I use apply again to update things as needed, or is there another step?

and fix typo
@kurtwheeler
Copy link
Contributor

Will terraform destroy only affect the resources defined in these files?

Yep, it's pretty good about that.

If I make changes, do I use apply again to update things as needed, or is there another step?

There is no other step other than maybe terraform plan. terraform apply will figure out what it needs to change and only modify/recreate those resources. terraform won't really do anything dangerous. The worst it can really do is get a corrupted state which will make it really hard to cleanup, but that's not that scary and you probably won't run into it.

@kurtwheeler
Copy link
Contributor

Ok I added some untested terraform code. It shouldn't work just yet, you'll need to add the public key of your ssh key into the keypair resource definition. Also https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/batch_compute_environment#ec2_key_pair didn't specify what attribute of the key_pair it wanted, so maybe you'll need to switch that to id (normal instances use the name though).

Once this is set up, you'll be able to ssh in with ssh -i <private key file> <user name>@<instance ip>. If you get the user name wrong, it won't be obvious that's the issue. It'll depend on your AMI. I've had to use ec2_user, ubuntu, and root in the past.

If any of that gives you any trouble lemme know!

@jashapiro jashapiro marked this pull request as ready for review August 17, 2020 18:34
@jashapiro
Copy link
Member Author

After Kurt successfully got things working by adding in missing gateway settings (who knew? Kurt), I have gone back and tightened things back up. I left in the sections that would allow SSH access to nodes, but that access is currently disabled: there is no public key passed to the nodes, and port 22 is closed.

I have also added more comments in the aws/setup-log.md (which maybe should be renamed README.md?) to describe the setup in a bit more detail.

I have also added an example nextflow.config file that includes minimal settings for running a job on the default batch queue.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think these docs are enough to get started, along with the custom AMI section of the Nextflow docs, but we should keep an eye on how to refine them over time!

The spot price threshold is set high (currently 100%!) so it should never not run.

To use these files, you will need to install Terraform on your local machine, either from https://www.terraform.io/downloads.html or via Conda, Homebrew, or the like.
(You will also probably want the `aws` cli tools, and to have run `aws configure` to set your access key ID and secret access key.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(You will also probably want the `aws` cli tools, and to have run `aws configure` to set your access key ID and secret access key.)
(You will also probably want the `aws` cli tools, and to have run [`aws configure`](https://docs.aws.amazon.com/cli/latest/reference/configure/index.html) to set your access key ID and secret access key.)

Alternatively link to: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm reading this, I am wondering about a local development section in the documentation, which almost certainly belongs in a different markdown file. This is a future problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will need to come. But the hope is that most development won't require messing with terraform. Other AWS stuff will need to be added for the rest of the rep though, to get access to required buckets, etc.

(You will also probably want the `aws` cli tools, and to have run `aws configure` to set your access key ID and secret access key.)

Once you have installed terraform, you can run `terraform init` from this directory to get started.
To implement any changes, you will want a copy of the current state file, which is not included in this repository for security reasons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me how I would ensure that I would have a copy of the current state file, but it may be the case that it has to be this way.

Copy link
Member Author

@jashapiro jashapiro Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, you would message me, and I would send it to you. Kurt pointed me to terraform cloud, but I haven't looked into it yet.

To implement any changes, you will want a copy of the current state file, which is not included in this repository for security reasons.
With that file in the directory, you can test any changes with `terraform plan` and then implement them on AWS with `terraform apply`.

Changes to the compute environments do not always seem to go smoothly, as Terraform does not always properly shut down the job queues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The readme seems nicely informative!

@jashapiro jashapiro merged commit f4ccc68 into master Aug 19, 2020
@jashapiro jashapiro deleted the jashapiro/terraform-batch branch November 5, 2020 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants