-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Support deployment mode as a Terra on Azure app #13944
Conversation
I don't love what I've had to do with the deploy config stuff. That's in my opinion the most finicky part of this (has already broken multiple times) and it's mostly our fault, because we overload the Alternative: Actually have and use a |
c7ad10f
to
38882d5
Compare
9453f65
to
5abf4c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty straightforward review! Still letting it all sink in, so some of my comments might be naive; tell me if they are!
batch/terra-chart/Chart.yaml
Outdated
name: hail-batch-terra-azure | ||
description: A chart for deploying Hail Batch into an Azure Terra workspace | ||
type: application | ||
version: "0.1.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a version for the chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updating hail batch in terra basically just requires a PR to Leonardo that updates the chart version they use. We could just use the pip version across the board, but haven't given that much thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like pip version. The fewer distinct versions we have to deal with the better.
|
||
|
||
class TerraAzureWorkerAPI(CloudWorkerAPI[AzureManagedIdentityCredentials]): | ||
nameserver_ip = '168.63.129.16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a class variable that is used by worker.py
. It probably is inconsequential in terra but I copied this value from the Azure worker API.
raise | ||
|
||
async def get_vm_state(self, instance: Instance) -> VMState: | ||
# TODO This should look at the response and use all applicable lifecycle types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lost track of this in all the LoC in this change. In the WSM API there's a state
field in the response of the following enum [ BROKEN, CREATING, DELETING, READY, UPDATING ]
. I'm hoping, but am not certain, that we can at least map READY
to our notion of ready. Interestingly, this functionality doesn't seem entirely necessary, as a VM is marked ready regardless when it comes online and contacts the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Is the TODO suggesting that we assert the state field is READY in the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a more thorough review tomorrow; today has too many meetings sadly.
async def _to_azure_url(self, url: str) -> str: | ||
if url in self._sas_token_cache: | ||
sas_token, expiration = self._sas_token_cache[url] | ||
ten_minutes_from_now = time_msecs() + 10 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking refresh sooner, but I admit my thinking is still hazy here.
It almost feels like our retry logic needs to be URL-aware. What if we're 11 minutes from expiration and we're streaming a big-ish file. We make some requests to get the first couple chunks, then we do some processing, then a network hiccups and we retry from the start. It feels very reasonable to exceed the 11 minutes.
Of course, the same problem can happen in normally authenticated code, but, at least for aiogoogle, every independent call to request
will check for authentication validity. Here, once we've resolved the URL, we've locked in our credentials until we successfully complete all interactions with this object. I wonder if the right place for this functionality is much deeper, at the level of AzureSession/AzureCredentials. We'd have to change the interface for credentials so that it has access to the URL though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dismiss when the signed URL fiddliness is ready for another look
@danking , I'm a bit stuck on how to proceed with the credential refreshing. Here's the layout of the problem:
Do you see a solution other than raking through our |
Stepping back a little bit, there might be a reasonable (if unsatisfying) middle ground. Presumably the operations most at risk are long streams that we always do in chunks anyway, and in that case we can create new downloaders on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some thoughts, haven't gone through everything yet
'name': disk_name, | ||
'size': disk_size_gb, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what happens to the disk when the VM dies? Who cleans up the disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've decided to remove the data disk from this PR and only accept VMs with a local ssd.
raise | ||
|
||
async def get_vm_state(self, instance: Instance) -> VMState: | ||
# TODO This should look at the response and use all applicable lifecycle types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Is the TODO suggesting that we assert the state field is READY in the response?
await self.terra_client.post( | ||
f'/vm/{terra_vm_resource_id}', | ||
json={ | ||
'jobControl': {'id': str(uuid.uuid4())}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a bike shed but also a bit of a principled stance against UUID.
I'm a bit generally skeptical of UUIDs. They encode a 128-bit number using 32+4 = 36 ASCII characters or 288 bits. They also waste 6 of the 128 bits on deterministic version information.
What's the arguments for/against uuid4 versus, say, a 64-bit random integer or even a 128-bit random integer, if we want a very low chance of collision?
For 64 bits the chance of collision with 100,000 draws is 1 in 10^-10 [1]. But even if we just base64.b64encode(secrets.token_bytes(16))
that's a full 128 bits of randomness encoded in 192 bits of base64 characters.
[1] https://www.wolframalpha.com/input?i=1+-+Pochhammer%5Bn-%28k-1%29%2Ck%5D%2Fn%5Ek+where+n+is+2%5E64+and+k+is+100000 Pochhammer is necessary because of the huge numbers involved. Pochhammer[x-(y-1), y]
is x!/(x-y)!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fair, I was simply going by the recommendation in the WSM API docs. I don't have a big preference, happy to change it to base64.b64encode(secrets.token_bytes(16))
log.info(f'Terra response creating disk {disk_name}: {res}') | ||
except Exception: | ||
log.exception(f'error while creating disk {disk_name}') | ||
return total_resources_on_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't disk creation failure be more traumatic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this is how the other cloud drivers operate when they can't create an instance. We're going to keep this as is but this should be scrutinized more broadly to consider whether we should immediately remove the instance when it fails to be created.
I handled the other states and removed the TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, but this seems broadly right to me!
WIP'ed in case you want to add the last shq |
First draft of a helm chart for packaging up Hail Batch as a Terra on Azure App. This will likely need numerous bug fixes as we set up a proper testing strategy, but the rough shape of everything should be pretty stable.