Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Support deployment mode as a Terra on Azure app #13944

Merged
merged 34 commits into from
Mar 1, 2024

Conversation

daniel-goldstein
Copy link
Contributor

First draft of a helm chart for packaging up Hail Batch as a Terra on Azure App. This will likely need numerous bug fixes as we set up a proper testing strategy, but the rough shape of everything should be pretty stable.

@daniel-goldstein
Copy link
Contributor Author

daniel-goldstein commented Oct 30, 2023

I don't love what I've had to do with the deploy config stuff. That's in my opinion the most finicky part of this (has already broken multiple times) and it's mostly our fault, because we overload the namespace parameter with both identifying the namespace in Kubernetes and signifying whether the environment is prod or not. All I want really is to change the domain to a domain and path prefix, and not have the namespace have such an impact on routing. Like what if namespace didn't affect routing, but if the deploy config only gave a domain with no path e.g. hail.is, we use subdomains so batch.hail.is, but if we provided a domain with a path prefix like internal.hail.is/dgoldste, we make the batch root internal.hail.is/dgoldste/batch?

Alternative: Actually have and use a base_path in the deploy config. This would be used in dev and terra environments.

@daniel-goldstein daniel-goldstein force-pushed the azure-terra branch 2 times, most recently from c7ad10f to 38882d5 Compare November 2, 2023 18:20
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty straightforward review! Still letting it all sink in, so some of my comments might be naive; tell me if they are!

name: hail-batch-terra-azure
description: A chart for deploying Hail Batch into an Azure Terra workspace
type: application
version: "0.1.9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a version for the chart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updating hail batch in terra basically just requires a PR to Leonardo that updates the chart version they use. We could just use the pip version across the board, but haven't given that much thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like pip version. The fewer distinct versions we have to deal with the better.



class TerraAzureWorkerAPI(CloudWorkerAPI[AzureManagedIdentityCredentials]):
nameserver_ip = '168.63.129.16'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a class variable that is used by worker.py. It probably is inconsequential in terra but I copied this value from the Azure worker API.

raise

async def get_vm_state(self, instance: Instance) -> VMState:
# TODO This should look at the response and use all applicable lifecycle types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lost track of this in all the LoC in this change. In the WSM API there's a state field in the response of the following enum [ BROKEN, CREATING, DELETING, READY, UPDATING ]. I'm hoping, but am not certain, that we can at least map READY to our notion of ready. Interestingly, this functionality doesn't seem entirely necessary, as a VM is marked ready regardless when it comes online and contacts the driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is the TODO suggesting that we assert the state field is READY in the response?

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a more thorough review tomorrow; today has too many meetings sadly.

async def _to_azure_url(self, url: str) -> str:
if url in self._sas_token_cache:
sas_token, expiration = self._sas_token_cache[url]
ten_minutes_from_now = time_msecs() + 10 * 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking refresh sooner, but I admit my thinking is still hazy here.

It almost feels like our retry logic needs to be URL-aware. What if we're 11 minutes from expiration and we're streaming a big-ish file. We make some requests to get the first couple chunks, then we do some processing, then a network hiccups and we retry from the start. It feels very reasonable to exceed the 11 minutes.

Of course, the same problem can happen in normally authenticated code, but, at least for aiogoogle, every independent call to request will check for authentication validity. Here, once we've resolved the URL, we've locked in our credentials until we successfully complete all interactions with this object. I wonder if the right place for this functionality is much deeper, at the level of AzureSession/AzureCredentials. We'd have to change the interface for credentials so that it has access to the URL though.

danking
danking previously requested changes Dec 7, 2023
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dismiss when the signed URL fiddliness is ready for another look

@daniel-goldstein
Copy link
Contributor Author

@danking , I'm a bit stuck on how to proceed with the credential refreshing. Here's the layout of the problem:

  1. In normal Azure, we accept user-provided SAS tokens. Since they are user-provided, we have no way of obtaining new ones and the onus is on the user to obtain a SAS token for however long they expect to need to use it.
  2. This current design in Terra is to not make the user have to do that, because that seems annoying, and for terra-controlled ABS containers we have an endpoint we can hit to get a SAS token. Ok, but now we need to update our Azure FS infrastructure to refresh a credential if it expires. But, we use the azure client lib and don't control all http requests. For example, for AzureStorageFS.open, we call downloader.readall() if we want to load the whole file into memory. I went spelunking through their source and looks like readall mostly wraps a sequence of range reads, but regardless if we were to use that method we would have to catch credential expiration errors, reset credentials on the blob client and retry hoping that we didn't break any invariants -- I don't want to do that as I wouldn't trust a stream that encountered a non-transient error like that. It could be that getting rid of downloader.readall is the only thing we have to worry about, but it makes me uneasy not having control of the http requests we're making to ABS.

Do you see a solution other than raking through our aioazure.fs and making sure that we only use "quick" methods and possibly retrying 401s? It just seems to me like we're going against the grain and even though it feels user-hostile the intention of SAS tokens are to have users own credential expiration.

@daniel-goldstein
Copy link
Contributor Author

Stepping back a little bit, there might be a reasonable (if unsatisfying) middle ground. Presumably the operations most at risk are long streams that we always do in chunks anyway, and in that case we can create new downloaders on AzureReadableStream.read if the SAS token expires. That would probably solve most of these problems.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boop

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts, haven't gone through everything yet

'name': disk_name,
'size': disk_size_gb,
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what happens to the disk when the VM dies? Who cleans up the disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've decided to remove the data disk from this PR and only accept VMs with a local ssd.

raise

async def get_vm_state(self, instance: Instance) -> VMState:
# TODO This should look at the response and use all applicable lifecycle types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is the TODO suggesting that we assert the state field is READY in the response?

await self.terra_client.post(
f'/vm/{terra_vm_resource_id}',
json={
'jobControl': {'id': str(uuid.uuid4())},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a bike shed but also a bit of a principled stance against UUID.


I'm a bit generally skeptical of UUIDs. They encode a 128-bit number using 32+4 = 36 ASCII characters or 288 bits. They also waste 6 of the 128 bits on deterministic version information.

What's the arguments for/against uuid4 versus, say, a 64-bit random integer or even a 128-bit random integer, if we want a very low chance of collision?

For 64 bits the chance of collision with 100,000 draws is 1 in 10^-10 [1]. But even if we just base64.b64encode(secrets.token_bytes(16)) that's a full 128 bits of randomness encoded in 192 bits of base64 characters.

[1] https://www.wolframalpha.com/input?i=1+-+Pochhammer%5Bn-%28k-1%29%2Ck%5D%2Fn%5Ek+where+n+is+2%5E64+and+k+is+100000 Pochhammer is necessary because of the huge numbers involved. Pochhammer[x-(y-1), y] is x!/(x-y)!.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair, I was simply going by the recommendation in the WSM API docs. I don't have a big preference, happy to change it to base64.b64encode(secrets.token_bytes(16))

log.info(f'Terra response creating disk {disk_name}: {res}')
except Exception:
log.exception(f'error while creating disk {disk_name}')
return total_resources_on_instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't disk creation failure be more traumatic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is how the other cloud drivers operate when they can't create an instance. We're going to keep this as is but this should be scrutinized more broadly to consider whether we should immediately remove the instance when it fails to be created.

@daniel-goldstein
Copy link
Contributor Author

Hmm. Is the TODO suggesting that we assert the state field is READY in the response?

I handled the other states and removed the TODO

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, but this seems broadly right to me!

@danking danking added the WIP label Feb 29, 2024
@danking
Copy link
Contributor

danking commented Feb 29, 2024

WIP'ed in case you want to add the last shq

@hail-ci-robot hail-ci-robot merged commit f40d1c9 into hail-is:main Mar 1, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants