[batch] Support deployment mode as a Terra on Azure app #13944

daniel-goldstein · 2023-10-30T16:15:01Z

First draft of a helm chart for packaging up Hail Batch as a Terra on Azure App. This will likely need numerous bug fixes as we set up a proper testing strategy, but the rough shape of everything should be pretty stable.

daniel-goldstein · 2023-10-30T16:22:32Z

I don't love what I've had to do with the deploy config stuff. That's in my opinion the most finicky part of this (has already broken multiple times) and it's mostly our fault, because we overload the namespace parameter with both identifying the namespace in Kubernetes and signifying whether the environment is prod or not. All I want really is to change the domain to a domain and path prefix, and not have the namespace have such an impact on routing. Like what if namespace didn't affect routing, but if the deploy config only gave a domain with no path e.g. hail.is, we use subdomains so batch.hail.is, but if we provided a domain with a path prefix like internal.hail.is/dgoldste, we make the batch root internal.hail.is/dgoldste/batch?

Alternative: Actually have and use a base_path in the deploy config. This would be used in dev and terra environments.

hail/python/hailtop/config/deploy_config.py

danking

Pretty straightforward review! Still letting it all sink in, so some of my comments might be naive; tell me if they are!

batch/terra-chart/.gitignore

batch/terra-chart/Chart.yaml

danking · 2023-11-27T22:41:44Z

batch/terra-chart/Chart.yaml

+name: hail-batch-terra-azure
+description: A chart for deploying Hail Batch into an Azure Terra workspace
+type: application
+version: "0.1.9"


Is this a version for the chart?

Yes, updating hail batch in terra basically just requires a PR to Leonardo that updates the chart version they use. We could just use the pip version across the board, but haven't given that much thought.

I like pip version. The fewer distinct versions we have to deal with the better.

batch/terra-chart/Dockerfile.batch

batch/terra-chart/Makefile

batch/batch/driver/instance.py

danking · 2023-11-27T23:33:55Z

batch/batch/cloud/terra/azure/worker/worker_api.py

+
+
+class TerraAzureWorkerAPI(CloudWorkerAPI[AzureManagedIdentityCredentials]):
+    nameserver_ip = '168.63.129.16'


unused perhaps?

This is a class variable that is used by worker.py. It probably is inconsequential in terra but I copied this value from the Azure worker API.

danking · 2023-11-27T23:35:45Z

batch/batch/cloud/terra/azure/driver/driver.py

+            raise
+
+    async def get_vm_state(self, instance: Instance) -> VMState:
+        # TODO This should look at the response and use all applicable lifecycle types


Lost track of this in all the LoC in this change. In the WSM API there's a state field in the response of the following enum [ BROKEN, CREATING, DELETING, READY, UPDATING ]. I'm hoping, but am not certain, that we can at least map READY to our notion of ready. Interestingly, this functionality doesn't seem entirely necessary, as a VM is marked ready regardless when it comes online and contacts the driver.

Hmm. Is the TODO suggesting that we assert the state field is READY in the response?

batch/batch/cloud/terra/azure/driver/driver.py

…secrets in terra

addressed

danking

I'll do a more thorough review tomorrow; today has too many meetings sadly.

batch/batch/cloud/terra/azure/driver/driver.py

danking · 2023-12-04T17:36:42Z

hail/python/hailtop/aiocloud/aioterra/azure/fs.py

+    async def _to_azure_url(self, url: str) -> str:
+        if url in self._sas_token_cache:
+            sas_token, expiration = self._sas_token_cache[url]
+            ten_minutes_from_now = time_msecs() + 10 * 60


I'm thinking refresh sooner, but I admit my thinking is still hazy here.

It almost feels like our retry logic needs to be URL-aware. What if we're 11 minutes from expiration and we're streaming a big-ish file. We make some requests to get the first couple chunks, then we do some processing, then a network hiccups and we retry from the start. It feels very reasonable to exceed the 11 minutes.

Of course, the same problem can happen in normally authenticated code, but, at least for aiogoogle, every independent call to request will check for authentication validity. Here, once we've resolved the URL, we've locked in our credentials until we successfully complete all interactions with this object. I wonder if the right place for this functionality is much deeper, at the level of AzureSession/AzureCredentials. We'd have to change the interface for credentials so that it has access to the URL though.

hail/python/hailtop/aiocloud/aioterra/azure/fs.py

danking

dismiss when the signed URL fiddliness is ready for another look

daniel-goldstein · 2024-02-06T18:01:16Z

@danking , I'm a bit stuck on how to proceed with the credential refreshing. Here's the layout of the problem:

In normal Azure, we accept user-provided SAS tokens. Since they are user-provided, we have no way of obtaining new ones and the onus is on the user to obtain a SAS token for however long they expect to need to use it.
This current design in Terra is to not make the user have to do that, because that seems annoying, and for terra-controlled ABS containers we have an endpoint we can hit to get a SAS token. Ok, but now we need to update our Azure FS infrastructure to refresh a credential if it expires. But, we use the azure client lib and don't control all http requests. For example, for AzureStorageFS.open, we call downloader.readall() if we want to load the whole file into memory. I went spelunking through their source and looks like readall mostly wraps a sequence of range reads, but regardless if we were to use that method we would have to catch credential expiration errors, reset credentials on the blob client and retry hoping that we didn't break any invariants -- I don't want to do that as I wouldn't trust a stream that encountered a non-transient error like that. It could be that getting rid of downloader.readall is the only thing we have to worry about, but it makes me uneasy not having control of the http requests we're making to ABS.

Do you see a solution other than raking through our aioazure.fs and making sure that we only use "quick" methods and possibly retrying 401s? It just seems to me like we're going against the grain and even though it feels user-hostile the intention of SAS tokens are to have users own credential expiration.

daniel-goldstein · 2024-02-06T18:07:01Z

Stepping back a little bit, there might be a reasonable (if unsatisfying) middle ground. Presumably the operations most at risk are long streams that we always do in chunks anyway, and in that case we can create new downloaders on AzureReadableStream.read if the SAS token expires. That would probably solve most of these problems.

x

danking

boop

…to expiring

done and passing!

danking

some thoughts, haven't gone through everything yet

batch/batch/cloud/terra/azure/driver/driver.py

danking · 2024-02-27T22:35:09Z

batch/batch/cloud/terra/azure/driver/driver.py

+            'name': disk_name,
+            'size': disk_size_gb,
+        },
+    }


Do you know what happens to the disk when the VM dies? Who cleans up the disk?

We've decided to remove the data disk from this PR and only accept VMs with a local ssd.

batch/batch/cloud/terra/azure/driver/driver.py

danking · 2024-02-27T22:39:45Z

batch/batch/cloud/terra/azure/driver/driver.py

+            raise
+
+    async def get_vm_state(self, instance: Instance) -> VMState:
+        # TODO This should look at the response and use all applicable lifecycle types


Hmm. Is the TODO suggesting that we assert the state field is READY in the response?

danking · 2024-02-27T23:13:16Z

batch/batch/cloud/terra/azure/driver/driver.py

+            await self.terra_client.post(
+                f'/vm/{terra_vm_resource_id}',
+                json={
+                    'jobControl': {'id': str(uuid.uuid4())},


This is a bit of a bike shed but also a bit of a principled stance against UUID.

I'm a bit generally skeptical of UUIDs. They encode a 128-bit number using 32+4 = 36 ASCII characters or 288 bits. They also waste 6 of the 128 bits on deterministic version information.

What's the arguments for/against uuid4 versus, say, a 64-bit random integer or even a 128-bit random integer, if we want a very low chance of collision?

For 64 bits the chance of collision with 100,000 draws is 1 in 10^-10 [1]. But even if we just base64.b64encode(secrets.token_bytes(16)) that's a full 128 bits of randomness encoded in 192 bits of base64 characters.

[1] https://www.wolframalpha.com/input?i=1+-+Pochhammer%5Bn-%28k-1%29%2Ck%5D%2Fn%5Ek+where+n+is+2%5E64+and+k+is+100000 Pochhammer is necessary because of the huge numbers involved. Pochhammer[x-(y-1), y] is x!/(x-y)!.

This is fair, I was simply going by the recommendation in the WSM API docs. I don't have a big preference, happy to change it to base64.b64encode(secrets.token_bytes(16))

danking · 2024-02-27T23:14:12Z

batch/batch/cloud/terra/azure/driver/driver.py

+                log.info(f'Terra response creating disk {disk_name}: {res}')
+            except Exception:
+                log.exception(f'error while creating disk {disk_name}')
+                return total_resources_on_instance


Shouldn't disk creation failure be more traumatic?

Turns out this is how the other cloud drivers operate when they can't create an instance. We're going to keep this as is but this should be scrutinized more broadly to consider whether we should immediately remove the instance when it fails to be created.

daniel-goldstein · 2024-02-29T21:31:31Z

Hmm. Is the TODO suggesting that we assert the state field is READY in the response?

I handled the other states and removed the TODO

addressed

danking

One nit, but this seems broadly right to me!

batch/batch/cloud/terra/azure/driver/driver.py

danking · 2024-02-29T22:19:56Z

WIP'ed in case you want to add the last shq

daniel-goldstein assigned danking Oct 30, 2023

github-advanced-security bot found potential problems Nov 1, 2023

View reviewed changes

hail/python/hailtop/config/deploy_config.py Dismissed Show dismissed Hide dismissed

daniel-goldstein force-pushed the azure-terra branch 2 times, most recently from c7ad10f to 38882d5 Compare November 2, 2023 18:20

[batch] Support deployment mode as a Terra on Azure app

5abf4c7

daniel-goldstein force-pushed the azure-terra branch from 9453f65 to 5abf4c7 Compare November 4, 2023 15:37

danking previously requested changes Nov 27, 2023

View reviewed changes

daniel-goldstein added 7 commits November 29, 2023 07:15

address comments

65ddb0d

add chart template

0e5658f

fix inconsistencies between scala and python and assert or assume no …

d8e5a0d

…secrets in terra

more cleanup

e58f6b7

comment why no mount_tokens in terra

7a1d691

remove special handling of ConnectionRefusedError when talking to VMs

c1e4b7d

fxi

1f48af7

fix scala terra fs

dd19a86

danking reviewed Dec 4, 2023

View reviewed changes

danking previously requested changes Dec 7, 2023

View reviewed changes

daniel-goldstein added 9 commits January 8, 2024 10:13

Merge branch 'main' into azure-terra

d7c6427

format

8b56014

Merge branch 'main' into azure-terra

0fb82fa

dont actually need a terra-specific deploy config in scala

d687af6

lint

479647a

Merge branch 'main' into azure-terra

730db13

fix

3916f11

fix

b9575a8

Merge branch 'main' into azure-terra

394a933

daniel-goldstein added 7 commits February 6, 2024 13:37

fixes

5e199c6

change a bit

53b0438

not necessary anymore

1e33603

Merge branch 'main' into azure-terra

10df984

its not mandatory

66d24cd

get it compiling again

082cfd0

update

2807e4d

danking previously requested changes Feb 22, 2024

View reviewed changes

daniel-goldstein added 4 commits February 23, 2024 17:09

refresh the azure client with a new terra sas token when it is close …

ad3ed31

…to expiring

Merge branch 'main' into azure-terra

27a51b2

remove unused import

4f44742

fix

b899fbe

danking previously requested changes Feb 28, 2024

View reviewed changes

daniel-goldstein added 3 commits February 29, 2024 13:05

shell quote everything

6e3d5cd

rip out the extra disk

305be51

handle other states

441e80b

danking approved these changes Feb 29, 2024

View reviewed changes

batch/batch/cloud/terra/azure/driver/driver.py Outdated Show resolved Hide resolved

batch/batch/cloud/terra/azure/driver/driver.py Outdated Show resolved Hide resolved

danking added the WIP label Feb 29, 2024

more shq and heredoc fix

5b448d9

daniel-goldstein removed the WIP label Feb 29, 2024

run scalafix

0ec0ac3

hail-ci-robot merged commit f40d1c9 into hail-is:main Mar 1, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] Support deployment mode as a Terra on Azure app #13944

[batch] Support deployment mode as a Terra on Azure app #13944

daniel-goldstein commented Oct 30, 2023

daniel-goldstein commented Oct 30, 2023 •

edited

Loading

danking left a comment

danking Nov 27, 2023

daniel-goldstein Nov 28, 2023

danking Dec 5, 2023

danking Nov 27, 2023

daniel-goldstein Dec 4, 2023

danking Nov 27, 2023

daniel-goldstein Dec 4, 2023

danking Feb 27, 2024

danking left a comment

danking Dec 4, 2023

danking left a comment

daniel-goldstein commented Feb 6, 2024

daniel-goldstein commented Feb 6, 2024

danking left a comment

danking left a comment

danking Feb 27, 2024

daniel-goldstein Feb 29, 2024

danking Feb 27, 2024

danking Feb 27, 2024

daniel-goldstein Feb 29, 2024

danking Feb 27, 2024

daniel-goldstein Feb 29, 2024

daniel-goldstein commented Feb 29, 2024

danking left a comment

danking commented Feb 29, 2024



		class TerraAzureWorkerAPI(CloudWorkerAPI[AzureManagedIdentityCredentials]):
		nameserver_ip = '168.63.129.16'

[batch] Support deployment mode as a Terra on Azure app #13944

[batch] Support deployment mode as a Terra on Azure app #13944

Conversation

daniel-goldstein commented Oct 30, 2023

daniel-goldstein commented Oct 30, 2023 • edited Loading

danking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking left a comment

Choose a reason for hiding this comment

daniel-goldstein commented Feb 6, 2024

daniel-goldstein commented Feb 6, 2024

danking left a comment

Choose a reason for hiding this comment

danking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein commented Feb 29, 2024

danking left a comment

Choose a reason for hiding this comment

danking commented Feb 29, 2024

daniel-goldstein commented Oct 30, 2023 •

edited

Loading