Sept 23 upstream 4 (dc1f086) #305

illusional · 2023-09-08T00:40:28Z

Depends on: #304

I realize this looks like a lot of code changes, but it's mostly copying and pasting two SQL procedures and changing one line in each. This adds 4 bits of metadata to requests that then can be queried as extra metadata: - batch_id - job_id - batch_operation - job_queue_time Should be self-explanatory except job_queue time is the time in which the job is first set to ready to when it was scheduled on the worker (exact moment is when the job config is made to send to the worker). Example logging query. Note that the search on "batch_id" is not optimized so you definitely want to add some kind of time limit that's short on the window to search. I can add my Python script that scrapes these logs and makes a Plotly figure in a separate PR once this goes in. ``` ( resource.labels.container_name="batch" resource.labels.namespace_name="{namespace}" ) OR ( resource.labels.container_name="batch-driver" resource.labels.namespace_name="{namespace}" ) OR ( resource.type="gce_instance" logName:"worker.log" labels."compute.googleapis.com/resource_name":"{namespace}" ) jsonPayload.batch_id="{batch_id}" timestamp >= "{start_timestamp}" {end_timestamp} ```

CHANGELOG: Fixed bug causing poor performance and memory leaks for Matrix.annotate_rows aggregations --------- Co-authored-by: patrick-schultz <pschultz@broadinstitute.org>

For deleted users the request to look up the AAD application can fail with a 404, and we don't need to gather this information for deleted users anyway.

Identities in test namespaces cannot share the same underlying cloud identity if we want to identify requests with cloud access tokens. This also means the `test` account does not need to have the union of roles of the other robot accounts, but pruning of the `test` account's roles is left until after this PR merges so we can properly assess which roles are still in use by the `test` account.

…hail-is#13206) Key changes: - Remove old VCF combiner - Add StreamZipJoinProducers an IR that takes an array, and a function from array.elementType to stream and zip joins the result of calling that function on each member of the array. - Combine Table._generate and this new stream zip operation to rewrite the gvcf merge stage of the vds combiner in O(1) IR --------- Co-authored-by: Tim Poterba <tpoterba@gmail.com>

…-is#13220) This got dropped in the move to typer.

Azure default credentials will use the metadata server when available so we can just use those instead of manually reaching out to the metadata server.

…3226) RR: hail-is#13045 RR: hail-is#13046 Support symmetric comparison of structs and struct expressions. Provide better error messages when attempting to construct literals from expressions with free variables.

…il-is#13231) I need this extra debugging information to understand what is going on in Azure with deleted VMs still showing up in the portal with ResourceNotFound errors. Miah and Greg are running into this same problem in their deployment. My guess is what is happening is the worker is active and working fine, but then the deployment gets "Canceled" because the OMSAgent takes too long to deploy. So our loop then cancels the deployment which messes up the state in Azure of the already deployed and running VM. I popped the parameters from the deployment result in case it contains sensitive data (I'm mainly worried about any private SSH keys).

Currently `TableRead.execute` always produces a `TableValueIntermediate`, even though almost all `TableReader`s are lowerable, so could produce a `TableStageIntermediate`. This pr refactors `TableReader` to allow producing a `TableStageIntermediate` in most cases, and to make it clearer which readers still need to be lowered (only `TableFromBlockMatrixNativeReader`, `MatrixVCFReader`, and `MatrixPLINKReader`). It also deletes some now dead code.

…3199) This is stacked on hail-is#12748

The test that lists batches timed out. The main problem is the limit in the aioclient used by the test_batch tests was passing a string rather than an integer. I assumed downstream the function was passing an integer. Therefore, we were doing this: batch_id < "137" and not batch_id < 137. So the query was running forever and scanning all batches from the test user. I also was missing a tag annotation on the queries, but that was not causing the timeout.

Currently, when the user views the jobs for a CI pipeline, either on the `pr.html` page or the `batch.html` page, they are displayed all together in a big table, like so: <img width="618" alt="Screenshot 2023-06-06 at 15 30 21" src="https://github.com/hail-is/hail/assets/84595986/8c3f45a8-756d-4e10-ac27-a57791f3965c"> This change filters the jobs out into smaller tables, with any failed jobs displayed first, followed by any jobs that are currently running, then any pending jobs, then the rest. Example with failed job: <img width="728" alt="Screenshot 2023-06-06 at 15 12 36" src="https://github.com/hail-is/hail/assets/84595986/5b20f95e-e530-43c2-bc96-4fcd639083d3"> Example with running/pending jobs: <img width="643" alt="Screenshot 2023-06-06 at 15 10 54" src="https://github.com/hail-is/hail/assets/84595986/7b5790e6-a5e0-426f-b290-a28fee19e967">

)

https://ci.hail.is/batches/7525669/jobs/178

…st (hail-is#13243)

@daniel-goldstein

Qin He reported that listing a folder containing around 50k files took 1h15. This new code takes ~16 seconds which is about how long it takes `gcloud storage ls`. There are two improvements: 1. Use `bounded_gather2`. The use of a semaphore in `bounded_gather2`, which is missing from `bounded_gather`, allows it to be used recursively. In particular, suppose we had a semaphore of 50. The outer `bounded_gather2` might need 20 slots to run its 20 paths in parallel. That leaves 30 slots of parallelism left over for its children. By passing the semaphore down, we let our children optimistically use some of that excess parallelism. 2. If we happen to have the `StatResult` for a particular object, we should never again look it up. In particular, getting the `StatResult` for every file in a directory can be done in O(1) requests. Getting the `StatResult` for each of those files individually (using their full paths) is necessarily O(N). If there was at least one glob and also there are no `suffix_components`, then we can use the `StatResult`s that we learned when checking the glog pattern. The latter point is perhaps a bit more clear with examples: 1. `gs://foo/bar/baz`. Since there are no globs, we can make exactly one API request to list `gs://foo/bar/baz`. 2. `gs://foo/b*r/baz`. In this case, we must make one API request to list `gs://foo/`. This gives us a list of paths under that prefix. We check each path for conformance to the glob pattern `gs://foo/b*r`. For any path that matches, we must then list `<the matching path>/baz` which may itself be a directory containing files. Overall we make O(1) API requests to do the glob and then O(K) API requests to get the final `StatResult`s, where K is the number of paths matching the glob pattern. 3. `gs://foo/bar/b*z`. In this case, we must make one API request to list `gs://foo/bar/`. In `main`, we then throw away the `StatResult`s we got from that API request! Now we have to make O(K) requests to recover those `StatResult`s for all K paths that match the glob pattern. This PR just caches the `StatResult`s of the most recent globbing. If there is no suffix to later append, then we can just re-use the `StatResult`s we already have! cc: @daniel-goldstein since you've reviewed this before. Might be of interest.

…il-is#13239) `hailtop.batch.ServiceBackend` uses `get_user_config().get` to read the `HAIL_BATCH_REGIONS` environment variable, when it should use `configuration_of`. This change fixes that.

Let me know if you think this is good and whether I need to test the UI with dev deploy. --------- Co-authored-by: Dan King <daniel.zidan.king@gmail.com>

…3264) Bug reported by https://discuss.hail.is/t/no-change-with-partition-hint/3497/2

…hail-is#13278) Replaces hail-is#13260. - `test_spectral_moments` times out in a PR: (QoB) https://hail.zulipchat.com/#narrow/stream/127527-team/topic/timeouts/near/376698259, (spark) https://ci.hail.is/batches/7653376/jobs/74, (spark) https://ci.hail.is/batches/7653376/jobs/72, (spark) https://ci.hail.is/batches/7653376/jobs/62 I also backed local off to 4m even though it has no evidence of time outs. Seems simpler for Spark and local to be the same.

Some context: To make Batch feature additions safer, we now have infrastructure to turn off new components with checkboxes on the driver page. I forgot to add the text before the checkbox when I put it in last week.

CHANGELOG: `hl.Struct` is now pickle-able.

Fixes hail-is#13483 <img width="524" alt="Screenshot 2023-08-24 at 2 46 17 PM" src="https://github.com/hail-is/hail/assets/1693348/517318fe-051f-4a10-a59e-c8a29ecc55ff"> <img width="562" alt="Screenshot 2023-08-24 at 2 39 57 PM" src="https://github.com/hail-is/hail/assets/1693348/c47abc08-c2b2-48f6-b2f6-5c7b1f5ddebd"> <img width="248" alt="Screenshot 2023-08-24 at 2 39 48 PM" src="https://github.com/hail-is/hail/assets/1693348/5bc1708c-3ec3-4dc7-9dd2-1d33a656475b">

CHANGELOG: Fix bug introduced in 0.2.117 by commit `c9de81108` which prevented the passing of keyword arguments to Python jobs. This manifested as "ValueError: too many values to unpack". We also weren't preserving tuples. They become lists. I fixed that too. I also added some types and avoided an is instance by encoding the necessary knowledge.

…3536) Fixes hail-is#13535. All the new hailgenetics/dill images are now up: https://hub.docker.com/r/hailgenetics/python-dill/tags

We can use this for pyright or mypy but I found both took ~1.5s even on a tiny file like `ci.py`. The check is slow the first time because it has to install a new venv; however, subsequent executions can use the extant venv.

Upgrade to Ubuntu 22.04 everywhere including for the Batch Worker VMs in Azure and Google. --------- Co-authored-by: Sophie Parsa <parsa@wm9c6-e4e.broadinstitute.org> Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>

This inspect command prevents us from updating a tag, for example, if we need to replace an image with a security problem or if there is a bug like the one fixed by hail-is#13536.

I am having trouble determining the effect of this mistake, but it seems like we would be substantially undercharging for the serivce fee if it was really being charged by worker_fraction_in_1024ths instead of core-hours.

CHANGELOG: In QoB, Hail's file systems now correctly list all files in a directory, not just the first 1000. This could manifest in an import_table or import_vcf which used a glob expression. In such a case, only the first 1000 files would have been included in the resulting Table or MatrixTable. I also moved two GSFS-only tests into the FSSuite. There should be very few tests that are cloud-specific.

I added this feature because I am tired of every time I want to dev deploy and try out new changes, it triggers a new build in CI. I'd prefer to put a new label on the PR rather than close it each time or make a copy of the branch and test the copy.

…3550) Not a correctness bug because we raise an assertion error in the partition function.

All these classes inherit from `Resource` which is an `abc.ABC`.

…ail-is#13131) Deprecate hail-minted API keys in favor of using access tokens from the identity providers already associated with user identities. For more context and a high-level overview of the implementation, see [this RFC](hail-is/hail-rfcs#2)

gear/gear/auth.py

batch/batch/driver/main.py

hail/python/hailtop/batch/backend.py

batch/batch/batch.py

Co-authored-by: John Marshall <jmarshall@hey.com>

ci/ci/github.py

jmarshall

I've commented on all the potential problems I saw in this one.

There's also the secrets in build.yaml where we seem to have removed some previously and some refactoring/renaming has gone on upstream. That seems like a suck it and see scenario, and you know better than I do what secrets we've got set up.

jigold and others added 30 commits June 28, 2023 17:20

[query] Fix BAD loop deoptimization with StreamAgg (hail-is#12995)

71ea012

CHANGELOG: Fixed bug causing poor performance and memory leaks for Matrix.annotate_rows aggregations --------- Co-authored-by: patrick-schultz <pschultz@broadinstitute.org>

[ci] Add merge candidate to UI page (hail-is#13217)

245dfdd

[auth] Only resolve the identity UID for active users (hail-is#13221)

d53a266

For deleted users the request to look up the AAD application can fail with a 404, and we don't need to gather this information for deleted users anyway.

[hailctl] Restore pkgs alias for dataproc start packages option (hail…

3e0bb6f

…-is#13220) This got dropped in the move to typer.

[batch] Reuse AzureCredentials on the worker (hail-is#13225)

b7cc5f3

Azure default credentials will use the metadata server when available so we can just use those instead of manually reaching out to the metadata server.

[batch] Add new list_batches query interface to REST calls (hail-is#1…

39a9de0

…3199) This is stacked on hail-is#12748

[batch] Fix schedulable cores in UI on pools page (hail-is#13236)

068c0f2

[query] skip combiner tests until combiner is fast in QoB (hail-is#13240

7581167

)

[transient-error] new transient error from GCS (hail-is#13170)

7db520f

https://ci.hail.is/batches/7525669/jobs/178

[tests] prevents collisions in batch advanced search partial match te…

4d8f090

…st (hail-is#13243)

[ci] Add sjparsa as an authorized user (hail-is#13257)

138d69e

[k8s] update to avoid unsupported apiVersions (hail-is#13277)

a599c55

[query] reenable long dormant localize_entries tests (hail-is#13241)

dae3778

[batch] reads HAIL_BATCH_REGIONS environment variable correctly (ha…

7f718fb

…il-is#13239) `hailtop.batch.ServiceBackend` uses `get_user_config().get` to read the `HAIL_BATCH_REGIONS` environment variable, when it should use `configuration_of`. This change fixes that.

[batch] Add feature switches (hail-is#13262)

9b835f4

Let me know if you think this is good and whether I need to test the UI with dev deploy. --------- Co-authored-by: Dan King <daniel.zidan.king@gmail.com>

[query] use nPartitions arg of KeyByAndAgg in lowered impl (hail-is#1…

18c4d53

…3264) Bug reported by https://discuss.hail.is/t/no-change-with-partition-hint/3497/2

[batch] Fix feature-flags UI on driver page (hail-is#13290)

ae3f81e

[tests] Increase service backend test timeout to 600s (hail-is#13297)

4d583cc

[batch] Add feature flag for oms agent (hail-is#13293)

5326b01

danking and others added 15 commits August 31, 2023 21:30

[query] fix pickling of hl.Struct (hail-is#13523)

a2f29c3

CHANGELOG: `hl.Struct` is now pickle-able.

[python-dill][hailtop] update past the bad version of dill (hail-is#1…

344b9e3

…3536) Fixes hail-is#13535. All the new hailgenetics/dill images are now up: https://hub.docker.com/r/hailgenetics/python-dill/tags

[pre-commit] add ruff, check-sql checks (hail-is#13474)

5205d61

We can use this for pyright or mypy but I found both took ~1.5s even on a tiny file like `ci.py`. The check is slow the first time because it has to install a new venv; however, subsequent executions can use the extant venv.

[batch] update ubuntu image version to latest (hail-is#13440)

2342d1b

Upgrade to Ubuntu 22.04 everywhere including for the Batch Worker VMs in Azure and Google. --------- Co-authored-by: Sophie Parsa <parsa@wm9c6-e4e.broadinstitute.org> Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>

[docker] copy even if the tag already exists (hail-is#13538)

adc34d3

This inspect command prevents us from updating a tag, for example, if we need to replace an image with a security problem or if there is a bug like the one fixed by hail-is#13536.

[batch] fix misnamed resources (hail-is#13539)

f063e96

I am having trouble determining the effect of this mistake, but it seems like we would be substantially undercharging for the serivce fee if it was really being charged by worker_fraction_in_1024ths instead of core-hours.

[query] avoid signed integer overlow in partition function (hail-is#1…

076a587

…3550) Not a correctness bug because we raise an assertion error in the partition function.

[batch] Do not mount the xfs_quota from the worker host (hail-is#13544)

5f8fcf5

[batch] remove redundant super class (hail-is#13541)

dbb28ae

All these classes inherit from `Resource` which is an `abc.ABC`.

Merge commit 'dc1f086' into sept-23-upstream-4-dc1f086

b015fef

illusional mentioned this pull request Sep 8, 2023

Sept 23 upstream 5 0.2.122 #306

Merged

illusional changed the base branch from main to sept-23-upstream-3-1c28203 September 8, 2023 00:42

illusional requested a review from jmarshall September 8, 2023 00:49

github-advanced-security bot found potential problems Sep 8, 2023

View reviewed changes

gear/gear/auth.py Dismissed Show dismissed Hide dismissed

batch/batch/driver/main.py Dismissed Show dismissed Hide dismissed

jmarshall reviewed Sep 10, 2023

View reviewed changes

hail/python/hailtop/batch/backend.py Outdated Show resolved Hide resolved

jmarshall reviewed Sep 12, 2023

View reviewed changes

batch/batch/batch.py Outdated Show resolved Hide resolved

Fix dropped change

f76da5a

Co-authored-by: John Marshall <jmarshall@hey.com>

jmarshall reviewed Sep 15, 2023

View reviewed changes

ci/ci/github.py Show resolved Hide resolved

jmarshall approved these changes Sep 15, 2023

View reviewed changes

illusional added 2 commits September 19, 2023 08:26

Address review feedback for bad merge

ba7792d

Apply review suggestions

2b74e60

illusional marked this pull request as ready for review September 19, 2023 20:57

Base automatically changed from sept-23-upstream-3-1c28203 to main September 19, 2023 21:01

illusional merged commit daecd7d into main Sep 19, 2023

illusional deleted the sept-23-upstream-4-dc1f086 branch September 19, 2023 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sept 23 upstream 4 (dc1f086) #305

Sept 23 upstream 4 (dc1f086) #305

illusional commented Sep 8, 2023

jmarshall left a comment

Sept 23 upstream 4 (dc1f086) #305

Sept 23 upstream 4 (dc1f086) #305

Conversation

illusional commented Sep 8, 2023

jmarshall left a comment

Choose a reason for hiding this comment