Upstream v0.2.113 #282

illusional · 2023-04-12T00:58:22Z

    # Conflicts:
    #       .gitignore
    #       auth/Makefile
    #       batch/batch/driver/instance_collection/pool.py
    #       batch/batch/driver/main.py
    #       build.yaml
    #       ci/create_initial_account.py
    #       docker/Dockerfile.service-base
    #       hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala
    #       hail/src/main/scala/is/hail/backend/service/Worker.scala

Some more challenging conflicts in this release - @lgruen, could you look over the previous files and let me know if you see anything weird.

It looks like they standardised some of the params in one of the jinja renders

There were some strange conflicts in build.yaml that I didn't understand. Our logs don't show we really made changes to the section around:

- kind: buildImage2
    name: echo_image
    dockerFile: /io/echo/Dockerfile
    contextPath: /io/echo
    publishAs: echo
    resources:
      storage: 10Gi
      cpu: "2"
      memory: standard
    inputs:
      - from: /repo/echo
        to: /io/echo
    dependsOn:
      - hail_ubuntu_image
      - merge_code
  - kind: deploy
    name: deploy_echo
    namespace:
      valueFrom: default_ns.name
    config: echo/deployment.yaml
    dependsOn:
      - default_ns
      - echo_image

I also noticed they made some fairly chunky changes to docker, which we had changes in files that have since been removed (docker/Dockerfile.service-base)

CHANGELOG: Fix `hailctl hdinsight submit` to pass args to the files. See `/batches` at https://livy.incubator.apache.org/docs/latest/rest-api.html

I added the configuration option for the minimum number of workers that should be present at any time. I tested this in my namespace. I'd like you to double check the logic is correct for the number of workers needed as I derived it by working through examples: ```python3 n_live_instances = self.n_instances_by_state['pending'] + self.n_instances_by_state['active'] n_standing_instances_needed = max(0, self.min_instances - self.n_instances) n_standing_instances_needed = min( n_standing_instances_needed, self.max_live_instances - n_live_instances, self.max_instances - self.n_instances, remaining_instances_per_autoscaler_loop, # 20 queries/s; our GCE long-run quota 300, ) ```

CHANGELOG: Added the regions parameter to `hl.init()` for specifying which regions Query on Batch jobs should run in.

CHANGELOG: Added the ability to update an existing batch with additional jobs by calling `Batch.run()` more than once. I wrote this quite awhile ago, so I don't remember if there were any issues as to why I didn't make a PR sooner or if I'm missing some complexity here.

Add python classes analogous to those in Scala Add printer/parsers to Scala and Python

…12778)

The only remaining references are in the datasets scripts, but those are meant as references of how we created the files, and in the `hailctl dataproc` command. I chose not to change the latter because I fear some users might still have ancient versions of `gcloud`. The reviewer should verify that I got the right version of the `gcloud` command in each case. Hopefully this resolves the bizarre error we see in test-dataproc.

Different splits (`test_hailtop_batch_0`, `test_hailtop_batch_1`, …) can clobber each other's tmpdir.

…2775) I'm not sure this is the "query-way" of doing these checks, but having this feature would have saved Cal from running an expensive pipeline and the output file "disappearing" at the end.

…ion planning (hail-is#12587) Evaluation of relational lets is an explicit pass. Executing and rewriting shuffles is an explicit pass. * lowerDistributedSort executes the shuffle and produces a TableReader Higher-level passes that recursively lower and execute are parameterized by the contained pipeline.

Co-authored-by: Dan King <dking@broadinstitute.org>

…ail-is#12791)

…is#12792)

hail-is#12799) Looks like the `get_remote_tmpdir` function was made when the Query ServiceBackend needed this config and didn't want to copy code directly from the batch ServiceBackend. Howevr, the batch ServiceBackend was never changed to use the new functions so got left behind. Aside from deleting a lot of duplicate code, the only change is that now the batch ServiceBackend will pick up the following environment variables `HAIL_BATCH_REMOTE_TMPDIR` and `HAIL_BATCH_BILLING_PROJECT`

@tpoterba

cc @tpoterba My apologies. I made several changes to lowered logistic regression as well. All the generalized linear model methods share the same fit result. I abstracted this into one datatype at the top of `statgen.py`: `numerical_regression_fit_dtype`. --- You'll notice I moved the cases such that we check for convergence *before* checking if we are at the maximum iteration. It seemed to me that: - `max_iter == 0` means do not even attempt to fit. - `max_iter == 1` means take one gradient step, if you've converged, then return successfully, otherwise fail. - etc. The `main` branch currently always fails if you set `max_iter == 1`, even if the first step lands on the true maximum likelihood fit. I substantially refactored logistic regression. There were dead code paths (e.g. the covariates array is known to be non-empty). I also found all the function currying and comingling of fitting and testing really confusing. To be fair, the Scala code does this (and its really confusing). I think the current structure is easier to follow: 1. Fit the null model. 2. If wald, assume the beta for the genotypes is zero and use the rest of the parameters from the null model fit to compute the score (i.e. the gradient of the likelihood). Recall calculus: gradient near zero => value near the maximum. Return: this is the test. 3. Otherwise, fit the full model starting at the null fit parameters. 4. Test the "goodness" of this new & full fit. --- Poisson regression is similar but with a different likelihood function and gradient thereof. Notice that I `key_cols_by()` to indicate to Hail that the order of the cols is irrelevant (the result is a locus-keyed table after all). This is necessary at least until hail-is#12753 merges. I think it's generally a good idea though: it indicates to Hail that the ordering of the columns is irrelevant, which is potentially useful information for the optimizer! --- Both logistic and Poisson regression can benefit from BLAS3 by running at least the score test for multiple variants at once. --- I'll attach an image in the comments, but I spend ~6 seconds compiling this trivial model and ~140ms testing it. ```python3 import hail as hl mt = hl.utils.range_matrix_table(1, 3) mt = mt.annotate_entries(x=hl.literal([1, 3, 10, 5])) ht = hl.poisson_regression_rows( 'wald', y=hl.literal([0, 1, 1, 0])[mt.col_idx], x=mt.x[mt.col_idx], covariates=[1], max_iterations=2) ht.collect() ``` I grabbed some [sample code from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html) for Poisson regression (doing a score test rather than a wald test) and timed it. It takes ~8ms. So we're 3 orders of magnitude including the compiler, and ~1.2 orders of magnitude off without the compiler. Digging in a bit: - ~65ms for class loading. - ~15ms for region allocation. - ~20ms various little spots. Leaving about 40ms strictly executing generated code That's about 5x which is starting to feel reasonable.

…are empty (hail-is#12797) Using changePartitionerNoRepartition with a partitioner with a different number of partitions cannot be correct and will result in dropped data. Just construct the new partitioner (based on row index, so it will comply with invariants) directly.

A bunch of stuff: 1. grab disabled status, ignore them in most places, show disabledness when printing 2. Handle the case where a service account has no active key 3. Remove loop confirmation (I can just ctrl-c) Future work: We probably do need a bulk update thing but this remove enough annoyance for me for now.

This caused assertion errors in Scala in `MakeArray.unify`.

I sometimes iterate quickly with `pytest -k testname` from the `hail/python` directory. I cannot do that if the test directory is hard coded.

The most trivial loop is the loop which does not recur. I added a simple test and fixed Requiredness to handle this case.

…ch (hail-is#12801) CHANGELOG: Hitting CTRL-C while interactively using Batch or Query-on-Batch cancels the underlying batch.

The compiler still catches this at runtime, but it is hard to understand because it lacks a meaningful stacktrace.

@ehigham

cc: @ehigham

…d local backend jobs (hail-is#12845) CHANGELOG: The Batch LocalBackend now sets the working directory for dockerized jobs to the root directory instead of the temp directory Tested with the following: ``` In [1]: import hailtop.batch as hb ...: b = hb.Batch(backend=hb.LocalBackend()) ...: j = b.new_job() ...: j.image('ubuntu:20.04') ...: j.command('pwd') ...: b.run() ...: ...: / Batch completed successfully! ```

…s#12843)

@jigold

…is#12841) Got annoyed with the constant re-tagging of images that don't need to be rebuilt, and decided to play a little make golf along the way. cc @jigold This should dramatically reduce the number of tags for hail-ubuntu from make-deployed images, though the number of layers in the container registry should not change.

CHANGELOG: `hl.variant_qc` no longer requires a `locus` field. Fixes hail-is#12844.

) I develop accross a couple of OSes and my username isn't always the same. I'd like to expose this make variable so that I can push images to the same location.

CHANGELOG: In Query-on-Batch, `hl.logistic_regression('firth', ...)` is now supported. Forgive me: I cleaned up and unified the look of the (now three) `fit` methods. A few of the sweeping cleanups: 1. num_iter, max_iter, and cur_iter are now n_iterations, max_iterations, and iteration. 2. Pervasive use of broadcasting functions rather than map. 3. `log_lkhd` only evaluated on the last iteration (in particular, its not bound before the `case`) 4. `select` as the last step rather than `drop` (we didn't drop all the unnecessary fields previously). 5. `select_globals` to make sure we only keep the `null_fit`. Major changes in this PR: 1. Add no_crash to triangular_solve 2. Split the epacts tests by type (now easy to run all firth tests with `-k firth`). 3. Firth fitting and test. A straight copy from Scala. I honestly don't understand what this is doing.

…ctions (hail-is#12854) CHANGELOG: In Query-on-Batch, simple pipelines with large numbers of partitions should be substantially faster. :facepalm:

CHANGELOG: In Query-on-Batch, allow writing to requester pays buckets, which was broken before this release.

Is there a better way to communicate the Query logging bug and its fix?

Merge of hail-is#12868 and hail-is#12867 because I expect each is likely to fail due to the other's error. --- [qob] retry `storage.writer` and `storage.reader` I do not think we frequently get errors in `storage.reader`, but I think `storage.writer` was always flaky and we were protected by the `retryTransientErrors` on `createNoCompression`. My change to fix requester pays delayed the error until either the first `write` or the `close` which do not have a `retryTransientErrors` (and it is not obvious to me that it is safe to retry a `flush`). --- [qob] retry transient errors reading the results file ---

I left the changes to Query and Batch in separate commits for ease of review. I put these in the same PR because we don't really have standalone testing for JVM Jobs outside of Query-on-Batch so the FASTA use-case serves as a test here that cloudfuse is working properly for JVM Jobs. Would be great if Jackie you could review the batch commit and Tim could review the query commit. ## Hail Query - Added support for the `FROM_FASTA_FILE` rpc and the service backend now passes sequence file information from RGs in every rpc - Refactored the liftover handling in service_backend to not redundantly store liftover maps and just take them from the ReferenceGenome objects like I did for sequence files. This means that add/remove liftover/sequence functions on the Backend are just intended to sync up the backend with python, which is a no-op for the service backend. - Don't localize the index file on fromFASTAFile/addSequence before creating the index object. `FastaSequenceIndex` just loads the whole file on construction so might as well stream it in from whatever storage it's in. - FASTA caching is left alone because those files will be mounted and unmounted from the jvm container over the life of the job. JVM doesn't have to worry about disk usage because that's handled by Batch XFS quotas, so long as the service backend requests enough storage to fit the FASTA file. Batch will make sure that a given bucket (and therefore a given FASTA file) is mounted once per-user on a batch worker. ## Hail Batch - Added support for read-only cloudfuse mounts for JVM jobs - These mounts are shared between jobs on the same machine from the same user - I did not change DockerJobs, but they could be very easily adapted to use this new mount-sharing code.

…l-is#12818) - On *deploys*, makes sure that whatever is in our third-party images is in our private registry before starting builds like hail-ubuntu that might depend on those images. This means that we can update our ubuntu base image without the australians needing to deploy any images by hand. However, this does not run in PRs because I 1) didn't want to add that kind of latency for PRs and 2) we don't do any kind of namespacing for our images so if we did include this for a PR that ultimately wasn't merged we would have to manually remove the image anyway so why not manually add it if you're going to PR it… I think point 2 is a little weak but I recall this being what we agreed on a couple months back when we discussed this. I'm wondering if we should just eat the minute or so latency at the beginning of PRs to be safe but it also feels like a shame for something that changes so infrequently. - Again on deploys, upload the hailgenetics/* images to the private registry if they don't already exist there. This way any deployments that aren't hail team's GCP deployment can get these images automatically when they deploy a new SHA instead of uploading them manually. It won't backfill skipped versions, but we decided that was ok. This seems less relevant for testing on PRs as it will get triggered on releases and we can easily dev deploy to rectify the image if this breaks.

…batch tests (hail-is#12862) Trying to make it more ergonomic to simply do `python3 -m pytest batch/test/test_batch.py::test_job` (now works without any extra environment variables or configuration). This involved the following changes: - Deleted of some env vars that are no longer used / can be easily consolidated into existing ones - Gave defaults to those testing env variables for which there are reasonable defaults. E.g. `DOCKER_ROOT_IMAGE` and `HAIL_GENETICS_HAIL_IMAGE`. - Pushed other environment variables for which there are not reasonable defaults into the tests that need them. If you run a test that requires `HAIL_CLOUD`, you'll still get an error that that env variable is unset and you should set it. But, if you just want to run a single test that doesn't need `HAIL_CLOUD` it won't get in the way.

…ing (hail-is#12863) This fixes a bug Mike W was hitting with a gnomAD pipeline. I wasn't able to replicate with a trivial example, and would prefer not to spend another half-day poring through that huge IR to try to minimize. Verified that this fixes the problem, though. https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/.E2.9C.94.20Summing.20number.20of.20entries.20meeting.20criteria.20.20by.20group

I also cleaned up a duplicate errno entry.

…is#12866) I noticed that this step blew up in the benchmarking PR and thought I'd provide a better error message.

the `test_dataproc` steps need what is in ci-utils but also the tools in base_image to build hail. Tested with a dev deploy of `test_dataproc-37`

…-113 # Conflicts: # .gitignore # auth/Makefile # batch/batch/driver/instance_collection/pool.py # batch/batch/driver/main.py # build.yaml # ci/create_initial_account.py # docker/Dockerfile.service-base # hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala # hail/src/main/scala/is/hail/backend/service/Worker.scala

lgruen

That all looks good to me, but the build.yaml changes in particular are really hard to completely grok. Could we try a dev deploy first?

devbin/rotate_keys.py

+                    f'''Found a user ({self.username()}) without a unique active key in Kubernetes.
+The known IAM keys are:
+{known_iam_keys_str}
+
+These keys are present in Kubernetes:
+{keys_to_k8s_secret_str}
+
+We will assume {kube_key.id} is the active key.
+'''


danking and others added 30 commits March 8, 2023 17:08

[hailctl] pass through the pass through args (hail-is#12772)

c36d26c

CHANGELOG: Fix `hailctl hdinsight submit` to pass args to the files. See `/batches` at https://livy.incubator.apache.org/docs/latest/rest-api.html

[qob] Add regions parameter (hail-is#12581)

45c8993

CHANGELOG: Added the regions parameter to `hl.init()` for specifying which regions Query on Batch jobs should run in.

[batch] Use /io if available for container volumes (hail-is#12777)

3a73080

[release] 0.2.111 (hail-is#12774)

6fe93bb

[query] Expose TableGen to Python (hail-is#12737)

cb2e28e

Add python classes analogous to those in Scala Add printer/parsers to Scala and Python

[lowering] lower_only decorator now lowers block matrix too (hail-is#…

730ab73

…12778)

[query] Stage (matrix)table writes locally (hail-is#12773)

2b80404

[batch] Unique remote tmpdir path for each split (hail-is#12782)

200bd31

Different splits (`test_hailtop_batch_0`, `test_hailtop_batch_1`, …) can clobber each other's tmpdir.

[query] Fail fast when writing local output files with QoB (hail-is#1…

de76b3d

…2775) I'm not sure this is the "query-way" of doing these checks, but having this feature would have saved Cal from running an expensive pipeline and the output file "disappearing" at the end.

[query] Dont use memory service (hail-is#12784)

30767d5

[release] 0.2.112 (hail-is#12785)

ed5e903

[release] 0.2.112 for real this time (hail-is#12787)

c5ece88

Co-authored-by: Dan King <dking@broadinstitute.org>

[docker] Update gcloud to a version that has the storage subcommand (h…

31ceff2

…ail-is#12791)

[compiler] IndexedSeq good, Seq bad (hail-is#12776)

8e76d42

[compiler] WriteValue Stage Locally (hail-is#12798)

c1d1492

[lowering] New lowering for BlockMatrixSparsify (hail-is#12706)

0d02bc5

[hailtop.batch] Add links to forms for submitting bp inquiries (hail-…

b1dbae1

…is#12792)

[query] better hl.nd.concatenate error message (hail-is#12750)

1095e21

This caused assertion errors in Scala in `MakeArray.unify`.

[query] better local testing experience (hail-is#12795)

82e6571

I sometimes iterate quickly with `pytest -k testname` from the `hail/python` directory. I cannot do that if the test directory is hard coded.

[query] fix hl.experimental.loop (hail-is#12796)

fd0f789

The most trivial loop is the loop which does not recur. I added a simple test and fixed Requiredness to handle this case.

[batch/qob] Ctrl-c when interactively running a batch cancels the bat…

a9b70ae

…ch (hail-is#12801) CHANGELOG: Hitting CTRL-C while interactively using Batch or Query-on-Batch cancels the underlying batch.

[query] tndarray equality includes number of dimensions (hail-is#12802)

732eb8e

The compiler still catches this at runtime, but it is hard to understand because it lacks a meaningful stacktrace.

danking and others added 24 commits April 4, 2023 07:05

[benchmark] actually set the env vars (hail-is#12840)

aa6a363

cc: @ehigham

[hailtop] it is not a bug to not have HOME set (hail-is#12842)

5b8e5d6

[vds/combiner] Fix attribute assignment on combiner plan load (hail-i…

9f7ede5

…s#12843)

[vds] Support split sparse matrixtable to VDS conversion (hail-is#12831)

a9fea66

[query] relax requirements of variant_qc (hail-is#12851)

68f4ccc

CHANGELOG: `hl.variant_qc` no longer requires a `locus` field. Fixes hail-is#12844.

[benchmark] Allow configuration of BENCHMARK_DOCKER_TAG (hail-is#12846

91793d9

) I develop accross a couple of OSes and my username isn't always the same. I'd like to expose this make variable so that I can push images to the same location.

[qob] Read results files 20x faster by not using Scala parallel colle…

c65098a

…ctions (hail-is#12854) CHANGELOG: In Query-on-Batch, simple pipelines with large numbers of partitions should be substantially faster. :facepalm:

[qob] allow writing to requester pays buckets (hail-is#12855)

9386fdd

CHANGELOG: In Query-on-Batch, allow writing to requester pays buckets, which was broken before this release.

[release] 0.2.113 (hail-is#12853)

0230cd0

Is there a better way to communicate the Query logging bug and its fix?

[hdinsight] we need git for the azure wheel step (hail-is#12871)

9f64bef

[AzureStorageFS] more transient errors (hail-is#12873)

263d415

I also cleaned up a duplicate errno entry.

[pip] Tell the user how to regenerate pinned requirements file (hail-…

eeb2646

…is#12866) I noticed that this step blew up in the benchmarking PR and thought I'd provide a better error message.

[ci] Put ci-utils on top of base image (hail-is#12874)

3f87182

the `test_dataproc` steps need what is in ci-utils but also the tools in base_image to build hail. Tested with a dev deploy of `test_dataproc-37`

[query] error in Make if git is missing (hail-is#12872)

cf32652

Linting

2068ccc

illusional requested a review from lgruen April 12, 2023 00:58

lgruen approved these changes Apr 12, 2023

View reviewed changes

github-advanced-security bot found potential problems Apr 12, 2023

View reviewed changes

Rename missed hail_pip_install_image -> hailgenetics_hail_image

844f6ee

illusional merged commit f5875e2 into main Apr 18, 2023

illusional deleted the upstream-113 branch April 18, 2023 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream v0.2.113 #282

Upstream v0.2.113 #282

illusional commented Apr 12, 2023

lgruen left a comment

Upstream v0.2.113 #282

Upstream v0.2.113 #282

Conversation

illusional commented Apr 12, 2023

lgruen left a comment

Choose a reason for hiding this comment