Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream v0.2.113 #282

Merged
merged 80 commits into from
Apr 18, 2023
Merged

Upstream v0.2.113 #282

merged 80 commits into from
Apr 18, 2023

Conversation

illusional
Copy link

    # Conflicts:
    #       .gitignore
    #       auth/Makefile
    #       batch/batch/driver/instance_collection/pool.py
    #       batch/batch/driver/main.py
    #       build.yaml
    #       ci/create_initial_account.py
    #       docker/Dockerfile.service-base
    #       hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala
    #       hail/src/main/scala/is/hail/backend/service/Worker.scala

Some more challenging conflicts in this release - @lgruen, could you look over the previous files and let me know if you see anything weird.

  1. It looks like they standardised some of the params in one of the jinja renders

image

  1. There were some strange conflicts in build.yaml that I didn't understand. Our logs don't show we really made changes to the section around:
- kind: buildImage2
    name: echo_image
    dockerFile: /io/echo/Dockerfile
    contextPath: /io/echo
    publishAs: echo
    resources:
      storage: 10Gi
      cpu: "2"
      memory: standard
    inputs:
      - from: /repo/echo
        to: /io/echo
    dependsOn:
      - hail_ubuntu_image
      - merge_code
  - kind: deploy
    name: deploy_echo
    namespace:
      valueFrom: default_ns.name
    config: echo/deployment.yaml
    dependsOn:
      - default_ns
      - echo_image
  1. I also noticed they made some fairly chunky changes to docker, which we had changes in files that have since been removed (docker/Dockerfile.service-base)

danking and others added 30 commits March 8, 2023 17:08
CHANGELOG: Fix `hailctl hdinsight submit` to pass args to the files.

See `/batches` at
https://livy.incubator.apache.org/docs/latest/rest-api.html
I added the configuration option for the minimum number of workers that
should be present at any time. I tested this in my namespace. I'd like
you to double check the logic is correct for the number of workers
needed as I derived it by working through examples:

```python3
        n_live_instances = self.n_instances_by_state['pending'] + self.n_instances_by_state['active']
        n_standing_instances_needed = max(0, self.min_instances - self.n_instances)
        n_standing_instances_needed = min(
            n_standing_instances_needed,
            self.max_live_instances - n_live_instances,
            self.max_instances - self.n_instances,
            remaining_instances_per_autoscaler_loop,
            # 20 queries/s; our GCE long-run quota
            300,
        )
```
CHANGELOG: Added the regions parameter to `hl.init()` for specifying
which regions Query on Batch jobs should run in.
CHANGELOG: Added the ability to update an existing batch with additional
jobs by calling `Batch.run()` more than once.

I wrote this quite awhile ago, so I don't remember if there were any
issues as to why I didn't make a PR sooner or if I'm missing some
complexity here.
Add python classes analogous to those in Scala
Add printer/parsers to Scala and Python
The only remaining references are in the datasets scripts, but those are
meant as references of how we created the files, and in the `hailctl
dataproc` command.

I chose not to change the latter because I fear some users might still
have ancient versions of `gcloud`.

The reviewer should verify that I got the right version of the `gcloud`
command in each case.

Hopefully this resolves the bizarre error we see in test-dataproc.
Different splits (`test_hailtop_batch_0`, `test_hailtop_batch_1`, …) can
clobber each other's tmpdir.
…2775)

I'm not sure this is the "query-way" of doing these checks, but having
this feature would have saved Cal from running an expensive pipeline and
the output file "disappearing" at the end.
…ion planning (hail-is#12587)

Evaluation of relational lets is an explicit pass.

Executing and rewriting shuffles is an explicit pass.
 * lowerDistributedSort executes the shuffle and produces a TableReader

Higher-level passes that recursively lower and execute are parameterized
 by the contained pipeline.
Co-authored-by: Dan King <dking@broadinstitute.org>
hail-is#12799)

Looks like the `get_remote_tmpdir` function was made when the Query
ServiceBackend needed this config and didn't want to copy code directly
from the batch ServiceBackend. Howevr, the batch ServiceBackend was
never changed to use the new functions so got left behind. Aside from
deleting a lot of duplicate code, the only change is that now the batch
ServiceBackend will pick up the following environment variables
`HAIL_BATCH_REMOTE_TMPDIR` and `HAIL_BATCH_BILLING_PROJECT`
cc @tpoterba 

My apologies. I made several changes to lowered logistic regression as
well.

All the generalized linear model methods share the same fit result. I
abstracted this into one datatype at the top of `statgen.py`:
`numerical_regression_fit_dtype`.

---

You'll notice I moved the cases such that we check for convergence
*before* checking if we are at the maximum iteration. It seemed to me
that:
- `max_iter == 0` means do not even attempt to fit.
- `max_iter == 1` means take one gradient step, if you've converged,
then return successfully, otherwise fail.
- etc. The `main` branch currently always fails if you set `max_iter ==
1`, even if the first step lands on the true maximum likelihood fit.

I substantially refactored logistic regression. There were dead code
paths (e.g. the covariates array is known to be non-empty). I also found
all the function currying and comingling of fitting and testing really
confusing. To be fair, the Scala code does this (and its really
confusing). I think the current structure is easier to follow:

1. Fit the null model.
2. If wald, assume the beta for the genotypes is zero and use the rest
of the parameters from the null model fit to compute the score (i.e. the
gradient of the likelihood). Recall calculus: gradient near zero =>
value near the maximum. Return: this is the test.
3. Otherwise, fit the full model starting at the null fit parameters.
4. Test the "goodness" of this new & full fit.

---

Poisson regression is similar but with a different likelihood function
and gradient thereof. Notice that I `key_cols_by()` to indicate to Hail
that the order of the cols is irrelevant (the result is a locus-keyed
table after all). This is necessary at least until hail-is#12753 merges. I
think it's generally a good idea though: it indicates to Hail that the
ordering of the columns is irrelevant, which is potentially useful
information for the optimizer!

---

Both logistic and Poisson regression can benefit from BLAS3 by running
at least the score test for multiple variants at once.

---

I'll attach an image in the comments, but I spend ~6 seconds compiling
this trivial model and ~140ms testing it.

```python3
import hail as hl
mt = hl.utils.range_matrix_table(1, 3)
mt = mt.annotate_entries(x=hl.literal([1, 3, 10, 5]))
ht = hl.poisson_regression_rows(
    'wald', y=hl.literal([0, 1, 1, 0])[mt.col_idx], x=mt.x[mt.col_idx], covariates=[1], max_iterations=2)
ht.collect()
```

I grabbed some [sample code from

scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html)
for Poisson regression (doing a score test rather than a wald test) and
timed it. It takes ~8ms. So we're 3 orders of magnitude including the
compiler, and ~1.2 orders of magnitude off without the compiler. Digging
in a bit:
- ~65ms for class loading.
- ~15ms for region allocation.
- ~20ms various little spots. Leaving about 40ms strictly executing
generated code That's about 5x which is starting to feel reasonable.
…are empty (hail-is#12797)

Using changePartitionerNoRepartition with a partitioner with a different
number of partitions cannot be correct and will result in dropped data.
Just construct the new partitioner (based on row index, so it will
comply with invariants) directly.
A bunch of stuff:
1. grab disabled status, ignore them in most places, show disabledness
when printing
2. Handle the case where a service account has no active key
3. Remove loop confirmation (I can just ctrl-c)

Future work:
We probably do need a bulk update thing but this remove enough annoyance
for me for now.
This caused assertion errors in Scala in `MakeArray.unify`.
I sometimes iterate quickly with `pytest -k testname` from the
`hail/python` directory. I cannot do that if the test directory is hard
coded.
The most trivial loop is the loop which does not recur. I added a simple
test and fixed Requiredness to handle this case.
…ch (hail-is#12801)

CHANGELOG: Hitting CTRL-C while interactively using Batch or
Query-on-Batch cancels the underlying batch.
The compiler still catches this at runtime, but it is hard to understand
because it lacks a meaningful stacktrace.
danking and others added 24 commits April 4, 2023 07:05
…d local backend jobs (hail-is#12845)

CHANGELOG: The Batch LocalBackend now sets the working directory for
dockerized jobs to the root directory instead of the temp directory

Tested with the following:

```
In [1]: import hailtop.batch as hb
   ...: b = hb.Batch(backend=hb.LocalBackend())
   ...: j = b.new_job()
   ...: j.image('ubuntu:20.04')
   ...: j.command('pwd')
   ...: b.run()
   ...:
   ...:
/
Batch completed successfully!
```
…is#12841)

Got annoyed with the constant re-tagging of images that don't need to be
rebuilt, and decided to play a little make golf along the way.

cc @jigold This should dramatically reduce the number of tags for
hail-ubuntu from make-deployed images, though the number of layers in
the container registry should not change.
CHANGELOG: `hl.variant_qc` no longer requires a `locus` field.

Fixes hail-is#12844.
)

I develop accross a couple of OSes and my username isn't always the
same. I'd like to expose this make variable so that I can push images to
the same location.
CHANGELOG: In Query-on-Batch, `hl.logistic_regression('firth', ...)` is
now supported.

Forgive me: I cleaned up and unified the look of the (now three) `fit`
methods.

A few of the sweeping cleanups:
1. num_iter, max_iter, and cur_iter are now n_iterations,
max_iterations, and iteration.
2. Pervasive use of broadcasting functions rather than map.
3. `log_lkhd` only evaluated on the last iteration (in particular, its
not bound before the `case`)
4. `select` as the last step rather than `drop` (we didn't drop all the
unnecessary fields previously).
5. `select_globals` to make sure we only keep the `null_fit`.

Major changes in this PR:
1. Add no_crash to triangular_solve
2. Split the epacts tests by type (now easy to run all firth tests with
`-k firth`).
3. Firth fitting and test. A straight copy from Scala. I honestly don't
understand what this is doing.
…ctions (hail-is#12854)

CHANGELOG: In Query-on-Batch, simple pipelines with large numbers of
partitions should be substantially faster.

:facepalm:
CHANGELOG: In Query-on-Batch, allow writing to requester pays buckets,
which was broken before this release.
Is there a better way to communicate the Query logging bug and its fix?
Merge of hail-is#12868 and
hail-is#12867 because I expect each is
likely to fail due to the other's error.

---

    [qob] retry `storage.writer` and `storage.reader`
    
I do not think we frequently get errors in `storage.reader`, but I think
`storage.writer` was
always flaky and we were protected by the `retryTransientErrors` on
`createNoCompression`. My
change to fix requester pays delayed the error until either the first
`write` or the `close`
which do not have a `retryTransientErrors` (and it is not obvious to me
that it is safe to retry
    a `flush`).

---

    [qob] retry transient errors reading the results file

---
I left the changes to Query and Batch in separate commits for ease of
review. I put these in the same PR because we don't really have
standalone testing for JVM Jobs outside of Query-on-Batch so the FASTA
use-case serves as a test here that cloudfuse is working properly for
JVM Jobs. Would be great if Jackie you could review the batch commit and
Tim could review the query commit.

## Hail Query
- Added support for the `FROM_FASTA_FILE` rpc and the service backend
now passes sequence file information from RGs in every rpc
- Refactored the liftover handling in service_backend to not redundantly
store liftover maps and just take them from the ReferenceGenome objects
like I did for sequence files. This means that add/remove
liftover/sequence functions on the Backend are just intended to sync up
the backend with python, which is a no-op for the service backend.
- Don't localize the index file on fromFASTAFile/addSequence before
creating the index object. `FastaSequenceIndex` just loads the whole
file on construction so might as well stream it in from whatever storage
it's in.
- FASTA caching is left alone because those files will be mounted and
unmounted from the jvm container over the life of the job. JVM doesn't
have to worry about disk usage because that's handled by Batch XFS
quotas, so long as the service backend requests enough storage to fit
the FASTA file. Batch will make sure that a given bucket (and therefore
a given FASTA file) is mounted once per-user on a batch worker.


## Hail Batch
- Added support for read-only cloudfuse mounts for JVM jobs
- These mounts are shared between jobs on the same machine from the same
user
- I did not change DockerJobs, but they could be very easily adapted to
use this new mount-sharing code.
…l-is#12818)

- On *deploys*, makes sure that whatever is in our third-party images is
in our private registry before starting builds like hail-ubuntu that
might depend on those images. This means that we can update our ubuntu
base image without the australians needing to deploy any images by hand.
However, this does not run in PRs because I 1) didn't want to add that
kind of latency for PRs and 2) we don't do any kind of namespacing for
our images so if we did include this for a PR that ultimately wasn't
merged we would have to manually remove the image anyway so why not
manually add it if you're going to PR it… I think point 2 is a little
weak but I recall this being what we agreed on a couple months back when
we discussed this. I'm wondering if we should just eat the minute or so
latency at the beginning of PRs to be safe but it also feels like a
shame for something that changes so infrequently.

- Again on deploys, upload the hailgenetics/* images to the private
registry if they don't already exist there. This way any deployments
that aren't hail team's GCP deployment can get these images
automatically when they deploy a new SHA instead of uploading them
manually. It won't backfill skipped versions, but we decided that was
ok. This seems less relevant for testing on PRs as it will get triggered
on releases and we can easily dev deploy to rectify the image if this
breaks.
…batch tests (hail-is#12862)

Trying to make it more ergonomic to simply do `python3 -m pytest
batch/test/test_batch.py::test_job` (now works without any extra
environment variables or configuration).

This involved the following changes:
- Deleted of some env vars that are no longer used / can be easily
consolidated into existing ones
- Gave defaults to those testing env variables for which there are
reasonable defaults. E.g. `DOCKER_ROOT_IMAGE` and
`HAIL_GENETICS_HAIL_IMAGE`.
- Pushed other environment variables for which there are not reasonable
defaults into the tests that need them. If you run a test that requires
`HAIL_CLOUD`, you'll still get an error that that env variable is unset
and you should set it. But, if you just want to run a single test that
doesn't need `HAIL_CLOUD` it won't get in the way.
…ing (hail-is#12863)

This fixes a bug Mike W was hitting with a gnomAD pipeline. I wasn't
able to replicate with a trivial example, and would prefer not to spend
another half-day poring through that huge IR to try to minimize.
Verified that this fixes the problem, though.


https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/.E2.9C.94.20Summing.20number.20of.20entries.20meeting.20criteria.20.20by.20group
I also cleaned up a duplicate errno entry.
…is#12866)

I noticed that this step blew up in the benchmarking PR and thought I'd
provide a better error message.
the `test_dataproc` steps need what is in ci-utils but also the tools in
base_image to build hail. Tested with a dev deploy of `test_dataproc-37`
…-113

# Conflicts:
#	.gitignore
#	auth/Makefile
#	batch/batch/driver/instance_collection/pool.py
#	batch/batch/driver/main.py
#	build.yaml
#	ci/create_initial_account.py
#	docker/Dockerfile.service-base
#	hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala
#	hail/src/main/scala/is/hail/backend/service/Worker.scala
@illusional illusional requested a review from lgruen April 12, 2023 00:58
Copy link

@lgruen lgruen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all looks good to me, but the build.yaml changes in particular are really hard to completely grok. Could we try a dev deploy first?

Comment on lines +204 to +212
f'''Found a user ({self.username()}) without a unique active key in Kubernetes.
The known IAM keys are:
{known_iam_keys_str}

These keys are present in Kubernetes:
{keys_to_k8s_secret_str}

We will assume {kube_key.id} is the active key.
'''

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information

This expression logs [sensitive data (secret)](1) as clear text. This expression logs [sensitive data (secret)](2) as clear text.
@illusional illusional merged commit f5875e2 into main Apr 18, 2023
@illusional illusional deleted the upstream-113 branch April 18, 2023 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants