-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream v0.2.113 #282
Upstream v0.2.113 #282
Conversation
CHANGELOG: Fix `hailctl hdinsight submit` to pass args to the files. See `/batches` at https://livy.incubator.apache.org/docs/latest/rest-api.html
I added the configuration option for the minimum number of workers that should be present at any time. I tested this in my namespace. I'd like you to double check the logic is correct for the number of workers needed as I derived it by working through examples: ```python3 n_live_instances = self.n_instances_by_state['pending'] + self.n_instances_by_state['active'] n_standing_instances_needed = max(0, self.min_instances - self.n_instances) n_standing_instances_needed = min( n_standing_instances_needed, self.max_live_instances - n_live_instances, self.max_instances - self.n_instances, remaining_instances_per_autoscaler_loop, # 20 queries/s; our GCE long-run quota 300, ) ```
CHANGELOG: Added the regions parameter to `hl.init()` for specifying which regions Query on Batch jobs should run in.
CHANGELOG: Added the ability to update an existing batch with additional jobs by calling `Batch.run()` more than once. I wrote this quite awhile ago, so I don't remember if there were any issues as to why I didn't make a PR sooner or if I'm missing some complexity here.
Add python classes analogous to those in Scala Add printer/parsers to Scala and Python
The only remaining references are in the datasets scripts, but those are meant as references of how we created the files, and in the `hailctl dataproc` command. I chose not to change the latter because I fear some users might still have ancient versions of `gcloud`. The reviewer should verify that I got the right version of the `gcloud` command in each case. Hopefully this resolves the bizarre error we see in test-dataproc.
Different splits (`test_hailtop_batch_0`, `test_hailtop_batch_1`, …) can clobber each other's tmpdir.
…2775) I'm not sure this is the "query-way" of doing these checks, but having this feature would have saved Cal from running an expensive pipeline and the output file "disappearing" at the end.
…ion planning (hail-is#12587) Evaluation of relational lets is an explicit pass. Executing and rewriting shuffles is an explicit pass. * lowerDistributedSort executes the shuffle and produces a TableReader Higher-level passes that recursively lower and execute are parameterized by the contained pipeline.
Co-authored-by: Dan King <dking@broadinstitute.org>
hail-is#12799) Looks like the `get_remote_tmpdir` function was made when the Query ServiceBackend needed this config and didn't want to copy code directly from the batch ServiceBackend. Howevr, the batch ServiceBackend was never changed to use the new functions so got left behind. Aside from deleting a lot of duplicate code, the only change is that now the batch ServiceBackend will pick up the following environment variables `HAIL_BATCH_REMOTE_TMPDIR` and `HAIL_BATCH_BILLING_PROJECT`
cc @tpoterba My apologies. I made several changes to lowered logistic regression as well. All the generalized linear model methods share the same fit result. I abstracted this into one datatype at the top of `statgen.py`: `numerical_regression_fit_dtype`. --- You'll notice I moved the cases such that we check for convergence *before* checking if we are at the maximum iteration. It seemed to me that: - `max_iter == 0` means do not even attempt to fit. - `max_iter == 1` means take one gradient step, if you've converged, then return successfully, otherwise fail. - etc. The `main` branch currently always fails if you set `max_iter == 1`, even if the first step lands on the true maximum likelihood fit. I substantially refactored logistic regression. There were dead code paths (e.g. the covariates array is known to be non-empty). I also found all the function currying and comingling of fitting and testing really confusing. To be fair, the Scala code does this (and its really confusing). I think the current structure is easier to follow: 1. Fit the null model. 2. If wald, assume the beta for the genotypes is zero and use the rest of the parameters from the null model fit to compute the score (i.e. the gradient of the likelihood). Recall calculus: gradient near zero => value near the maximum. Return: this is the test. 3. Otherwise, fit the full model starting at the null fit parameters. 4. Test the "goodness" of this new & full fit. --- Poisson regression is similar but with a different likelihood function and gradient thereof. Notice that I `key_cols_by()` to indicate to Hail that the order of the cols is irrelevant (the result is a locus-keyed table after all). This is necessary at least until hail-is#12753 merges. I think it's generally a good idea though: it indicates to Hail that the ordering of the columns is irrelevant, which is potentially useful information for the optimizer! --- Both logistic and Poisson regression can benefit from BLAS3 by running at least the score test for multiple variants at once. --- I'll attach an image in the comments, but I spend ~6 seconds compiling this trivial model and ~140ms testing it. ```python3 import hail as hl mt = hl.utils.range_matrix_table(1, 3) mt = mt.annotate_entries(x=hl.literal([1, 3, 10, 5])) ht = hl.poisson_regression_rows( 'wald', y=hl.literal([0, 1, 1, 0])[mt.col_idx], x=mt.x[mt.col_idx], covariates=[1], max_iterations=2) ht.collect() ``` I grabbed some [sample code from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html) for Poisson regression (doing a score test rather than a wald test) and timed it. It takes ~8ms. So we're 3 orders of magnitude including the compiler, and ~1.2 orders of magnitude off without the compiler. Digging in a bit: - ~65ms for class loading. - ~15ms for region allocation. - ~20ms various little spots. Leaving about 40ms strictly executing generated code That's about 5x which is starting to feel reasonable.
…are empty (hail-is#12797) Using changePartitionerNoRepartition with a partitioner with a different number of partitions cannot be correct and will result in dropped data. Just construct the new partitioner (based on row index, so it will comply with invariants) directly.
A bunch of stuff: 1. grab disabled status, ignore them in most places, show disabledness when printing 2. Handle the case where a service account has no active key 3. Remove loop confirmation (I can just ctrl-c) Future work: We probably do need a bulk update thing but this remove enough annoyance for me for now.
This caused assertion errors in Scala in `MakeArray.unify`.
I sometimes iterate quickly with `pytest -k testname` from the `hail/python` directory. I cannot do that if the test directory is hard coded.
The most trivial loop is the loop which does not recur. I added a simple test and fixed Requiredness to handle this case.
…ch (hail-is#12801) CHANGELOG: Hitting CTRL-C while interactively using Batch or Query-on-Batch cancels the underlying batch.
The compiler still catches this at runtime, but it is hard to understand because it lacks a meaningful stacktrace.
…d local backend jobs (hail-is#12845) CHANGELOG: The Batch LocalBackend now sets the working directory for dockerized jobs to the root directory instead of the temp directory Tested with the following: ``` In [1]: import hailtop.batch as hb ...: b = hb.Batch(backend=hb.LocalBackend()) ...: j = b.new_job() ...: j.image('ubuntu:20.04') ...: j.command('pwd') ...: b.run() ...: ...: / Batch completed successfully! ```
…is#12841) Got annoyed with the constant re-tagging of images that don't need to be rebuilt, and decided to play a little make golf along the way. cc @jigold This should dramatically reduce the number of tags for hail-ubuntu from make-deployed images, though the number of layers in the container registry should not change.
CHANGELOG: `hl.variant_qc` no longer requires a `locus` field. Fixes hail-is#12844.
CHANGELOG: In Query-on-Batch, `hl.logistic_regression('firth', ...)` is now supported. Forgive me: I cleaned up and unified the look of the (now three) `fit` methods. A few of the sweeping cleanups: 1. num_iter, max_iter, and cur_iter are now n_iterations, max_iterations, and iteration. 2. Pervasive use of broadcasting functions rather than map. 3. `log_lkhd` only evaluated on the last iteration (in particular, its not bound before the `case`) 4. `select` as the last step rather than `drop` (we didn't drop all the unnecessary fields previously). 5. `select_globals` to make sure we only keep the `null_fit`. Major changes in this PR: 1. Add no_crash to triangular_solve 2. Split the epacts tests by type (now easy to run all firth tests with `-k firth`). 3. Firth fitting and test. A straight copy from Scala. I honestly don't understand what this is doing.
…ctions (hail-is#12854) CHANGELOG: In Query-on-Batch, simple pipelines with large numbers of partitions should be substantially faster. :facepalm:
CHANGELOG: In Query-on-Batch, allow writing to requester pays buckets, which was broken before this release.
Is there a better way to communicate the Query logging bug and its fix?
Merge of hail-is#12868 and hail-is#12867 because I expect each is likely to fail due to the other's error. --- [qob] retry `storage.writer` and `storage.reader` I do not think we frequently get errors in `storage.reader`, but I think `storage.writer` was always flaky and we were protected by the `retryTransientErrors` on `createNoCompression`. My change to fix requester pays delayed the error until either the first `write` or the `close` which do not have a `retryTransientErrors` (and it is not obvious to me that it is safe to retry a `flush`). --- [qob] retry transient errors reading the results file ---
I left the changes to Query and Batch in separate commits for ease of review. I put these in the same PR because we don't really have standalone testing for JVM Jobs outside of Query-on-Batch so the FASTA use-case serves as a test here that cloudfuse is working properly for JVM Jobs. Would be great if Jackie you could review the batch commit and Tim could review the query commit. ## Hail Query - Added support for the `FROM_FASTA_FILE` rpc and the service backend now passes sequence file information from RGs in every rpc - Refactored the liftover handling in service_backend to not redundantly store liftover maps and just take them from the ReferenceGenome objects like I did for sequence files. This means that add/remove liftover/sequence functions on the Backend are just intended to sync up the backend with python, which is a no-op for the service backend. - Don't localize the index file on fromFASTAFile/addSequence before creating the index object. `FastaSequenceIndex` just loads the whole file on construction so might as well stream it in from whatever storage it's in. - FASTA caching is left alone because those files will be mounted and unmounted from the jvm container over the life of the job. JVM doesn't have to worry about disk usage because that's handled by Batch XFS quotas, so long as the service backend requests enough storage to fit the FASTA file. Batch will make sure that a given bucket (and therefore a given FASTA file) is mounted once per-user on a batch worker. ## Hail Batch - Added support for read-only cloudfuse mounts for JVM jobs - These mounts are shared between jobs on the same machine from the same user - I did not change DockerJobs, but they could be very easily adapted to use this new mount-sharing code.
…l-is#12818) - On *deploys*, makes sure that whatever is in our third-party images is in our private registry before starting builds like hail-ubuntu that might depend on those images. This means that we can update our ubuntu base image without the australians needing to deploy any images by hand. However, this does not run in PRs because I 1) didn't want to add that kind of latency for PRs and 2) we don't do any kind of namespacing for our images so if we did include this for a PR that ultimately wasn't merged we would have to manually remove the image anyway so why not manually add it if you're going to PR it… I think point 2 is a little weak but I recall this being what we agreed on a couple months back when we discussed this. I'm wondering if we should just eat the minute or so latency at the beginning of PRs to be safe but it also feels like a shame for something that changes so infrequently. - Again on deploys, upload the hailgenetics/* images to the private registry if they don't already exist there. This way any deployments that aren't hail team's GCP deployment can get these images automatically when they deploy a new SHA instead of uploading them manually. It won't backfill skipped versions, but we decided that was ok. This seems less relevant for testing on PRs as it will get triggered on releases and we can easily dev deploy to rectify the image if this breaks.
…batch tests (hail-is#12862) Trying to make it more ergonomic to simply do `python3 -m pytest batch/test/test_batch.py::test_job` (now works without any extra environment variables or configuration). This involved the following changes: - Deleted of some env vars that are no longer used / can be easily consolidated into existing ones - Gave defaults to those testing env variables for which there are reasonable defaults. E.g. `DOCKER_ROOT_IMAGE` and `HAIL_GENETICS_HAIL_IMAGE`. - Pushed other environment variables for which there are not reasonable defaults into the tests that need them. If you run a test that requires `HAIL_CLOUD`, you'll still get an error that that env variable is unset and you should set it. But, if you just want to run a single test that doesn't need `HAIL_CLOUD` it won't get in the way.
…ing (hail-is#12863) This fixes a bug Mike W was hitting with a gnomAD pipeline. I wasn't able to replicate with a trivial example, and would prefer not to spend another half-day poring through that huge IR to try to minimize. Verified that this fixes the problem, though. https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/.E2.9C.94.20Summing.20number.20of.20entries.20meeting.20criteria.20.20by.20group
I also cleaned up a duplicate errno entry.
…is#12866) I noticed that this step blew up in the benchmarking PR and thought I'd provide a better error message.
the `test_dataproc` steps need what is in ci-utils but also the tools in base_image to build hail. Tested with a dev deploy of `test_dataproc-37`
…-113 # Conflicts: # .gitignore # auth/Makefile # batch/batch/driver/instance_collection/pool.py # batch/batch/driver/main.py # build.yaml # ci/create_initial_account.py # docker/Dockerfile.service-base # hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala # hail/src/main/scala/is/hail/backend/service/Worker.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all looks good to me, but the build.yaml
changes in particular are really hard to completely grok. Could we try a dev deploy
first?
f'''Found a user ({self.username()}) without a unique active key in Kubernetes. | ||
The known IAM keys are: | ||
{known_iam_keys_str} | ||
|
||
These keys are present in Kubernetes: | ||
{keys_to_k8s_secret_str} | ||
|
||
We will assume {kube_key.id} is the active key. | ||
''' |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information
Some more challenging conflicts in this release - @lgruen, could you look over the previous files and let me know if you see anything weird.
build.yaml
that I didn't understand. Our logs don't show we really made changes to the section around:docker/Dockerfile.service-base
)