Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sept 23 upstream 4 (dc1f086) #305

Merged
merged 181 commits into from
Sep 19, 2023
Merged

Sept 23 upstream 4 (dc1f086) #305

merged 181 commits into from
Sep 19, 2023

Conversation

illusional
Copy link

Depends on: #304

jigold and others added 30 commits June 28, 2023 17:20
I realize this looks like a lot of code changes, but it's mostly copying
and pasting two SQL procedures and changing one line in each.

This adds 4 bits of metadata to requests that then can be queried as
extra metadata:
- batch_id
- job_id
- batch_operation
- job_queue_time

Should be self-explanatory except job_queue time is the time in which
the job is first set to ready to when it was scheduled on the worker
(exact moment is when the job config is made to send to the worker).

Example logging query. Note that the search on "batch_id" is not
optimized so you definitely want to add some kind of time limit that's
short on the window to search. I can add my Python script that scrapes
these logs and makes a Plotly figure in a separate PR once this goes in.

```
(
resource.labels.container_name="batch"
resource.labels.namespace_name="{namespace}"
) OR (
resource.labels.container_name="batch-driver"
resource.labels.namespace_name="{namespace}"
) OR (
resource.type="gce_instance"
logName:"worker.log"
labels."compute.googleapis.com/resource_name":"{namespace}"
)
jsonPayload.batch_id="{batch_id}"
timestamp >= "{start_timestamp}" {end_timestamp}
```
CHANGELOG: Fixed bug causing poor performance and memory leaks for
Matrix.annotate_rows aggregations

---------

Co-authored-by: patrick-schultz <pschultz@broadinstitute.org>
For deleted users the request to look up the AAD application can fail
with a 404, and we don't need to gather this information for deleted
users anyway.
Identities in test namespaces cannot share the same underlying cloud
identity if we want to identify requests with cloud access tokens. This
also means the `test` account does not need to have the union of roles
of the other robot accounts, but pruning of the `test` account's roles
is left until after this PR merges so we can properly assess which roles
are still in use by the `test` account.
…hail-is#13206)

Key changes:

- Remove old VCF combiner
- Add StreamZipJoinProducers an IR that takes an array, and a function
from array.elementType to stream and zip joins the result of calling
that function on each member of the array.
- Combine Table._generate and this new stream zip operation to rewrite
the gvcf merge stage of the vds combiner in O(1) IR

---------

Co-authored-by: Tim Poterba <tpoterba@gmail.com>
Azure default credentials will use the metadata server when available so
we can just use those instead of manually reaching out to the metadata
server.
…3226)

RR: hail-is#13045
RR: hail-is#13046 
Support symmetric comparison of structs and struct expressions.
Provide better error messages when attempting to construct literals from
expressions with free variables.
…il-is#13231)

I need this extra debugging information to understand what is going on
in Azure with deleted VMs still showing up in the portal with
ResourceNotFound errors. Miah and Greg are running into this same
problem in their deployment. My guess is what is happening is the worker
is active and working fine, but then the deployment gets "Canceled"
because the OMSAgent takes too long to deploy. So our loop then cancels
the deployment which messes up the state in Azure of the already
deployed and running VM.

I popped the parameters from the deployment result in case it contains
sensitive data (I'm mainly worried about any private SSH keys).
Currently `TableRead.execute` always produces a
`TableValueIntermediate`, even though almost all `TableReader`s are
lowerable, so could produce a `TableStageIntermediate`.

This pr refactors `TableReader` to allow producing a
`TableStageIntermediate` in most cases, and to make it clearer which
readers still need to be lowered (only
`TableFromBlockMatrixNativeReader`, `MatrixVCFReader`, and
`MatrixPLINKReader`). It also deletes some now dead code.
The test that lists batches timed out. The main problem is the limit in
the aioclient used by the test_batch tests was passing a string rather
than an integer. I assumed downstream the function was passing an
integer. Therefore, we were doing this:

batch_id < "137"
and not batch_id < 137.

So the query was running forever and scanning all batches from the test
user.

I also was missing a tag annotation on the queries, but that was not
causing the timeout.
Currently, when the user views the jobs for a CI pipeline, either on the
`pr.html` page or the `batch.html` page, they are displayed all together
in a big table, like so:

<img width="618" alt="Screenshot 2023-06-06 at 15 30 21"
src="https://github.com/hail-is/hail/assets/84595986/8c3f45a8-756d-4e10-ac27-a57791f3965c">

This change filters the jobs out into smaller tables, with any failed
jobs displayed first, followed by any jobs that are currently running,
then any pending jobs, then the rest.

Example with failed job:

<img width="728" alt="Screenshot 2023-06-06 at 15 12 36"
src="https://github.com/hail-is/hail/assets/84595986/5b20f95e-e530-43c2-bc96-4fcd639083d3">

Example with running/pending jobs:

<img width="643" alt="Screenshot 2023-06-06 at 15 10 54"
src="https://github.com/hail-is/hail/assets/84595986/7b5790e6-a5e0-426f-b290-a28fee19e967">
Qin He reported that listing a folder containing around 50k files took
1h15. This new code takes ~16 seconds which is about how long it takes
`gcloud storage ls`.

There are two improvements:

1. Use `bounded_gather2`. The use of a semaphore in `bounded_gather2`,
which is missing from `bounded_gather`, allows it to be used
recursively. In particular, suppose we had a semaphore of
50. The outer `bounded_gather2` might need 20 slots to run its 20 paths
in parallel. That leaves 30 slots of parallelism left over for its
children. By passing the semaphore down, we let our children
optimistically use some of that excess parallelism.

2. If we happen to have the `StatResult` for a particular object, we
should never again look it up. In particular, getting the `StatResult`
for every file in a directory can be done in O(1) requests. Getting the
`StatResult` for each of those files individually (using their full
paths) is necessarily O(N). If there was at least one glob and also
there are no `suffix_components`, then we can use the `StatResult`s that
we learned when checking the glog pattern.

The latter point is perhaps a bit more clear with examples:

1. `gs://foo/bar/baz`. Since there are no globs, we can make exactly one
API request to list `gs://foo/bar/baz`.

2. `gs://foo/b*r/baz`. In this case, we must make one API request to
list `gs://foo/`. This gives us a list of paths under that prefix. We
check each path for conformance to the glob pattern `gs://foo/b*r`. For
any path that matches, we must then list `<the matching path>/baz` which
may itself be a directory containing files. Overall we make O(1) API
requests to do the glob and then O(K) API requests to get the final
`StatResult`s, where K is the number of paths matching the glob pattern.

3. `gs://foo/bar/b*z`. In this case, we must make one API request to
list `gs://foo/bar/`. In `main`, we then throw away the `StatResult`s we
got from that API request! Now we have to make O(K) requests to recover
those `StatResult`s for all K paths that match the glob pattern. This PR
just caches the `StatResult`s of the most recent globbing. If there is
no suffix to later append, then we can just re-use the `StatResult`s we
already have!


cc: @daniel-goldstein since you've reviewed this before. Might be of
interest.
…il-is#13239)

`hailtop.batch.ServiceBackend` uses `get_user_config().get` to read the
`HAIL_BATCH_REGIONS` environment variable, when it should use
`configuration_of`. This change fixes that.
Let me know if you think this is good and whether I need to test the UI
with dev deploy.

---------

Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
…hail-is#13278)

Replaces hail-is#13260.

- `test_spectral_moments` times out in a PR: (QoB)
https://hail.zulipchat.com/#narrow/stream/127527-team/topic/timeouts/near/376698259,
(spark) https://ci.hail.is/batches/7653376/jobs/74, (spark)
https://ci.hail.is/batches/7653376/jobs/72, (spark)
https://ci.hail.is/batches/7653376/jobs/62

I also backed local off to 4m even though it has no evidence of time
outs. Seems simpler for Spark and local to be the same.
Some context: To make Batch feature additions safer, we now have
infrastructure to turn off new components with checkboxes on the driver
page. I forgot to add the text before the checkbox when I put it in last
week.
danking and others added 15 commits August 31, 2023 21:30
CHANGELOG: `hl.Struct` is now pickle-able.
CHANGELOG: Fix bug introduced in 0.2.117 by commit `c9de81108` which
prevented the passing of keyword arguments to Python jobs. This
manifested as "ValueError: too many values to unpack".

We also weren't preserving tuples. They become lists. I fixed that too.
I also added some types and avoided an is instance by encoding the
necessary knowledge.
We can use this for pyright or mypy but I found both took ~1.5s even on
a tiny file like `ci.py`.

The check is slow the first time because it has to install a new venv;
however, subsequent executions can use the extant venv.
Upgrade to Ubuntu 22.04 everywhere including for the Batch Worker VMs in
Azure and Google.

---------

Co-authored-by: Sophie Parsa <parsa@wm9c6-e4e.broadinstitute.org>
Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>
This inspect command prevents us from updating a tag, for example, if we
need to replace an image with a security problem or if there is a bug
like the one fixed by hail-is#13536.
I am having trouble determining the effect of this mistake, but it seems
like we would be substantially undercharging for the serivce fee if it
was really being charged by worker_fraction_in_1024ths instead of
core-hours.
CHANGELOG: In QoB, Hail's file systems now correctly list all files in a
directory, not just the first 1000. This could manifest in an
import_table or import_vcf which used a glob expression. In such a case,
only the first 1000 files would have been included in the resulting
Table or MatrixTable.

I also moved two GSFS-only tests into the FSSuite. There should be very
few tests that are cloud-specific.
I added this feature because I am tired of every time I want to dev
deploy and try out new changes, it triggers a new build in CI. I'd
prefer to put a new label on the PR rather than close it each time or
make a copy of the branch and test the copy.
…3550)

Not a correctness bug because we raise an assertion error in the
partition function.
All these classes inherit from `Resource` which is an `abc.ABC`.
…ail-is#13131)

Deprecate hail-minted API keys in favor of using access tokens from the
identity providers already associated with user identities. For more
context and a high-level overview of the implementation, see [this
RFC](hail-is/hail-rfcs#2)
@illusional illusional changed the base branch from main to sept-23-upstream-3-1c28203 September 8, 2023 00:42
gear/gear/auth.py Dismissed Show dismissed Hide dismissed
batch/batch/driver/main.py Dismissed Show dismissed Hide dismissed
batch/batch/batch.py Outdated Show resolved Hide resolved
Co-authored-by: John Marshall <jmarshall@hey.com>
Copy link

@jmarshall jmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented on all the potential problems I saw in this one.

There's also the secrets in build.yaml where we seem to have removed some previously and some refactoring/renaming has gone on upstream. That seems like a suck it and see scenario, and you know better than I do what secrets we've got set up.

@illusional illusional marked this pull request as ready for review September 19, 2023 20:57
Base automatically changed from sept-23-upstream-3-1c28203 to main September 19, 2023 21:01
@illusional illusional merged commit daecd7d into main Sep 19, 2023
@illusional illusional deleted the sept-23-upstream-4-dc1f086 branch September 19, 2023 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.