Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317

jmarshall · 2023-11-09T22:11:17Z

In particular, we need to incorporate and exercise hail-is#13977 as the proposed fix for jobs becoming unresponsive due to being targeted by the kernel's OOM-killer.

I really borked our SQL linting. This PR is short but it catches a few critical problems. 1. The point of `check-sql.sh` is to detect modifications or deletions of SQL files in PRs and fail if such a change occurs. Currently on `main` it does not detect modifications. In hail-is#13456, I removed the `delete-<service>-tables.sql` files (intentionally), so added the `^D` to the `grep` regex to indicate that it is OK to have a deletion. What I inadvertently did though is change the rule from "It's ok to have Additions of any file OR Modifications of estimated-current.sql / delete-<service>-tables.sql" to "It's ok to have Additions OR Modifications OR Deletions of estimated-current.sql / delete-<service>-tables.sql". Really this should have been "It's ok to have Additions OR Modifications of estimated-current.sql OR Deletions of delete-<service>-tables.sql". I've changed it to reflect that rule. 2. Rules currently silently *pass* in CI with an error message that git is not installed. In hail-is#13437 I changed the image used to run the linters and inadvertently didn't include `git` which `check-sql.sh` needs to run. Here's how it failed in a sneaky way: - Since `git` is not installed, all calls to `git` fail, but the script is not run with `set -e` so every line of the script is executed - Since `git` lines fail, `modified_sql_file_list` remains empty - Since `modified_sql_file_list` remains empty, it appears to the check at the end that everything checked out - The if statement runs successfully and the script returns with error code 0 To fix this I do a few things: - installed `git` in the linting image - `set -e` by default and only enable `set +e` later on when necessary (because we don't want a failed `git diff` to immediately exit) - Do away with the file checking and instead check the error code of the grep. If nothing survives the grep filter, which means there were no illegal changes made, grep will return with exit code 1. So we treat that exit code as a success.

Fixes hail-is#13556. I haven't tested these changes -- would like to get initial feedback first.

I neglected to include the extra classpath necessary when using a skinny JAR.

`f` is a thunk so it is currently being evaluated thrice before inserted into the code cache. The `compiledFunction` variable was unused so I think this is what was originally intended.

…t stream (hail-is#13802)

ndarray concat was broken when the first input has size 0 along the concat axis. For example ``` In [3]: hl.eval(hl.nd.hstack([hl.nd.zeros((2, 0)), hl.nd.array([[1.0, 2.0], [3.0, 4.0]])])) Out[3]: array([[0., 2.], [0., 4.]]) ``` The zeros matrix is 2 by 0, so horizontal concatenation should just return the other matrix. (I once saw the first column filled with random numbers, presumably from a buffer overflow) I did some cleaning up in the concat implementation, but the functional change is to record the index of the first input which is non-empty along the concat axis, and when resetting to the start of the axis, reset to that non-empty index. Other size 0 inputs are correctly handled when incrementing the index, the problem was that the first read happens before an increment.

…13355) CHANGELOG: make hail's optimization rewriting filters to interval-filters smarter and more robust Completely rewrites ExtractIntervalFilters. Instead of matching against very specific patterns, and failing completely for things that don't quite match (e.g. an input is let bound, or the fold implementing "locus is contained in a set of intervals" is written slightly differently), this uses a standard abstract interpretation framework, which is almost completely insensitive to the form of the IR, only depending on the semantics. It also correctly handles missing key fields, where the previous implementation often produced an unsound transformation of the IR. Also adds a much more thorough test suite than we had before. At the top level, the analysis takes a boolean typed IR `cond` in an environment where there is a reference to some `key`, and produces a set `intervals`, such that `cond` is equivalent to `cond & intervals.contains(key)` (in other words `cond` implies `intervals.contains(key)`, or `intervals` contains all rows where `cond` is true). This means for instance it is safe to replace `TableFilter(t, cond)` with `TableFilter(TableFilterIntervals(t, intervals), cond)`. Then in a second pass it rewrites `cond` to `cond2`, such that `cond & (intervals.contains(key))` is equivalent to `cond2 & intervals.contains(key)` (in other words `cond` implies `cond2`, and `cond2 & intervals.contains(key)` implies `cond`). This means it is safe to replace the `TableFilter(t, cond)` with `TableFilter(TableFilterIntervals(t, intervals), cond2)`. A common example is when `cond` can be completely captured by the interval filter, i.e. `cond` is equivant to `intervals.contains(key)`, in which case we can take `cond2 = True`, and the `TableFilter` can be optimized away. This all happens in the function ```scala def extractPartitionFilters(ctx: ExecuteContext, cond: IR, ref: Ref, key: IndexedSeq[String]): Option[(IR, IndexedSeq[Interval])] = { if (key.isEmpty) None else { val extract = new ExtractIntervalFilters(ctx, ref.typ.asInstanceOf[TStruct].typeAfterSelectNames(key)) val trueSet = extract.analyze(cond, ref.name) if (trueSet == extract.KeySetLattice.top) None else { val rw = extract.Rewrites(mutable.Set.empty, mutable.Set.empty) extract.analyze(cond, ref.name, Some(rw), trueSet) Some((extract.rewrite(cond, rw), trueSet)) } } } ``` `trueSet` is the set of intervals which contains all rows where `cond` is true. This set is passed back into `analyze` in a second pass, which asks it to rewrite `cond` to something equivalent, under the assumption that all keys are contained in `trueSet`. The abstraction of runtime values tracks two types of information: * Is this value a reference to / copy of one of the key fields of this row? We need to know this to be able to recognize comparisons with key values, which we want to extract to interval filters. * For boolean values (including, ultimately, the filter predicate itself), we track three sets of intervals of the key type: overapproximations of when the bool is true, false, and missing. Overapproximation here means, for example, if the boolean evaluates to true in some row with key `k`, then `k` must be contained in the "true" set of intervals. But it's completely fine if the set of intervals contains keys of rows where the bool is not true. In particular, a boolean about which we know nothing (e.g. it's just some non-key boolean field in the dataset) is represented by an abstract boolean value where all three sets are the set of all keys.

CHANGELOG: Mitigate new transient error from Google Cloud Storage which manifests as `aiohttp.client_exceptions.ClientOSError: [Errno 1] [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2548)`. As of around 1500 ET 2023-10-16, this exception happens whenever we issue a lot of requests to GCS. See [Zulip thread](https://hail.zulipchat.com/#narrow/stream/300487-Hail-Batch-Dev/topic/cluster.20size/near/396777320).

The `logging_queries` variable is always *defined* but sometimes `None`.

…ail-is#13715) CHANGELOG: Fixes hail-is#13697, a long standing issue with QoB, in which a failing partition job or driver job is not failed in the Batch UI. I am not sure why we did not do this this way in the first place. If a JVMJob raises an exception, Batch will mark the job as failed. Ergo, we should raise an exception when a driver or a worker fails! Here's an example: I used a simple pipeline that write to a bucket to which I have read-only access. You can see an example Batch (where every partition fails): https://batch.hail.is/batches/8046901. [1] ```python3 import hail as hl hl.utils.range_table(3, n_partitions=3).write('gs://neale-bge/foo.ht') ``` NB: I removed the `log.error` in `handleForPython` because that log is never necessary. That function converts a stack of exceptions into a triplet of the short message, the full exception with stack trace, and a Hail error id (if present). That triplet is always passed along to someone else who logs the exception. (FWIW, the error id indicates a Python source location that is associated with the error. On the Python-side, we can look up that error id and provide a better stack trace.) [1] You'll notice the logs are missing. I noticed this as well, it's a new bug. I fixed it in hail-is#13729.

Picking up where hail-is#13776 left off. CHANGELOG: improved speed of reading hail format datasets from disk This PR speeds up decoding arrays in two main ways: * instead of calling `arrayType.isElementDefined(array, i)` on every single array element, which expands to ```scala val b = aoff + lengthHeaderBytes + (i >> 3) !((Memory.loadByte(b) & (1 << (i & 7).toInt)) != 0) ``` process elements in groups of 64, and load the corresponding long of missing bits once * once we have a whole long of missing bits, we can be smarter than branching on each bit. After flipping to get `presentBits`, we use the following psuedocode to extract the positions of the set bits, with time proportional to the number of set bits: ``` while (presentBits != 0) { val idx = java.lang.Long.numberOfTrailingZeroes(presentBits) // do something with idx presentBits = presentBits & (presentBits - 1) // unsets the rightmost set bit } ``` To avoid needing to handle the last block of 64 elements differently, this PR changes the layout of `PCanonicalArray` to ensure the missing bits are always padded out to a multiple of 64 bits. They were already padded to a multiple of 32, and I don't expect this change to have much of an effect. But if needed, blocking by 32 elements instead had very similar performance in my benchmarks. I also experimented with unrolling loops. In the non-missing case, this is easy. In the missing case, I tried using `if (presentBits.bitCount >= 8)` to guard an unrolled inner loop. In both cases, unrolling was if anything slower. Dan observed benefit from unrolling, but that was combined with the first optimization above (not loading a bit from memory every element), which I beleive was the real source of improvement.

1. File rate is more interesting for small files. 2. The source_report controls the progress bar. By updating it eagerly while we are listing a directory, the progress bar is more accurate sooner. We currently wait until we get a semaphore for a particular file to update the progress bar.

Some quality-of-life stuff for `hailtop.aiotools.delete`. Without deleting in batches, I found it impossible to delete very large lists of files because we create too many asyncio tasks.

resolves hail-is#13828

…ail-is#13794) Consider this: ```scala class Foo { def bar(): (Long, Long) = (3, 4) def destructure(): Unit = { val (x, y) = bar() } def accessors(): Unit = { val zz = bar() val x = zz._1 val y = zz._2 } } ``` ![image](https://github.com/hail-is/hail/assets/106194/532dc7ea-8027-461d-8e12-3217f5451713) These should be exactly equivalent, right? There's no way Scala would compile the match into something horrible. Right? Right? ``` public void destructure(); Code: 0: aload_0 1: invokevirtual #27 // Method bar:()Lscala/Tuple2; 4: astore_3 5: aload_3 6: ifnull 35 9: aload_3 10: invokevirtual #33 // Method scala/Tuple2._1$mcJ$sp:()J 13: lstore 4 15: aload_3 16: invokevirtual #36 // Method scala/Tuple2._2$mcJ$sp:()J 19: lstore 6 21: new #13 // class scala/Tuple2$mcJJ$sp 24: dup 25: lload 4 27: lload 6 29: invokespecial #21 // Method scala/Tuple2$mcJJ$sp."<init>":(JJ)V 32: goto 47 35: goto 38 38: new #38 // class scala/MatchError 41: dup 42: aload_3 43: invokespecial #41 // Method scala/MatchError."<init>":(Ljava/lang/Object;)V 46: athrow 47: astore_2 48: aload_2 49: invokevirtual #33 // Method scala/Tuple2._1$mcJ$sp:()J 52: lstore 8 54: aload_2 55: invokevirtual #36 // Method scala/Tuple2._2$mcJ$sp:()J 58: lstore 10 60: return public void accessors(); Code: 0: aload_0 1: invokevirtual #27 // Method bar:()Lscala/Tuple2; 4: astore_1 5: aload_1 6: invokevirtual #33 // Method scala/Tuple2._1$mcJ$sp:()J 9: lstore_2 10: aload_1 11: invokevirtual #36 // Method scala/Tuple2._2$mcJ$sp:()J 14: lstore 4 16: return ``` Yeah, so, it extracts the first and second elements of the primitive-specialized tuple, ~~constructs a `(java.lang.Long, java.lang.Long)` Tuple~~ constructs another primitive-specialized tuple (for no reason???), then does the match on that. sigh.

…age (hail-is#13838) Fixes hail-is#13722

The conceptual change here is we want to parameterize all batch related tables to have a new job group ID that I've set to **0** for the root job group. We need to make sure all future inserts / updates into the batches table are propagated to the new job groups table. When we create a batch now, we also create the corresponding entries into the job groups and job group parents tables. I chose the root job group to be 0 as I think conceptually, the client should start numbering job groups at 1 and not know there is a hidden root job group being created under the hood. I'm not wedded to this. I tried to check for all the indices that would be needed in my prototype. It's possible I missed one or two, but it's not a big deal to add it later. I don't think we need to test this on a populated database (dev deploy main, submit jobs, then run the migration), but let me know if you think that would be helpful.

This change grew out of hail-is#13674. The idea is simple - we shouldn't be appending code after control statements as such statements are redundant. That idea opened pandora's box, but now we're not generating and dropping dead code anymore. Main changes that rose form fixing fallout from adding assert in `Block.append`: - Implement basic control-flow structures (if, while, for, switch) in `CodeBuilderLike` and remove the older implementations from `Code`. - main difference is these are built from sequencing `Code` operations rather than being defined from LIR - allows for a higher-level implementation that I think is simpler to read. - Use the type-system to prevent foot-guns like `cb.ifx(cond, label.goto)`. Other changes: - rename `ifx`, `forLoop` and `whileLoop` to just `if_`, `for_` and `while_`, respectively. - Implement loops in-terms of one-another to remove code duplication. - Fix logic for when to write IRs as some default value behaviour was broken when `HAIL_WRITE_IR_FILES` was set in tests

…s#13849) Fixes hail-is#13788: - Add `raise_unless_column_indexed` guard and apply to all column-indexed parameters in `statgen.py`. - Rename `check_row_indexed` and `check_entry_indexed` as I'm allergic to functions called "check" - now it's clearer what they do.

@jigold

This is the result of some experimentation. With ten-way parallelism, the copier very rarely gets rate-limited. With 75-way parallelism (the default), we almost always experience a tens of transient errors. If we start at ten and back off as in this PR, I can get to 75 with just a handful of transient errors. cc: @jigold

…13839)

Similar to hail-is#13818. We *must* retrieve exceptions from any task that is `done` otherwise we'll get a warning when the task is freed.

Containers get deleted when a job is cancelled. This is not exceptional behavior. Example: https://cloudlogging.app.goo.gl/punCSPauoM1ZEqZ27

Fixes hail-is#13767.

…-is#13865)

🤦 I grepped for other hasattr that do not check `self`

We should never have been using `await`. (aiomysql should probably not implement `__await__`). `create_pool` returns `aiomysql.utils._PoolContextManager` which inherits from `aiomysql.utils._ContextManager` which implements `__await__`, `__aenter__`, and `__aexit__` thusly: ```python3 def __await__(self): return self._coro.__await__() async def __aenter__(self): self._obj = await self._coro return self._obj async def __aexit__(self, exc_type, exc, tb): await self._obj.close() self._obj = None ``` `__await__` is a footgun! You should never do that! You should close the return value of the coroutine!

…-is#13818) Besides the types and some transient exceptions, I think I fixed our task exception handling in several spots. Two things: 1. We do not need to wait on a cancelled task. If it was not done, then it could not possibly have an exception to retrieve. Moreover, now that it is cancelled, there is nothing else to do. Cancellation is immediate. 2. If a task is done, we *must* always retrieve the exception, otherwise we might not see an exception.

…13922) Namely, TableKeyByAndAggregate and TableAggregateByKey

Fixes hail-is#13895. This is unused now.

Fixes hail-is#13860 --------- Co-authored-by: iris <84595986+iris-garden@users.noreply.github.com>

…ail-is#13945) Fixes hail-is#13858. <img width="589" alt="Screenshot 2023-10-30 at 12 23 06 PM" src="https://github.com/hail-is/hail/assets/1693348/5ad26813-5534-488c-8029-f2607ba72033">

Fixes hail-is#13914

The `DeployConfig.service_ns` doesn't really do anything, we always use the `_default_namespace`. This is maybe from an earlier age where some services might live in different namespaces.

These are not used as far as I can tell.

The combiner benchmarks broke following the deletion of the `experimental.vcf_combiner` python package. Re-implement them in terms of the `vds` package.

We have no high-level IR analogue to `CodeBuilderLike.switch`. Such a node is useful for flattening the IR in deeply-nested `If` nodes, predicated on integer equality. This partially addresses the stack-overflow error on the `matrix_muluti_write_nothing` benchmark, which currently has a stack-overflow error when computing the type of the CDA.

- `CreateNamespaceStep.public` was entirely unused - `adminServiceAccount` is not used in `build.yaml` so `CreateNamespaceStep.admin_service_account` is always `None` meaning it has no effect. - The three environment variables that I deleted from the `deployment.yaml` are as far as I can tell entirely unused (they are now grabbed from the global config)

Very small change, something I noticed while working on something else entirely. Given how this is currently used I don't think it needs to be boxed anymore.

This is a fix for an error Ben found. ``` Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 1907, in run await self.setup_io() File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 1848, in setup_io await self.disk.create(labels=labels) File "/usr/local/lib/python3.9/dist-packages/batch/cloud/gcp/worker/disk.py", line 47, in create await self._attach() File "/usr/local/lib/python3.9/dist-packages/batch/cloud/gcp/worker/disk.py", line 112, in _attach self.last_response = await self.compute_client.attach_disk( File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 83, in attach_disk return await self._request_with_zonal_operations_response(self.post, path, params, **kwargs) File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 126, in _request_with_zonal_operations_response return await retry_transient_errors(request_and_wait) File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 763, in retry_transient_errors return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 775, in retry_transient_errors_with_debug_string return await f(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 116, in request_and_wait raise GCPOperationError(result['httpErrorStatusCode'], hailtop.aiocloud.aiogoogle.client.compute_client.GCPOperationError: GCPOperationError: 400:BAD REQUEST ['RESOURCE_IN_USE_BY_ANOTHER_RESOURCE'] ["The disk resource 'projects/hail-vdc/zones/us-central1-b/disks/batch-disk-82XXXXX' is already being used by 'projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjXXXX'"]; {'kind': 'compute#operation', 'id': 'XXXXX', 'name': 'operation-XXXXX', 'zone': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b', 'clientOperationId': 'XXXX', 'operationType': 'attachDisk', 'targetLink': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjupd', 'targetId': 'XXXX', 'status': 'DONE', 'user': 'batch2-agent@hail-vdc.iam.gserviceaccount.com', 'progress': 100, 'insertTime': '2023-10-30T20:38:40.145-07:00', 'startTime': '2023-10-30T20:38:41.871-07:00', 'endTime': '2023-10-30T20:38:42.367-07:00', 'error': {'errors': [{'code': 'RESOURCE_IN_USE_BY_ANOTHER_RESOURCE', 'message': "The disk resource 'projects/hail-vdc/zones/us-central1-b/disks/batch-disk-82XXXXX' is already being used by 'projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjXXXX'"}]}, 'httpErrorStatusCode': 400, 'httpErrorMessage': 'BAD REQUEST', 'selfLink': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b/operations/operation-XXX'} ```

I couldn't find the best issue for this. Should fix hail-is#13908, but I thought there was another issue about reducing noisy grafana alerts which this PR also addresses.

Fixes hail-is#13903 <img width="944" alt="Screenshot 2023-10-30 at 3 47 30 PM" src="https://github.com/hail-is/hail/assets/1693348/16e49387-9ded-44a1-8cc3-501aae889cf5">

Another small step in going key-less.

This PR just populates the records for older batch into the `job_groups` and `job_group_self_and_ancestors` tables. Stacked on hail-is#13475

I forgot that "open" was a valid batches state when I created the job groups table state column as an enum. This should fix the failed migration from hail-is#13487

…l-is#13986) The CSS for the website is a real mess. I initially tried to clean it up, but that became a time sink. We should eventually do that, but for now I made the minimal edits to get a reasonable looking layout. # Main Page ## Big <img width="2032" alt="Screenshot 2023-11-07 at 12 19 20" src="https://github.com/hail-is/hail/assets/106194/94c5c2d8-6a4d-44a9-888c-61b28d590857"> <img width="2032" alt="Screenshot 2023-11-07 at 12 19 27" src="https://github.com/hail-is/hail/assets/106194/8c35f736-cd56-4d8b-b5d6-3284592ff65a"> <img width="2032" alt="Screenshot 2023-11-07 at 12 19 29" src="https://github.com/hail-is/hail/assets/106194/9b396b45-bae5-469b-9825-b73a5cd8f917"> <img width="2032" alt="Screenshot 2023-11-07 at 12 19 31" src="https://github.com/hail-is/hail/assets/106194/8d27d238-5b3a-4c40-9c32-a7eb691c622b"> ## Phone <img width="2032" alt="Screenshot 2023-11-07 at 12 22 42" src="https://github.com/hail-is/hail/assets/106194/be32332a-cdba-4f6d-b117-d7e8c163d8c8"> <img width="2032" alt="Screenshot 2023-11-07 at 12 22 44" src="https://github.com/hail-is/hail/assets/106194/ebc4f1d5-c728-4b0d-90f9-adbb9de4fd88"> <img width="2032" alt="Screenshot 2023-11-07 at 12 22 47" src="https://github.com/hail-is/hail/assets/106194/9cffe08a-fdfa-4af4-b060-cfd242c0642c"> <img width="2032" alt="Screenshot 2023-11-07 at 12 22 48" src="https://github.com/hail-is/hail/assets/106194/f5e5b09a-8692-4411-ba13-e7055c17be70"> # Docs ## Big <img width="2032" alt="Screenshot 2023-11-07 at 12 24 09" src="https://github.com/hail-is/hail/assets/106194/564a47e5-8036-4e60-a7fc-16e5aeeabd94"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 19" src="https://github.com/hail-is/hail/assets/106194/0d954da0-8bdb-49e0-aa66-4ac5e0acb1f4"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 25" src="https://github.com/hail-is/hail/assets/106194/e0466542-90d3-440c-a7a5-b797b88af63c"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 40" src="https://github.com/hail-is/hail/assets/106194/4d5e5946-b014-484c-b404-3e9bd4389378"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 49" src="https://github.com/hail-is/hail/assets/106194/5e2e4666-3bac-4560-a831-4e2ea05de0ae"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 55" src="https://github.com/hail-is/hail/assets/106194/5f103ee1-a168-47ca-a5b2-f1385d4deac9"> ## Phone <img width="2032" alt="Screenshot 2023-11-07 at 12 25 21" src="https://github.com/hail-is/hail/assets/106194/087b638c-e6f8-4633-9639-9f188b6b2e57"> <img width="2032" alt="Screenshot 2023-11-07 at 12 25 23" src="https://github.com/hail-is/hail/assets/106194/cba530ea-d75c-4609-8307-16b3096a0e8c"> With the navbar open, in mobile, it looks the same as the non-docs pages.

`org.apache.commons.lang` is from the `commons-lang` library, but in `build.gradle` we explicitly depend on `commons-lang3` which puts everything under the `lang3` package. We must be picking up `commons-lang` as some transitive dependency but we no longer get it in Spark 3.4. Regardless, better to use what we explicitly depend on.

Removes any occurences of async / sync / async nesting in the code, i.e. a coroutine should not involve somewhere deep down a synchronous call that blocks on the completion of an async task. --------- Co-authored-by: Dan King <dking@broadinstitute.org>

…ail-is#13977) This PR fixes that crun has a different way to specify memory requirements when using cgroups v2 instead of cgroups v1. Should fix hail-is#13902.

In particular, we need to incorporate and test hail-is#13977 as the proposed fix for jobs becoming unresponsive due to being targeted by the kernel's OOM-killer. (Our local gcsfuse repo workaround is replaced by upstream's.)

illusional

Yep, thought that gcsfuse might come up as a conflict, surprised how many commits there were to get up to date, Hail is a productive team!

devbin/rotate_keys.py

jmarshall · 2023-11-09T23:00:15Z

Successful dev deploy: https://ci.hail.populationgenomics.org.au/batches/429671

jmarshall · 2023-11-10T00:55:36Z

Due to 2e536ff we also need to generate a new batch-worker-15 boot disk image, which has been done.

daniel-goldstein and others added 30 commits October 10, 2023 21:47

[ci] Generate PR specific logging queries (hail-is#13744)

b598a78

Fixes hail-is#13556. I haven't tested these changes -- would like to get initial feedback first.

[query] fix local backend for install-editable (hail-is#13799)

d187d52

I neglected to include the extra classpath necessary when using a skinny JAR.

[query] Dont compile code cache function three times (hail-is#13796)

f7be69c

`f` is a thunk so it is currently being evaluated thrice before inserted into the code cache. The `compiledFunction` variable was unused so I think this is what was originally intended.

[query] Dont allocate an array when writing a single byte to an outpu…

87398e1

…t stream (hail-is#13802)

[aiotools.fs] LocalFS types (hail-is#13819)

9b54271

[ci] fix Azure CI PR page (hail-is#13824)

568858d

The `logging_queries` variable is always *defined* but sometimes `None`.

[fs] delete in batches, accept stdin (hail-is#13821)

c64d880

Some quality-of-life stuff for `hailtop.aiotools.delete`. Without deleting in batches, I found it impossible to delete very large lists of files because we create too many asyncio tasks.

[vds/combiner] use get_lgt for PGT handling as well (hail-is#13829)

9817797

resolves hail-is#13828

[ci] Force the installation of node and npm pyright in the linting im…

dc33c6d

…age (hail-is#13838) Fixes hail-is#13722

[query] convert BlockMatrixToTable to use BlockMatrixStage2 (hail-is#…

4132cd1

…13839)

[fs] raise exceptions from failed tasks (hail-is#13831)

8ff5933

Similar to hail-is#13818. We *must* retrieve exceptions from any task that is `done` otherwise we'll get a warning when the task is freed.

[batch] no logs on container deleted (hail-is#13857)

e107ebe

Containers get deleted when a job is cancelled. This is not exceptional behavior. Example: https://cloudlogging.app.goo.gl/punCSPauoM1ZEqZ27

[hailctl] Expand username in dataproc submit (hail-is#13805)

35994fb

Fixes hail-is#13767.

[query] minor cleanup of combine.py (hail-is#13830)

dc3aa8a

[build.yaml] ensure pyright is installed in hail_lint_image too (hail…

cc844ae

…-is#13865)

[aiogoogle] finally squash an open http session (hail-is#13867)

d5bb0f6

🤦 I grepped for other hasattr that do not check `self`

ehigham and others added 23 commits October 31, 2023 11:46

[query] Add semhash support for Table-to-Table Aggregations (hail-is#…

a950ede

…13922) Namely, TableKeyByAndAggregate and TableAggregateByKey

[ci] Delete old CreateDatabaseStep (hail-is#13947)

22bcec2

Fixes hail-is#13895. This is unused now.

[batch] Turn off autocomplete on billing projects page (hail-is#13946)

701bce0

Fixes hail-is#13860 --------- Co-authored-by: iris <84595986+iris-garden@users.noreply.github.com>

[batch] Check whether user exists before adding to billing project (h…

40a3467

…ail-is#13945) Fixes hail-is#13858. <img width="589" alt="Screenshot 2023-10-30 at 12 23 06 PM" src="https://github.com/hail-is/hail/assets/1693348/5ad26813-5534-488c-8029-f2607ba72033">

[batch] Show attributes on the jobs UI page (hail-is#13948)

662dc75

Fixes hail-is#13914

[hailtop] Remove service_ns method in DeployConfig (hail-is#13959)

3f0b115

The `DeployConfig.service_ns` doesn't really do anything, we always use the `_default_namespace`. This is maybe from an earlier age where some services might live in different namespaces.

[hailtop] Remove unused blocking httpx classes (hail-is#13965)

9badedb

These are not used as far as I can tell.

[benchmark] fix combiner benchmarks (hail-is#13956)

5f4508f

The combiner benchmarks broke following the deletion of the `experimental.vcf_combiner` python package. Re-implement them in terms of the `vds` package.

[batch] Dont box data_disk_space_remaining (hail-is#13968)

2c1188c

Very small change, something I noticed while working on something else entirely. Given how this is currently used I don't think it needs to be boxed anymore.

[batch] Fix async exit stacks (hail-is#13969)

a5c7a8a

I couldn't find the best issue for this. Should fix hail-is#13908, but I thought there was another issue about reducing noisy grafana alerts which this PR also addresses.

[batch] Install the Ops Agent in GCP on instance startup (hail-is#13949)

2e536ff

Fixes hail-is#13903 <img width="944" alt="Screenshot 2023-10-30 at 3 47 30 PM" src="https://github.com/hail-is/hail/assets/1693348/16e49387-9ded-44a1-8cc3-501aae889cf5">

[batch] Remove explicit settings in hailtop tests (hail-is#13973)

806baf9

Another small step in going key-less.

[batch] Populate job groups in database (hail-is#13487)

7d536c2

This PR just populates the records for older batch into the `job_groups` and `job_group_self_and_ancestors` tables. Stacked on hail-is#13475

[batch] Fix enum column state in job groups table (hail-is#13982)

c4aa1c6

I forgot that "open" was a valid batches state when I created the job groups table state column as an enum. This should fix the failed migration from hail-is#13487

[changelog] fix formatting (hail-is#13980)

f73d92e

[batch] Prevent the worker container from being killed on OOM event (h…

3b38d0b

…ail-is#13977) This PR fixes that crun has a different way to specify memory requirements when using cgroups v2 instead of cgroups v1. Should fix hail-is#13902.

jmarshall requested a review from illusional November 9, 2023 22:11

illusional approved these changes Nov 9, 2023

View reviewed changes

github-advanced-security bot found potential problems Nov 9, 2023

View reviewed changes

devbin/rotate_keys.py Dismissed Show dismissed Hide dismissed

jmarshall merged commit f36c781 into main Nov 10, 2023
5 checks passed

jmarshall deleted the upstream-126+oom branch November 10, 2023 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317

Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317

jmarshall commented Nov 9, 2023 •

edited

Loading

illusional left a comment

jmarshall commented Nov 9, 2023

jmarshall commented Nov 10, 2023

Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317

Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317

Conversation

jmarshall commented Nov 9, 2023 • edited Loading

illusional left a comment

Choose a reason for hiding this comment

jmarshall commented Nov 9, 2023

jmarshall commented Nov 10, 2023

jmarshall commented Nov 9, 2023 •

edited

Loading