-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317
Conversation
I really borked our SQL linting. This PR is short but it catches a few critical problems. 1. The point of `check-sql.sh` is to detect modifications or deletions of SQL files in PRs and fail if such a change occurs. Currently on `main` it does not detect modifications. In hail-is#13456, I removed the `delete-<service>-tables.sql` files (intentionally), so added the `^D` to the `grep` regex to indicate that it is OK to have a deletion. What I inadvertently did though is change the rule from "It's ok to have Additions of any file OR Modifications of estimated-current.sql / delete-<service>-tables.sql" to "It's ok to have Additions OR Modifications OR Deletions of estimated-current.sql / delete-<service>-tables.sql". Really this should have been "It's ok to have Additions OR Modifications of estimated-current.sql OR Deletions of delete-<service>-tables.sql". I've changed it to reflect that rule. 2. Rules currently silently *pass* in CI with an error message that git is not installed. In hail-is#13437 I changed the image used to run the linters and inadvertently didn't include `git` which `check-sql.sh` needs to run. Here's how it failed in a sneaky way: - Since `git` is not installed, all calls to `git` fail, but the script is not run with `set -e` so every line of the script is executed - Since `git` lines fail, `modified_sql_file_list` remains empty - Since `modified_sql_file_list` remains empty, it appears to the check at the end that everything checked out - The if statement runs successfully and the script returns with error code 0 To fix this I do a few things: - installed `git` in the linting image - `set -e` by default and only enable `set +e` later on when necessary (because we don't want a failed `git diff` to immediately exit) - Do away with the file checking and instead check the error code of the grep. If nothing survives the grep filter, which means there were no illegal changes made, grep will return with exit code 1. So we treat that exit code as a success.
Fixes hail-is#13556. I haven't tested these changes -- would like to get initial feedback first.
I neglected to include the extra classpath necessary when using a skinny JAR.
`f` is a thunk so it is currently being evaluated thrice before inserted into the code cache. The `compiledFunction` variable was unused so I think this is what was originally intended.
ndarray concat was broken when the first input has size 0 along the concat axis. For example ``` In [3]: hl.eval(hl.nd.hstack([hl.nd.zeros((2, 0)), hl.nd.array([[1.0, 2.0], [3.0, 4.0]])])) Out[3]: array([[0., 2.], [0., 4.]]) ``` The zeros matrix is 2 by 0, so horizontal concatenation should just return the other matrix. (I once saw the first column filled with random numbers, presumably from a buffer overflow) I did some cleaning up in the concat implementation, but the functional change is to record the index of the first input which is non-empty along the concat axis, and when resetting to the start of the axis, reset to that non-empty index. Other size 0 inputs are correctly handled when incrementing the index, the problem was that the first read happens before an increment.
…13355) CHANGELOG: make hail's optimization rewriting filters to interval-filters smarter and more robust Completely rewrites ExtractIntervalFilters. Instead of matching against very specific patterns, and failing completely for things that don't quite match (e.g. an input is let bound, or the fold implementing "locus is contained in a set of intervals" is written slightly differently), this uses a standard abstract interpretation framework, which is almost completely insensitive to the form of the IR, only depending on the semantics. It also correctly handles missing key fields, where the previous implementation often produced an unsound transformation of the IR. Also adds a much more thorough test suite than we had before. At the top level, the analysis takes a boolean typed IR `cond` in an environment where there is a reference to some `key`, and produces a set `intervals`, such that `cond` is equivalent to `cond & intervals.contains(key)` (in other words `cond` implies `intervals.contains(key)`, or `intervals` contains all rows where `cond` is true). This means for instance it is safe to replace `TableFilter(t, cond)` with `TableFilter(TableFilterIntervals(t, intervals), cond)`. Then in a second pass it rewrites `cond` to `cond2`, such that `cond & (intervals.contains(key))` is equivalent to `cond2 & intervals.contains(key)` (in other words `cond` implies `cond2`, and `cond2 & intervals.contains(key)` implies `cond`). This means it is safe to replace the `TableFilter(t, cond)` with `TableFilter(TableFilterIntervals(t, intervals), cond2)`. A common example is when `cond` can be completely captured by the interval filter, i.e. `cond` is equivant to `intervals.contains(key)`, in which case we can take `cond2 = True`, and the `TableFilter` can be optimized away. This all happens in the function ```scala def extractPartitionFilters(ctx: ExecuteContext, cond: IR, ref: Ref, key: IndexedSeq[String]): Option[(IR, IndexedSeq[Interval])] = { if (key.isEmpty) None else { val extract = new ExtractIntervalFilters(ctx, ref.typ.asInstanceOf[TStruct].typeAfterSelectNames(key)) val trueSet = extract.analyze(cond, ref.name) if (trueSet == extract.KeySetLattice.top) None else { val rw = extract.Rewrites(mutable.Set.empty, mutable.Set.empty) extract.analyze(cond, ref.name, Some(rw), trueSet) Some((extract.rewrite(cond, rw), trueSet)) } } } ``` `trueSet` is the set of intervals which contains all rows where `cond` is true. This set is passed back into `analyze` in a second pass, which asks it to rewrite `cond` to something equivalent, under the assumption that all keys are contained in `trueSet`. The abstraction of runtime values tracks two types of information: * Is this value a reference to / copy of one of the key fields of this row? We need to know this to be able to recognize comparisons with key values, which we want to extract to interval filters. * For boolean values (including, ultimately, the filter predicate itself), we track three sets of intervals of the key type: overapproximations of when the bool is true, false, and missing. Overapproximation here means, for example, if the boolean evaluates to true in some row with key `k`, then `k` must be contained in the "true" set of intervals. But it's completely fine if the set of intervals contains keys of rows where the bool is not true. In particular, a boolean about which we know nothing (e.g. it's just some non-key boolean field in the dataset) is represented by an abstract boolean value where all three sets are the set of all keys.
CHANGELOG: Mitigate new transient error from Google Cloud Storage which manifests as `aiohttp.client_exceptions.ClientOSError: [Errno 1] [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2548)`. As of around 1500 ET 2023-10-16, this exception happens whenever we issue a lot of requests to GCS. See [Zulip thread](https://hail.zulipchat.com/#narrow/stream/300487-Hail-Batch-Dev/topic/cluster.20size/near/396777320).
The `logging_queries` variable is always *defined* but sometimes `None`.
…ail-is#13715) CHANGELOG: Fixes hail-is#13697, a long standing issue with QoB, in which a failing partition job or driver job is not failed in the Batch UI. I am not sure why we did not do this this way in the first place. If a JVMJob raises an exception, Batch will mark the job as failed. Ergo, we should raise an exception when a driver or a worker fails! Here's an example: I used a simple pipeline that write to a bucket to which I have read-only access. You can see an example Batch (where every partition fails): https://batch.hail.is/batches/8046901. [1] ```python3 import hail as hl hl.utils.range_table(3, n_partitions=3).write('gs://neale-bge/foo.ht') ``` NB: I removed the `log.error` in `handleForPython` because that log is never necessary. That function converts a stack of exceptions into a triplet of the short message, the full exception with stack trace, and a Hail error id (if present). That triplet is always passed along to someone else who logs the exception. (FWIW, the error id indicates a Python source location that is associated with the error. On the Python-side, we can look up that error id and provide a better stack trace.) [1] You'll notice the logs are missing. I noticed this as well, it's a new bug. I fixed it in hail-is#13729.
Picking up where hail-is#13776 left off. CHANGELOG: improved speed of reading hail format datasets from disk This PR speeds up decoding arrays in two main ways: * instead of calling `arrayType.isElementDefined(array, i)` on every single array element, which expands to ```scala val b = aoff + lengthHeaderBytes + (i >> 3) !((Memory.loadByte(b) & (1 << (i & 7).toInt)) != 0) ``` process elements in groups of 64, and load the corresponding long of missing bits once * once we have a whole long of missing bits, we can be smarter than branching on each bit. After flipping to get `presentBits`, we use the following psuedocode to extract the positions of the set bits, with time proportional to the number of set bits: ``` while (presentBits != 0) { val idx = java.lang.Long.numberOfTrailingZeroes(presentBits) // do something with idx presentBits = presentBits & (presentBits - 1) // unsets the rightmost set bit } ``` To avoid needing to handle the last block of 64 elements differently, this PR changes the layout of `PCanonicalArray` to ensure the missing bits are always padded out to a multiple of 64 bits. They were already padded to a multiple of 32, and I don't expect this change to have much of an effect. But if needed, blocking by 32 elements instead had very similar performance in my benchmarks. I also experimented with unrolling loops. In the non-missing case, this is easy. In the missing case, I tried using `if (presentBits.bitCount >= 8)` to guard an unrolled inner loop. In both cases, unrolling was if anything slower. Dan observed benefit from unrolling, but that was combined with the first optimization above (not loading a bit from memory every element), which I beleive was the real source of improvement.
1. File rate is more interesting for small files. 2. The source_report controls the progress bar. By updating it eagerly while we are listing a directory, the progress bar is more accurate sooner. We currently wait until we get a semaphore for a particular file to update the progress bar.
Some quality-of-life stuff for `hailtop.aiotools.delete`. Without deleting in batches, I found it impossible to delete very large lists of files because we create too many asyncio tasks.
…ail-is#13794) Consider this: ```scala class Foo { def bar(): (Long, Long) = (3, 4) def destructure(): Unit = { val (x, y) = bar() } def accessors(): Unit = { val zz = bar() val x = zz._1 val y = zz._2 } } ``` ![image](https://github.com/hail-is/hail/assets/106194/532dc7ea-8027-461d-8e12-3217f5451713) These should be exactly equivalent, right? There's no way Scala would compile the match into something horrible. Right? Right? ``` public void destructure(); Code: 0: aload_0 1: invokevirtual #27 // Method bar:()Lscala/Tuple2; 4: astore_3 5: aload_3 6: ifnull 35 9: aload_3 10: invokevirtual #33 // Method scala/Tuple2._1$mcJ$sp:()J 13: lstore 4 15: aload_3 16: invokevirtual #36 // Method scala/Tuple2._2$mcJ$sp:()J 19: lstore 6 21: new #13 // class scala/Tuple2$mcJJ$sp 24: dup 25: lload 4 27: lload 6 29: invokespecial #21 // Method scala/Tuple2$mcJJ$sp."<init>":(JJ)V 32: goto 47 35: goto 38 38: new #38 // class scala/MatchError 41: dup 42: aload_3 43: invokespecial #41 // Method scala/MatchError."<init>":(Ljava/lang/Object;)V 46: athrow 47: astore_2 48: aload_2 49: invokevirtual #33 // Method scala/Tuple2._1$mcJ$sp:()J 52: lstore 8 54: aload_2 55: invokevirtual #36 // Method scala/Tuple2._2$mcJ$sp:()J 58: lstore 10 60: return public void accessors(); Code: 0: aload_0 1: invokevirtual #27 // Method bar:()Lscala/Tuple2; 4: astore_1 5: aload_1 6: invokevirtual #33 // Method scala/Tuple2._1$mcJ$sp:()J 9: lstore_2 10: aload_1 11: invokevirtual #36 // Method scala/Tuple2._2$mcJ$sp:()J 14: lstore 4 16: return ``` Yeah, so, it extracts the first and second elements of the primitive-specialized tuple, ~~constructs a `(java.lang.Long, java.lang.Long)` Tuple~~ constructs another primitive-specialized tuple (for no reason???), then does the match on that. sigh.
The conceptual change here is we want to parameterize all batch related tables to have a new job group ID that I've set to **0** for the root job group. We need to make sure all future inserts / updates into the batches table are propagated to the new job groups table. When we create a batch now, we also create the corresponding entries into the job groups and job group parents tables. I chose the root job group to be 0 as I think conceptually, the client should start numbering job groups at 1 and not know there is a hidden root job group being created under the hood. I'm not wedded to this. I tried to check for all the indices that would be needed in my prototype. It's possible I missed one or two, but it's not a big deal to add it later. I don't think we need to test this on a populated database (dev deploy main, submit jobs, then run the migration), but let me know if you think that would be helpful.
This change grew out of hail-is#13674. The idea is simple - we shouldn't be appending code after control statements as such statements are redundant. That idea opened pandora's box, but now we're not generating and dropping dead code anymore. Main changes that rose form fixing fallout from adding assert in `Block.append`: - Implement basic control-flow structures (if, while, for, switch) in `CodeBuilderLike` and remove the older implementations from `Code`. - main difference is these are built from sequencing `Code` operations rather than being defined from LIR - allows for a higher-level implementation that I think is simpler to read. - Use the type-system to prevent foot-guns like `cb.ifx(cond, label.goto)`. Other changes: - rename `ifx`, `forLoop` and `whileLoop` to just `if_`, `for_` and `while_`, respectively. - Implement loops in-terms of one-another to remove code duplication. - Fix logic for when to write IRs as some default value behaviour was broken when `HAIL_WRITE_IR_FILES` was set in tests
…s#13849) Fixes hail-is#13788: - Add `raise_unless_column_indexed` guard and apply to all column-indexed parameters in `statgen.py`. - Rename `check_row_indexed` and `check_entry_indexed` as I'm allergic to functions called "check" - now it's clearer what they do.
This is the result of some experimentation. With ten-way parallelism, the copier very rarely gets rate-limited. With 75-way parallelism (the default), we almost always experience a tens of transient errors. If we start at ten and back off as in this PR, I can get to 75 with just a handful of transient errors. cc: @jigold
Similar to hail-is#13818. We *must* retrieve exceptions from any task that is `done` otherwise we'll get a warning when the task is freed.
Containers get deleted when a job is cancelled. This is not exceptional behavior. Example: https://cloudlogging.app.goo.gl/punCSPauoM1ZEqZ27
🤦 I grepped for other hasattr that do not check `self`
We should never have been using `await`. (aiomysql should probably not implement `__await__`). `create_pool` returns `aiomysql.utils._PoolContextManager` which inherits from `aiomysql.utils._ContextManager` which implements `__await__`, `__aenter__`, and `__aexit__` thusly: ```python3 def __await__(self): return self._coro.__await__() async def __aenter__(self): self._obj = await self._coro return self._obj async def __aexit__(self, exc_type, exc, tb): await self._obj.close() self._obj = None ``` `__await__` is a footgun! You should never do that! You should close the return value of the coroutine!
…-is#13818) Besides the types and some transient exceptions, I think I fixed our task exception handling in several spots. Two things: 1. We do not need to wait on a cancelled task. If it was not done, then it could not possibly have an exception to retrieve. Moreover, now that it is cancelled, there is nothing else to do. Cancellation is immediate. 2. If a task is done, we *must* always retrieve the exception, otherwise we might not see an exception.
…13922) Namely, TableKeyByAndAggregate and TableAggregateByKey
Fixes hail-is#13895. This is unused now.
Fixes hail-is#13860 --------- Co-authored-by: iris <84595986+iris-garden@users.noreply.github.com>
…ail-is#13945) Fixes hail-is#13858. <img width="589" alt="Screenshot 2023-10-30 at 12 23 06 PM" src="https://github.com/hail-is/hail/assets/1693348/5ad26813-5534-488c-8029-f2607ba72033">
The `DeployConfig.service_ns` doesn't really do anything, we always use the `_default_namespace`. This is maybe from an earlier age where some services might live in different namespaces.
These are not used as far as I can tell.
The combiner benchmarks broke following the deletion of the `experimental.vcf_combiner` python package. Re-implement them in terms of the `vds` package.
We have no high-level IR analogue to `CodeBuilderLike.switch`. Such a node is useful for flattening the IR in deeply-nested `If` nodes, predicated on integer equality. This partially addresses the stack-overflow error on the `matrix_muluti_write_nothing` benchmark, which currently has a stack-overflow error when computing the type of the CDA.
- `CreateNamespaceStep.public` was entirely unused - `adminServiceAccount` is not used in `build.yaml` so `CreateNamespaceStep.admin_service_account` is always `None` meaning it has no effect. - The three environment variables that I deleted from the `deployment.yaml` are as far as I can tell entirely unused (they are now grabbed from the global config)
Very small change, something I noticed while working on something else entirely. Given how this is currently used I don't think it needs to be boxed anymore.
This is a fix for an error Ben found. ``` Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 1907, in run await self.setup_io() File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 1848, in setup_io await self.disk.create(labels=labels) File "/usr/local/lib/python3.9/dist-packages/batch/cloud/gcp/worker/disk.py", line 47, in create await self._attach() File "/usr/local/lib/python3.9/dist-packages/batch/cloud/gcp/worker/disk.py", line 112, in _attach self.last_response = await self.compute_client.attach_disk( File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 83, in attach_disk return await self._request_with_zonal_operations_response(self.post, path, params, **kwargs) File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 126, in _request_with_zonal_operations_response return await retry_transient_errors(request_and_wait) File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 763, in retry_transient_errors return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 775, in retry_transient_errors_with_debug_string return await f(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 116, in request_and_wait raise GCPOperationError(result['httpErrorStatusCode'], hailtop.aiocloud.aiogoogle.client.compute_client.GCPOperationError: GCPOperationError: 400:BAD REQUEST ['RESOURCE_IN_USE_BY_ANOTHER_RESOURCE'] ["The disk resource 'projects/hail-vdc/zones/us-central1-b/disks/batch-disk-82XXXXX' is already being used by 'projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjXXXX'"]; {'kind': 'compute#operation', 'id': 'XXXXX', 'name': 'operation-XXXXX', 'zone': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b', 'clientOperationId': 'XXXX', 'operationType': 'attachDisk', 'targetLink': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjupd', 'targetId': 'XXXX', 'status': 'DONE', 'user': 'batch2-agent@hail-vdc.iam.gserviceaccount.com', 'progress': 100, 'insertTime': '2023-10-30T20:38:40.145-07:00', 'startTime': '2023-10-30T20:38:41.871-07:00', 'endTime': '2023-10-30T20:38:42.367-07:00', 'error': {'errors': [{'code': 'RESOURCE_IN_USE_BY_ANOTHER_RESOURCE', 'message': "The disk resource 'projects/hail-vdc/zones/us-central1-b/disks/batch-disk-82XXXXX' is already being used by 'projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjXXXX'"}]}, 'httpErrorStatusCode': 400, 'httpErrorMessage': 'BAD REQUEST', 'selfLink': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b/operations/operation-XXX'} ```
I couldn't find the best issue for this. Should fix hail-is#13908, but I thought there was another issue about reducing noisy grafana alerts which this PR also addresses.
Fixes hail-is#13903 <img width="944" alt="Screenshot 2023-10-30 at 3 47 30 PM" src="https://github.com/hail-is/hail/assets/1693348/16e49387-9ded-44a1-8cc3-501aae889cf5">
Another small step in going key-less.
This PR just populates the records for older batch into the `job_groups` and `job_group_self_and_ancestors` tables. Stacked on hail-is#13475
I forgot that "open" was a valid batches state when I created the job groups table state column as an enum. This should fix the failed migration from hail-is#13487
…l-is#13986) The CSS for the website is a real mess. I initially tried to clean it up, but that became a time sink. We should eventually do that, but for now I made the minimal edits to get a reasonable looking layout. # Main Page ## Big <img width="2032" alt="Screenshot 2023-11-07 at 12 19 20" src="https://github.com/hail-is/hail/assets/106194/94c5c2d8-6a4d-44a9-888c-61b28d590857"> <img width="2032" alt="Screenshot 2023-11-07 at 12 19 27" src="https://github.com/hail-is/hail/assets/106194/8c35f736-cd56-4d8b-b5d6-3284592ff65a"> <img width="2032" alt="Screenshot 2023-11-07 at 12 19 29" src="https://github.com/hail-is/hail/assets/106194/9b396b45-bae5-469b-9825-b73a5cd8f917"> <img width="2032" alt="Screenshot 2023-11-07 at 12 19 31" src="https://github.com/hail-is/hail/assets/106194/8d27d238-5b3a-4c40-9c32-a7eb691c622b"> ## Phone <img width="2032" alt="Screenshot 2023-11-07 at 12 22 42" src="https://github.com/hail-is/hail/assets/106194/be32332a-cdba-4f6d-b117-d7e8c163d8c8"> <img width="2032" alt="Screenshot 2023-11-07 at 12 22 44" src="https://github.com/hail-is/hail/assets/106194/ebc4f1d5-c728-4b0d-90f9-adbb9de4fd88"> <img width="2032" alt="Screenshot 2023-11-07 at 12 22 47" src="https://github.com/hail-is/hail/assets/106194/9cffe08a-fdfa-4af4-b060-cfd242c0642c"> <img width="2032" alt="Screenshot 2023-11-07 at 12 22 48" src="https://github.com/hail-is/hail/assets/106194/f5e5b09a-8692-4411-ba13-e7055c17be70"> # Docs ## Big <img width="2032" alt="Screenshot 2023-11-07 at 12 24 09" src="https://github.com/hail-is/hail/assets/106194/564a47e5-8036-4e60-a7fc-16e5aeeabd94"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 19" src="https://github.com/hail-is/hail/assets/106194/0d954da0-8bdb-49e0-aa66-4ac5e0acb1f4"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 25" src="https://github.com/hail-is/hail/assets/106194/e0466542-90d3-440c-a7a5-b797b88af63c"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 40" src="https://github.com/hail-is/hail/assets/106194/4d5e5946-b014-484c-b404-3e9bd4389378"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 49" src="https://github.com/hail-is/hail/assets/106194/5e2e4666-3bac-4560-a831-4e2ea05de0ae"> <img width="2032" alt="Screenshot 2023-11-07 at 12 24 55" src="https://github.com/hail-is/hail/assets/106194/5f103ee1-a168-47ca-a5b2-f1385d4deac9"> ## Phone <img width="2032" alt="Screenshot 2023-11-07 at 12 25 21" src="https://github.com/hail-is/hail/assets/106194/087b638c-e6f8-4633-9639-9f188b6b2e57"> <img width="2032" alt="Screenshot 2023-11-07 at 12 25 23" src="https://github.com/hail-is/hail/assets/106194/cba530ea-d75c-4609-8307-16b3096a0e8c"> With the navbar open, in mobile, it looks the same as the non-docs pages.
`org.apache.commons.lang` is from the `commons-lang` library, but in `build.gradle` we explicitly depend on `commons-lang3` which puts everything under the `lang3` package. We must be picking up `commons-lang` as some transitive dependency but we no longer get it in Spark 3.4. Regardless, better to use what we explicitly depend on.
Removes any occurences of async / sync / async nesting in the code, i.e. a coroutine should not involve somewhere deep down a synchronous call that blocks on the completion of an async task. --------- Co-authored-by: Dan King <dking@broadinstitute.org>
…ail-is#13977) This PR fixes that crun has a different way to specify memory requirements when using cgroups v2 instead of cgroups v1. Should fix hail-is#13902.
In particular, we need to incorporate and test hail-is#13977 as the proposed fix for jobs becoming unresponsive due to being targeted by the kernel's OOM-killer. (Our local gcsfuse repo workaround is replaced by upstream's.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thought that gcsfuse might come up as a conflict, surprised how many commits there were to get up to date, Hail is a productive team!
Successful dev deploy: https://ci.hail.populationgenomics.org.au/batches/429671 |
Due to 2e536ff we also need to generate a new |
In particular, we need to incorporate and exercise hail-is#13977 as the proposed fix for jobs becoming unresponsive due to being targeted by the kernel's OOM-killer.