Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream 0.2.126 and proposed OOM-killer fix (3b38d0b) #317

Merged
merged 96 commits into from
Nov 10, 2023

Conversation

jmarshall
Copy link

@jmarshall jmarshall commented Nov 9, 2023

In particular, we need to incorporate and exercise hail-is#13977 as the proposed fix for jobs becoming unresponsive due to being targeted by the kernel's OOM-killer.

daniel-goldstein and others added 30 commits October 10, 2023 21:47
I really borked our SQL linting. This PR is short but it catches a few
critical problems.


1. The point of `check-sql.sh` is to detect modifications or deletions
of SQL files in PRs and fail if such a change occurs. Currently on
`main` it does not detect modifications. In hail-is#13456, I removed the
`delete-<service>-tables.sql` files (intentionally), so added the `^D`
to the `grep` regex to indicate that it is OK to have a deletion. What I
inadvertently did though is change the rule from "It's ok to have
Additions of any file OR Modifications of estimated-current.sql /
delete-<service>-tables.sql" to "It's ok to have Additions OR
Modifications OR Deletions of estimated-current.sql /
delete-<service>-tables.sql". Really this should have been "It's ok to
have Additions OR Modifications of estimated-current.sql OR Deletions of
delete-<service>-tables.sql". I've changed it to reflect that rule.

2. Rules currently silently *pass* in CI with an error message that git
is not installed. In hail-is#13437 I changed the image used to run the linters
and inadvertently didn't include `git` which `check-sql.sh` needs to
run. Here's how it failed in a sneaky way:
- Since `git` is not installed, all calls to `git` fail, but the script
is not run with `set -e` so every line of the script is executed
- Since `git` lines fail, `modified_sql_file_list` remains empty
- Since `modified_sql_file_list` remains empty, it appears to the check
at the end that everything checked out
- The if statement runs successfully and the script returns with error
code 0

To fix this I do a few things:
- installed `git` in the linting image
- `set -e` by default and only enable `set +e` later on when necessary
(because we don't want a failed `git diff` to immediately exit)
- Do away with the file checking and instead check the error code of the
grep. If nothing survives the grep filter, which means there were no
illegal changes made, grep will return with exit code 1. So we treat
that exit code as a success.
Fixes hail-is#13556. I haven't tested these changes -- would like to get
initial feedback first.
I neglected to include the extra classpath necessary when using a skinny
JAR.
`f` is a thunk so it is currently being evaluated thrice before inserted
into the code cache. The `compiledFunction` variable was unused so I
think this is what was originally intended.
ndarray concat was broken when the first input has size 0 along the
concat axis. For example
```
In [3]: hl.eval(hl.nd.hstack([hl.nd.zeros((2, 0)), hl.nd.array([[1.0, 2.0], [3.0, 4.0]])]))
Out[3]:
array([[0., 2.],
       [0., 4.]])
```
The zeros matrix is 2 by 0, so horizontal concatenation should just
return the other matrix.
(I once saw the first column filled with random numbers, presumably from
a buffer overflow)

I did some cleaning up in the concat implementation, but the functional
change is to record the index of the first input which is non-empty
along the concat axis, and when resetting to the start of the axis,
reset to that non-empty index. Other size 0 inputs are correctly handled
when incrementing the index, the problem was that the first read happens
before an increment.
…13355)

CHANGELOG: make hail's optimization rewriting filters to
interval-filters smarter and more robust

Completely rewrites ExtractIntervalFilters. Instead of matching against
very specific patterns, and failing completely for things that don't
quite match (e.g. an input is let bound, or the fold implementing "locus
is contained in a set of intervals" is written slightly differently),
this uses a standard abstract interpretation framework, which is almost
completely insensitive to the form of the IR, only depending on the
semantics. It also correctly handles missing key fields, where the
previous implementation often produced an unsound transformation of the
IR.

Also adds a much more thorough test suite than we had before.

At the top level, the analysis takes a boolean typed IR `cond` in an
environment where there is a reference to some `key`, and produces a set
`intervals`, such that `cond` is equivalent to `cond &
intervals.contains(key)` (in other words `cond` implies
`intervals.contains(key)`, or `intervals` contains all rows where `cond`
is true). This means for instance it is safe to replace `TableFilter(t,
cond)` with `TableFilter(TableFilterIntervals(t, intervals), cond)`.

Then in a second pass it rewrites `cond` to `cond2`, such that `cond &
(intervals.contains(key))` is equivalent to `cond2 &
intervals.contains(key)` (in other words `cond` implies `cond2`, and
`cond2 & intervals.contains(key)` implies `cond`). This means it is safe
to replace the `TableFilter(t, cond)` with
`TableFilter(TableFilterIntervals(t, intervals), cond2)`. A common
example is when `cond` can be completely captured by the interval
filter, i.e. `cond` is equivant to `intervals.contains(key)`, in which
case we can take `cond2 = True`, and the `TableFilter` can be optimized
away.

This all happens in the function
```scala
  def extractPartitionFilters(ctx: ExecuteContext, cond: IR, ref: Ref, key: IndexedSeq[String]): Option[(IR, IndexedSeq[Interval])] = {
    if (key.isEmpty) None
    else {
      val extract = new ExtractIntervalFilters(ctx, ref.typ.asInstanceOf[TStruct].typeAfterSelectNames(key))
      val trueSet = extract.analyze(cond, ref.name)
      if (trueSet == extract.KeySetLattice.top)
        None
      else {
        val rw = extract.Rewrites(mutable.Set.empty, mutable.Set.empty)
        extract.analyze(cond, ref.name, Some(rw), trueSet)
        Some((extract.rewrite(cond, rw), trueSet))
      }
    }
  }
```
`trueSet` is the set of intervals which contains all rows where `cond`
is true. This set is passed back into `analyze` in a second pass, which
asks it to rewrite `cond` to something equivalent, under the assumption
that all keys are contained in `trueSet`.

The abstraction of runtime values tracks two types of information:
* Is this value a reference to / copy of one of the key fields of this
row? We need to know this to be able to recognize comparisons with key
values, which we want to extract to interval filters.
* For boolean values (including, ultimately, the filter predicate
itself), we track three sets of intervals of the key type:
overapproximations of when the bool is true, false, and missing.
Overapproximation here means, for example, if the boolean evaluates to
true in some row with key `k`, then `k` must be contained in the "true"
set of intervals. But it's completely fine if the set of intervals
contains keys of rows where the bool is not true. In particular, a
boolean about which we know nothing (e.g. it's just some non-key boolean
field in the dataset) is represented by an abstract boolean value where
all three sets are the set of all keys.
CHANGELOG: Mitigate new transient error from Google Cloud Storage which
manifests as `aiohttp.client_exceptions.ClientOSError: [Errno 1] [SSL:
SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2548)`.

As of around 1500 ET 2023-10-16, this exception happens whenever we
issue a lot of requests to GCS.

See [Zulip
thread](https://hail.zulipchat.com/#narrow/stream/300487-Hail-Batch-Dev/topic/cluster.20size/near/396777320).
The `logging_queries` variable is always *defined* but sometimes `None`.
…ail-is#13715)

CHANGELOG: Fixes hail-is#13697, a long standing issue with QoB, in which a
failing partition job or driver job is not failed in the Batch UI.

I am not sure why we did not do this this way in the first place. If a
JVMJob raises an exception, Batch will mark the job as failed. Ergo, we
should raise an exception when a driver or a worker fails!

Here's an example: I used a simple pipeline that write to a bucket to
which I have read-only access. You can see an example Batch (where every
partition fails): https://batch.hail.is/batches/8046901. [1]

```python3
import hail as hl
hl.utils.range_table(3, n_partitions=3).write('gs://neale-bge/foo.ht')
```

NB: I removed the `log.error` in `handleForPython` because that log is
never necessary. That function converts a stack of exceptions into a
triplet of the short message, the full exception with stack trace, and a
Hail error id (if present). That triplet is always passed along to
someone else who logs the exception.

(FWIW, the error id indicates a Python source location that is
associated with the error. On the Python-side, we can look up that error
id and provide a better stack trace.)

[1] You'll notice the logs are missing. I noticed this as well, it's a
new bug. I fixed it in hail-is#13729.
Picking up where hail-is#13776 left off.

CHANGELOG: improved speed of reading hail format datasets from disk

This PR speeds up decoding arrays in two main ways:
* instead of calling `arrayType.isElementDefined(array, i)` on every
single array element, which expands to
  ```scala
  val b = aoff + lengthHeaderBytes + (i >> 3)
  !((Memory.loadByte(b) & (1 << (i & 7).toInt)) != 0)
  ```
process elements in groups of 64, and load the corresponding long of
missing bits once
* once we have a whole long of missing bits, we can be smarter than
branching on each bit. After flipping to get `presentBits`, we use the
following psuedocode to extract the positions of the set bits, with time
proportional to the number of set bits:
  ```
  while (presentBits != 0) {
    val idx = java.lang.Long.numberOfTrailingZeroes(presentBits)
    // do something with idx
presentBits = presentBits & (presentBits - 1) // unsets the rightmost
set bit
  }
  ```

To avoid needing to handle the last block of 64 elements differently,
this PR changes the layout of `PCanonicalArray` to ensure the missing
bits are always padded out to a multiple of 64 bits. They were already
padded to a multiple of 32, and I don't expect this change to have much
of an effect. But if needed, blocking by 32 elements instead had very
similar performance in my benchmarks.

I also experimented with unrolling loops. In the non-missing case, this
is easy. In the missing case, I tried using `if (presentBits.bitCount >=
8)` to guard an unrolled inner loop. In both cases, unrolling was if
anything slower.

Dan observed benefit from unrolling, but that was combined with the
first optimization above (not loading a bit from memory every element),
which I beleive was the real source of improvement.
1. File rate is more interesting for small files.
2. The source_report controls the progress bar. By updating it eagerly
while we are listing a directory, the progress bar is more accurate
sooner. We currently wait until we get a semaphore for a particular file
to update the progress bar.
Some quality-of-life stuff for `hailtop.aiotools.delete`.

Without deleting in batches, I found it impossible to delete very large
lists of files because we create too many asyncio tasks.
…ail-is#13794)

Consider this:

```scala
class Foo {
   def bar(): (Long, Long) = (3, 4)

   def destructure(): Unit = {
     val (x, y) = bar()
   }

   def accessors(): Unit = {
     val zz = bar()
     val x = zz._1
     val y = zz._2
   }
}
```


![image](https://github.com/hail-is/hail/assets/106194/532dc7ea-8027-461d-8e12-3217f5451713)

These should be exactly equivalent, right? There's no way Scala would
compile the match into something horrible. Right? Right?

```
public void destructure();
  Code:
     0: aload_0
     1: invokevirtual #27                 // Method bar:()Lscala/Tuple2;
     4: astore_3
     5: aload_3
     6: ifnull        35
     9: aload_3
    10: invokevirtual #33                 // Method scala/Tuple2._1$mcJ$sp:()J
    13: lstore        4
    15: aload_3
    16: invokevirtual #36                 // Method scala/Tuple2._2$mcJ$sp:()J
    19: lstore        6
    21: new           #13                 // class scala/Tuple2$mcJJ$sp
    24: dup
    25: lload         4
    27: lload         6
    29: invokespecial #21                 // Method scala/Tuple2$mcJJ$sp."<init>":(JJ)V
    32: goto          47
    35: goto          38
    38: new           #38                 // class scala/MatchError
    41: dup
    42: aload_3
    43: invokespecial #41                 // Method scala/MatchError."<init>":(Ljava/lang/Object;)V
    46: athrow
    47: astore_2
    48: aload_2
    49: invokevirtual #33                 // Method scala/Tuple2._1$mcJ$sp:()J
    52: lstore        8
    54: aload_2
    55: invokevirtual #36                 // Method scala/Tuple2._2$mcJ$sp:()J
    58: lstore        10
    60: return

public void accessors();
  Code:
     0: aload_0
     1: invokevirtual #27                 // Method bar:()Lscala/Tuple2;
     4: astore_1
     5: aload_1
     6: invokevirtual #33                 // Method scala/Tuple2._1$mcJ$sp:()J
     9: lstore_2
    10: aload_1
    11: invokevirtual #36                 // Method scala/Tuple2._2$mcJ$sp:()J
    14: lstore        4
    16: return
```

Yeah, so, it extracts the first and second elements of the
primitive-specialized tuple, ~~constructs a `(java.lang.Long,
java.lang.Long)` Tuple~~ constructs another primitive-specialized tuple
(for no reason???), then does the match on that.

sigh.
The conceptual change here is we want to parameterize all batch related
tables to have a new job group ID that I've set to **0** for the root
job group. We need to make sure all future inserts / updates into the
batches table are propagated to the new job groups table. When we create
a batch now, we also create the corresponding entries into the job
groups and job group parents tables.

I chose the root job group to be 0 as I think conceptually, the client
should start numbering job groups at 1 and not know there is a hidden
root job group being created under the hood. I'm not wedded to this.

I tried to check for all the indices that would be needed in my
prototype. It's possible I missed one or two, but it's not a big deal to
add it later.

I don't think we need to test this on a populated database (dev deploy
main, submit jobs, then run the migration), but let me know if you think
that would be helpful.
This change grew out of hail-is#13674.
The idea is simple - we shouldn't be appending code after control
statements as such statements are redundant. That idea opened pandora's
box, but now we're not generating and dropping dead code anymore.

Main changes that rose form fixing fallout from adding assert in
`Block.append`:
- Implement basic control-flow structures (if, while, for, switch) in
`CodeBuilderLike` and remove the older implementations from `Code`.
- main difference is these are built from sequencing `Code` operations
rather than being defined from LIR
- allows for a higher-level implementation that I think is simpler to
read.
- Use the type-system to prevent foot-guns like `cb.ifx(cond,
label.goto)`.

Other changes:
- rename `ifx`, `forLoop` and `whileLoop` to just `if_`, `for_` and
`while_`, respectively.
- Implement loops in-terms of one-another to remove code duplication.
- Fix logic for when to write IRs as some default value behaviour was
broken when `HAIL_WRITE_IR_FILES` was set in tests
…s#13849)

Fixes hail-is#13788:
- Add `raise_unless_column_indexed` guard and apply to all
column-indexed parameters in `statgen.py`.
- Rename `check_row_indexed` and `check_entry_indexed` as I'm allergic
to functions called "check" - now it's clearer what they do.
This is the result of some experimentation. With ten-way parallelism,
the copier very rarely gets rate-limited. With 75-way parallelism (the
default), we almost always experience a tens of transient errors. If we
start at ten and back off as in this PR, I can get to 75 with just a
handful of transient errors.

cc: @jigold
Similar to hail-is#13818. We *must*
retrieve exceptions from any task that is `done` otherwise we'll get a
warning when the task is freed.
Containers get deleted when a job is cancelled. This is not exceptional
behavior.

Example: https://cloudlogging.app.goo.gl/punCSPauoM1ZEqZ27
🤦

I grepped for other hasattr that do not check `self`
We should never have been using `await`. (aiomysql should probably not
implement `__await__`). `create_pool` returns
`aiomysql.utils._PoolContextManager` which inherits from
`aiomysql.utils._ContextManager` which implements `__await__`,
`__aenter__`, and `__aexit__` thusly:

```python3
    def __await__(self):
        return self._coro.__await__()

    async def __aenter__(self):
        self._obj = await self._coro
        return self._obj

    async def __aexit__(self, exc_type, exc, tb):
        await self._obj.close()
        self._obj = None
```

`__await__` is a footgun! You should never do that! You should close the
return value of the coroutine!
…-is#13818)

Besides the types and some transient exceptions, I think I fixed our
task exception handling in several spots. Two things:

1. We do not need to wait on a cancelled task. If it was not done, then
it could not possibly have an exception to retrieve. Moreover, now that
it is cancelled, there is nothing else to do. Cancellation is immediate.

2. If a task is done, we *must* always retrieve the exception, otherwise
we might not see an exception.
ehigham and others added 23 commits October 31, 2023 11:46
…13922)

Namely, TableKeyByAndAggregate and TableAggregateByKey
Fixes hail-is#13860

---------

Co-authored-by: iris <84595986+iris-garden@users.noreply.github.com>
The `DeployConfig.service_ns` doesn't really do anything, we always use
the `_default_namespace`. This is maybe from an earlier age where some
services might live in different namespaces.
These are not used as far as I can tell.
The combiner benchmarks broke following the deletion of the
`experimental.vcf_combiner` python package. Re-implement them in terms
of the `vds` package.
We have no high-level IR analogue to `CodeBuilderLike.switch`. Such a
node is useful for flattening the IR in deeply-nested `If` nodes,
predicated on integer equality.
This partially addresses the stack-overflow error on the
`matrix_muluti_write_nothing` benchmark, which currently has a
stack-overflow error when computing the type of the CDA.
- `CreateNamespaceStep.public` was entirely unused
- `adminServiceAccount` is not used in `build.yaml` so
`CreateNamespaceStep.admin_service_account` is always `None` meaning it
has no effect.
- The three environment variables that I deleted from the
`deployment.yaml` are as far as I can tell entirely unused (they are now
grabbed from the global config)
Very small change, something I noticed while working on something else
entirely. Given how this is currently used I don't think it needs to be
boxed anymore.
This is a fix for an error Ben found.

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 1907, in run
    await self.setup_io()
  File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 1848, in setup_io
    await self.disk.create(labels=labels)
  File "/usr/local/lib/python3.9/dist-packages/batch/cloud/gcp/worker/disk.py", line 47, in create
    await self._attach()
  File "/usr/local/lib/python3.9/dist-packages/batch/cloud/gcp/worker/disk.py", line 112, in _attach
    self.last_response = await self.compute_client.attach_disk(
  File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 83, in attach_disk
    return await self._request_with_zonal_operations_response(self.post, path, params, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 126, in _request_with_zonal_operations_response
    return await retry_transient_errors(request_and_wait)
  File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 763, in retry_transient_errors
    return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 775, in retry_transient_errors_with_debug_string
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/aiogoogle/client/compute_client.py", line 116, in request_and_wait
    raise GCPOperationError(result['httpErrorStatusCode'],
hailtop.aiocloud.aiogoogle.client.compute_client.GCPOperationError: GCPOperationError: 400:BAD REQUEST ['RESOURCE_IN_USE_BY_ANOTHER_RESOURCE'] ["The disk resource 'projects/hail-vdc/zones/us-central1-b/disks/batch-disk-82XXXXX' is already being used by 'projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjXXXX'"]; {'kind': 'compute#operation', 'id': 'XXXXX', 'name': 'operation-XXXXX', 'zone': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b', 'clientOperationId': 'XXXX', 'operationType': 'attachDisk', 'targetLink': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjupd', 'targetId': 'XXXX', 'status': 'DONE', 'user': 'batch2-agent@hail-vdc.iam.gserviceaccount.com', 'progress': 100, 'insertTime': '2023-10-30T20:38:40.145-07:00', 'startTime': '2023-10-30T20:38:41.871-07:00', 'endTime': '2023-10-30T20:38:42.367-07:00', 'error': {'errors': [{'code': 'RESOURCE_IN_USE_BY_ANOTHER_RESOURCE', 'message': "The disk resource 'projects/hail-vdc/zones/us-central1-b/disks/batch-disk-82XXXXX' is already being used by 'projects/hail-vdc/zones/us-central1-b/instances/batch-worker-default-standard-yjXXXX'"}]}, 'httpErrorStatusCode': 400, 'httpErrorMessage': 'BAD REQUEST', 'selfLink': 'https://www.googleapis.com/compute/v1/projects/hail-vdc/zones/us-central1-b/operations/operation-XXX'}
```
I couldn't find the best issue for this. Should fix hail-is#13908, but I
thought there was another issue about reducing noisy grafana alerts
which this PR also addresses.
This PR just populates the records for older batch into the `job_groups`
and `job_group_self_and_ancestors` tables.

Stacked on hail-is#13475
I forgot that "open" was a valid batches state when I created the job
groups table state column as an enum. This should fix the failed
migration from hail-is#13487
…l-is#13986)

The CSS for the website is a real mess. I initially tried to clean it
up, but that became a time sink. We should eventually do that, but for
now I made the minimal edits to get a reasonable looking layout.

# Main Page
## Big

<img width="2032" alt="Screenshot 2023-11-07 at 12 19 20"
src="https://github.com/hail-is/hail/assets/106194/94c5c2d8-6a4d-44a9-888c-61b28d590857">
<img width="2032" alt="Screenshot 2023-11-07 at 12 19 27"
src="https://github.com/hail-is/hail/assets/106194/8c35f736-cd56-4d8b-b5d6-3284592ff65a">
<img width="2032" alt="Screenshot 2023-11-07 at 12 19 29"
src="https://github.com/hail-is/hail/assets/106194/9b396b45-bae5-469b-9825-b73a5cd8f917">
<img width="2032" alt="Screenshot 2023-11-07 at 12 19 31"
src="https://github.com/hail-is/hail/assets/106194/8d27d238-5b3a-4c40-9c32-a7eb691c622b">

## Phone
<img width="2032" alt="Screenshot 2023-11-07 at 12 22 42"
src="https://github.com/hail-is/hail/assets/106194/be32332a-cdba-4f6d-b117-d7e8c163d8c8">
<img width="2032" alt="Screenshot 2023-11-07 at 12 22 44"
src="https://github.com/hail-is/hail/assets/106194/ebc4f1d5-c728-4b0d-90f9-adbb9de4fd88">
<img width="2032" alt="Screenshot 2023-11-07 at 12 22 47"
src="https://github.com/hail-is/hail/assets/106194/9cffe08a-fdfa-4af4-b060-cfd242c0642c">
<img width="2032" alt="Screenshot 2023-11-07 at 12 22 48"
src="https://github.com/hail-is/hail/assets/106194/f5e5b09a-8692-4411-ba13-e7055c17be70">


# Docs
## Big
<img width="2032" alt="Screenshot 2023-11-07 at 12 24 09"
src="https://github.com/hail-is/hail/assets/106194/564a47e5-8036-4e60-a7fc-16e5aeeabd94">
<img width="2032" alt="Screenshot 2023-11-07 at 12 24 19"
src="https://github.com/hail-is/hail/assets/106194/0d954da0-8bdb-49e0-aa66-4ac5e0acb1f4">
<img width="2032" alt="Screenshot 2023-11-07 at 12 24 25"
src="https://github.com/hail-is/hail/assets/106194/e0466542-90d3-440c-a7a5-b797b88af63c">
<img width="2032" alt="Screenshot 2023-11-07 at 12 24 40"
src="https://github.com/hail-is/hail/assets/106194/4d5e5946-b014-484c-b404-3e9bd4389378">
<img width="2032" alt="Screenshot 2023-11-07 at 12 24 49"
src="https://github.com/hail-is/hail/assets/106194/5e2e4666-3bac-4560-a831-4e2ea05de0ae">
<img width="2032" alt="Screenshot 2023-11-07 at 12 24 55"
src="https://github.com/hail-is/hail/assets/106194/5f103ee1-a168-47ca-a5b2-f1385d4deac9">

## Phone
<img width="2032" alt="Screenshot 2023-11-07 at 12 25 21"
src="https://github.com/hail-is/hail/assets/106194/087b638c-e6f8-4633-9639-9f188b6b2e57">
<img width="2032" alt="Screenshot 2023-11-07 at 12 25 23"
src="https://github.com/hail-is/hail/assets/106194/cba530ea-d75c-4609-8307-16b3096a0e8c">

With the navbar open, in mobile, it looks the same as the non-docs
pages.
`org.apache.commons.lang` is from the `commons-lang` library, but in
`build.gradle` we explicitly depend on `commons-lang3` which puts
everything under the `lang3` package. We must be picking up
`commons-lang` as some transitive dependency but we no longer get it in
Spark 3.4. Regardless, better to use what we explicitly depend on.
Removes any occurences of async / sync / async nesting in the code, i.e.
a coroutine should not involve somewhere deep down a synchronous call
that blocks on the completion of an async task.

---------

Co-authored-by: Dan King <dking@broadinstitute.org>
…ail-is#13977)

This PR fixes that crun has a different way to specify memory
requirements when using cgroups v2 instead of cgroups v1. Should fix
hail-is#13902.
In particular, we need to incorporate and test hail-is#13977
as the proposed fix for jobs becoming unresponsive due to being
targeted by the kernel's OOM-killer.

(Our local gcsfuse repo workaround is replaced by upstream's.)
Copy link

@illusional illusional left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thought that gcsfuse might come up as a conflict, surprised how many commits there were to get up to date, Hail is a productive team!

devbin/rotate_keys.py Dismissed Show dismissed Hide dismissed
@jmarshall
Copy link
Author

Successful dev deploy: https://ci.hail.populationgenomics.org.au/batches/429671

@jmarshall
Copy link
Author

Due to 2e536ff we also need to generate a new batch-worker-15 boot disk image, which has been done.

@jmarshall jmarshall merged commit f36c781 into main Nov 10, 2023
5 checks passed
@jmarshall jmarshall deleted the upstream-126+oom branch November 10, 2023 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants