Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream HEAD(bea04d9, 2024-05-21) [release] 0.2.130 (#14454) #335

Merged
merged 199 commits into from
May 24, 2024

Conversation

milo-hyben
Copy link

In the final there was just one conflict.

iris-garden and others added 30 commits January 3, 2024 14:02
Oh Chrome,

How do I call thee? Let me count the ways.
I've already deployed to the cluster to test. Unless master deploys, you
can see it here:
https://grafana.azure.hail.is/datasources/edit/kgzY5Io7k

I also fixed the grafana Makefile.
…4122)

Co-authored-by: Edmund Higham <ehigham@users.noreply.github.com>
See hail-is#14120, which will add the
resultant commit from this PR to `.git-blame-ignore-revs` once this PR
is squashed and merged.
…l-is#13617)

Add staged code generation capabilities by lowering to and emitting
`StreamLeftIntervalJoin`.
The `EmitStream` rule for this node uses a code-generated min heap* of
intervals that contain the current key in the left stream, ordered by
right endpoint. 
The row type of the join result is the left row with an array of
values**
from the right table, ordered by their associated interval's right
endpoint.

Note that this implementation is by no means optimised. There are a
number of
opportunities that I'd like to consider in subsequent changes,
including:
- Laziness in restoring the heap property on push/pop from heap.
- Not deep-copying elements of the right stream when no explicit memory
  management per element is required.


\* This min heap is code generated instead of using an off-the-shelf
implementation as:
- we dont yet to have a mapping from `SType` to a java class or
interface to
  parameterise the heap with
- not obvious how to handle region memory management in an existing
solution

\** The value is a row with the key fields omitted.

Additional Changes:
- Make `this` an explicit argument to `CodeBuilder.invoke()`
  - Gives us control which object we're dispatching on.
  - Useful for generating more self-contained classes
  - Previously assumed all methods were defined in the same class.
- Removed `referenceGenomes` from `ExecuteContext`
- Delegate to `Backend`; in practice these come from the backend anyway.
  - These were being populated from the backend object
- Backend is mutable meaning we can add/remove fake genomes for testing
Before this change, testing just the n_partitions method of Table takes
nearly a minute, 52s of which is spent in "setup". Admittedly, this
setup is shared across multiple tests, but this is an unacceptable
burden for iterating on one method.

```
52.30s setup    hail/table.py::hail.table.Table.n_partitions
3.07s call     hail/table.py::hail.table.Table.n_partitions
```

After this change, the setup time significantly reduces. The call gets
slower, presumably because the JVM is not warm. I think the setup time
is now dominated by Hail JVM initialization.

```
11.77s call     hail/table.py::hail.table.Table.n_partitions
9.68s setup    hail/table.py::hail.table.Table.n_partitions
```
This reduces the practical runtime of this test by 50%.

This commit adds 72kB to the repository:

```
$ git diff-tree -r -c -M -C --no-commit-id HEAD | awk '{print $4}' | git cat-file --batch-check | awk 'BEGIN {s=0} {s+= $3} END {print s}'
72998
```
We do not yet have pyright running on the Hail python package nor the
tests. Without this change, I get failures on push.
Currently, the k8s namespace field is used both for routing internal
requests inside kubernetes but also external requests over the internet.
It also has special logic based on whether the namespace indicates a
production or dev environment. For example, if `namespace == 'default'`,
then we route external `batch` requests to `batch.<domain>/path`, but if
`namespace == foo_dev`, we route external `batch` requests to
`internal.<domain>/foo_dev/path`.

This PR decouples the namespace field from routing. Aside from being
overall more straightforward in my opinion, this is necessary for batch
on azure terra where batch is served out of a subpath it does not
control and is unrelated to whatever namespace it might reside in. The
guiding principle for routing is then as follows: If the config has no
subpath, use a subdomain, otherwise put everything under domain +
subpath.

For example:
- `{'domain': 'hail.is', 'subpath': null}` => `batch.hail.is`
- `{'domain': 'internal.hail.is', 'subpath': '/foo_dev'}` =>
`internal.hail.is/foo_dev/batch`

Since the CI pipeline runs on current production instances, there is a
minor need to stay compatible with old deploy configs (or else hack up
the CI build.yaml). It's quite a simple translation though, because if
there is no subpath provided we can infer one based on the
`default_namespace`.
Split from hail-is#14103
Co-Authored by @patrick-schultz 

This change includes some edits to prevent build failures caused by
using the scala formatter

---------

Co-authored-by: patrick-schultz <pschultz@broadinstitute.org>
Run scalafmt on compiler scala sources.
Split from hail-is#14103.
Co-authored by @patrick-schultz.
…op (hail-is#14097)

Fixes hail-is#14130.

We pervasively assume:
1. That our entire system is used within a single Python thread.
2. That once an event loop is created that's the only event loop that
will exist forever.

Pytest (and newer version of IPython, afaict) violate this pretty
liberally.

~~pytest_asyncio has [explicit instructions on how to run every test in
the same event
loop](https://pytest-asyncio.readthedocs.io/en/latest/how-to-guides/run_session_tests_in_same_loop.html).
I've implemented those here.~~ [These instructions don't
work](pytest-dev/pytest-asyncio#744). It seems
that the reliable way to ensure we're using one event loop everywhere is
to use pytest-asyncio < 0.23 and to define an event_loop fixture with
scope `'session'`.

I also switched test_batch.py into pytest-only style. This allows me to
use session-scoped fixtures so that they exist exactly once for the
entire test suite execution.

Also:
- `RouterAsyncFS` methods must either be a static method or an async
method. We must not create an FS in a sync method. Both `parse_url` and
`copy_part_size` now both do not allocate an FS.
- `httpx.py` now eagerly errors if the running event loop in `request`
differs from that at allocation time. Annoying but much better error
message than this nonsense about timeout context managers.
- `hail_event_loop` either gets the current thread's event loop (running
or not, doesn't matter to us) or creates a fresh event loop and sets it
as the current thread's event loop. The previous code didn't guarantee
we'd get an event loop b/c `get_event_loop` fails if `set_event_loop`
was previously called.
- `conftest.py` is inherited downward, so I lifted fixtures out of
test_copy.py and friends and into a common `hailtop/conftest.py`
- I added `make -C hail pytest-inter-cloud` for testing the inter cloud
directory. You still need appropriate permissions and authn.
- I removed extraneous pytest.mark.asyncio since we use auto mode
everywhere.
- `FailureInjectingClientSession` creates an `aiohttp.ClientSession` and
therefore must be used while an event loop is running. Easiest fix was
to make the test async.
CHANGELOG: Fixed bugs in the identity by descent implementation for
Query on Batch

This PR fixes hail-is#14052. There were two bugs in how we compute IBD. In
addition, the tests weren't running in QoB and the test dataset we were
using doesn't have enough variability to catch errors. I used Balding
Nichols generated data instead. Do we need to set the seed in the tests
here?
Plotly relies on behavior that Pandas is deprecated. See
plotly/plotly.py#4363 and
plotly/plotly.py#4379. The fix
(plotly/plotly.py#4379) merged into [main and
was released in
5.18.0](plotly/plotly.py@57e4d1d).
Here's a diff of `hailctl dataproc start foo --dry-run` on main and on
this branch. Notice that the properties and metadata arguments gain a
leading and trailing single quote. This ensure that things like
`sys_platform!="win32"` are properly transmitted.

In `start.py` we just use exec-style invocation, so there's no
equivalent issue.

```
7c7
<     --properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=36g|||yarn:yarn.nodemanager.resource.memory-mb=29184|||yarn:yarn.scheduler.maximum-allocation-mb=14592|||spark:spark.executor.cores=4|||spark:spark.executor.memory=5837m|||spark:spark.executor.memoryOverhead=8755m|||spark:spark.memory.storageFraction=0.2|||spark:spark.executorEnv.HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=3648 \
---
>     '--properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=36g|||yarn:yarn.nodemanager.resource.memory-mb=29184|||yarn:yarn.scheduler.maximum-allocation-mb=14592|||spark:spark.executor.cores=4|||spark:spark.executor.memory=5837m|||spark:spark.executor.memoryOverhead=8755m|||spark:spark.memory.storageFraction=0.2|||spark:spark.executorEnv.HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=3648' \
9c9
<     --metadata=^|||^WHEEL=gs://hail-30-day/hailctl/dataproc/dking-dev/0.2.126-a51eabd65859/hail-0.2.126-py3-none-any.whl|||PKGS=aiodns==2.0.0|aiohttp==3.9.1|aiosignal==1.3.1|async-timeout==4.0.3|attrs==23.1.0|avro==1.11.3|azure-common==1.1.28|azure-core==1.29.5|azure-identity==1.15.0|azure-mgmt-core==1.4.0|azure-mgmt-storage==20.1.0|azure-storage-blob==12.19.0|bokeh==3.3.1|boto3==1.33.1|botocore==1.33.1|cachetools==5.3.2|certifi==2023.11.17|cffi==1.16.0|charset-normalizer==3.3.2|click==8.1.7|commonmark==0.9.1|contourpy==1.2.0|cryptography==41.0.7|decorator==4.4.2|deprecated==1.2.14|dill==0.3.7|frozenlist==1.4.0|google-auth==2.23.4|google-auth-oauthlib==0.8.0|humanize==1.1.0|idna==3.6|isodate==0.6.1|janus==1.0.0|jinja2==3.1.2|jmespath==1.0.1|jproperties==2.1.1|markupsafe==2.1.3|msal==1.25.0|msal-extensions==1.0.0|msrest==0.7.1|multidict==6.0.4|nest-asyncio==1.5.8|numpy==1.26.2|oauthlib==3.2.2|orjson==3.9.10|packaging==23.2|pandas==2.1.3|parsimonious==0.10.0|pillow==10.1.0|plotly==5.18.0|portalocker==2.8.2|protobuf==3.20.2|py4j==0.10.9.5|pyasn1==0.5.1|pyasn1-modules==0.3.0|pycares==4.4.0|pycparser==2.21|pygments==2.17.2|pyjwt[crypto]==2.8.0|python-dateutil==2.8.2|python-json-logger==2.0.7|pytz==2023.3.post1|pyyaml==6.0.1|regex==2023.10.3|requests==2.31.0|requests-oauthlib==1.3.1|rich==12.6.0|rsa==4.9|s3transfer==0.8.0|scipy==1.11.4|six==1.16.0|sortedcontainers==2.4.0|tabulate==0.9.0|tenacity==8.2.3|tornado==6.3.3|typer==0.9.0|typing-extensions==4.8.0|tzdata==2023.3|urllib3==1.26.18|uvloop==0.19.0;sys_platform!="win32"|wrapt==1.16.0|xyzservices==2023.10.1|yarl==1.9.3 \
---
>     '--metadata=^|||^WHEEL=gs://hail-30-day/hailctl/dataproc/dking-dev/0.2.126-a51eabd65859/hail-0.2.126-py3-none-any.whl|||PKGS=aiodns==2.0.0|aiohttp==3.9.1|aiosignal==1.3.1|async-timeout==4.0.3|attrs==23.1.0|avro==1.11.3|azure-common==1.1.28|azure-core==1.29.5|azure-identity==1.15.0|azure-mgmt-core==1.4.0|azure-mgmt-storage==20.1.0|azure-storage-blob==12.19.0|bokeh==3.3.1|boto3==1.33.1|botocore==1.33.1|cachetools==5.3.2|certifi==2023.11.17|cffi==1.16.0|charset-normalizer==3.3.2|click==8.1.7|commonmark==0.9.1|contourpy==1.2.0|cryptography==41.0.7|decorator==4.4.2|deprecated==1.2.14|dill==0.3.7|frozenlist==1.4.0|google-auth==2.23.4|google-auth-oauthlib==0.8.0|humanize==1.1.0|idna==3.6|isodate==0.6.1|janus==1.0.0|jinja2==3.1.2|jmespath==1.0.1|jproperties==2.1.1|markupsafe==2.1.3|msal==1.25.0|msal-extensions==1.0.0|msrest==0.7.1|multidict==6.0.4|nest-asyncio==1.5.8|numpy==1.26.2|oauthlib==3.2.2|orjson==3.9.10|packaging==23.2|pandas==2.1.3|parsimonious==0.10.0|pillow==10.1.0|plotly==5.18.0|portalocker==2.8.2|protobuf==3.20.2|py4j==0.10.9.5|pyasn1==0.5.1|pyasn1-modules==0.3.0|pycares==4.4.0|pycparser==2.21|pygments==2.17.2|pyjwt[crypto]==2.8.0|python-dateutil==2.8.2|python-json-logger==2.0.7|pytz==2023.3.post1|pyyaml==6.0.1|regex==2023.10.3|requests==2.31.0|requests-oauthlib==1.3.1|rich==12.6.0|rsa==4.9|s3transfer==0.8.0|scipy==1.11.4|six==1.16.0|sortedcontainers==2.4.0|tabulate==0.9.0|tenacity==8.2.3|tornado==6.3.3|typer==0.9.0|typing-extensions==4.8.0|tzdata==2023.3|urllib3==1.26.18|uvloop==0.19.0;sys_platform!="win32"|wrapt==1.16.0|xyzservices==2023.10.1|yarl==1.9.3' \
```
As written:

- `make -C hail doctest` runs the normal tests.
- `test_python_docs` only tests the RST files (e.g. [see this
deploy](https://batch.hail.is/batches/8097884/jobs/127))
CHANGELOG: `hailctl dataproc` now creates clusters using Dataproc
version 2.1.33. It previously used version 2.1.2.

Dataproc verifications:
1. Memory available after startup: `mem avail: 42959 of 52223 MiB`. OK.
2. SparkMonitor still shows up.
3. Native code (`hl.identity_by_descent`) still works.


The Dataproc version list only keeps the most recent five, sadly. Lucky
for us, archive.org happens to have a capture with 2.1.2. I also
requested a capture of the current state of the page.

-
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1

- 2023-03-07:
https://web.archive.org/web/20230307225815/https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1

- 2023-12-12:
https://web.archive.org/web/20231212001716/https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1

| Component | 2.1.33-debian11 | 2.1.2-debian11 |
| --- | --- | --- |
| Apache Atlas | 2.2.0 | 2.2.0 |
| Apache Flink | 1.15.4 | 1.15.0 |
| Apache Hadoop | 3.3.6 | 3.3.3 |
| Apache Hive | 3.1.3 | 3.1.3 |
| Apache Hive WebHCat | 3.1.3 | 3.1.3 |
| Apache Hudi | 0.12.3 | 0.12.0 |
| Apache Kafka | 3.1.0 | 3.1.0 |
| Apache Pig | 0.18.0-SNAPSHOT | 0.18.0-SNAPSHOT | | Apache Spark |
3.3.2 | 3.3.0 |
| Apache Sqoop v1 | 1.5.0-SNAPSHOT | 1.5.0-SNAPSHOT | | Apache Sqoop v2
| N/A | 1.99.6 |
| Apache Tez | N/A | 0.10.1 |
| Cloud Storage Connector | hadoop3-2.2.18 | hadoop3-2.2.9 | | Conscrypt
| 2.5.2 | 2.5.2 |
| Docker | 20.10 | 20.10 |
| Hue | 4.10.0 | 4.10.0 |
| Java | 11 | 11 |
| JupyterLab Notebook | 3.4 | 3.4 |
| Oozie | 5.2.1 | 5.2.1 |
| Trino | 376 | 376 |
| Python | Python 3.10 | Python 3.10 |
| R | R 4.1 | R 4.1 |
| Ranger | 2.2.0 | 2.2.0 |
| Scala | 2.12.18 | 2.12.14 |
| Solr | 9.0.0 | 9.0.0 |
| Zeppelin Notebook | 0.10.1 | 0.10.1 |
| Zookeeper | 3.8.3 | 3.8.0 |

Here's a diff. I think the things that may affect us are:

1. Scala patch version
4. Cloud Storage Connector patch version (this is the motivation for
this change)
5. Spark patch version
6. Total daemon memory use at start-up.

```diff
diff -u /tmp/b /tmp/a
--- /tmp/b	2023-12-11 19:20:34
+++ /tmp/a	2023-12-11 19:22:42
@@ -1,17 +1,17 @@
-Component	2.1.2-debian11
+Component	2.1.33-debian11
 Apache Atlas	2.2.0
-Apache Flink	1.15.0
-Apache Hadoop	3.3.3
+Apache Flink	1.15.4
+Apache Hadoop	3.3.6
 Apache Hive	3.1.3
 Apache Hive WebHCat	3.1.3
-Apache Hudi	0.12.0
+Apache Hudi	0.12.3
 Apache Kafka	3.1.0
 Apache Pig	0.18.0-SNAPSHOT
-Apache Spark	3.3.0
+Apache Spark	3.3.2
 Apache Sqoop v1	1.5.0-SNAPSHOT
-Apache Sqoop v2	1.99.6
-Apache Tez	0.10.1
-Cloud Storage Connector	hadoop3-2.2.9
+Apache Sqoop v2	N/A
+Apache Tez	N/A
+Cloud Storage Connector	hadoop3-2.2.18
 Conscrypt	2.5.2
 Docker	20.10
 Hue	4.10.0
@@ -22,7 +22,7 @@
 Python	Python 3.10
 R	R 4.1
 Ranger	2.2.0
-Scala	2.12.14
+Scala	2.12.18
 Solr	9.0.0
 Zeppelin Notebook	0.10.1
-Zookeeper	3.8.0
\ No newline at end of file
+Zookeeper	3.8.3
```
CHANGELOG: Hail Query-on-Batch previously used Class A Operations for
all interaction with blobs. This change ensures that QoB only uses Class
A Operations when necessary.

Inspired by @jigold 's file system improvement campaign, I pursued the
avoidance of "list" operations. I anticipate this reduces flakiness in
Azure (which is tracked in hail-is#13351) and cost in Azure.

I enforced aiotools.fs terminology on hail.fs and Scala:

1. `FileStatus`. Metadata about a blob or file. It does not know if a
directory exists at this path.

2. `FileListEntry`. Metadata from a list operation. It knows if a
directory exists at this path.

Variable names were updated to reflect this distinction:

1. `fileStatus` / `fileStatuses`

2. `fle`/ `fles` / `fileListEntry` / `fileListEntries`, respectively.

`listStatus` renamed to `listDirectory` for clarity.

In both Azure and Google, `fileStatus` does not use a list operation.

`fileListEntry` can be used when we must know if a directory exists. I
just rewrote this from first principles because:
1. In neither Google nor Azure did it check if the path was a directory
and a file.
2. In Google, if the directory entry wasn't in the first page, it would
fail (NB: there are fifteen non-control characters in ASCII before `/`,
if the page size is 15 or fewer, we'd miss the first entry with a `/` at
the end).
3. In Azure, we issued both a get and a list.

There are now unit tests for this method.

---

1. `copyMerge` and `concatenateFiles` previously used `O(N_FILES)` list
operations, they now use `O(N_FILES)` get operations.
2. Writers that used `exists` to check for a _SUCCESS file now use a get
operation.
3. Index readers, import BGEN, and import plink all now check file size
with a get operation.

That said, overall, the bulk of our Class A Operations are probably
writes.
…#13812)

The bug in hail-is#13785 was that we used a `relpath`, but the files that were
to be included was the parent directory of the current working directory
of the script. To solve this, I added a new option `--mounts` that uses
the Docker syntax for `-v src:dest` to specify where the contents of the
source should be mounted into the container.


Fixes hail-is#13785
This change removes shim credential classes from the batch worker code
and attempts to consolidate credential handling in the worker API
classes.
…ts (hail-is#14105)

Fixes hail-is#13346.

Another user was confused by this:
hail-is#14102. Unfortunately, the world
appears to have embraced missing values in VCF array fields even though
the single element case is ambiguous. In hail-is#13346, I proposed a scheme by
which we can disambiguate many of the cases, but implementing it ran
into challenges because LoadVCF.scala does not expose whether or not an
INFO field was a literal "." or elided entirely from that line.

Anyway, this error message actually points users to the fix. I also
changed some method names such that every method is ArrayType and never
TypeArray.
`ruff` is faster than `black` when formatting Python `files`, so this
change replaces the formatting checks for our files with calls to `ruff`
instead of `black`, and runs `ruff format` on all files that would cause
those checks to fail.
CHANGELOG: Use indexed VEP cache files for GRCh38 on both dataproc and
QoB.

Fixes hail-is#13989

In this PR, I did the following:
1. Installed samtools into the Docker image to get rid of errors in the
log output
2. Added the `--merged` flag so that VEP will use the directory
`homo_sapiens_merged` for the cache

Outstanding Issues:
1. The FASTA files that are in `homo_sapiens/` were not present in the
merged dataset. Do we keep both the `homo_sapiens` and
`homo_sapiens_merged/` directories in our bucket or do we transfer the
FASTA files to the merged directory?
2. Once we decide the answer to (1), then I can fix this in dataproc.
The easiest thing to do is to add the tar file with the `_merged` data
to the dataproc vep folders and use the `--merged` flag. However, that
will double the startup time for VEP on a worker node in dataproc.

Before:
<img width="617" alt="Screenshot 2023-12-05 at 12 42 16 PM"
src="https://github.com/hail-is/hail/assets/1693348/bee7fff5-782c-4f19-aa88-26383ed386b7">


After:
<img width="619" alt="Screenshot 2023-12-05 at 12 46 30 PM"
src="https://github.com/hail-is/hail/assets/1693348/3d731759-6c69-4f1c-9c73-92bfb05c239a">
…ail-is#14139)

The term `name` has long been ambiguous and, indeed, our Azure
implementation differed from S3 and Google. Now both FileStatus and
FileListEntry use the term `basename`. Azure has been changed to have
the same semantics as S3 and Google.

---------

Co-authored-by: jigold <jigold@users.noreply.github.com>
…l-is#14151)

This provides useful feedback we can use to understand why
`test_file_in_current_dir` keeps failing.
ehigham and others added 23 commits April 3, 2024 18:56
This test times out occasionally, as far as we can tell due to
unavailability of g2 machines. We've determined that for the time being
we will allow the test to time out so it does not interfere with the CI
pipeline.
Decided to tackle this as it was causing noise in the logs. There is no
reason this read-only select query should lock any rows.
Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>
I sometimes find the term "Cost" on this page confusing, and I think its
inferred meaning differs depending on who's reading the page. This page
shows what users are spending on the Batch service, not what it is
*costing* operators to run the service (since we charge a service fee).
Closes hail-is#14210.

Shows whether a Job is "Always Run" on the Batch page:

<img width="1044" alt="Screenshot 2024-03-28 at 14 10 04"
src="https://github.com/hail-is/hail/assets/84595986/ec0dd8aa-6f58-4b84-a4f5-1ead8a98f13a">

and on the Job page:

<img width="281" alt="Screenshot 2024-03-28 at 14 10 11"
src="https://github.com/hail-is/hail/assets/84595986/b2fbe767-35a9-4ff3-8b9d-c50be7cf4408">
This PR refactors the `Binds.scala` file, which encodes the binding
structure of the IR.

The binding structure controls what variable bindings are in scope in
each location of the IR. The encoding specifies, for each node `x` and
each child `c` of `x`, how does the environment (the set of bindings
which are in scope, and their types) of 'x' differ from the environment
of `c`. For instance, `c`'s environment might add a new bindings (e.g.
in the body of a `StreamMap`), or it might drop all bindings, isolating
the child, so that it cannot refer to anything bound outside of `x`
(e.g. all relational nodes).

The binding structure is complicated by the separation of the
environment into "eval", "agg", "scan", and "relational" scopes. The
child `c`'s environment might add new bindings in any of these scopes,
and it might modify the parent's environment in other ways, like
replacing the "eval" scope with the "agg" scope.

In `Binds.scala`, the encoding of this relationship between parent's and
child's environments was split across many methods, separately encoding
bindings added to each of the scopes, and a prior modification to the
parent scope. This made it difficult to understand the binding behavior
of any particular node. And for nodes with more complicated binding
behavior, these separate encodings became confusingly entangled.

In this PR, `Binds.scala` is rewritten so that the binding behavior of a
node is specified in a single place. Instead of computing a list of new
bindings, it simply takes the parent's environment and returns the
child's.

Some use cases do require knowing what new bindings are added by a node
in a child's environment. To support this, I made the
`BindingEnvironment` interface a trait, and added another implementor
`SegregatedBindingEnv`, which wraps two separate `BindingEnvironment`s,
and puts all new bindings into the second. This satisfies the invariant
that for any function `f(GenericBindingEnv): GenericBindingEnv` which
modifies an environment using only the `GenericBindingEnv` interface,
and any `env: BindingEnv[T]`,
```
f(env) == f(SegregatedBindingEnv(env)).unified
```
i.e. the segregation doesn't have any effect, other than keeping track
of which bindings are new.

Now that all binding structure for a node is encoded in one place, a
future change could move this encoding to live in the node class
directly.
Adds `copy_spark_log_on_error` init configuration option. When true,
driver logs are copied to the remote tmpdir if an error occurs. This is
useful in support cases where users cannot copy logs off the dataproc
server themselves as it has already shut down.
~Stacked on hail-is#14404~

This set of changes started out as refactoring and trying to better
understand `extract.scala`. But the bigger change ended up being getting
rid of AggLet in favor of allowing bindings in `Let` to be agg/scan
bindings.

So each binding in a `Let` now has a binding scope, encoded using the
pre-existing `Scope` enum. I also renamed `Let` to `Block`. While I
think that's a better name for what `Let` has become, and what I hope it
will further become in the future, I initially tried to keep the name
`Let`. I changed it so that I could keep `Let.apply` the way it is,
specialized to bindings all in the eval scope. We can change all the
uses of `Let.apply` seperately, I didn't want to pollute this already
large pr with that.
In a89d64a, I modified `build.yaml` to release the wheel we had already
built and tested. Unbeknownst to me was that we rebuild the wheel with a
different version of `hail/python/hailtop/hailctl/deploy.yaml` and
releasing the version used for testing borked `hailctl dataproc`
commands.

To fix this, we'll rebuild the wheel but use the `jar` we've already
built and tested. This is safe to do as far as I know because we don't
bundle any information into the jar that depends on the make flag
`DEPLOY_REMOTE`.

Fixes: hail-is#14452
'duration': duration,
'cost': coalesce(record.get('cost'), 0),
'msec_mcpu': record['msec_mcpu'],
'cost_breakdown': cost_breakdown,
Copy link
Author

@milo-hyben milo-hyben May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflict msg, I have meged incoming with our version.

++<<<<<<< HEAD
 +            'cost_breakdown': cost_breakdown,
++||||||| 6a6c38d5a
++            'cost_breakdown': record['cost_breakdown'],
++=======
+             'cost_breakdown': record['cost_breakdown'],
+             'always_run': bool(record['always_run']),
+             'display_state': None,
++>>>>>>> bea04d9

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one — a nice straight-forward one to resolve

@milo-hyben milo-hyben requested a review from jmarshall May 21, 2024 04:42
@milo-hyben milo-hyben marked this pull request as ready for review May 21, 2024 04:42
@milo-hyben milo-hyben merged commit 6f76e36 into main May 24, 2024
@milo-hyben milo-hyben deleted the upstream-14454 branch May 24, 2024 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants