Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream HEAD (da6668b, 2024-01-02) for auth fix #324

Merged
merged 49 commits into from
Jan 5, 2024

Conversation

jmarshall
Copy link

@jmarshall jmarshall commented Jan 3, 2024

The recently cherry-picked CVE-2023-51663 fix (PR #323) included refactored code dependent on recent upstream code that was not cherry-picked. This caused:

{"severity":"ERROR","levelname":"ERROR","asctime":"2024-01-03 01:59:09,079","filename":"auth.py",
"funcNameAndLine":"callback:341","message":"oauth2 callback: could not fetch and verify token",
"exc_info":"Traceback (most recent call last):\n
File \"/usr/local/lib/python3.9/dist-packages/auth/auth.py\", line 335, in callback\n
    flow_client = request.app[AppKeys.FLOW_CLIENT]\n
NameError: name 'AppKeys' is not defined",
"hail_log":1}

Context: https://centrepopgen.slack.com/archives/C030X7WGFCL/p1704243108737559

This merges current upstream HEAD to include the necessary AppKeys class (see auth/auth/auth.py) and any unrelated fixes and improvements.

Merge conflicts resolved included letsencrypt/subdomains.txt, in which we omitted upstream's addition of guide-analysis which is specific to their lab.

jigold and others added 30 commits November 9, 2023 15:28
…#13943)

Fixes hail-is#13815. I tried to simplify the concepts here to be unified across
all instance collection types. I renamed "provisioned" to "live".
Schedulable means only workers that are active and from the latest
instance version.

I think the example figures are self explanatory.

 
<img width="1594" alt="Screenshot 2023-10-30 at 11 33 39 AM"
src="https://github.com/hail-is/hail/assets/1693348/0c8f4d2e-019e-419e-86b6-12de510ac5a4">

<img width="1569" alt="Screenshot 2023-10-30 at 11 35 29 AM"
src="https://github.com/hail-is/hail/assets/1693348/131f4978-76c4-4a6e-9e72-ecf9f99c8d5e">
`asyncinit` is unused AFAICT and the `frozenlist` requirement is already
inherited from hailtop (though it is not used in `hailtop` only in query
code, so am also happy to move it out of hailtop fully and into query).
Currently the binding structure is redundantly specified in two places:
Binds.scala, and the parser. We need the binding structure in the parser
to propagate the environment, so we can annotate `Ref` nodes (and a few
other things) with their types. But we can't use Binds.scala because we
don't yet have an IR.

This PR removes environment maintenance from the parser by deferring
type annotation to a separate pass (which is simple, because it can use
the Binds.scala infrastructure). One consequence is that we can't assign
types to nodes like `Ref` during parsing, which means we can't ask for
the type of any node during parsing, and by extension we can't ask for
types of children in IR node constructors. Instead, all typechecking
logic is moved to the `TypeCheck` pass.

Some benefits of this change:
* The parser is simpler, as it doesn't have to maintain a typing
environment.
* Binds.scala is now the single source of truth on the binding structure
of the IR.
* Instead of typechecking being split in an ad-hoc way between IR
constructors and the `TypeCheck` pass, all typechecking and type error
reporting logic is in one place.
* The parser parses a context-free grammar, no more and no less. If the
input is gramatically correct, the parser succeeds.
* We can round trip IR with type errors through the text representation.
For instance, if we log an IR that fails TypeCheck, we can copy the IR
from the log, parse it, then debug.

This change was motivated by my work in progress to convert the parser
to use the SSA grammar, which this should greatly simplify.

I chose to make the type annotation pass after parsing mutate the IR in
place (with the unfortunate exception of `Apply`, which can change into
an `ApplyIR` or `ApplySpecial`. Do these really need to be separate
nodes?). The type of a `Ref` node was already mutable to allow this sort
of deferred annotation, and I've had to make a few other things mutable
as well. Alternatively we could rebuild the entire IR to include type
annotations, but I think the mutation is sufficiently well localized,
and this is a common compiler pattern.

This touches a lot of lines, but the high level changes are:
* Remove the `refMap` from the `IRParserEnvironment`, and remove all
code that modifies the typing environment from the parser. Nodes like
`Ref` that need a type from the environment get types set to `null`, to
be filled in after parsing.
* Add `annotateTypes` pass to fill in type annotations from the
environment. This is currently written in the parser, and always called
after parsing. This means for the moment we can only parse type correct
IR. But in the future we could move this to a separate stage of the
compilation pipeline.
* Move typechecking logic on relational nodes from the constructors to a
`typecheck` method, which is called from the `TypeCheck` pass.
* Make the `typ` field on IR nodes consistently lazy (or a def when the
type is a child's type without modification). Before we sometimes did
this for performance, but it's now required to avoid querying children's
types during construction.
* Make types mutable on `AggSignature` and `ComparisonOp`, so they can
be filled in after parsing.
* Ensure that the structure in `Binds.scala` satisfies the following
invariant: to construct the typing environment of child `i`, we only
need the types of children with index less than `i`. This was almost
always satisfied already, and allows us to use the generic binds
infrastucture in the pass to annotate types (where when visiting child
`i`, we can only query types of already visited children).
* Change the text representation of `TailLoop`/`Recur` to move the
explicit type annotation from `Recur` to `TailLoop`. This is necessary
to comply with the previous point. It's also consistent with `Ref`,
where types of references are inferred from the environment.
* Add explicit types in the text representation of
`TableFilterIntervals` and `MatrixFilterIntervals`, where the types were
needed during parsing and we can no longer get them from child types.
* Fix IR examples used in parser tests to be type correct.
* Add an explicit return type to the `Apply` node. Before the parser
parsed an `Apply` node to one of `Apply/ApplySpecial/ApplyIR`; now it
always parses to `Apply`, and the type annotation pass converts to the
appropriate specialization, which needs the parsed return type.
Fix bug where relational IR inside the condition of a TableFilter or
MatrixFilter causes a ClassCastException. This can happen if, for
example, there's a TableAggregate inside the condition.
Main change: add `var mark: Int` to `BaseIR`.
On profiling the benchmark `matrix_multi_write_nothing`, I noticed a
significant amount of time was spent
- iterating through zipped arrays in requiredness 
- Adding and removing elements from `HashSet`s.
In fact, half the time spent in requiredness was removing ir nodes from
the `HashSet` set used as the queue! With this change, requiredness runs
like a stabbed rat!

Explanation of `mark`:
This field acts as a flag that analyses can set. For example:
- `HasSharing` can use the field to see if it has visited a node before.
- `Requiredness` uses this field to tell if a node is currently
enqueued.

The `nextFlag` method in `IrMetadata` allows for analyses to get a fresh
value they can set the `mark` field.
This removes the need to traverse the IR after analyses to re-zero every
`mark` field.
On the Azure doc page
…ail-is#14022)

CHANGELOG: Fix hail-is#13937 caused by faulty library code in the Google Cloud
Storage API Java client library.
…#13981)

A very small PR but here's the background and context behind this
change. When talking to either GCP or Azure, hail chooses credentials in
the following order from highest priority to lowest priority:

1. An explicit `credential_file` argument passed to the relevant
credentials class
2. An environment variable containing the path to the credentials
(`GOOGLE_APPLICATION_CREDENTIALS` or `AZURE_APPLICATION_CREDENTIALS`)
(from this you can see why the code that was here is totally redundant)
3. The latent credentials present on the machine. This might be `gcloud`
or `az` credentials, or the metadata server if you're on a cloud VM.

I'm trying to rid the codebase of most explicit providing of credentials
file paths, for two reasons:
- Quality of life. I'm already signed into the cloud with `gcloud` and
`az`. I shouldn't need to download some file and provide
`AZURE_APPLICATION_CREDENTIALS` to run this test. It should just use the
latent credentials.
- We are trying to phase out credentials files altogether for security
reasons. These files are long-lived secrets that you really don't want
to leak and are currently exposed to users in Batch jobs, so they can be
easily exfiltrated. Using the latent credentials on a cloud VM (the
metadata server) has the benefit of only issuing short-lived access
tokens which last for hours not months, so it's basically always better
to use the latent credentials when possible.
CHANGELOG: Deprecate default_reference parameter to hl.init, users
should use `default_reference` with an argument to set new default
references usually shortly after init.

Resolves hail-is#13856

---------

Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
This change allows `Let` nodes to bind multiple values. Serialisation is
backwards compatible meaning no changes to existing python code are
necessary.
This form of `Let` is perferable because it flattens deeply nested IRs
which can help reduce the time and stack space needed to type-check.
An extreme example of this is the benchmark
[matrix_multi_write_nothing](https://github.com/hail-is/hail/blob/67801dfc66b504a7d49daa53f7ec6d22c1194585/benchmark/python/benchmark_hail/run/matrix_table_benchmarks.py#L369C10-L373),
which overflows the stack on type-checking without this change.
CHANGELOG: Hail supports identity_by_descent on Apple M1 and M2 chips;
however, your Java installation must be an arm64 installation. Using
x86_64 Java with Hail on Apple M1 or M2 will cause SIGILL errors. If you
have an Apple M1 or Apple M2 and `/usr/libexec/java_home -V` does not
include `(arm64)`, you must switch to an arm64 version of the JVM. Fixes
(hail#14000).

Fixes hail-is#14000

Hail has never supported its native functionality on Mac OS X Apple M1
chips. In particular, we only built x86_64 compatible dylibs. M1 chips
will try to simulate a very basic x86_64 ISA using Rosetta 2 but our
x86_64 dylibs expect the ISA of at least sandybridge, which includes
some SIMD instructions not supported by Rosetta 2.

This PR bifurcates our native build into x86_64 and arm64 targets which
live in build/x86_64 and build/arm64, respectively.

In Linux, this moves where the object files live, but should otherwise
have no effect. The test and benchmark targets use the "native" build
which always points at the x86_64 object files. The shared object
targets, LIBBOOT & LIBHAIL, explicitly depend on x86_64 because that is
the only linux architecture we support.

In OS X, we only test and benchmark the "native" build, which is
detected using `uname -m`. For the shared objects (the dylibs) we have
four new files: libboot and libbhail for x86_64 and for arm64. Each pair
files is placed in `darwin/x86_64/` and `darwin/arm64/`, respectively.
Those dylibs are never meant to escape the src/main/c world. The LIBBOOT
and LIBHAIL targets (which are invoked by hail/Makefile) combine the two
architecture-specific dylibs into a "universal" dylib. You can verify
this by running `file` on the dylibs. Here I run them on the new
"prebuilt" files which are in this PR:

```
(base) dking@wm28c-761 hail % file hail/prebuilt/lib/darwin/libboot.dylib
hail/prebuilt/lib/darwin/libboot.dylib: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit dynamically linked shared library x86_64] [arm64:Mach-O 64-bit dynamically linked shared library arm64]
hail/prebuilt/lib/darwin/libboot.dylib (for architecture x86_64):	Mach-O 64-bit dynamically linked shared library x86_64
hail/prebuilt/lib/darwin/libboot.dylib (for architecture arm64):	Mach-O 64-bit dynamically linked shared library arm64
(base) dking@wm28c-761 hail % file hail/prebuilt/lib/darwin/libhail.dylib
hail/prebuilt/lib/darwin/libhail.dylib: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit dynamically linked shared library x86_64] [arm64:Mach-O 64-bit dynamically linked shared library arm64]
hail/prebuilt/lib/darwin/libhail.dylib (for architecture x86_64):	Mach-O 64-bit dynamically linked shared library x86_64
hail/prebuilt/lib/darwin/libhail.dylib (for architecture arm64):	Mach-O 64-bit dynamically linked shared library arm64
```

@chrisvittal , I need you to test this new makefile in Linux:

```
make -C hail pytest HAIL_COMPILE_NATIVES=1 PYTEST_ARGS='-k test_ibd_does_not_error_with_dummy_maf_float64'
```

I am fairly certain this Makefile will not work on old macbooks because
they cannot compile for Apple M1. This means that we always need someone
with an Apple M1 macbook to build the prebuilts.

*A MAJOR CAVEAT OF THIS CHANGE*: Users of Apple M1 macbooks must upgrade
to an arm64 JVM. They *must not* use Rosetta 2 to simulate an x86_64
JVM. In particular, this means that Apple M1 users must upgrade to JVM
11 because there are [no Adopt OpenJDK JVMs for
arm64](https://adoptium.net/temurin/releases/?version=8&os=mac).
…is#14025)

Resolves hail-is#14020.

I'll manually delete the scale up cron jobs from the dev namespaces once
this goes in.
It’s a random test, and it seems the current tolerance still allows rare
sporadic failures.
Stacked on hail-is#13475. This PR renames the following tables to have job
groups in the name instead of batches. Note that this PR needs to
shutdown the batch deployment (offline migration). I'm not 100% sure
this is necessary, but I want to avoid a case where MJC of the database
migration job cannot happen thus deadlocking the system.

```sql
RENAME TABLE batch_attributes TO job_group_attributes,
             batches_cancelled TO job_groups_cancelled,
             batches_inst_coll_staging TO job_groups_inst_coll_staging, 
             batch_inst_coll_cancellable_resources TO job_group_inst_coll_cancellable_resources, 
             aggregated_batch_resources_v2 TO aggregated_job_group_resources_v2,
             aggregated_batch_resources_v3 TO aggregated_job_group_resources_v3, 
             batches_n_jobs_in_complete_states TO job_groups_n_jobs_in_complete_states;
```
CHANGELOG: Fix hail-is#13998 which appeared in 0.2.58 and prevented reading
from a networked filesystem mounted within the filesystem of the worker
node for certain pipelines (those that did not trigger "lowering").

We use the IndexReader in `PartitionNativeIntervalReader`,
`PartitionNativeReaderIndexed`, and
`PartitionZippedIndexedNativeReader`.

1. `PartitionNativeIntervalReader` is only used by `query_table`.

2. `PartitionNativeReaderIndexed` is only used by
`IndexedRVDSpec2.readTableStage` which is used by `TableNativeReader`
when there is a new partitioner.

3. `PartitionZippedIndexedNativeReader` is only sued by
`AbstractRVDSpec.readZippedLowered` when there is a new partitioner.

Two is for tables, three is for matrix tables. In `readZippedLowered` we
explicitly [drop the file
protocol](https://github.com/hail-is/hail/blob/1dedf3c63f9aabf1b6ce538165360056f82f76e4/hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala#L154-L155):

```
        val absPathLeft = removeFileProtocol(pathLeft)
        val absPathRight = removeFileProtocol(pathRight)
```

We have done this, by various names, since this lowered code path was
added. I added `removeFileProtocol` because stripping the protocol in
Query-on-Batch prevented the reading and writing of gs:// URIs, the only
URIs I could read in QoB. `uriPath` (the function whose use I replaced
with `removeFileProtocol`) was added by Cotton [a very long time
ago](hail-is@92a9936).
It seems he added it so that he could use HDFS to generate a temporary
file path on the local filesystem but pass the file path to binary tools
that know nothing of HDFS and file:// URIs.

hail-is#9522 added the lowered code path and thus introduced this bug. It
attempted to mirror the extant code in
[`readIndexedPartitions`](https://github.com/hail-is/hail/blob/2b0aded9206849252b453dd80710cea8d2156793/hail/src/main/scala/is/hail/HailContext.scala#L421-L440)
which *does not* strip any protocols from the path.

This has gone undetected because we never try to read data through the
OS's filesystem. We always use gs://, Azure, or s3:// because we do not
test in environments that have a networked file system mounted in the
OS's filesystem. To replicate this bug (and add a test for it), we would
need a cluster with a lustre file system (or another networked
filesystem). This would be a fairly large lift. The fix is trivial: just
never intentionally strip the protocol!
We observed some sporadic failures, but the error was always less than
.011. This raises the tolerance to .015, which should be a comfortable
margin.
Spark depends on a very old verison of SLF4J. We cannot upgrade. We
added this dependency ages ago to fix some undocumented issue with
logging and SLF4J. It seems reasonable to me that we should just accept
whatever version of SLF4J that Spark provides.

This removes this message:
```
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/lib/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
```

Which, IMO, really should be a stop-the-world error.
We only have `make` commands for running `pylint` on subdirectories that
have been kept up to date with its rules, but the `pylintrc` doesn't
actually contain any indication of which subdirectories should be
ignored when running `pylint`. This makes the use of language servers
that run `pylint` on the file that's open frustrating, as files in the
ignored subdirectories will often be full of `pylint` suggestions. This
change adds the relevant subdirectories to the `pylintrc` file. Note
that this does not necessarily enable us to run `pylint` directly on
those subdirectories with the equivalent `make` commands to the ones
that already exist, because there is no way that I've found to make
`pylint` ignore the `__init__.py` file of whatever module it's being run
on, so running it on `hail/python/hail`, for example, produces many
errors.
…ild.sh (hail-is#14064)

This wasn't updated when I separated out building and pushing so the
`build` target wasn't pushing the image. When I fixed that the cert
generation failed because `memory` is no longer available. These changes
got everything working again.
No need to accept anything addressed to a public IP
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the
driver JVM dangerously high. It is now set to an appropriate level. This
issue manifests in a variety of inscrutable ways including
RemoteDisconnectedError and socket closed. See issue hail-is#13960 for details.

In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced
["memory
protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection)
which is a euphemism for a newly aggressive OOMKiller. When the
OOMKiller kills the JVM driver process, there is no hs_err_pid...log
file, no exceptional log statements, and no clean shutdown of any
sockets. The process is simply SIGTERM'ed and then SIGKILL'ed.

From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was
pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`,
by default, reserves 80% of the advertised memory of the driver node for
the use of the Hail Query Driver JVM process. For example, Google
advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the
`spark:spark.driver.memory` property to 41g (we always round down).
Before aggressive memory protection, this setting was sufficient to
protect the driver from starving itself of memory.

Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled
"memory protection". Moreover, in the years since Hail 0.2.15, the
memory in use by system processes on Dataproc driver nodes appears to
have increased. Due to these two circumstances, the driver VM's memory
usage can grow high enough to trigger the OOMKiller before the JVM
triggers a GC. Consider, for example, these slices of the syslog of the
n1-highmem-8 driver VM of a Dataproc cluster:

```
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total:    0 MiB
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <=  0.12% and swap <=  1.00%,
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]:         SIGKILL when mem <=  0.06% and swap <=  0.50%
...
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done'
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done
Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free:    0 of    0 MiB ( 0.00%)
```

Notice:

1. The total memory available on the machine is less than 52 GiB (=
53,248 MiB), indeed it is a full 1025 MiB below the advertised amount.

2. Once all the components of the Dataproc cluster have started (but
before any Hail Query jobs are submitted) the total memory available is
already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (=
41,984 MiB) to its JVM. This leaves the Python process and all other
daemons on the system only 776 MiB of excess RAM. For reference python3
-c 'import hail' needs 206 MiB.

This PR modifies `hailctl dataproc start` and the meaning of
`--master-memory-fraction`. Now, `--master-memory-fraction` is the
precentage of the memory available to the master node after accounting
for the missing 1GiB and the system daemons. We also increase the
default memory fraction to 90%.

For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An
n1-highmem-16 is unchanged at 83 GiB.
`aiohttp` 3.9 introduced `AppKey` which allows `Application.__getitem__`
to convey a return type based on its input. So we can do things like `db
= app[AppKeys.DB]` and `db` is known to be of type `Database`. This
should help a great deal with static type checking and in-editor type
hints.
…#14070)

<p>This PR was automatically created by Snyk using the credentials of a
real user.</p><br /><h3>Snyk has created this PR to fix one or more
vulnerable packages in the `pip` dependencies of this project.</h3>



#### Changes included in this PR

- Changes to the following files to upgrade the vulnerable dependencies
to a fixed version:
    - hail/python/dev/pinned-requirements.txt


<details>
<summary>⚠️ <b>Warning</b></summary>

```
jupyter 1.0.0 requires notebook, which is not installed.
jupyter 1.0.0 requires qtconsole, which is not installed.
beautifulsoup4 4.12.2 requires soupsieve, which is not installed.
argon2-cffi-bindings 21.2.0 requires cffi, which is not installed.
aiosignal 1.3.1 requires frozenlist, which is not installed.

```
</details>


#### Vulnerabilities that will be fixed





##### By pinning:
Severity | Priority Score (*) | Issue | Upgrade | Breaking Change |
Exploit Maturity

:-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:-------------------------
![low
severity](https://res.cloudinary.com/snyk/image/upload/w_20,h_20/v1561977819/icon/l.png
"low severity") | **461/1000** <br/> **Why?** Recently disclosed, Has a
fix available, CVSS 3.5 | Generation of Error Message Containing
Sensitive Information
<br/>[SNYK-PYTHON-JUPYTERSERVER-6099119](https://snyk.io/vuln/SNYK-PYTHON-JUPYTERSERVER-6099119)
| `jupyter-server:` <br> `1.24.0 -> 2.11.2` <br> | No | No Known Exploit

(*) Note that the real score may have changed since the PR was raised.




Some vulnerabilities couldn't be fully fixed and so Snyk will still find
them when the project is tested again. This may be because the
vulnerability existed within more than one direct dependency, but not
all of the affected dependencies could be upgraded.


Check the changes in this PR to ensure they won't cause issues with your
project.



------------



**Note:** *You are seeing this because you or someone else with access
to this repository has authorized Snyk to open fix PRs.*

For more information: <img
src="https://api.segment.io/v1/pixel/track?data=eyJ3cml0ZUtleSI6InJyWmxZcEdHY2RyTHZsb0lYd0dUcVg4WkFRTnNCOUEwIiwiYW5vbnltb3VzSWQiOiI3MThjYjgyZC1jNGU3LTRlNWEtODgzZi02NjQ0NjlmYzA4MGEiLCJldmVudCI6IlBSIHZpZXdlZCIsInByb3BlcnRpZXMiOnsicHJJZCI6IjcxOGNiODJkLWM0ZTctNGU1YS04ODNmLTY2NDQ2OWZjMDgwYSJ9fQ=="
width="0" height="0"/>
🧐 [View latest project
report](https://app.snyk.io/org/danking/project/20159ae6-a5aa-42fa-845a-c89f5bcbf999?utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;fix-pr)

🛠 [Adjust project
settings](https://app.snyk.io/org/danking/project/20159ae6-a5aa-42fa-845a-c89f5bcbf999?utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;fix-pr/settings)

📚 [Read more about Snyk's upgrade and patch
logic](https://support.snyk.io/hc/en-us/articles/360003891078-Snyk-patches-to-fix-vulnerabilities)

[//]: #
(snyk:metadata:{"prId":"718cb82d-c4e7-4e5a-883f-664469fc080a","prPublicId":"718cb82d-c4e7-4e5a-883f-664469fc080a","dependencies":[{"name":"jupyter-server","from":"1.24.0","to":"2.11.2"}],"packageManager":"pip","projectPublicId":"20159ae6-a5aa-42fa-845a-c89f5bcbf999","projectUrl":"https://app.snyk.io/org/danking/project/20159ae6-a5aa-42fa-845a-c89f5bcbf999?utm_source=github&utm_medium=referral&page=fix-pr","type":"auto","patch":[],"vulns":["SNYK-PYTHON-JUPYTERSERVER-6099119"],"upgrade":[],"isBreakingChange":false,"env":"prod","prType":"fix","templateVariants":["updated-fix-title","pr-warning-shown","priorityScore"],"priorityScoreList":[461],"remediationStrategy":"vuln"})

---

**Learn how to fix vulnerabilities with free interactive lessons:**

🦉 [Generation of Error Message Containing Sensitive
Information](https://learn.snyk.io/lesson/error-message-with-sensitive-information/?loc&#x3D;fix-pr)

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
daniel-goldstein and others added 19 commits December 5, 2023 19:04
Jobs that are configured with `mount_tokens=True` will have their Hail
tokens mounted into the main container. However, now that we are using
access tokens from cloud identities, the tokens are no longer used. This
removes the default behavior of mounting the `tokens.json` files since
they aren't used by our codebase anyway.
…is#14080)

CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most
frequently as "com.github.luben.zstd.ZstdException: Corrupted block
detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The
google-cloud-storage java library has a bug present at least since
2.29.0 in which simply incorrect data was returned.
googleapis/java-storage#2301 . The issue seems
related to their use of multiple intremediate ByteBuffers. As far as I
can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate
ByteBuffer.
3. While attempting to read more data into a subsequent intermediate
ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which
frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the
`position` hasn't been updated. When we recreate the channel we will
jump to the wrong position and re-read some data. Lucky for us, between
Zstd and our assertions, this usually crashes the program instead of
silently returning bad data.

This is the third bug we have found in Google's cloud storage java
library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data
corruption issues, check if updating google-cloud-storage fixes the
problem.
The Zstandard version is not changing. The zstd-jni library, which wraps
Zstandard and provides some interoperation with java.nio, has released 9
times since 1.5.5-2. They do not publish a changelog, but I scanned
through their commits. There were some potentially relevant bug fixes:

1. When using array-backed ByteBuffers, zstd-jni reads the wrong data if
the arrayOffset is not zero.
luben/zstd-jni@355b851

2. Perhaps a slightly faster path for array-backed ByteBuffers.
luben/zstd-jni@100c434

3. Possibly faster buffer pool.
luben/zstd-jni@2b6c3b7

4. Removed a double free during compression.
luben/zstd-jni@b2ad383
This adds the k8s config necessary to host the [guide
browser](https://hub.docker.com/r/gneak123/guide_browser/tags) in our
k8s cluster. You can see it running in dev
[here](https://internal.hail.is/dgoldste/guide-analysis/). There's not
much special here, a deployment with the browser app and an envoy
sidecar to handle TLS. Once this merges and the
`ssl-config-guide-analysis` is created in `default` I can `make -C guide
deploy NAMESPACE=default` and then recreate the certs to pick up the new
subdomain, after which it should be live.

Resolves hail-is#14067
We need to retry 401s until we get a 401 from credentials we know to be
invalid.
CHANGELOG: Fix hail-is#14089, which makes `hailctl dataproc connect` work in
Windows Subsystem for Linux.

1. Non 64-bit Windows uses "Program Files" not "Program Files (x86)"

2. Windows Subsystem for Linux looks like GNU/Linux but will not have
chromium on its path.

3. The removed arguments are no longer supported. They produce a warning
message in my version of Chrome and appear to not work in the version of
Chrome that this user was using. Instead, I bind to 0.0.0.0 and access
the Notebook using the machine DNS name. This is how Google recommend
accessing the Spark UI anyway.
…ail-is#14087)

* Add assertion to `load_combiner` and `new_combiner` to fail if the
output vds exists
* Remove assertion that disallows empty `gvcfs` and `vdses` in
`VariantDatasetCombiner.__init__`

Resolves hail-is#14079
Wenhan observed this error after I gave her a branch using google cloud
storage 2.30.1. I've reported this new transient error to the client API
repo, but I doubt it will be fixed.
googleapis/java-storage#2337

SSL errors seem like the kind of thing we should not retry forever since
they could indicate a bad actor.
Just explicitly start a loop and use our cloud-friendly FSes. Also, we
need to recursively copy the resources.
…14069)

I'm trying to move our codebase away from relying on the notion of a K8s
namespace for routing because this model does not hold in alternative
deployment environments such as Terra. Further, having a `-n` option in
`hailctl auth` commands is awkward because it can only be used for dev
environments yet is visible to users in the help menu. As such, this is
a breaking change only for developers. The functionality is not really
gone though because you can replace any `hailctl auth … -n dgoldste`
with `HAIL_DEFAULT_NAMESPACE=dgoldste hailctl auth …`.
Oh Chrome,

How do I call thee? Let me count the ways.
…returning errors (hail-is#14085)

The original goal of this PR was avoiding `Try` when we are not using
the restartability provided by semantic hashing because I strongly
suspect it is related to the loss of stacktraces in exceptions.

Unrelatedly, we realized the semantic hash PR changed the semantics of
Query-on-Spark even when semantic hash is disabled: previously we would
abort RDD writing on the first exception. In Hail 0.2.123 through
0.2.126, the semantics were changed to only crash *after* we already ran
every other partition. Two bad scenarios of which I can think:

1. Suppose the first partition fails due to OOM. We now waste time/money
on the rest of the partitions even though we cannot possibly get a valid
output.

2. Suppose every partition hits a permission error. Users should get
that feedback after paying for O(1) partitions run, not O(N).

I created two Backend paths: the normal `parallelizeAndComputeWithIndex`
with its pre-0.2.123 semantics as well as
`parallelizeAndComputeWithIndexReturnAllErrors` which, as the name says,
returns errors instead of raising them.

While making this change, I think I found two other bugs in the "return
all errors" path, only one of which I addressed in this PR:

1. I'm pretty sure semantic-hash-enabled QoB batch submission is broken
because it uses the logical partition ids as job indices. Suppose there
are 10,000 partitions, but we only need to compute 1, 100, and 1543.
0.2.126 would try to submit a batch of size 3 but whose job indices are
1, 100, and 1543.

2. Likewise, the Query-on-Spark path returns an invalid
`SparkTaskContext.partitionId` which, at best, produces confusing
partition filenames.

I only fixed the former because it was simple to fix. I wasn't exactly
sure what to do about the latter. We should fix that separately because
the changes in this PR need to urgently land in the next release to
avoid unexpected cost when one partition fails.
CHANGELOG: Fix `hailctl hdinsight start`, which has been broken since
0.2.118.

The shift to typer/click accidentally converted this parameter from a
string to an int.
This would seem to violate this [statement about
`hd`](https://developers.google.com/identity/openid-connect/openid-connect#id_token-hd):
> The absence of this claim indicates that the account does not belong
to a Google hosted domain.

but alas, that's the truth. They elide `hd` for iam.gserviceaccount.com
accounts.
The recently cherry-picked CVE-2023-51663 fix included refactored
code dependent on recent upstream code that was not cherry-picked.
Merge current upstream HEAD to include the necessary AppKeys class
and any unrelated fixes and improvements.
@jmarshall jmarshall marked this pull request as draft January 3, 2024 03:25
@jmarshall jmarshall marked this pull request as ready for review January 4, 2024 22:56
Copy link

@illusional illusional left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Noting that there is an offline database merge, and there are terraform changes to apply - fun!

@jmarshall
Copy link
Author

Yep, I was just looking at the batch/sql/rename-job-groups-tables.sql migration script, which will be interesting to observe running…

@jmarshall jmarshall merged commit c27a6e7 into main Jan 5, 2024
5 checks passed
@jmarshall jmarshall deleted the upstream-newyear2024 branch January 5, 2024 01:30
jmarshall added a commit that referenced this pull request Jan 8, 2024
PR #325 removed AppKeys.FLOW_CLIENT here as we did not cherry-pick
the rest of the AppKeys change. We later applied PR #324 which fully
merged upstream's AppKeys changes. Unfortunately when we merged that
PR via GitHub it chose #325's version of this line. :doh:
jmarshall added a commit that referenced this pull request Jan 8, 2024
PR #325 removed AppKeys.FLOW_CLIENT here as we did not cherry-pick
the rest of the AppKeys change. We later applied PR #324 which fully
merged upstream's AppKeys changes. Unfortunately when we merged that
PR via GitHub it chose #325's version of this line. :doh:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants