forked from hail-is/hail
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge upstream HEAD (da6668b, 2024-01-02) for auth fix #324
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…#13943) Fixes hail-is#13815. I tried to simplify the concepts here to be unified across all instance collection types. I renamed "provisioned" to "live". Schedulable means only workers that are active and from the latest instance version. I think the example figures are self explanatory. <img width="1594" alt="Screenshot 2023-10-30 at 11 33 39 AM" src="https://github.com/hail-is/hail/assets/1693348/0c8f4d2e-019e-419e-86b6-12de510ac5a4"> <img width="1569" alt="Screenshot 2023-10-30 at 11 35 29 AM" src="https://github.com/hail-is/hail/assets/1693348/131f4978-76c4-4a6e-9e72-ecf9f99c8d5e">
`asyncinit` is unused AFAICT and the `frozenlist` requirement is already inherited from hailtop (though it is not used in `hailtop` only in query code, so am also happy to move it out of hailtop fully and into query).
Currently the binding structure is redundantly specified in two places: Binds.scala, and the parser. We need the binding structure in the parser to propagate the environment, so we can annotate `Ref` nodes (and a few other things) with their types. But we can't use Binds.scala because we don't yet have an IR. This PR removes environment maintenance from the parser by deferring type annotation to a separate pass (which is simple, because it can use the Binds.scala infrastructure). One consequence is that we can't assign types to nodes like `Ref` during parsing, which means we can't ask for the type of any node during parsing, and by extension we can't ask for types of children in IR node constructors. Instead, all typechecking logic is moved to the `TypeCheck` pass. Some benefits of this change: * The parser is simpler, as it doesn't have to maintain a typing environment. * Binds.scala is now the single source of truth on the binding structure of the IR. * Instead of typechecking being split in an ad-hoc way between IR constructors and the `TypeCheck` pass, all typechecking and type error reporting logic is in one place. * The parser parses a context-free grammar, no more and no less. If the input is gramatically correct, the parser succeeds. * We can round trip IR with type errors through the text representation. For instance, if we log an IR that fails TypeCheck, we can copy the IR from the log, parse it, then debug. This change was motivated by my work in progress to convert the parser to use the SSA grammar, which this should greatly simplify. I chose to make the type annotation pass after parsing mutate the IR in place (with the unfortunate exception of `Apply`, which can change into an `ApplyIR` or `ApplySpecial`. Do these really need to be separate nodes?). The type of a `Ref` node was already mutable to allow this sort of deferred annotation, and I've had to make a few other things mutable as well. Alternatively we could rebuild the entire IR to include type annotations, but I think the mutation is sufficiently well localized, and this is a common compiler pattern. This touches a lot of lines, but the high level changes are: * Remove the `refMap` from the `IRParserEnvironment`, and remove all code that modifies the typing environment from the parser. Nodes like `Ref` that need a type from the environment get types set to `null`, to be filled in after parsing. * Add `annotateTypes` pass to fill in type annotations from the environment. This is currently written in the parser, and always called after parsing. This means for the moment we can only parse type correct IR. But in the future we could move this to a separate stage of the compilation pipeline. * Move typechecking logic on relational nodes from the constructors to a `typecheck` method, which is called from the `TypeCheck` pass. * Make the `typ` field on IR nodes consistently lazy (or a def when the type is a child's type without modification). Before we sometimes did this for performance, but it's now required to avoid querying children's types during construction. * Make types mutable on `AggSignature` and `ComparisonOp`, so they can be filled in after parsing. * Ensure that the structure in `Binds.scala` satisfies the following invariant: to construct the typing environment of child `i`, we only need the types of children with index less than `i`. This was almost always satisfied already, and allows us to use the generic binds infrastucture in the pass to annotate types (where when visiting child `i`, we can only query types of already visited children). * Change the text representation of `TailLoop`/`Recur` to move the explicit type annotation from `Recur` to `TailLoop`. This is necessary to comply with the previous point. It's also consistent with `Ref`, where types of references are inferred from the environment. * Add explicit types in the text representation of `TableFilterIntervals` and `MatrixFilterIntervals`, where the types were needed during parsing and we can no longer get them from child types. * Fix IR examples used in parser tests to be type correct. * Add an explicit return type to the `Apply` node. Before the parser parsed an `Apply` node to one of `Apply/ApplySpecial/ApplyIR`; now it always parses to `Apply`, and the type annotation pass converts to the appropriate specialization, which needs the parsed return type.
Fix bug where relational IR inside the condition of a TableFilter or MatrixFilter causes a ClassCastException. This can happen if, for example, there's a TableAggregate inside the condition.
Main change: add `var mark: Int` to `BaseIR`. On profiling the benchmark `matrix_multi_write_nothing`, I noticed a significant amount of time was spent - iterating through zipped arrays in requiredness - Adding and removing elements from `HashSet`s. In fact, half the time spent in requiredness was removing ir nodes from the `HashSet` set used as the queue! With this change, requiredness runs like a stabbed rat! Explanation of `mark`: This field acts as a flag that analyses can set. For example: - `HasSharing` can use the field to see if it has visited a node before. - `Requiredness` uses this field to tell if a node is currently enqueued. The `nextFlag` method in `IrMetadata` allows for analyses to get a fresh value they can set the `mark` field. This removes the need to traverse the IR after analyses to re-zero every `mark` field.
On the Azure doc page
…ail-is#14022) CHANGELOG: Fix hail-is#13937 caused by faulty library code in the Google Cloud Storage API Java client library.
…#13981) A very small PR but here's the background and context behind this change. When talking to either GCP or Azure, hail chooses credentials in the following order from highest priority to lowest priority: 1. An explicit `credential_file` argument passed to the relevant credentials class 2. An environment variable containing the path to the credentials (`GOOGLE_APPLICATION_CREDENTIALS` or `AZURE_APPLICATION_CREDENTIALS`) (from this you can see why the code that was here is totally redundant) 3. The latent credentials present on the machine. This might be `gcloud` or `az` credentials, or the metadata server if you're on a cloud VM. I'm trying to rid the codebase of most explicit providing of credentials file paths, for two reasons: - Quality of life. I'm already signed into the cloud with `gcloud` and `az`. I shouldn't need to download some file and provide `AZURE_APPLICATION_CREDENTIALS` to run this test. It should just use the latent credentials. - We are trying to phase out credentials files altogether for security reasons. These files are long-lived secrets that you really don't want to leak and are currently exposed to users in Batch jobs, so they can be easily exfiltrated. Using the latent credentials on a cloud VM (the metadata server) has the benefit of only issuing short-lived access tokens which last for hours not months, so it's basically always better to use the latent credentials when possible.
CHANGELOG: Deprecate default_reference parameter to hl.init, users should use `default_reference` with an argument to set new default references usually shortly after init. Resolves hail-is#13856 --------- Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
This change allows `Let` nodes to bind multiple values. Serialisation is backwards compatible meaning no changes to existing python code are necessary. This form of `Let` is perferable because it flattens deeply nested IRs which can help reduce the time and stack space needed to type-check. An extreme example of this is the benchmark [matrix_multi_write_nothing](https://github.com/hail-is/hail/blob/67801dfc66b504a7d49daa53f7ec6d22c1194585/benchmark/python/benchmark_hail/run/matrix_table_benchmarks.py#L369C10-L373), which overflows the stack on type-checking without this change.
CHANGELOG: Hail supports identity_by_descent on Apple M1 and M2 chips; however, your Java installation must be an arm64 installation. Using x86_64 Java with Hail on Apple M1 or M2 will cause SIGILL errors. If you have an Apple M1 or Apple M2 and `/usr/libexec/java_home -V` does not include `(arm64)`, you must switch to an arm64 version of the JVM. Fixes (hail#14000). Fixes hail-is#14000 Hail has never supported its native functionality on Mac OS X Apple M1 chips. In particular, we only built x86_64 compatible dylibs. M1 chips will try to simulate a very basic x86_64 ISA using Rosetta 2 but our x86_64 dylibs expect the ISA of at least sandybridge, which includes some SIMD instructions not supported by Rosetta 2. This PR bifurcates our native build into x86_64 and arm64 targets which live in build/x86_64 and build/arm64, respectively. In Linux, this moves where the object files live, but should otherwise have no effect. The test and benchmark targets use the "native" build which always points at the x86_64 object files. The shared object targets, LIBBOOT & LIBHAIL, explicitly depend on x86_64 because that is the only linux architecture we support. In OS X, we only test and benchmark the "native" build, which is detected using `uname -m`. For the shared objects (the dylibs) we have four new files: libboot and libbhail for x86_64 and for arm64. Each pair files is placed in `darwin/x86_64/` and `darwin/arm64/`, respectively. Those dylibs are never meant to escape the src/main/c world. The LIBBOOT and LIBHAIL targets (which are invoked by hail/Makefile) combine the two architecture-specific dylibs into a "universal" dylib. You can verify this by running `file` on the dylibs. Here I run them on the new "prebuilt" files which are in this PR: ``` (base) dking@wm28c-761 hail % file hail/prebuilt/lib/darwin/libboot.dylib hail/prebuilt/lib/darwin/libboot.dylib: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit dynamically linked shared library x86_64] [arm64:Mach-O 64-bit dynamically linked shared library arm64] hail/prebuilt/lib/darwin/libboot.dylib (for architecture x86_64): Mach-O 64-bit dynamically linked shared library x86_64 hail/prebuilt/lib/darwin/libboot.dylib (for architecture arm64): Mach-O 64-bit dynamically linked shared library arm64 (base) dking@wm28c-761 hail % file hail/prebuilt/lib/darwin/libhail.dylib hail/prebuilt/lib/darwin/libhail.dylib: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit dynamically linked shared library x86_64] [arm64:Mach-O 64-bit dynamically linked shared library arm64] hail/prebuilt/lib/darwin/libhail.dylib (for architecture x86_64): Mach-O 64-bit dynamically linked shared library x86_64 hail/prebuilt/lib/darwin/libhail.dylib (for architecture arm64): Mach-O 64-bit dynamically linked shared library arm64 ``` @chrisvittal , I need you to test this new makefile in Linux: ``` make -C hail pytest HAIL_COMPILE_NATIVES=1 PYTEST_ARGS='-k test_ibd_does_not_error_with_dummy_maf_float64' ``` I am fairly certain this Makefile will not work on old macbooks because they cannot compile for Apple M1. This means that we always need someone with an Apple M1 macbook to build the prebuilts. *A MAJOR CAVEAT OF THIS CHANGE*: Users of Apple M1 macbooks must upgrade to an arm64 JVM. They *must not* use Rosetta 2 to simulate an x86_64 JVM. In particular, this means that Apple M1 users must upgrade to JVM 11 because there are [no Adopt OpenJDK JVMs for arm64](https://adoptium.net/temurin/releases/?version=8&os=mac).
…is#14025) Resolves hail-is#14020. I'll manually delete the scale up cron jobs from the dev namespaces once this goes in.
It’s a random test, and it seems the current tolerance still allows rare sporadic failures.
Stacked on hail-is#13475. This PR renames the following tables to have job groups in the name instead of batches. Note that this PR needs to shutdown the batch deployment (offline migration). I'm not 100% sure this is necessary, but I want to avoid a case where MJC of the database migration job cannot happen thus deadlocking the system. ```sql RENAME TABLE batch_attributes TO job_group_attributes, batches_cancelled TO job_groups_cancelled, batches_inst_coll_staging TO job_groups_inst_coll_staging, batch_inst_coll_cancellable_resources TO job_group_inst_coll_cancellable_resources, aggregated_batch_resources_v2 TO aggregated_job_group_resources_v2, aggregated_batch_resources_v3 TO aggregated_job_group_resources_v3, batches_n_jobs_in_complete_states TO job_groups_n_jobs_in_complete_states; ```
CHANGELOG: Fix hail-is#13998 which appeared in 0.2.58 and prevented reading from a networked filesystem mounted within the filesystem of the worker node for certain pipelines (those that did not trigger "lowering"). We use the IndexReader in `PartitionNativeIntervalReader`, `PartitionNativeReaderIndexed`, and `PartitionZippedIndexedNativeReader`. 1. `PartitionNativeIntervalReader` is only used by `query_table`. 2. `PartitionNativeReaderIndexed` is only used by `IndexedRVDSpec2.readTableStage` which is used by `TableNativeReader` when there is a new partitioner. 3. `PartitionZippedIndexedNativeReader` is only sued by `AbstractRVDSpec.readZippedLowered` when there is a new partitioner. Two is for tables, three is for matrix tables. In `readZippedLowered` we explicitly [drop the file protocol](https://github.com/hail-is/hail/blob/1dedf3c63f9aabf1b6ce538165360056f82f76e4/hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala#L154-L155): ``` val absPathLeft = removeFileProtocol(pathLeft) val absPathRight = removeFileProtocol(pathRight) ``` We have done this, by various names, since this lowered code path was added. I added `removeFileProtocol` because stripping the protocol in Query-on-Batch prevented the reading and writing of gs:// URIs, the only URIs I could read in QoB. `uriPath` (the function whose use I replaced with `removeFileProtocol`) was added by Cotton [a very long time ago](hail-is@92a9936). It seems he added it so that he could use HDFS to generate a temporary file path on the local filesystem but pass the file path to binary tools that know nothing of HDFS and file:// URIs. hail-is#9522 added the lowered code path and thus introduced this bug. It attempted to mirror the extant code in [`readIndexedPartitions`](https://github.com/hail-is/hail/blob/2b0aded9206849252b453dd80710cea8d2156793/hail/src/main/scala/is/hail/HailContext.scala#L421-L440) which *does not* strip any protocols from the path. This has gone undetected because we never try to read data through the OS's filesystem. We always use gs://, Azure, or s3:// because we do not test in environments that have a networked file system mounted in the OS's filesystem. To replicate this bug (and add a test for it), we would need a cluster with a lustre file system (or another networked filesystem). This would be a fairly large lift. The fix is trivial: just never intentionally strip the protocol!
We observed some sporadic failures, but the error was always less than .011. This raises the tolerance to .015, which should be a comfortable margin.
Spark depends on a very old verison of SLF4J. We cannot upgrade. We added this dependency ages ago to fix some undocumented issue with logging and SLF4J. It seems reasonable to me that we should just accept whatever version of SLF4J that Spark provides. This removes this message: ``` SLF4J: No SLF4J providers were found. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details. SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier. SLF4J: Ignoring binding found at [jar:file:/usr/lib/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation. ``` Which, IMO, really should be a stop-the-world error.
We only have `make` commands for running `pylint` on subdirectories that have been kept up to date with its rules, but the `pylintrc` doesn't actually contain any indication of which subdirectories should be ignored when running `pylint`. This makes the use of language servers that run `pylint` on the file that's open frustrating, as files in the ignored subdirectories will often be full of `pylint` suggestions. This change adds the relevant subdirectories to the `pylintrc` file. Note that this does not necessarily enable us to run `pylint` directly on those subdirectories with the equivalent `make` commands to the ones that already exist, because there is no way that I've found to make `pylint` ignore the `__init__.py` file of whatever module it's being run on, so running it on `hail/python/hail`, for example, produces many errors.
…ild.sh (hail-is#14064) This wasn't updated when I separated out building and pushing so the `build` target wasn't pushing the image. When I fixed that the cert generation failed because `memory` is no longer available. These changes got everything working again.
No need to accept anything addressed to a public IP
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue hail-is#13960 for details. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced ["memory protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection) which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the `spark:spark.driver.memory` property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory. Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster: ``` Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2 Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%, Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50% ... Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done' Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%) ``` Notice: 1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount. 2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB. This PR modifies `hailctl dataproc start` and the meaning of `--master-memory-fraction`. Now, `--master-memory-fraction` is the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%. For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.
`aiohttp` 3.9 introduced `AppKey` which allows `Application.__getitem__` to convey a return type based on its input. So we can do things like `db = app[AppKeys.DB]` and `db` is known to be of type `Database`. This should help a great deal with static type checking and in-editor type hints.
…#14070) <p>This PR was automatically created by Snyk using the credentials of a real user.</p><br /><h3>Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.</h3> #### Changes included in this PR - Changes to the following files to upgrade the vulnerable dependencies to a fixed version: - hail/python/dev/pinned-requirements.txt <details> <summary>⚠️ <b>Warning</b></summary> ``` jupyter 1.0.0 requires notebook, which is not installed. jupyter 1.0.0 requires qtconsole, which is not installed. beautifulsoup4 4.12.2 requires soupsieve, which is not installed. argon2-cffi-bindings 21.2.0 requires cffi, which is not installed. aiosignal 1.3.1 requires frozenlist, which is not installed. ``` </details> #### Vulnerabilities that will be fixed ##### By pinning: Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- ![low severity](https://res.cloudinary.com/snyk/image/upload/w_20,h_20/v1561977819/icon/l.png "low severity") | **461/1000** <br/> **Why?** Recently disclosed, Has a fix available, CVSS 3.5 | Generation of Error Message Containing Sensitive Information <br/>[SNYK-PYTHON-JUPYTERSERVER-6099119](https://snyk.io/vuln/SNYK-PYTHON-JUPYTERSERVER-6099119) | `jupyter-server:` <br> `1.24.0 -> 2.11.2` <br> | No | No Known Exploit (*) Note that the real score may have changed since the PR was raised. Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded. Check the changes in this PR to ensure they won't cause issues with your project. ------------ **Note:** *You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.* For more information: <img src="https://api.segment.io/v1/pixel/track?data=eyJ3cml0ZUtleSI6InJyWmxZcEdHY2RyTHZsb0lYd0dUcVg4WkFRTnNCOUEwIiwiYW5vbnltb3VzSWQiOiI3MThjYjgyZC1jNGU3LTRlNWEtODgzZi02NjQ0NjlmYzA4MGEiLCJldmVudCI6IlBSIHZpZXdlZCIsInByb3BlcnRpZXMiOnsicHJJZCI6IjcxOGNiODJkLWM0ZTctNGU1YS04ODNmLTY2NDQ2OWZjMDgwYSJ9fQ==" width="0" height="0"/> 🧐 [View latest project report](https://app.snyk.io/org/danking/project/20159ae6-a5aa-42fa-845a-c89f5bcbf999?utm_source=github&utm_medium=referral&page=fix-pr) 🛠 [Adjust project settings](https://app.snyk.io/org/danking/project/20159ae6-a5aa-42fa-845a-c89f5bcbf999?utm_source=github&utm_medium=referral&page=fix-pr/settings) 📚 [Read more about Snyk's upgrade and patch logic](https://support.snyk.io/hc/en-us/articles/360003891078-Snyk-patches-to-fix-vulnerabilities) [//]: # (snyk:metadata:{"prId":"718cb82d-c4e7-4e5a-883f-664469fc080a","prPublicId":"718cb82d-c4e7-4e5a-883f-664469fc080a","dependencies":[{"name":"jupyter-server","from":"1.24.0","to":"2.11.2"}],"packageManager":"pip","projectPublicId":"20159ae6-a5aa-42fa-845a-c89f5bcbf999","projectUrl":"https://app.snyk.io/org/danking/project/20159ae6-a5aa-42fa-845a-c89f5bcbf999?utm_source=github&utm_medium=referral&page=fix-pr","type":"auto","patch":[],"vulns":["SNYK-PYTHON-JUPYTERSERVER-6099119"],"upgrade":[],"isBreakingChange":false,"env":"prod","prType":"fix","templateVariants":["updated-fix-title","pr-warning-shown","priorityScore"],"priorityScoreList":[461],"remediationStrategy":"vuln"}) --- **Learn how to fix vulnerabilities with free interactive lessons:** 🦉 [Generation of Error Message Containing Sensitive Information](https://learn.snyk.io/lesson/error-message-with-sensitive-information/?loc=fix-pr) Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Jobs that are configured with `mount_tokens=True` will have their Hail tokens mounted into the main container. However, now that we are using access tokens from cloud identities, the tokens are no longer used. This removes the default behavior of mounting the `tokens.json` files since they aren't used by our codebase anyway.
…is#14080) CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
The Zstandard version is not changing. The zstd-jni library, which wraps Zstandard and provides some interoperation with java.nio, has released 9 times since 1.5.5-2. They do not publish a changelog, but I scanned through their commits. There were some potentially relevant bug fixes: 1. When using array-backed ByteBuffers, zstd-jni reads the wrong data if the arrayOffset is not zero. luben/zstd-jni@355b851 2. Perhaps a slightly faster path for array-backed ByteBuffers. luben/zstd-jni@100c434 3. Possibly faster buffer pool. luben/zstd-jni@2b6c3b7 4. Removed a double free during compression. luben/zstd-jni@b2ad383
This adds the k8s config necessary to host the [guide browser](https://hub.docker.com/r/gneak123/guide_browser/tags) in our k8s cluster. You can see it running in dev [here](https://internal.hail.is/dgoldste/guide-analysis/). There's not much special here, a deployment with the browser app and an envoy sidecar to handle TLS. Once this merges and the `ssl-config-guide-analysis` is created in `default` I can `make -C guide deploy NAMESPACE=default` and then recreate the certs to pick up the new subdomain, after which it should be live. Resolves hail-is#14067
We need to retry 401s until we get a 401 from credentials we know to be invalid.
With the types, pyright can see this.
CHANGELOG: Fix hail-is#14089, which makes `hailctl dataproc connect` work in Windows Subsystem for Linux. 1. Non 64-bit Windows uses "Program Files" not "Program Files (x86)" 2. Windows Subsystem for Linux looks like GNU/Linux but will not have chromium on its path. 3. The removed arguments are no longer supported. They produce a warning message in my version of Chrome and appear to not work in the version of Chrome that this user was using. Instead, I bind to 0.0.0.0 and access the Notebook using the machine DNS name. This is how Google recommend accessing the Spark UI anyway.
…ail-is#14087) * Add assertion to `load_combiner` and `new_combiner` to fail if the output vds exists * Remove assertion that disallows empty `gvcfs` and `vdses` in `VariantDatasetCombiner.__init__` Resolves hail-is#14079
Wenhan observed this error after I gave her a branch using google cloud storage 2.30.1. I've reported this new transient error to the client API repo, but I doubt it will be fixed. googleapis/java-storage#2337 SSL errors seem like the kind of thing we should not retry forever since they could indicate a bad actor.
Just explicitly start a loop and use our cloud-friendly FSes. Also, we need to recursively copy the resources.
…14069) I'm trying to move our codebase away from relying on the notion of a K8s namespace for routing because this model does not hold in alternative deployment environments such as Terra. Further, having a `-n` option in `hailctl auth` commands is awkward because it can only be used for dev environments yet is visible to users in the help menu. As such, this is a breaking change only for developers. The functionality is not really gone though because you can replace any `hailctl auth … -n dgoldste` with `HAIL_DEFAULT_NAMESPACE=dgoldste hailctl auth …`.
Oh Chrome, How do I call thee? Let me count the ways.
…returning errors (hail-is#14085) The original goal of this PR was avoiding `Try` when we are not using the restartability provided by semantic hashing because I strongly suspect it is related to the loss of stacktraces in exceptions. Unrelatedly, we realized the semantic hash PR changed the semantics of Query-on-Spark even when semantic hash is disabled: previously we would abort RDD writing on the first exception. In Hail 0.2.123 through 0.2.126, the semantics were changed to only crash *after* we already ran every other partition. Two bad scenarios of which I can think: 1. Suppose the first partition fails due to OOM. We now waste time/money on the rest of the partitions even though we cannot possibly get a valid output. 2. Suppose every partition hits a permission error. Users should get that feedback after paying for O(1) partitions run, not O(N). I created two Backend paths: the normal `parallelizeAndComputeWithIndex` with its pre-0.2.123 semantics as well as `parallelizeAndComputeWithIndexReturnAllErrors` which, as the name says, returns errors instead of raising them. While making this change, I think I found two other bugs in the "return all errors" path, only one of which I addressed in this PR: 1. I'm pretty sure semantic-hash-enabled QoB batch submission is broken because it uses the logical partition ids as job indices. Suppose there are 10,000 partitions, but we only need to compute 1, 100, and 1543. 0.2.126 would try to submit a batch of size 3 but whose job indices are 1, 100, and 1543. 2. Likewise, the Query-on-Spark path returns an invalid `SparkTaskContext.partitionId` which, at best, produces confusing partition filenames. I only fixed the former because it was simple to fix. I wasn't exactly sure what to do about the latter. We should fix that separately because the changes in this PR need to urgently land in the next release to avoid unexpected cost when one partition fails.
CHANGELOG: Fix `hailctl hdinsight start`, which has been broken since 0.2.118. The shift to typer/click accidentally converted this parameter from a string to an int.
This would seem to violate this [statement about `hd`](https://developers.google.com/identity/openid-connect/openid-connect#id_token-hd): > The absence of this claim indicates that the account does not belong to a Google hosted domain. but alas, that's the truth. They elide `hd` for iam.gserviceaccount.com accounts.
The recently cherry-picked CVE-2023-51663 fix included refactored code dependent on recent upstream code that was not cherry-picked. Merge current upstream HEAD to include the necessary AppKeys class and any unrelated fixes and improvements.
illusional
approved these changes
Jan 4, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Noting that there is an offline database merge, and there are terraform changes to apply - fun!
Yep, I was just looking at the batch/sql/rename-job-groups-tables.sql migration script, which will be interesting to observe running… |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The recently cherry-picked CVE-2023-51663 fix (PR #323) included refactored code dependent on recent upstream code that was not cherry-picked. This caused:
Context: https://centrepopgen.slack.com/archives/C030X7WGFCL/p1704243108737559
This merges current upstream HEAD to include the necessary AppKeys class (see auth/auth/auth.py) and any unrelated fixes and improvements.
Merge conflicts resolved included letsencrypt/subdomains.txt, in which we omitted upstream's addition of
guide-analysis
which is specific to their lab.