-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hail 118 #299
Hail 118 #299
Conversation
This allows us to fully control subnet parameters. In particular, we want VPC flow logs enabled on all subnets. I already made this change in the console and deleted the asia, europe, me, northamerica, and southamerica subnets which were not in use and which lacked VPC flow logs.
Fixes hail-is#12983 --- After an `FSSeekableInputStream` method (successfully) returns, `getPosition` always represents theintended location within the object. We can entirely eliminate `lazyPosition` because it tracks the same value as `getPosition` when `reader == null`. For retryable reads, we just seek back to the known correct location of the stream before attempting to read again.
I noticed this while xfail-ing a different test. It is possible this test now passes. We will see!
…s#13039) Related: hail-is#12963 Remember to add your service account to the new respository policy bindings: ``` $ gcloud artifacts repositories add-iam-policy-binding 'hail-benchmarks' \ --member='serviceAccount:YOUR_SERVICE_ACCOUNT' \ --role='roles/artifactregistry.repoAdmin' \ --location='us' \ --project='broad-ctsa' ```
`request_retry_transient_errors` is a charlatan. It does not retry errors that occur in reading the response body. I eliminated it in favor of the tried and true `retry_transient_errors` and some new helper methods that initiate the request *and* read the response.
what do you think of it? --- We should at least keep the fix to list_batches.py. I am unsure why I discovered blatant type errors in ci.
The service backend is shared among all the threads which means we were changing the resource requests of random other tests. This substantially delays certain tests when, say, 16 quick partitions require 8 highmem cores each.
Without this I get a diff because it wants to set workload_pool to null.
CHANGELOG: In Query-on-Batch, `naive_coalsce` no longer performs a full write/read of the dataset. It now operates identically to the Query-on-Spark implementation.
`private[this]` is always faster than `private`. I saw `...off`, for example, appear in the profiler, though admittedly as a tiny tiny fraction. We spend a lot of time in the `readBlock`, so it seems worthwhile to remove the unnecessary assert and, in some cases, unnecessary field set.
…3030) This change creates mysql pods for test and dev namespaces, instead of sharing the CloudSQL database server. The areas of change are as follows: ### Generation of the namespace's database-server-config The current approach in main does a little trick. Since the current `createDatabase` step uses the `database-server-config` from default to generate admin/user sql configs, the CI pipeline creates a dummy database `test-database-instance` to create a `sql-test-instance-admin-config` that inherits the credentials from the production `database-server-config`, and then copies that within the test namespace to `database-server-config`. In this change, since we are creating the server ourselves, we can just replace these with a step that creates a `database-server-config` from scratch, and then uses that for the DB pod. Overall making these changes really gave me the heebie jeebies that the test and dev namespaces have all these credentials to the CloudSQL server. I'm glad this gets rid of that. ### Accessing the database server We use the DB pod's service DNS name as the `host` so inside Kubernetes this Just Works. The one caveat is the CI pipeline in which we run migrations in batch jobs. Those jobs need a way to reach the DB pod. I achieve this with a NodePort and then use the job's K8s credentials to resolve the node and port that the DB is on. The code I've added to do this resolution feels a bit janky, wouldn't mind some feedback on that. In terms of security, if a user job was able to somehow resolve the address of a test db, they would still not have the credentials to access it, and this is currently also the case with the production database. Nevertheless, this does raise an action item that we should only allow traffic to the k8s and DB subnets for `network=private` jobs, but I think we should make that a separate PR. ### Database creation In order to test this properly in a dev deploy, I needed to make some changes to `create_database.py`. In main, dev deploys can't create databases. I think they should be able to, and those operations should just be idempotent. When allowing dev deploys to create databases, I hit the `ALTER USER` lines in `create_database.py` which lock out any already-deployed application, which feels silly. Instead, I create the mysql user and create the corresponding secret iff the mysql username does not already exist. ### create_initial_account.py When starting off with a fresh database, we encounter a bootstrapping problem where we need to add the developer's user. The current way to add the developer's user is through `create_initial_account.py` which assumes that the developer's gsa key secret is not already in the namespace, but now it could be, so I delete the key in the namespace if it exists before creating it. This could all change with my other PR to add devs to test namespaces. But this change to allow ephemeral dev databases will make testing that other PR *way* easier.
…tainer registry credentials (hail-is#13035) Just some more shoving of implementation details into `CloudWorkerAPI` so that credential objects are more of a black box to `worker.py`. Can e.g. change up certain implementations to use oauth2tokens for users.
…is#13034) Bumps [types-chardet](https://github.com/python/typeshed) from 5.0.4.5 to 5.0.4.6. <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/python/typeshed/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=types-chardet&package-manager=pip&previous-version=5.0.4.5&new-version=5.0.4.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This appears unnecessary. I was mislead by a website about the correct path at which to store Serivce Provider Interface implementations.
For posterity, update the date in the change log to reflect when PR hail-is#12987 was merged and the release made. Very minor, and probably already less important than it was last week, but for future readers it's useful for these to be aligned…
This test takes up to fifteen minutes on QoB and can be a long tail for CI pipelines so I broke it up into smaller tests.
hail-is#13051) Validation code was not updated when adding azure https support. Made this check a bit more robust and tested locally. Resolves hail-is#13049
…2948) This represents a small redesign of ValueWriter as well. Add deserializers, subclasses of ValueReader and have ReadValue use them. A ValueReader deserializes a single hail value from an input stream. This change also alters the semantics of ValueWriters. Both readers and writers no longer manage their own I/O resources. It is the responsibility of 'callers' to do so instead. However programmers must be careful as we need to create input/output buffers to perform native serialization/deserialization. For InputBuffers in particular, the underlying stream may be left in an unusable state.
We've had to do a redeploy of our hail batch instance on Azure. This PR resolves/clarifies two issues we encountered. 1) Storage Account Name Uniqueness Due to Azure's restrictions on storage account naming (mainly that names must be globally unique) the redeploy did not succeed. This is because the resource group name (we chose to reuse hail) is possible under a new subscription, but the generated storage account names were therefore identical to our previous stack. I've added in an argument called `storage_account_suffix` to account for this issue. It can be set to any arbitrary string that complies with Azure's storage account naming scheme in order to avoid naming conflicts in the future. While the option remains to simply choose a novel resource group name this is not enforced by Azure and anyone deploying a stack similarly named to someone else would not know until the `terraform apply` stage that the name would not work. 2) Mysql Flexible Server Zones The only other issue is that the zone argument for the mysql flexible server is no longer always valid depending on your compute region. We needed to comment it out for a successful deploy in Australia East. The comment that has been added we hope will be helpful for others in future.
…s#13070) Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>
) Now that test databases are hosted on their own servers instead of the single cloud-hosted MySQL, we can ramp up the parallelism both in our tests and in the number of PRs that we run at once. I recall that even before we had this DB bottleneck we still restricted the number of PRs running at once for cost reasons, but if that's not the case we could remove that restriction entirely.
…il-is#13124) I'm surprised that no one has complained about this. Is this setting not used much? Should we alert people that they might have garbage in places they wouldn't expect?
Maybe some Google API cannot handle 3 batch-drivers under full load? Also, this docker inspect thing is just pervasive and extraordinarily annoying.
We get a lot of errors about files that already exist. Hail commands are usually not retryable because there are file paths that might have been partly written to. This tightly scopes the retries to just the I/O. It also avoids keeping the output stream open for a long period of time.
In particular, start with a few copies, no autoscaling, and no disruption budget
…-is#13160) Alright, I snagged the PR namespace from the CI: ``` pr-13135-default-u5tt5011yt5w ``` Then I went to the Azure [Log Analytics workspace haildev-logs](https://portal.azure.com/#@haildev.onmicrosoft.com/resource/subscriptions/22cd45fe-f996-4c51-af67-ef329d977519/resourceGroups/haildev/providers/Microsoft.OperationalInsights/workspaces/haildev-logs/logs). I went to "Queries", selected "DK's AKS Pod Logs", modified the namespace to the aforementioned one, and added a filter for "hail-az://". ``` let startTimestamp = ago(2h); KubePodInventory | where TimeGenerated > startTimestamp | extend PodName=Name | where Namespace == "pr-13135-default-u5tt5011yt5w" and PodName startswith "batch-driver" | distinct ContainerID, PodName, Namespace | join ( ContainerLog | where TimeGenerated > startTimestamp ) on ContainerID | project TimeGenerated, message=parse_json(LogEntry).message, LogEntry=parse_json(LogEntry) | where message contains "hail-az://" | order by TimeGenerated desc ``` That revealed the batch logs path: ``` EXAMPLE BATCH_JOB_LOGS_PATH hail-az://haildevtest/test/batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1/1/abc123/main/log ``` In the [failing PR test job logs](https://ci.azure.hail.is/batches/3956877/jobs/152), I found the batch id: ``` [2023-06-09 12:43:34] test/hail/methods/test_impex.py::BGENTests::test_import_bgen_row_fields -------------------------------- live log call --------------------------------- INFO batch_client.aioclient:aioclient.py:753 created batch 1148 INFO batch_client.aioclient:aioclient.py:770 updated batch 1148 FAILED ``` I listed the job logs: ``` (base) dking@wm28c-761 hail % az storage blob list --account-name haildevtest --container test --prefix batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/ -o table Name Blob Type Blob Tier Length Content Type Last Modified Snapshot ----------------------------------------------------------------------------- ----------- ----------- -------- ------------------------ ------------------------- ---------- batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/log BlockBlob Hot 11724 application/octet-stream 2023-06-09T12:43:36+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/resource_usage BlockBlob Hot 64 application/octet-stream 2023-06-09T12:43:36+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/status.json BlockBlob Hot 1240 application/octet-stream 2023-06-09T12:43:36+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/log BlockBlob Hot 16626 application/octet-stream 2023-06-09T12:44:22+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/resource_usage BlockBlob Hot 680 application/octet-stream 2023-06-09T12:44:22+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json BlockBlob Hot 4453 application/octet-stream 2023-06-09T12:44:22+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs BlockBlob Hot 1264 application/octet-stream 2023-06-09T12:43:37+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs.idx BlockBlob Hot 16 application/octet-stream 2023-06-09T12:43:37+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs BlockBlob Hot 1264 application/octet-stream 2023-06-09T12:43:34+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs.idx BlockBlob Hot 16 application/octet-stream 2023-06-09T12:43:34+00:00 ``` I looked at the status: ``` az storage blob download --account-name haildevtest --container test --name batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json | jq '.' | less ``` which contained an error (I un-escaped the string here): ``` JVMUserError: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at is.hail.JVMEntryway.retrieveException(JVMEntryway.java:253) at is.hail.JVMEntryway.finishFutures(JVMEntryway.java:215) at is.hail.JVMEntryway.main(JVMEntryway.java:185) Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at is.hail.JVMEntryway$1.run(JVMEntryway.java:122) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at is.hail.JVMEntryway$1.run(JVMEntryway.java:119) ... 7 more Caused by: is.hail.backend.service.EndOfInputException at is.hail.backend.service.ServiceBackendSocketAPI2.read(ServiceBackend.scala:497) at is.hail.backend.service.ServiceBackendSocketAPI2.readInt(ServiceBackend.scala:510) at is.hail.backend.service.ServiceBackendSocketAPI2.executeOneCommand(ServiceBackend.scala:561) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6(ServiceBackend.scala:462) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6$adapted(ServiceBackend.scala:461) at is.hail.utils.package$.using(package.scala:635) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$5(ServiceBackend.scala:461) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at is.hail.services.package$.retryTransientErrors(package.scala:141) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4(ServiceBackend.scala:460) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4$adapted(ServiceBackend.scala:459) at is.hail.utils.package$.using(package.scala:635) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$3(ServiceBackend.scala:459) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at is.hail.services.package$.retryTransientErrors(package.scala:141) at is.hail.backend.service.ServiceBackendSocketAPI2$.main(ServiceBackend.scala:458) at is.hail.backend.service.Main$.main(Main.scala:15) at is.hail.backend.service.Main.main(Main.scala) ... 11 more ``` Which suggests that the service backend experienced an EOF somewhere in the first four bytes of the input file. Unfortunately, we automatically cleanup the input and output files, so I can't investigate further. This PR reads the input and output files and stores them in the error message so that next time this happens we get more information.
…il-is#13125) Unclear to me exactly why (perhaps a terraform update on my machine) but running terraform against azure was complaining that it didn't know the resource group that the terraform state storage account lives in. Unclear to me why this is necessary because the storage account is globally unique but it was easy enough to add and fixed my problem *shrug*. Let me know if you'd like me to look into this further.
CHANGELOG: Requester pays buckets now work in `hailtop.fs` and `hl.hadoop_*`. This has been broken since at least 0.2.115. I first check for an explicit argument or the standard hailctl configuration. If neither of those exist, I try to parse spark-defaults.conf with lots of error handling and warning.
Soyeon hit an ugly message from Scala.
Co-authored-by: Dan King <daniel.zidan.king@gmail.com> Co-authored-by: Daniel King <dking@broadinstitute.org>
…ts exc (hail-is#13163) This will at least give us more information about which directories are taking forever to delete.
…ail-is#13168) Replaces six Snyk PRs.
…ces (hail-is#13135) Was getting tired of having to manually delete my root cert every so often and redeploy. This will recreate the root cert in dev and test namespaces if it is already expired. Deleting an expired root cert won't break communication that isn't already now broken.
Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
<img width="1600" alt="Screen Shot 2023-05-24 at 3 59 35 PM" src="https://github.com/hail-is/hail/assets/1693348/94f560cb-3f88-4e9c-9622-4735b06a51f2">
Support Zstdandard compression for hail input and output block buffers. Zstd is notable for having both very fast compression speed and adequate decompression speed, such that we expect to be network limited for decompression. Further tests may show that Zstd is more performant than LZ4, leading to a proper switch from one format to the other. --------- Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
A test timed out with this in the logs. This is some driver job. It just hangs for 2 minutes trying to talk to Azure Blob Storage presumably? Let us get some more information: ``` 2023-06-08 20:22:28.209 JVMEntryway: INFO: Yielding control to the QoB Job. 2023-06-08 20:22:28.210 Tokens: INFO: tokens found for namespaces {pr-12991-default-ei61x1qrplk9} 2023-06-08 20:22:28.210 tls: INFO: ssl config file found at /batch/240df6ec091b49d8a6062b781e6700d3/secrets/ssl-config/ssl-config.json 2023-06-08 20:24:30.873 : INFO: RegionPool: initialized for thread 10: pool-2-thread-2 2023-06-08 20:24:31.016 ServiceBackend$: INFO: executing: iaU8w3QX6Y6hRrB9jczJds ArrayBuffer((), is.hail.utils.SerializableHadoopConfiguration@5ad5cde6) 2023-06-08 20:24:31.153 : INFO: JSON: JObject(List((name,JString(TableFilterPartitions)), (parts,JArray(List(JInt(0)))), (keep,JBool(true)))) ```
I need this functionality for compacting the billing tables. I'm not sure how we get this privilege to `batch`, `batch-admin` etc., but this should be fine for me to at least measure any performance gains.
Reverts hail-is#12981 until a time when we can confidently commit these changes to main.
…#13151) This both avoids an extra pass to find the failing rows as well as avoiding a an extra pass if the globals depend on non-global data. In particular, this pipeline would run the column aggregations four times (IMO, at most twice is OK): ``` mt = hl.utils.range_matrix_table(2,2) mt = mt.annotate_entries(x = mt.row_idx * mt.col_idx) mt = mt.annotate_cols(mean_x = hl.agg.mean(mt.x)) mt = mt.annotate_entries(x = mt.x - mt.mean_x) mt._same(mt) ```
…s#13176) WHEEL needs to be visible in the subprocess environment for `assert_pypi_has_room.py`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, as a tidy merge of upstream 0.2.118 on top of the previous merges.
There may be alternative approaches in Git to carry our local changes on top of upstream's vendor branch more clearly and with perhaps less work — e.g. I tend to prefer a local branch of additions that gets rebased periodically —, but that is a conversation for another day. |
Stacks on: #298