-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[qob] more debugging information on UnexpectedEOFError in Azure #13160
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Alright, I snagged the PR namespace from the CI: ``` pr-13135-default-u5tt5011yt5w ``` Then I went to the Azure [Log Analytics workspace haildev-logs](https://portal.azure.com/#@haildev.onmicrosoft.com/resource/subscriptions/22cd45fe-f996-4c51-af67-ef329d977519/resourceGroups/haildev/providers/Microsoft.OperationalInsights/workspaces/haildev-logs/logs). I went to "Queries", selected "DK's AKS Pod Logs", modified the namespace to the aforementioned one, and added a filter for "hail-az://". ``` let startTimestamp = ago(2h); KubePodInventory | where TimeGenerated > startTimestamp | extend PodName=Name | where Namespace == "pr-13135-default-u5tt5011yt5w" and PodName startswith "batch-driver" | distinct ContainerID, PodName, Namespace | join ( ContainerLog | where TimeGenerated > startTimestamp ) on ContainerID | project TimeGenerated, message=parse_json(LogEntry).message, LogEntry=parse_json(LogEntry) | where message contains "hail-az://" | order by TimeGenerated desc ``` That revealed the batch logs path: ``` EXAMPLE BATCH_JOB_LOGS_PATH hail-az://haildevtest/test/batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1/1/abc123/main/log ``` In the [failing PR test job logs](https://ci.azure.hail.is/batches/3956877/jobs/152), I found the batch id: ``` [2023-06-09 12:43:34] test/hail/methods/test_impex.py::BGENTests::test_import_bgen_row_fields -------------------------------- live log call --------------------------------- INFO batch_client.aioclient:aioclient.py:753 created batch 1148 INFO batch_client.aioclient:aioclient.py:770 updated batch 1148 FAILED ``` I listed the job logs: ``` (base) dking@wm28c-761 hail % az storage blob list --account-name haildevtest --container test --prefix batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/ -o table Name Blob Type Blob Tier Length Content Type Last Modified Snapshot ----------------------------------------------------------------------------- ----------- ----------- -------- ------------------------ ------------------------- ---------- batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/log BlockBlob Hot 11724 application/octet-stream 2023-06-09T12:43:36+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/resource_usage BlockBlob Hot 64 application/octet-stream 2023-06-09T12:43:36+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/status.json BlockBlob Hot 1240 application/octet-stream 2023-06-09T12:43:36+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/log BlockBlob Hot 16626 application/octet-stream 2023-06-09T12:44:22+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/resource_usage BlockBlob Hot 680 application/octet-stream 2023-06-09T12:44:22+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json BlockBlob Hot 4453 application/octet-stream 2023-06-09T12:44:22+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs BlockBlob Hot 1264 application/octet-stream 2023-06-09T12:43:37+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs.idx BlockBlob Hot 16 application/octet-stream 2023-06-09T12:43:37+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs BlockBlob Hot 1264 application/octet-stream 2023-06-09T12:43:34+00:00 batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs.idx BlockBlob Hot 16 application/octet-stream 2023-06-09T12:43:34+00:00 ``` I looked at the status: ``` az storage blob download --account-name haildevtest --container test --name batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json | jq '.' | less ``` which contained an error (I un-escaped the string here): ``` JVMUserError: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at is.hail.JVMEntryway.retrieveException(JVMEntryway.java:253) at is.hail.JVMEntryway.finishFutures(JVMEntryway.java:215) at is.hail.JVMEntryway.main(JVMEntryway.java:185) Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at is.hail.JVMEntryway$1.run(JVMEntryway.java:122) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at is.hail.JVMEntryway$1.run(JVMEntryway.java:119) ... 7 more Caused by: is.hail.backend.service.EndOfInputException at is.hail.backend.service.ServiceBackendSocketAPI2.read(ServiceBackend.scala:497) at is.hail.backend.service.ServiceBackendSocketAPI2.readInt(ServiceBackend.scala:510) at is.hail.backend.service.ServiceBackendSocketAPI2.executeOneCommand(ServiceBackend.scala:561) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6(ServiceBackend.scala:462) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6$adapted(ServiceBackend.scala:461) at is.hail.utils.package$.using(package.scala:635) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$5(ServiceBackend.scala:461) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at is.hail.services.package$.retryTransientErrors(package.scala:141) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4(ServiceBackend.scala:460) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4$adapted(ServiceBackend.scala:459) at is.hail.utils.package$.using(package.scala:635) at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$3(ServiceBackend.scala:459) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at is.hail.services.package$.retryTransientErrors(package.scala:141) at is.hail.backend.service.ServiceBackendSocketAPI2$.main(ServiceBackend.scala:458) at is.hail.backend.service.Main$.main(Main.scala:15) at is.hail.backend.service.Main.main(Main.scala) ... 11 more ``` Which suggests that the service backend experienced an EOF somewhere in the first four bytes of the input file. Unfortunately, we automatically cleanup the input and output files, so I can't investigate further. This PR reads the input and output files and stores them in the error message so that next time this happens we get more information.
daniel-goldstein
approved these changes
Jun 9, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Alright, I snagged the PR namespace from the CI:
Then I went to the Azure Log Analytics workspace haildev-logs.
I went to "Queries", selected "DK's AKS Pod Logs", modified the namespace to the aforementioned one, and added a filter for "hail-az://".
That revealed the batch logs path:
In the failing PR test job logs, I found the batch id:
I listed the job logs:
I looked at the status:
which contained an error (I un-escaped the string here):
Which suggests that the service backend experienced an EOF somewhere in the first four bytes of the input file. Unfortunately, we automatically cleanup the input and output files, so I can't investigate further. This PR reads the input and output files and stores them in the error message so that next time this happens we get more information.