Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] more debugging information on UnexpectedEOFError in Azure #13160

Merged
merged 2 commits into from
Jun 9, 2023

Conversation

danking
Copy link
Contributor

@danking danking commented Jun 9, 2023

Alright, I snagged the PR namespace from the CI:

pr-13135-default-u5tt5011yt5w

Then I went to the Azure Log Analytics workspace haildev-logs.

I went to "Queries", selected "DK's AKS Pod Logs", modified the namespace to the aforementioned one, and added a filter for "hail-az://".

let startTimestamp = ago(2h);
KubePodInventory
| where TimeGenerated > startTimestamp
| extend PodName=Name
| where Namespace == "pr-13135-default-u5tt5011yt5w" and PodName startswith "batch-driver"
| distinct ContainerID, PodName, Namespace
| join (
   ContainerLog
    | where TimeGenerated > startTimestamp
) on ContainerID
| project TimeGenerated, message=parse_json(LogEntry).message, LogEntry=parse_json(LogEntry)
| where message contains "hail-az://"
| order by TimeGenerated desc

That revealed the batch logs path:

EXAMPLE BATCH_JOB_LOGS_PATH hail-az://haildevtest/test/batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1/1/abc123/main/log

In the failing PR test job logs, I found the batch id:

[2023-06-09 12:43:34] test/hail/methods/test_impex.py::BGENTests::test_import_bgen_row_fields
-------------------------------- live log call ---------------------------------
INFO     batch_client.aioclient:aioclient.py:753 created batch 1148

INFO     batch_client.aioclient:aioclient.py:770 updated batch 1148

FAILED

I listed the job logs:

(base) dking@wm28c-761 hail % az storage blob list --account-name haildevtest --container test --prefix batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/ -o table
Name                                                                           Blob Type    Blob Tier    Length    Content Type              Last Modified              Snapshot
-----------------------------------------------------------------------------  -----------  -----------  --------  ------------------------  -------------------------  ----------
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/log                 BlockBlob    Hot          11724     application/octet-stream  2023-06-09T12:43:36+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/resource_usage      BlockBlob    Hot          64        application/octet-stream  2023-06-09T12:43:36+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/status.json              BlockBlob    Hot          1240      application/octet-stream  2023-06-09T12:43:36+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/log                 BlockBlob    Hot          16626     application/octet-stream  2023-06-09T12:44:22+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/resource_usage      BlockBlob    Hot          680       application/octet-stream  2023-06-09T12:44:22+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json              BlockBlob    Hot          4453      application/octet-stream  2023-06-09T12:44:22+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs      BlockBlob    Hot          1264      application/octet-stream  2023-06-09T12:43:37+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs.idx  BlockBlob    Hot          16        application/octet-stream  2023-06-09T12:43:37+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs      BlockBlob    Hot          1264      application/octet-stream  2023-06-09T12:43:34+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs.idx  BlockBlob    Hot          16        application/octet-stream  2023-06-09T12:43:34+00:00

I looked at the status:

az storage blob download --account-name haildevtest --container test --name batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json | jq '.' | less

which contained an error (I un-escaped the string here):

JVMUserError: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at is.hail.JVMEntryway.retrieveException(JVMEntryway.java:253)
	at is.hail.JVMEntryway.finishFutures(JVMEntryway.java:215)
	at is.hail.JVMEntryway.main(JVMEntryway.java:185)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:122)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119)
	... 7 more
Caused by: is.hail.backend.service.EndOfInputException
	at is.hail.backend.service.ServiceBackendSocketAPI2.read(ServiceBackend.scala:497)
	at is.hail.backend.service.ServiceBackendSocketAPI2.readInt(ServiceBackend.scala:510)
	at is.hail.backend.service.ServiceBackendSocketAPI2.executeOneCommand(ServiceBackend.scala:561)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6(ServiceBackend.scala:462)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6$adapted(ServiceBackend.scala:461)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$5(ServiceBackend.scala:461)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:141)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4(ServiceBackend.scala:460)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4$adapted(ServiceBackend.scala:459)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$3(ServiceBackend.scala:459)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:141)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.main(ServiceBackend.scala:458)
	at is.hail.backend.service.Main$.main(Main.scala:15)
	at is.hail.backend.service.Main.main(Main.scala)
	... 11 more

Which suggests that the service backend experienced an EOF somewhere in the first four bytes of the input file. Unfortunately, we automatically cleanup the input and output files, so I can't investigate further. This PR reads the input and output files and stores them in the error message so that next time this happens we get more information.

Alright, I snagged the PR namespace from the CI:

```
pr-13135-default-u5tt5011yt5w
```

Then I went to the Azure [Log Analytics workspace haildev-logs](https://portal.azure.com/#@haildev.onmicrosoft.com/resource/subscriptions/22cd45fe-f996-4c51-af67-ef329d977519/resourceGroups/haildev/providers/Microsoft.OperationalInsights/workspaces/haildev-logs/logs).

I went to "Queries", selected "DK's AKS Pod Logs", modified the namespace to the aforementioned one, and added a filter for "hail-az://".

```
let startTimestamp = ago(2h);
KubePodInventory
| where TimeGenerated > startTimestamp
| extend PodName=Name
| where Namespace == "pr-13135-default-u5tt5011yt5w" and PodName startswith "batch-driver"
| distinct ContainerID, PodName, Namespace
| join (
   ContainerLog
    | where TimeGenerated > startTimestamp
) on ContainerID
| project TimeGenerated, message=parse_json(LogEntry).message, LogEntry=parse_json(LogEntry)
| where message contains "hail-az://"
| order by TimeGenerated desc
```

That revealed the batch logs path:

```
EXAMPLE BATCH_JOB_LOGS_PATH hail-az://haildevtest/test/batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1/1/abc123/main/log
```

In the [failing PR test job logs](https://ci.azure.hail.is/batches/3956877/jobs/152), I found the batch id:

```
[2023-06-09 12:43:34] test/hail/methods/test_impex.py::BGENTests::test_import_bgen_row_fields
-------------------------------- live log call ---------------------------------
INFO     batch_client.aioclient:aioclient.py:753 created batch 1148

INFO     batch_client.aioclient:aioclient.py:770 updated batch 1148

FAILED
```

I listed the job logs:

```
(base) dking@wm28c-761 hail % az storage blob list --account-name haildevtest --container test --prefix batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/ -o table
Name                                                                           Blob Type    Blob Tier    Length    Content Type              Last Modified              Snapshot
-----------------------------------------------------------------------------  -----------  -----------  --------  ------------------------  -------------------------  ----------
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/log                 BlockBlob    Hot          11724     application/octet-stream  2023-06-09T12:43:36+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/main/resource_usage      BlockBlob    Hot          64        application/octet-stream  2023-06-09T12:43:36+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/1/i4CoSh/status.json              BlockBlob    Hot          1240      application/octet-stream  2023-06-09T12:43:36+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/log                 BlockBlob    Hot          16626     application/octet-stream  2023-06-09T12:44:22+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/main/resource_usage      BlockBlob    Hot          680       application/octet-stream  2023-06-09T12:44:22+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json              BlockBlob    Hot          4453      application/octet-stream  2023-06-09T12:44:22+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs      BlockBlob    Hot          1264      application/octet-stream  2023-06-09T12:43:37+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/dK3o5ZfXmYSkP5TA/specs.idx  BlockBlob    Hot          16        application/octet-stream  2023-06-09T12:43:37+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs      BlockBlob    Hot          1264      application/octet-stream  2023-06-09T12:43:34+00:00
batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/bunch/eOrFpVrN98GBIizi/specs.idx  BlockBlob    Hot          16        application/octet-stream  2023-06-09T12:43:34+00:00
```

I looked at the status:

```
az storage blob download --account-name haildevtest --container test --name batch/logs/we5a79QlczzdluUx8kT2Vh/batch/1148/2/31Owgv/status.json | jq '.' | less
```

which contained an error (I un-escaped the string here):

```
JVMUserError: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at is.hail.JVMEntryway.retrieveException(JVMEntryway.java:253)
	at is.hail.JVMEntryway.finishFutures(JVMEntryway.java:215)
	at is.hail.JVMEntryway.main(JVMEntryway.java:185)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:122)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119)
	... 7 more
Caused by: is.hail.backend.service.EndOfInputException
	at is.hail.backend.service.ServiceBackendSocketAPI2.read(ServiceBackend.scala:497)
	at is.hail.backend.service.ServiceBackendSocketAPI2.readInt(ServiceBackend.scala:510)
	at is.hail.backend.service.ServiceBackendSocketAPI2.executeOneCommand(ServiceBackend.scala:561)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6(ServiceBackend.scala:462)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$6$adapted(ServiceBackend.scala:461)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$5(ServiceBackend.scala:461)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:141)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4(ServiceBackend.scala:460)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$4$adapted(ServiceBackend.scala:459)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.$anonfun$main$3(ServiceBackend.scala:459)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:141)
	at is.hail.backend.service.ServiceBackendSocketAPI2$.main(ServiceBackend.scala:458)
	at is.hail.backend.service.Main$.main(Main.scala:15)
	at is.hail.backend.service.Main.main(Main.scala)
	... 11 more
```

Which suggests that the service backend experienced an EOF somewhere in the first four bytes of the
input file. Unfortunately, we automatically cleanup the input and output files, so I can't
investigate further. This PR reads the input and output files and stores them in the error message
so that next time this happens we get more information.
@danking danking merged commit 39dc3c0 into hail-is:main Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants