Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The allocation stopped as expected #423

Closed
sentry-io bot opened this issue Aug 21, 2023 · 7 comments · Fixed by openHPI/codeocean#1982
Closed

The allocation stopped as expected #423

sentry-io bot opened this issue Aug 21, 2023 · 7 comments · Fixed by openHPI/codeocean#1982
Labels
bug Something isn't working

Comments

@sentry-io
Copy link

sentry-io bot commented Aug 21, 2023

Sentry Issue: CODEOCEAN-HG

An error occurred during code execution while being connected to wss://poseidon-terraform.compute.internal:7200/api/v1/runners/29-0a8486bf-3fbe-11ee-a793-fa163e079f19/websocket?executionID=cf7426a2-980c-4710-93df-36b205d5c137.

the allocation stopped as expected

@MrSerth MrSerth added the bug Something isn't working label Aug 21, 2023
@mpass99
Copy link
Contributor

mpass99 commented Oct 12, 2023

This error appears in two cases:

  1. A running execution was canceled because CodeOcean deleted the runner.
  2. A running execution was canceled because Nomad completed(/deleted) the allocation (likely because Poseidon requested it to do so).

I do not necessarily see a reason to look into why one of these cases was given. Rather we should agree on how to handle such cases. I would argue that it is correct that Poseidon does not exit the execution silently but reports that it was aborted. Maybe CodeOcean should handle this case?

@MrSerth
Copy link
Member

MrSerth commented Oct 12, 2023

I am not sure whether I get the second reason correctly. When is Nomad completing the allocation if not instructed to do so by CodeOcean? The only reason could be to sync an environment (and hence restart all allocations), but this should not happen that often.

Consequentially, the first reason seems to be more plausible. Are there other occurrences, such as when the event stream broke or when the allocation got rescheduled?

I do not necessarily see a reason to look into why one of these cases was given.

Mh, maybe. Nevertheless, I think we should check the overall system for this case. Sure, CodeOcean could request to delete a runner and one could consider this as an error there. But still, why is CodeOcean deleting the runner if it is still used? I think we should investigate overall to understand those occasions. One reason (I haven't verified) would be that a learner uses two tabs to request two executions with one failing (causing the runner to be deleted). Not sure about that, though.

Maybe CodeOcean should handle this case?

I am currently lacking more information on the occurrences when this error happens to give a specific response to this question, see above.

@MrSerth
Copy link
Member

MrSerth commented Oct 23, 2023

Let's quickly check why and when the issue is occurring. Then, we can decide which solution to follow in order to resolve it.

@mpass99
Copy link
Contributor

mpass99 commented Oct 29, 2023

  • Event Oct 29 12:21:16 PM Log
    • The user is spamming executions (33 in total, 14 in the last minute, each rather short ~400ms)
    • Two/Three executions are overlapping
      • CodeOcean requests the deletion of the runner after the third last execution returned
      • CodeOcean closes the WebSocket connection to the second last execution (The second last execution reports the error the execution did not stop after SIGQUIT to Sentry).
      • Both the second last and the last execution get canceled by the deletion of the runner.
      • CodeOcean receives the The allocation stopped as expected from the last execution.
  • Event Oct 29 10:23:39 AM UTC Log
    • CodeOcean starts an execution
    • 5s later CodeOcean updates the file system and starts another execution
    • 300ms later CodeOcean disconnects from the first execution and requests the runner deletion
    • The second execution is canceled by the runner deletion. It reports the error The allocation stopped as expected to CodeOcean.
      • Even though the log statement Execution canceled by context is missing
  • Event Oct 28 06:34:42 PM UTC Log
    • CodeOcean starts an execution
    • 6s later CodeOcean started another execution
    • 10s later CodeOcea disconnects from the second execution and requests the deletion of the runner.
    • The first execution is canceled by the runner deletion. It reports the error The allocation stopped as expected to CodeOcean.

It seems like the first case for this error is the relevant: CodeOcean is requesting the runner deletion while an execution is running.

Side note: In the second and third log the statement Execution canceled by context is missing. Instead, it contains just the statement Execution terminated after SIGQUIT. Which statement is logged depends on a (noncritical) race condition between the Nomad Client Library and Poseidon both handling the canceled context.

MrSerth added a commit to openHPI/codeocean that referenced this issue Oct 29, 2023
Previously, the same runner could be used multiple times with different submissions simultaneously. This, however, yielded errors, for example when one submission time oud (causing the running to be deleted) while another submission was still executed.

Admin actions, such as the shell, can be still executed regardless of any other code execution.

Fixes CODEOCEAN-HG
Fixes openHPI/poseidon#423
MrSerth added a commit to openHPI/codeocean that referenced this issue Oct 29, 2023
Previously, the same runner could be used multiple times with different submissions simultaneously. This, however, yielded errors, for example when one submission time oud (causing the running to be deleted) while another submission was still executed.

Admin actions, such as the shell, can be still executed regardless of any other code execution.

Fixes CODEOCEAN-HG
Fixes openHPI/poseidon#423
@MrSerth
Copy link
Member

MrSerth commented Oct 29, 2023

Thanks for digging deeper into the issue. I created a PR to tackle the issue on CodeOcean: openHPI/codeocean#1982. Feel free to have a look there.

Which statement is logged depends on a (noncritical) race condition between the Nomad Client Library and Poseidon both handling the canceled context.

Do you think we should create a ticket for that, in order to prevent this non-critical race condition?

MrSerth added a commit to openHPI/codeocean that referenced this issue Oct 31, 2023
Previously, the same runner could be used multiple times with different submissions simultaneously. This, however, yielded errors, for example when one submission time oud (causing the running to be deleted) while another submission was still executed.

Admin actions, such as the shell, can be still executed regardless of any other code execution.

Fixes CODEOCEAN-HG
Fixes openHPI/poseidon#423
@MrSerth
Copy link
Member

MrSerth commented Oct 31, 2023

While I've merged (and deployed) the change in CodeOcean, we also have the question about the non-critical race condition open. Therefore, I am temporarily re-opening this issue.

@mpass99
Copy link
Contributor

mpass99 commented Oct 31, 2023

Do you think we should create a ticket for that, in order to prevent this non-critical race condition?

Yeah 👍 I've done so with #487

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants