-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No allocation found while updateFileSystem #649
Comments
Yesterday, we experienced one case of this error. Case: Poseidon Logs
What is interesting about the Poseidon logs is that the runner got rescheduled two times. The event is preceded by two Deployments, one at Unfortunately, the deployment caused an outage of our InfluxDB. From
This is exactly the timeframe, we would need to check if Poseidon missed some Nomad events. Now, InfluxDB contained only the information that Nomad did a
The second case is more plausible because it explains Poseidon's behavior. Poseidon does not destroy the Runner but thinks it still exists. When checking the Nomad Agent Logs, it seems that a deployment was running at Nomad Agent 1 Logs
For me, it is not evident why Nomad killed this Allocation and Job. This case might support the considerations of #597 to use the |
Thanks for digging into this issue and sorry for the inconvenience with the monitoring data. Let me explain what happened regarding the monitoring:
Yes, this was the local deployment I triggered (and forgot 🙈). While I might have been able to prove some further information on the timeline, I don't have any clue about the |
Oh, wow thanks for handling all the operations work, here!
Yes, however, our aspiration is to have error-free deployments 🤷 Maybe we just skip this occurrence and handle the next one where we might have more monitoring data? |
Okay, let's skip this occurrence for now. Let's ensure to redeploy more often (during daytimes) when merging the following PRs, so that we increase the likelihood of seeing this issue again. |
We didn't notice any new occurrence, closing. |
Just a few seconds ago, the issue reoccured. Most likely, it was triggered by me, since I synchronized all environments in CodeOcean after rebuilding the environments for openHPI/dockerfiles#37. Hence, I am wondering: Is this behavior "expected" or can we improve the situation a little? |
Great that we've got another occurrence to observe. We have 3 users/runners causing the 19 errors when trying to update the FS.
First, we see that the users tried for multiple minutes to run their execution, always failing. Better if CodeOcean uses a fresh runner when it receives (multiple times) an Internal Server Error when copying files. Regarding the Nomad Events, the three runners behave the same. Nomad Events
Only a We might start to listen to the |
Thanks for looking into this issue already; this really helps to track down potential issues. To add more context: According to CodeOcean logs:
I've also verified that CodeOcean and Poseidon uses the same time base (at least up to, incl., the seconds). Therefore, just to clarify: The timings you provided for the three runners affected are timestamps when the issue occurred (i.e. learners posting files to a non-existent runner), right?
Ah, yes (see my previous comment).
We handle the case where the runner is non-existent and properly reported by Poseidon through a 410 error: https://github.com/openHPI/codeocean/blob/6a0c4976baf24b02e659145e912c494fe05b6557/lib/runner/strategy/poseidon.rb#L291-L292 Other status codes (or an internal server error) are currently not causing a request for a new runner. Do you feel we should catch more errors in CodeOcean and/or handle the error better in Poseidon to return a proper 410 response? Both might make sense, I'd say 🙂. My proposal is presented at openHPI/codeocean#2511, but I would still suggest to fix the Poseidon error, too. |
Yes
I agree. Thanks already for your proposal. In #682 you can find an approach for |
Sentry Issue: POSEIDON-G
The text was updated successfully, but these errors were encountered: