Missing template task for some environments #522

MrSerth · 2023-12-08T21:37:30Z

I've just noticed an issue with Poseidon that needs further investigation:

According to CodeOcean, some environments were not found when executing code and were required to be synced CODEOCEAN-11J. Indeed, CodeOcean still shows empty pools for some environments, including the IDs 11, 18, 22, 33:

Nomad, however, has enough jobs scheduled:

During my investigation, I already identified that Poseidon is not aware of the environments:

{
    "executionEnvironments": [
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 512,
            "image": "openhpi/co_execenv_java:17",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 31
        },
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 512,
            "image": "openhpi/co_execenv_java:8-antlr",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 10
        },
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_r:4",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 28
        },
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 512,
            "image": "openhpi/co_execenv_julia:1.8",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 34
        },
        {
            "prewarmingPoolSize": 2,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_python:3.4",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 14
        },
        {
            "prewarmingPoolSize": 2,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_ruby:2.5",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 25
        },
        {
            "prewarmingPoolSize": 15,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_python:3.8",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 29
        },
        {
            "prewarmingPoolSize": 2,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_python:3.7-ml",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 30
        }
    ]
}

This matches Nomad as well:

Hence, Poseidon thinks that everything is okay and is not issuing a Prewarming Pool Alert:

Poseidon, however, was not restarted for quite a while:

root@poseidon-terraform:/home/ubuntu# systemctl status poseidon
● poseidon.service - Poseidon
     Loaded: loaded (/etc/systemd/system/poseidon.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-12-08 02:50:23 UTC; 18h ago
   Main PID: 623 (poseidon)
      Tasks: 9 (limit: 4647)
     Memory: 291.7M
        CPU: 2h 48min 12ms
     CGroup: /system.slice/poseidon.service
             └─623 /usr/local/bin/poseidon

Probably, the affected environments were lost during the night:

The text was updated successfully, but these errors were encountered:

mpass99 · 2024-08-20T14:00:38Z

The Sentry Issue CODEOCEAN-11J does no longer exist. @MrSerth have you experienced this issue lately or is there a follow-up Sentry issue?

From your images, we can derive that Poseidon restarted at 2023-12-08 02:50:23 UTC. According to the Grafana image, even after the restart, environment 33 was still active. It got inactive at 03:50. The Nomad Job was no longer running.

Indeed, when Poseidon is notified about a stopped environment job, it does absolutely nothing.

poseidon/internal/runner/nomad_manager.go

Lines 307 to 309 in 8390b90

    
           if nomad.IsEnvironmentTemplateID(runnerID) { 
        
           	return false 
        
           }

However, the environment is only removed from Poseidon's memory when requested via the API. If that is not the case, I have to assume that the Grafana image uses a different timezone and the environment stopped existing once Poseidon got restarted.

How should we deal with this issue?

Rely on Nomad
- Nomad should be responsible for correctly restarting and rescheduling the environment
- We have introduced changes in between that might reduce the likelihood of failures (Nomad Restart and Reschedule Policy #611)
- The effects of a lost environment are not too bad
  - CodeOcean recreates the environment once a user wants to access it
Poseidon handling
- Just Logging
- Recreation

sentry-io · 2024-08-20T22:29:28Z

Sentry Issue: CODEOCEAN-12Y

MrSerth · 2024-08-20T22:41:20Z

@MrSerth have you experienced this issue lately or is there a follow-up Sentry issue?

I linked the follow-up Sentry issue CODEOCEAN-12Y, which I triggered manually on our staging system (so no real issue). But still, it would be the one where further events are grouped.

Besides that, I don't have another recent occurrence, but saw the potentially lost environment on CodeOcean's dashboard during the past student courses (so in spring 2024).

If that is not the case, I have to assume that the Grafana image uses a different timezone and the environment stopped existing once Poseidon got restarted.

I am not sure about the time zone (sorry!), but would assume that the Grafana dashboard is shown in my browser's time zone (should be CET, +01:00) -- that's at least the case when I visit Grafana today (and I cannot remember that it was UTC).

Otherwise, I agree with your assumption: Probably, Poseidon got restarted after the environment job got lost, hence the recovery did not recover the environment at all.

The effects of a lost environment are not too bad

CodeOcean recreates the environment once a user wants to access it

That's true, of course. Still, if we can prevent this error, it would be even better.

How should we deal with this issue?

Good question. I like that we improved the situation already through the better restart and rescheduling mechanism. Still, jobs might get lost. Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so? And would you assume restarting is fine, or could this result in duplicated jobs, ...?

As an additional precaution, I tend to add support to Poseidon for this case, too (depending on your answer to the previous questions).

mpass99 · 2024-09-02T16:05:25Z

Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so?

We could and we have all the required information to do so. However, it would add some new complexity in the Nomad Event Stream Handling and new workflows within Poseidon.

Would you assume restarting is fine

There are three cases why environment jobs fail finally:

Because Poseidon stopped it
- All good (as long as we are not aware of Poseidon deleting environments without a respective API request.)
Because Nomad stopped it as defined in the restart and rescheduling policy
- Likely this is a misconfigured environment (e.g. wrong image)
  - All good
  - In this case, Poseidon should not restart the environment job, because we intended to not infinitely restart (invalid) environments/jobs
- We should validate each event of this issue to catch other causes of stopped environments.
Because Nomad stopped it unexpectedly
- We should monitor if this case happens and hunt down each cause.

All in all, it feels like we would re-implement too much of Nomad's functionality and responsibility. If we want to change the behavior we might rather adjust the job policy. I would go with adding visibility to this issue as proposed with #668.

MrSerth · 2024-09-02T21:03:01Z

Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so?

We could and we have all the required information to do so. However, it would add some new complexity in the Nomad Event Stream Handling and new workflows within Poseidon.

Okay, I see. Especially given the fact that we could try to advise Nomad to restart the job again, let's try this Nomad-based approach first.

There are three cases why environment jobs fail finally

Because Poseidon stopped it

That's fine, and I am not aware of any false deletion requests by Poseidon right now.

Because Nomad stopped it as defined in the restart and rescheduling policy

Likely this is a misconfigured environment (e.g. wrong image)

For the occurrence I created the issue for this (misconfigured environment) is highly unlikely. The environment (command, image, ...) doesn't change very frequently and I cannot remember of real issues with that recently. For sure, the restart and rescheduling limits could have been reached, due to network issues, agent shutdowns, etc.

I get that a Poseidon-based solution might not make sense here, but can we get creative and nevertheless adjust the task policy for environments (not for regular runner jobs) here? For example, I would assume that a wrong image leads to immediate and countless restarts. Other availability issues, however, would only fail for a limited time (like for the duration of one or two restarts) and work before or after that. With a comparatively small interval setting (1h / 30m?) one could try to narrow those case. Another idea would be to set environment stone unlimited restarts, and simply check whether (in Poseidon) whether a first error occurs shortly after synchronizing the environment (in which case it might be erroneous). Or ... some other idea you might have?

Because Nomad stopped it unexpectedly

Yes, we should monitor these and I like the enhanced logging with #668.

MrSerth · 2024-09-03T12:58:21Z

Since the changes introduced in #668 were deployed, we saw the first event in #672. Let's double check this issue, too.

mpass99 · 2024-09-04T11:29:59Z

Because Nomad stopped it as defined in the restart and rescheduling policy

#668 monitors also these cases. We might want to wait for an occurrence to verify that this is still a real issue.

A misconfigured environment is highly unlikely. The environment (command, image, ...) doesn't change very frequently and I cannot remember of real issues with that recently. For sure, the restart and rescheduling limits could have been reached, due to network issues, agent shutdowns, etc.

That's right, I agree

Can we adjust the task policy for environments (not for regular runner jobs) here?

Yeah, we might overwrite some policies within the Poseidon Code (e.g. just for template jobs).

All in all, it feels like we would spend too much time and code complexity to something without a big impact and without known recent cases. Let's continue investigating this issue, once a real case occurs again.

MrSerth · 2024-09-04T11:30:51Z

I would still like to see some improvement here and think adjusting some values for environments could be useful. However, giving the other pending issues (and especially #612), we decided to close and postpone this issue for now.

MrSerth added bug Something isn't working deployment Everything related to our production environment labels Dec 8, 2023

mpass99 mentioned this issue Sep 2, 2024

Log unexpectedly stopped environments #668

Merged

MrSerth closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing template task for some environments #522

Missing template task for some environments #522

MrSerth commented Dec 8, 2023

mpass99 commented Aug 20, 2024

sentry-io bot commented Aug 20, 2024

MrSerth commented Aug 20, 2024

mpass99 commented Sep 2, 2024

MrSerth commented Sep 2, 2024

MrSerth commented Sep 3, 2024

mpass99 commented Sep 4, 2024

MrSerth commented Sep 4, 2024

Missing template task for some environments #522

Missing template task for some environments #522

Comments

MrSerth commented Dec 8, 2023

mpass99 commented Aug 20, 2024

sentry-io bot commented Aug 20, 2024

MrSerth commented Aug 20, 2024

mpass99 commented Sep 2, 2024

MrSerth commented Sep 2, 2024

MrSerth commented Sep 3, 2024

mpass99 commented Sep 4, 2024

MrSerth commented Sep 4, 2024