Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing template task for some environments #522

Closed
MrSerth opened this issue Dec 8, 2023 · 8 comments
Closed

Missing template task for some environments #522

MrSerth opened this issue Dec 8, 2023 · 8 comments
Labels
bug Something isn't working deployment Everything related to our production environment

Comments

@MrSerth
Copy link
Member

MrSerth commented Dec 8, 2023

I've just noticed an issue with Poseidon that needs further investigation:

According to CodeOcean, some environments were not found when executing code and were required to be synced CODEOCEAN-11J. Indeed, CodeOcean still shows empty pools for some environments, including the IDs 11, 18, 22, 33:

Bildschirmfoto 2023-12-08 um 22 25 36

Nomad, however, has enough jobs scheduled:

Bildschirmfoto 2023-12-08 um 22 28 20

During my investigation, I already identified that Poseidon is not aware of the environments:

{
    "executionEnvironments": [
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 512,
            "image": "openhpi/co_execenv_java:17",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 31
        },
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 512,
            "image": "openhpi/co_execenv_java:8-antlr",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 10
        },
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_r:4",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 28
        },
        {
            "prewarmingPoolSize": 5,
            "cpuLimit": 20,
            "memoryLimit": 512,
            "image": "openhpi/co_execenv_julia:1.8",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 34
        },
        {
            "prewarmingPoolSize": 2,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_python:3.4",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 14
        },
        {
            "prewarmingPoolSize": 2,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_ruby:2.5",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 25
        },
        {
            "prewarmingPoolSize": 15,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_python:3.8",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 29
        },
        {
            "prewarmingPoolSize": 2,
            "cpuLimit": 20,
            "memoryLimit": 256,
            "image": "openhpi/co_execenv_python:3.7-ml",
            "networkAccess": false,
            "exposedPorts": null,
            "id": 30
        }
    ]
}

This matches Nomad as well:

Bildschirmfoto 2023-12-08 um 22 32 57

Hence, Poseidon thinks that everything is okay and is not issuing a Prewarming Pool Alert:

Bildschirmfoto 2023-12-08 um 22 24 59

Poseidon, however, was not restarted for quite a while:

root@poseidon-terraform:/home/ubuntu# systemctl status poseidon
● poseidon.service - Poseidon
     Loaded: loaded (/etc/systemd/system/poseidon.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-12-08 02:50:23 UTC; 18h ago
   Main PID: 623 (poseidon)
      Tasks: 9 (limit: 4647)
     Memory: 291.7M
        CPU: 2h 48min 12ms
     CGroup: /system.slice/poseidon.service
             └─623 /usr/local/bin/poseidon

Probably, the affected environments were lost during the night:

Bildschirmfoto 2023-12-08 um 22 35 50
@MrSerth MrSerth added bug Something isn't working deployment Everything related to our production environment labels Dec 8, 2023
@mpass99
Copy link
Contributor

mpass99 commented Aug 20, 2024

The Sentry Issue CODEOCEAN-11J does no longer exist. @MrSerth have you experienced this issue lately or is there a follow-up Sentry issue?

From your images, we can derive that Poseidon restarted at 2023-12-08 02:50:23 UTC. According to the Grafana image, even after the restart, environment 33 was still active. It got inactive at 03:50. The Nomad Job was no longer running.

Indeed, when Poseidon is notified about a stopped environment job, it does absolutely nothing.

if nomad.IsEnvironmentTemplateID(runnerID) {
return false
}

However, the environment is only removed from Poseidon's memory when requested via the API. If that is not the case, I have to assume that the Grafana image uses a different timezone and the environment stopped existing once Poseidon got restarted.

How should we deal with this issue?

  • Rely on Nomad
    • Nomad should be responsible for correctly restarting and rescheduling the environment
    • We have introduced changes in between that might reduce the likelihood of failures (Nomad Restart and Reschedule Policy #611)
    • The effects of a lost environment are not too bad
      • CodeOcean recreates the environment once a user wants to access it
  • Poseidon handling
    • Just Logging
    • Recreation

Copy link

sentry-io bot commented Aug 20, 2024

Sentry Issue: CODEOCEAN-12Y

@MrSerth
Copy link
Member Author

MrSerth commented Aug 20, 2024

@MrSerth have you experienced this issue lately or is there a follow-up Sentry issue?

I linked the follow-up Sentry issue CODEOCEAN-12Y, which I triggered manually on our staging system (so no real issue). But still, it would be the one where further events are grouped.

Besides that, I don't have another recent occurrence, but saw the potentially lost environment on CodeOcean's dashboard during the past student courses (so in spring 2024).

If that is not the case, I have to assume that the Grafana image uses a different timezone and the environment stopped existing once Poseidon got restarted.

I am not sure about the time zone (sorry!), but would assume that the Grafana dashboard is shown in my browser's time zone (should be CET, +01:00) -- that's at least the case when I visit Grafana today (and I cannot remember that it was UTC).

Otherwise, I agree with your assumption: Probably, Poseidon got restarted after the environment job got lost, hence the recovery did not recover the environment at all.

  • The effects of a lost environment are not too bad
    • CodeOcean recreates the environment once a user wants to access it

That's true, of course. Still, if we can prevent this error, it would be even better.

How should we deal with this issue?

Good question. I like that we improved the situation already through the better restart and rescheduling mechanism. Still, jobs might get lost. Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so? And would you assume restarting is fine, or could this result in duplicated jobs, ...?

As an additional precaution, I tend to add support to Poseidon for this case, too (depending on your answer to the previous questions).

@mpass99
Copy link
Contributor

mpass99 commented Sep 2, 2024

Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so?

We could and we have all the required information to do so. However, it would add some new complexity in the Nomad Event Stream Handling and new workflows within Poseidon.

Would you assume restarting is fine

There are three cases why environment jobs fail finally:

  • Because Poseidon stopped it
    • All good (as long as we are not aware of Poseidon deleting environments without a respective API request.)
  • Because Nomad stopped it as defined in the restart and rescheduling policy
    • Likely this is a misconfigured environment (e.g. wrong image)
      • All good
      • In this case, Poseidon should not restart the environment job, because we intended to not infinitely restart (invalid) environments/jobs
    • We should validate each event of this issue to catch other causes of stopped environments.
  • Because Nomad stopped it unexpectedly
    • We should monitor if this case happens and hunt down each cause.

All in all, it feels like we would re-implement too much of Nomad's functionality and responsibility. If we want to change the behavior we might rather adjust the job policy. I would go with adding visibility to this issue as proposed with #668.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 2, 2024

Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so?

We could and we have all the required information to do so. However, it would add some new complexity in the Nomad Event Stream Handling and new workflows within Poseidon.

Okay, I see. Especially given the fact that we could try to advise Nomad to restart the job again, let's try this Nomad-based approach first.

There are three cases why environment jobs fail finally

  • Because Poseidon stopped it

That's fine, and I am not aware of any false deletion requests by Poseidon right now.

  • Because Nomad stopped it as defined in the restart and rescheduling policy
  • Likely this is a misconfigured environment (e.g. wrong image)

For the occurrence I created the issue for this (misconfigured environment) is highly unlikely. The environment (command, image, ...) doesn't change very frequently and I cannot remember of real issues with that recently. For sure, the restart and rescheduling limits could have been reached, due to network issues, agent shutdowns, etc.

I get that a Poseidon-based solution might not make sense here, but can we get creative and nevertheless adjust the task policy for environments (not for regular runner jobs) here? For example, I would assume that a wrong image leads to immediate and countless restarts. Other availability issues, however, would only fail for a limited time (like for the duration of one or two restarts) and work before or after that. With a comparatively small interval setting (1h / 30m?) one could try to narrow those case. Another idea would be to set environment stone unlimited restarts, and simply check whether (in Poseidon) whether a first error occurs shortly after synchronizing the environment (in which case it might be erroneous). Or ... some other idea you might have?

  • Because Nomad stopped it unexpectedly

Yes, we should monitor these and I like the enhanced logging with #668.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 3, 2024

Since the changes introduced in #668 were deployed, we saw the first event in #672. Let's double check this issue, too.

@mpass99
Copy link
Contributor

mpass99 commented Sep 4, 2024

Because Nomad stopped it as defined in the restart and rescheduling policy

#668 monitors also these cases. We might want to wait for an occurrence to verify that this is still a real issue.

A misconfigured environment is highly unlikely. The environment (command, image, ...) doesn't change very frequently and I cannot remember of real issues with that recently. For sure, the restart and rescheduling limits could have been reached, due to network issues, agent shutdowns, etc.

That's right, I agree

Can we adjust the task policy for environments (not for regular runner jobs) here?

Yeah, we might overwrite some policies within the Poseidon Code (e.g. just for template jobs).

All in all, it feels like we would spend too much time and code complexity to something without a big impact and without known recent cases. Let's continue investigating this issue, once a real case occurs again.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 4, 2024

I would still like to see some improvement here and think adjusting some values for environments could be useful. However, giving the other pending issues (and especially #612), we decided to close and postpone this issue for now.

@MrSerth MrSerth closed this as completed Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deployment Everything related to our production environment
Projects
None yet
Development

No branches or pull requests

2 participants