Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TT-1928] improve startup retry logic #1540

Merged
merged 2 commits into from
Jan 8, 2025

Conversation

Tofel
Copy link
Contributor

@Tofel Tofel commented Jan 8, 2025

now we will not change container name on startup retry, but delete old container instead, as new name creates connection problems for containers that depend on the restarted one (they do not know, they need to connect to it using the new name).

This happened here, where startup retry worked, but CL node didn't find the new container.


Below is a summarization created by an LLM (gpt-4-0125-preview). Be mindful of hallucinations and verify accuracy.

Why

The changes improve the container retry mechanisms in our Docker library by introducing a more robust error handling and cleanup process before retrying container starts. Specifically, it addresses scenarios where a container fails to start due to image or platform compatibility issues, ensuring a fresh attempt is made without reusing the problematic container.

What

  • lib/docker/docker.go
    • Added context, container from github/docker/docker/api/types, and errors from github/pkg/errors imports to support new error handling and container removal logic.
    • Modified NaiveRetrier and LinuxPlatformImageRetrier functions to remove the existing container before retrying to start a new one. This involves setting req.Reuse to false and calling a new removeContainer function if an error occurs during container start.
      • This change ensures that any retry attempt to start a container is done with a fresh environment, preventing the reuse of a potentially corrupted or misconfigured container.
    • Added a new removeContainer function that attempts to remove a container by its name using the Docker provider. It handles errors gracefully, ignoring the error if the container does not exist, ensuring the function only intervenes when necessary to clean up resources.
      • This function supports the retry mechanisms by ensuring that any resources tied to a failed container start attempt are cleaned up before a new attempt is made, thereby increasing the likelihood of a successful container start on subsequent retries.

@cl-sonarqube-production
Copy link

@Tofel Tofel marked this pull request as ready for review January 8, 2025 13:48
@Tofel Tofel requested review from sebawo and a team as code owners January 8, 2025 13:48
@Tofel Tofel merged commit a3021b1 into main Jan 8, 2025
54 of 55 checks passed
@Tofel Tofel deleted the tt-1928-do-not-change-container-name-on-startup-retry branch January 8, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants