Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve service checker for gnmi/telemetry container #19153

Merged
merged 3 commits into from
Jun 18, 2024

Conversation

ganglyu
Copy link
Contributor

@ganglyu ganglyu commented May 31, 2024

Why I did it

Fix #19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
  • Microsoft ADO (number only): 28495305

How I did it

I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it

Run unit test and end to end test.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@ganglyu ganglyu requested a review from lguohan as a code owner May 31, 2024 06:09
@ganglyu ganglyu changed the title Fix health checker Improve service checker for gnmi/telemetry container May 31, 2024
@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict

@dprital
Copy link
Collaborator

dprital commented Jun 4, 2024

@qiluo-msft , can you please review this PR ? an if it is approved, can you please merge ?

@liat-grozovik
Copy link
Collaborator

@qiluo-msft kindly reminder to review. it should go to 202311 and 202405

"""
try:
DOCKER_CLIENT = docker.DockerClient(base_url='unix://var/run/docker.sock')
DOCKER_CLIENT.images.get(image_name)
Copy link
Contributor

@StormLiangMS StormLiangMS Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an assumption that images.get function will drop an exception when image is not there? Should we also check what it returns other than depend the logic of images.get function? For example, if the images.get refined to return a null for a search miss, it would still return True by check_docker_image. @ganglyu

Copy link
Contributor Author

@ganglyu ganglyu Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will raise exception if the image does not exist:

https://docker-py.readthedocs.io/en/stable/images.html#docker.models.images.ImageCollection.get
Raises:
docker.errors.ImageNotFound – If the image does not exist.
docker.errors.APIError – If the server returns an error.

Copy link
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change looks like more a work around, but I'm ok with it to deal with the difficulty of warm reboot case.

@ganglyu could you help to update the description about the details of difficulty of warm reboot case?

@StormLiangMS
Copy link
Contributor

@qiluo-msft to take a look before merge.

@dprital
Copy link
Collaborator

dprital commented Jun 13, 2024

@qiluo-msft , Can you please approve and merge ?

@StormLiangMS StormLiangMS merged commit 1fba66c into sonic-net:master Jun 18, 2024
20 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 18, 2024
Why I did it
Fix sonic-net#19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #19332

mssonicbld pushed a commit that referenced this pull request Jun 18, 2024
Why I did it
Fix #19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
@dprital
Copy link
Collaborator

dprital commented Jun 19, 2024

@yxieca , Can you please cherry pick to 202311 ?

@yxieca
Copy link
Contributor

yxieca commented Jun 20, 2024

@ganglyu can you provide the ADO number?

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 20, 2024
Why I did it
Fix sonic-net#19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #19360

@ganglyu
Copy link
Contributor Author

ganglyu commented Jun 20, 2024

@ganglyu can you provide the ADO number?

@yxieca ADO is 28495305

yxieca pushed a commit that referenced this pull request Jun 21, 2024
Why I did it
Fix #19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.

Co-authored-by: ganglv <88995770+ganglyu@users.noreply.github.com>
arun1355492 pushed a commit to arun1355492/sonic-buildimage that referenced this pull request Jul 26, 2024
Why I did it
Fix sonic-net#19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[telemetry] | After upgrade from 202305 to 202311 telemetry still in config_db and makes system not ready
8 participants