Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod information can't be found after the pod is restarted #239

Closed
dxsup opened this issue Jun 6, 2022 · 3 comments
Closed

Pod information can't be found after the pod is restarted #239

dxsup opened this issue Jun 6, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@dxsup
Copy link
Member

dxsup commented Jun 6, 2022

Describe the bug

After the pod is restarted, the pod information in the metric labels will be empty, but the service and workload information are both still there.
Before restarting:
before
After restarting:
after

How to reproduce?

This issue can be reproduced by running the following unit test.

func TestUpdateDelayDelete(t *testing.T) {
	addObjJson := "{\"metadata\": {\"name\": \"testdemo2-5c86748464-26crb\",\"namespace\": \"test-ns\",\"resourceVersion\": \"44895976\"},\"spec\": {\"containers\": [{\"name\": \"testdemo2\",\"ports\": [{\"containerPort\": 9001,\"protocol\": \"TCP\"}]}]},\"status\": {\"phase\": \"Running\",\"podIP\": \"192.168.136.210\",\"containerStatuses\": [{\"name\": \"testdemo2\",\"state\": {\"running\": {\"startedAt\": \"2022-05-25T08:55:36Z\"}},\"lastState\": {},\"ready\": true,\"restartCount\": 5,\"image\": \"\",\"imageID\": \"docker-pullable://10.10.102.213:8443/cloudnevro-test/test-netserver@sha256:6720f648b74ed590f36094a1c7a58b01b6881396409784c17f471ecfe445e3fd\",\"containerID\": \"docker://d505f50edb4e204cf31840e3cb8d26d33f212d4ebef994d0c3fc151d57e17413\",\"started\": true}]}}"
	updateObjJson := "{\"metadata\": {\"name\": \"testdemo2-5c86748464-26crb\",\"namespace\": \"test-ns\",\"resourceVersion\": \"47374698\"},\"spec\": {\"containers\": [{\"name\": \"testdemo2\",\"ports\": [{\"containerPort\": 9001,\"protocol\": \"TCP\"}]}]},\"status\": {\"phase\": \"Running\",\"podIP\": \"192.168.136.210\",\"containerStatuses\": [{\"name\": \"testdemo2\",\"state\": {\"terminated\": {\"exitCode\": 143,\"reason\": \"Error\",\"startedAt\": \"2022-05-25T08:55:36Z\",\"finishedAt\": \"2022-06-06T09:04:12Z\",\"containerID\": \"docker://d505f50edb4e204cf31840e3cb8d26d33f212d4ebef994d0c3fc151d57e17413\"}},\"lastState\": {},\"ready\": false,\"restartCount\": 5,\"image\": \"\",\"imageID\": \"docker-pullable://10.10.102.213:8443/cloudnevro-test/test-netserver@sha256:6720f648b74ed590f36094a1c7a58b01b6881396409784c17f471ecfe445e3fd\",\"containerID\": \"docker://d505f50edb4e204cf31840e3cb8d26d33f212d4ebef994d0c3fc151d57e17413\",\"started\": false}]}}"
	addObj := new(corev1.Pod)
	err := json.Unmarshal([]byte(addObjJson), addObj)
	if err != nil {
		t.Errorf("error unmarshalling %v", err)
	}
	updateObj := new(corev1.Pod)
	err = json.Unmarshal([]byte(updateObjJson), updateObj)
	if err != nil {
		t.Fatalf("error unmarshalling %v", err)
	}
	podIp := addObj.Status.PodIP
	port := addObj.Spec.Containers[0].Ports[0].ContainerPort
	onAdd(addObj)
	_, ok := MetaDataCache.GetContainerByIpPort(podIp, uint32(port))
	if !ok {
		t.Fatalf("Not found container [%s:%d]", podIp, port)
	} else {
		t.Logf("Found container [%s:%d]", podIp, port)
	}
	stopCh := make(chan struct{})
	go podDeleteLoop(100*time.Millisecond, 500*time.Millisecond, stopCh)
	OnUpdate(addObj, updateObj)
	time.Sleep(600 * time.Millisecond)
	_, ok = MetaDataCache.GetContainerByIpPort(podIp, uint32(port))
	if !ok {
		t.Errorf("Not found container [%s:%d]", podIp, port)
	} else {
		t.Logf("Found container [%s:%d]", podIp, port)
	}
	stopCh <- struct{}{}
}

Cause

When receiving an Update event, we first Delete the corresponding pod and then Add it back. The deleting logic is implemented in a delay mode, which means it will delete the pod from the cache after adding. So we can't find the pod in the end.

How to fix

I will rewrite the logic in the OnUpdate method of pod_watch.go. As we only care about the IP address, port, container ID, and labels of the pod, so I will compare these fields between the old pod and the new one. If they are all identical, I will skip this update event, otherwise, I will update the corresponding fields.

@dxsup dxsup added the bug Something isn't working label Jun 6, 2022
@dxsup dxsup self-assigned this Jun 6, 2022
@NeJan2020
Copy link
Collaborator

If they are both identical, I will skip this update event

The ContainerID always changed, Only MetaDataCache.GetContainerByIpPort(IP, port) may failed cause by this bug

@dxsup
Copy link
Member Author

dxsup commented Jun 6, 2022

Right. Then the problem becomes more complicated because we still need to delete the container ID from the cache. I will handle both these cases in the OnUpdate method.

@dxsup
Copy link
Member Author

dxsup commented Jun 13, 2022

Fixed by #245

@dxsup dxsup closed this as completed Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants