Pod information can't be found after the pod is restarted #239

dxsup · 2022-06-06T10:04:53Z

Describe the bug

After the pod is restarted, the pod information in the metric labels will be empty, but the service and workload information are both still there.
Before restarting:

After restarting:

How to reproduce?

This issue can be reproduced by running the following unit test.

func TestUpdateDelayDelete(t *testing.T) {
	addObjJson := "{\"metadata\": {\"name\": \"testdemo2-5c86748464-26crb\",\"namespace\": \"test-ns\",\"resourceVersion\": \"44895976\"},\"spec\": {\"containers\": [{\"name\": \"testdemo2\",\"ports\": [{\"containerPort\": 9001,\"protocol\": \"TCP\"}]}]},\"status\": {\"phase\": \"Running\",\"podIP\": \"192.168.136.210\",\"containerStatuses\": [{\"name\": \"testdemo2\",\"state\": {\"running\": {\"startedAt\": \"2022-05-25T08:55:36Z\"}},\"lastState\": {},\"ready\": true,\"restartCount\": 5,\"image\": \"\",\"imageID\": \"docker-pullable://10.10.102.213:8443/cloudnevro-test/test-netserver@sha256:6720f648b74ed590f36094a1c7a58b01b6881396409784c17f471ecfe445e3fd\",\"containerID\": \"docker://d505f50edb4e204cf31840e3cb8d26d33f212d4ebef994d0c3fc151d57e17413\",\"started\": true}]}}"
	updateObjJson := "{\"metadata\": {\"name\": \"testdemo2-5c86748464-26crb\",\"namespace\": \"test-ns\",\"resourceVersion\": \"47374698\"},\"spec\": {\"containers\": [{\"name\": \"testdemo2\",\"ports\": [{\"containerPort\": 9001,\"protocol\": \"TCP\"}]}]},\"status\": {\"phase\": \"Running\",\"podIP\": \"192.168.136.210\",\"containerStatuses\": [{\"name\": \"testdemo2\",\"state\": {\"terminated\": {\"exitCode\": 143,\"reason\": \"Error\",\"startedAt\": \"2022-05-25T08:55:36Z\",\"finishedAt\": \"2022-06-06T09:04:12Z\",\"containerID\": \"docker://d505f50edb4e204cf31840e3cb8d26d33f212d4ebef994d0c3fc151d57e17413\"}},\"lastState\": {},\"ready\": false,\"restartCount\": 5,\"image\": \"\",\"imageID\": \"docker-pullable://10.10.102.213:8443/cloudnevro-test/test-netserver@sha256:6720f648b74ed590f36094a1c7a58b01b6881396409784c17f471ecfe445e3fd\",\"containerID\": \"docker://d505f50edb4e204cf31840e3cb8d26d33f212d4ebef994d0c3fc151d57e17413\",\"started\": false}]}}"
	addObj := new(corev1.Pod)
	err := json.Unmarshal([]byte(addObjJson), addObj)
	if err != nil {
		t.Errorf("error unmarshalling %v", err)
	}
	updateObj := new(corev1.Pod)
	err = json.Unmarshal([]byte(updateObjJson), updateObj)
	if err != nil {
		t.Fatalf("error unmarshalling %v", err)
	}
	podIp := addObj.Status.PodIP
	port := addObj.Spec.Containers[0].Ports[0].ContainerPort
	onAdd(addObj)
	_, ok := MetaDataCache.GetContainerByIpPort(podIp, uint32(port))
	if !ok {
		t.Fatalf("Not found container [%s:%d]", podIp, port)
	} else {
		t.Logf("Found container [%s:%d]", podIp, port)
	}
	stopCh := make(chan struct{})
	go podDeleteLoop(100*time.Millisecond, 500*time.Millisecond, stopCh)
	OnUpdate(addObj, updateObj)
	time.Sleep(600 * time.Millisecond)
	_, ok = MetaDataCache.GetContainerByIpPort(podIp, uint32(port))
	if !ok {
		t.Errorf("Not found container [%s:%d]", podIp, port)
	} else {
		t.Logf("Found container [%s:%d]", podIp, port)
	}
	stopCh <- struct{}{}
}

Cause

When receiving an Update event, we first Delete the corresponding pod and then Add it back. The deleting logic is implemented in a delay mode, which means it will delete the pod from the cache after adding. So we can't find the pod in the end.

How to fix

I will rewrite the logic in the OnUpdate method of pod_watch.go. As we only care about the IP address, port, container ID, and labels of the pod, so I will compare these fields between the old pod and the new one. If they are all identical, I will skip this update event, otherwise, I will update the corresponding fields.

The text was updated successfully, but these errors were encountered:

NeJan2020 · 2022-06-06T10:16:08Z

If they are both identical, I will skip this update event

The ContainerID always changed, Only MetaDataCache.GetContainerByIpPort(IP, port) may failed cause by this bug

dxsup · 2022-06-06T10:23:07Z

Right. Then the problem becomes more complicated because we still need to delete the container ID from the cache. I will handle both these cases in the OnUpdate method.

dxsup · 2022-06-13T08:43:18Z

Fixed by #245

dxsup added the bug Something isn't working label Jun 6, 2022

dxsup self-assigned this Jun 6, 2022

dxsup mentioned this issue Jun 10, 2022

Rewrite the onUpdate and delete method of pod #245

Merged

dxsup closed this as completed Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod information can't be found after the pod is restarted #239

Pod information can't be found after the pod is restarted #239

dxsup commented Jun 6, 2022 •

edited

Loading

NeJan2020 commented Jun 6, 2022

dxsup commented Jun 6, 2022

dxsup commented Jun 13, 2022

Pod information can't be found after the pod is restarted #239

Pod information can't be found after the pod is restarted #239

Comments

dxsup commented Jun 6, 2022 • edited Loading

Describe the bug

How to reproduce?

Cause

How to fix

NeJan2020 commented Jun 6, 2022

dxsup commented Jun 6, 2022

dxsup commented Jun 13, 2022

dxsup commented Jun 6, 2022 •

edited

Loading