Skip to content
This repository has been archived by the owner on Dec 1, 2018. It is now read-only.

Ignore stats from terminated container if new was created. #1708

Merged

Conversation

loburm
Copy link
Contributor

@loburm loburm commented Jun 30, 2017

It may happen that cadvisor continues to report metrics for container
that was terminated even if a new one had been created. In such cases
heapster randomly selects metric from the terminated or running
container. This commit adds simple heuristic to ignore container that
is older and don't have any cpu usage.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 30, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@loburm
Copy link
Contributor Author

loburm commented Jun 30, 2017

/assign @piosz

@piosz
Copy link
Contributor

piosz commented Jul 4, 2017

cc @yujuhong @dashpole

@@ -194,10 +194,22 @@ func (this *summaryMetricsSource) decodePodStats(metrics map[string]*MetricSet,

for _, container := range pod.Containers {
key := PodContainerKey(ref.Namespace, ref.Name, container.Name)
if _, exist := metrics[key]; exist && containerIsTerminated(&container, metrics[key].CreateTime) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the logic is as follow:

  • if we first encounter an old container, then the new one will just replace it
  • if we first encounter the new, this check will ensure that the old will not replace it

It's not that obvious. Please add a comment explaining this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment was added.

metrics[key] = this.decodeContainerStats(podMetrics.Labels, &container, false)
}
}

func containerIsTerminated(container *stats.ContainerStats, otherStartTime time.Time) bool {
if container.StartTime.Time.Before(otherStartTime) && *container.CPU.UsageNanoCores == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about avoiding double checking of the same condition container.StartTime.Time.Before(otherStartTime) by implementing this in the following way:

if container.StartTime.Time.Before(otherStartTime) {
    if *container.CPU.UsageNanoCores == 0 {
        return true
    }
    glog.Warningf(...)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, like that it's much better.

@loburm loburm force-pushed the avoid_double_container_stats branch from 2fc5700 to 5a2462e Compare July 4, 2017 09:12
@loburm
Copy link
Contributor Author

loburm commented Jul 4, 2017

All comments have been resolved.

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one place were we should have logging, but otherwise LGTM 👍

// This check ensures that we are not replacing metrics of running container with metrics of terminated one if
// there are two exactly same containers reported by kubelet.
if _, exist := metrics[key]; exist && containerIsTerminated(&container, metrics[key].CreateTime) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please please please always log conditions like this (at high log level, like 8). Otherwise, it's impossible to debug user deployments without sending a patched Heapster container to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I have tried to include all information in the log that could help to debug issue.

Could you check it once again?

@loburm loburm force-pushed the avoid_double_container_stats branch 2 times, most recently from 4f1db8a to 4dd9527 Compare July 6, 2017 08:44
@DirectXMan12
Copy link
Contributor

looks good to me, will merge soon.

@loburm
Copy link
Contributor Author

loburm commented Jul 7, 2017

Thanks a lot!

metrics[key] = this.decodeContainerStats(podMetrics.Labels, &container, false)
}
}

func containerIsTerminated(container *stats.ContainerStats, otherStartTime time.Time) bool {
if container.StartTime.Time.Before(otherStartTime) {
if *container.CPU.UsageNanoCores == 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU usage may not be accurate. It's possible that the process didn't run for the past time period. Could you use memory instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, it really makes sense. I think checking of CPU and Memory usage should be good enough heuristic.

I have updated a code and test.

It may happen that cadvisor continues to report metrics for container
that was terminated even if a new one had been created. In such cases
heapster randomly selects metric from the terminated or running
container. This commit adds simple heuristic to ignore container that
is older and don't have any cpu usage.
@loburm loburm force-pushed the avoid_double_container_stats branch from 4dd9527 to b4cb344 Compare July 12, 2017 13:39
@loburm
Copy link
Contributor Author

loburm commented Jul 17, 2017

@piosz do you have any other comments?

@piosz
Copy link
Contributor

piosz commented Jul 20, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2017
@piosz piosz merged commit 142e549 into kubernetes-retired:master Jul 20, 2017
@loburm
Copy link
Contributor Author

loburm commented Jul 20, 2017

Fix #1697.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants