Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add endpoint slice watcher #1528

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

add endpoint slice watcher #1528

wants to merge 2 commits into from

Conversation

pxaws
Copy link
Contributor

@pxaws pxaws commented Feb 5, 2025

Description of the issue

Describe the problem or feature in addition to a link to the issues.

Description of changes

To reduce the overhead of ListPod calls on Kubernetes API server at cloudwatch agent startup, we want to switch to use endpoint slice API. The endpoint slice API has much smaller response size. We will use it to figure out the ip to workload mapping for Kubernetes services.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Describe what tests you have done.

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@pxaws pxaws requested a review from a team as a code owner February 5, 2025 00:46
@pxaws pxaws force-pushed the endpointslice-dev branch 2 times, most recently from 2d11975 to 0dada8d Compare February 5, 2025 22:59
Comment on lines +104 to +107
if derivedWorkload == "" {
w.logger.Warn("failed to infer workload name from Pod name", zap.String("podName", podName))
continue
}
Copy link
Contributor

@mxiamxia mxiamxia Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the chance of failure on this? if it happens, can we directly use service name to resolve the ip? My concern, will we skip the IP resolving if something goes wrong here from deployment name?

Copy link
Contributor Author

@pxaws pxaws Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferWorkloadName return the podName in the worst case. I think it's very unlikely that the podname will be empty. I think if a pod shows up in endpoint slice, it will at least have some name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we saying that we will resolve RemoteService to the podname in the worst case? Wondering if this is the user experience we want in the trace map since it has high cardinality?

wondering if we can do something like remove the uid...not sure what the edge cases are.

}

// 3) If none of the patterns matched, return the entire podName.
return podName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this wont happen. LOL. high cardinality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think the chance is low. Even when it happens, we have the metric limiter to mitigate the high cardinality issue.

mxiamxia
mxiamxia previously approved these changes Feb 6, 2025
Comment on lines 98 to 100
if endpoint.TargetRef.Kind == "Pod" {
podName = endpoint.TargetRef.Name
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If inferWorkloadName relies on the podName being set and this is the only place that sets the podName, couldn't we skip processing the endpoint if the TargetRef.Kind isn't Pod?

Suggested change
if endpoint.TargetRef.Kind == "Pod" {
podName = endpoint.TargetRef.Name
}
if endpoint.TargetRef.Kind != "Pod" {
continue
}
podName := endpoint.TargetRef.Name

Copy link
Contributor Author

@pxaws pxaws Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we should skip if it's not pod. Will update in next commit.

Comment on lines 95 to 96
podName := ""
ns := ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why are these declared outside of the if? They are only used if the endpoint.TargetRef != nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it inside the if block.

Comment on lines 135 to 156
// Service reference: "kubernetes.io/service-name"
svcName := slice.Labels["kubernetes.io/service-name"]
if svcName != "" {
svcFull := svcName + "@" + slice.Namespace

// If there's at least one endpoint, derive the workload from it
if len(slice.Endpoints) > 0 {
if endpoint := slice.Endpoints[0]; endpoint.TargetRef != nil {
derived := inferWorkloadName(endpoint.TargetRef.Name)
if derived != "" {
firstWl := derived + "@" + endpoint.TargetRef.Namespace
pairs = append(pairs, kvPair{
key: svcFull,
value: firstWl,
isService: true,
})
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Feels like this could be done within the loop so we don't have to derive the workload name again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I will merge it to the loop.

pairs := w.extractEndpointSliceKeyValuePairs(newSlice)

// Insert them into our ipToWorkload / serviceToWorkload, and track the keys.
var keys []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can preallocate space for this slice since we know the size.

Suggested change
var keys []string
keys := make([]string, 0, len(pairs))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in next commit.

Comment on lines 98 to 107
podInformer := sharedInformerFactory.Core().V1().Pods().Informer()
err = podInformer.SetTransform(minimizePod)
if err != nil {
logger.Error("failed to minimize Pod objects", zap.Error(err))
}
serviceInformer := sharedInformerFactory.Core().V1().Services().Informer()
err = serviceInformer.SetTransform(minimizeService)
if err != nil {
logger.Error("failed to minimize Service objects", zap.Error(err))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a reason this isn't part of newPodWatcher and newServiceWatcher similar to newEndpointSliceWatcher? Would be nice to have them be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have the pod/service watcher to hide the details of transform inside. For consistency, I will change them accordingly.

Comment on lines +104 to +107
if derivedWorkload == "" {
w.logger.Warn("failed to infer workload name from Pod name", zap.String("podName", podName))
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we saying that we will resolve RemoteService to the podname in the worst case? Wondering if this is the user experience we want in the trace map since it has high cardinality?

wondering if we can do something like remove the uid...not sure what the edge cases are.

if err != nil {
logger.Error("failed to minimize Service objects", zap.Error(err))
}
useListPod := (os.Getenv("USE_LIST_POD") == "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use an env variable instead of a config option in the processor?

Copy link
Contributor Author

@pxaws pxaws Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to remove the usage of list pods api eventually. That's why I don't want to make the config change which is more permanent. In the worst case, the old logic is kept here so that customers can quickly switch to it. But I don't want too much publicity for the environmental variable for now :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we saying that we will resolve RemoteService to the podname in the worst case? Wondering if this is the user experience we want in the trace map since it has high cardinality?

We have the metric limiter that will mitigate the high cardinality for metrics. For traces, the high cardinality issue is handled at backend. I remember that the backend will check how many nodes in the trace and cut off if there are too many. From this perspective, I am not very worried about the worst case. Anyway, the chance for the worst case to happen is pretty low in my opinion.

wondering if we can do something like remove the uid...not sure what the edge cases are.

I am not sure what you mean by "remove the uid".

@pxaws pxaws force-pushed the endpointslice-dev branch from 25b3510 to 42f2f53 Compare February 7, 2025 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants