Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revise Datadog trace sampling configuration #10151

Merged

Conversation

dgoffredo
Copy link
Contributor

@dgoffredo dgoffredo commented Jun 29, 2023

Datadog customers have begun to report that trace sampling is not behaving as expected when using ingress-nginx.

With other Datadog integrations, the default sampling behavior is to consult the Datadog Agent for a sample rate, which changes dynamically. This way, trace volume can be centrally controlled. A customer may optionally specify a fixed sampling rate; but if they don't, the default behavior is to let the Datadog Agent figure it out.

I made a change in Datadog's library last March that changed the meaning of sample_rate in the library's configuration. sample_rate corresponds to DatadogSampleRate in ingress-nginx. Previously, sample_rate was ignored by Datadog's library. This was a bug, but not a severe one, because the concept of "sampling rules" had since superceded what sample_rate used to configure. My change in March repurposed sample_rate to mean "append a sampling rule that matches all traces."

What I overlooked was the fact that ingress-nginx still uses sample_rate, and that it always specifies a value for it in /etc/nginx/opentracing.json, defaulting to 1.0.

This means that Datadog customers, since my change, have no way to say "use the rates calculated by the Datadog Agent." They can set DatadogSampleRate, and if they don't, they still get 1.0 instead of the desired default behavior.

The changes that I propose in this PR should have been proposed last March, but I didn't then notice this interaction.

These changes remove the DatadogPrioritySampling flag (which has not done anything for quite a long time), and change the type of DatadogSampleRate from float32 to *float32. This way, the default value is nil rather than 1.0, and we can detect this when constructing /etc/nginx/opentracing.json.

Conditionally including "sample_rate" in the generated JSON required me to rearrange the code that produces /etc/nginx/opentracing.json. Previously, the file content was chosen from one of multiple text/template templates. Such templates cannot, as far as I know, express conditionally included text based on the value of a pointer. Instead, I use encoding/json in a dedicated function to generate the Datadog JSON.

This will change the default sampling behavior of the Datadog integration, which is something that I'd like to mention in ingress-nginx's release notes should these changes be merged in their current form.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • CVE Report (Scanner found CVE and adding report)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

Which issue/s this PR fixes

The issue was not with ingress-nginx, but with behavior change in ingress-nginx brought on by a change in dd-opentracing-cpp.

The concern was raised in Datadog's support channels.

How Has This Been Tested?

Manual integration testing involved a few Datadog-specific files:

  • agent.dockerfile: Dockerfile for the mock Datadog Agent.
  • agent.js: Node.js HTTP server used as a mock Datadog Agent.
  • agent.yaml: Kubernetes DaemonSet resource that uses the image built by agent.dockerfile.
  • httpbin.yaml: Example HTTP Service and a corresponding Ingress.

Run make dev-env.

Apply the files above (this involves loading the built Docker image into the cluster, similarly to what is done in make dev-env).

Now requests made to the host's port 80 will flow through the NGINX ingress to httpbin.

Edit the ingress controller's Deployment to expose the node's IP address. That's where the mock Datadog Agent will be listening (because it's a DaemonSet, there's an instance on each node):

# ...
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
# ...

Edit the ingress controller's ConfigMap to enable Datadog tracing:

apiVersion: v1
data:
  datadog-collector-host: $HOST_IP
  enable-opentracing: "true"
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx

Verify that httpbin's /headers endpoint shows Datadog tracing propagation headers:

ubuntu@dgoffredo-devbox:~$ curl 'http://localhost/headers'
{
  "headers": {
    "Accept": "*/*", 
    "Host": "localhost", 
    "User-Agent": "curl/7.81.0", 
    "X-Datadog-Parent-Id": "4176797220720185106", 
    "X-Datadog-Sampling-Priority": "1", 
    "X-Datadog-Tags": "_dd.p.dm=-0", 
    "X-Datadog-Trace-Id": "356006255528867060", 
    "X-Forwarded-Host": "localhost", 
    "X-Forwarded-Scheme": "http", 
    "X-Scheme": "http"
  }
}

Verify that the default /etc/nginx/opentracing.json omits "sample_rate":

ubuntu@dgoffredo-devbox:~$ kubectl -n ingress-nginx get pods
NAME                                       READY   STATUS      RESTARTS   AGE
ingress-nginx-admission-create-nk8q9       0/1     Completed   0          2d17h
ingress-nginx-admission-patch-p8qk7        0/1     Completed   0          2d17h
ingress-nginx-controller-56fc94fb8-zbkdg   1/1     Running     0          22h
ubuntu@dgoffredo-devbox:~$ kubectl -n ingress-nginx exec -it ingress-nginx-controller-56fc94fb8-zbkdg -- cat /etc/nginx/opentracing.json | jq
{
  "agent_host": "172.18.0.2",
  "agent_port": 8126,
  "environment": "prod",
  "operation_name_override": "nginx.handle",
  "service": "nginx"
}

Edit the ingress controller's ConfigMap to specify an explicit sampling rate:

apiVersion: v1
kind: ConfigMap
data:
  datadog-collector-host: $HOST_IP
  datadog-sample-rate: "0.42"
  enable-opentracing: "true"

Send another request to httpbin, and check the log output of the mock Datadog Agent. Verify that the tagged sampling rate (_dd.rule_psr) is as configured.

ubuntu@dgoffredo-devbox:~$ kubectl -n datadog logs --follow dd-trace-agent-tnsgj
[
  [
    {
      "name": "nginx.handle",
      "service": "nginx",
      "resource": "/",
      "type": "web",
      "start": 1688055351332243000,
      "duration": 2471860,
      "meta": {
        "http.url": "http://localhost/headers",
        "upstream.name": "upstream_balancer",
        "http.method": "GET",
        "http.status_code": "200",
        "http.host": "localhost",
        "peer.address": "172.18.0.1:50500",
        "nginx.worker_pid": "96",
        "http.status_line": "200 OK",
        "component": "nginx",
        "upstream.address": "10.244.0.8:80",
        "env": "prod",
        "operation": "/"
      },
      "metrics": {},
      "span_id": 1493341588522438100,
      "trace_id": 2107002113598878000,
      "parent_id": 2107002113598878000,
      "error": 0
    },
    {
      "name": "nginx.handle",
      "service": "nginx",
      "resource": "/",
      "type": "web",
      "start": 1688055351332218400,
      "duration": 2515420,
      "meta": {
        "http.url": "http://localhost/headers",
        "upstream.name": "upstream_balancer",
        "http.method": "GET",
        "nginx.worker_pid": "96",
        "_dd.p.dm": "-3",
        "component": "nginx",
        "http.status_line": "200 OK",
        "http.host": "localhost",
        "peer.address": "172.18.0.1:50500",
        "http.status_code": "200",
        "upstream.address": "10.244.0.8:80",
        "env": "prod",
        "operation": "/"
      },
      "metrics": {
        "_dd.rule_psr": 0.42,
        "_sampling_priority_v1": -1
      },
      "span_id": 2107002113598878000,
      "trace_id": 2107002113598878000,
      "parent_id": 0,
      "error": 0
    }
  ]
]

Note that it's 0.42, as expected.

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have added unit and/or e2e tests to cover my changes.
  • All new and existing tests passed.

@netlify
Copy link

netlify bot commented Jun 29, 2023

Deploy Preview for kubernetes-ingress-nginx canceled.

Name Link
🔨 Latest commit 7712ef2
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-ingress-nginx/deploys/649de8e56a2b53000758f287

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 29, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 29, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jun 29, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @dgoffredo. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 29, 2023
@dgoffredo dgoffredo changed the title David.goffredo/datadog revise sampling revise Datadog trace sampling configuration Jun 29, 2023
Copy link
Member

@tao12345666333 tao12345666333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Thanks for your contributions.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2023
@rikatz
Copy link
Contributor

rikatz commented Jul 6, 2023

Error apparently was due to a github action problem, triggering here

@rikatz
Copy link
Contributor

rikatz commented Jul 6, 2023

/lgtm
/approve
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 6, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoffredo, rikatz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2023
@k8s-ci-robot k8s-ci-robot merged commit 6d55e1f into kubernetes:main Jul 6, 2023
@dgoffredo
Copy link
Contributor Author

Thanks, @rikatz!

@strongjz
Copy link
Member

/cherry-pick release-1.8

@k8s-infra-cherrypick-robot
Copy link
Contributor

@strongjz: new pull request created: #10224

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants