Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCensusMetricsExporter: GKE resource labels not sent #951

Closed
ocervell opened this issue Jul 28, 2020 · 34 comments · Fixed by #937
Closed

OpenCensusMetricsExporter: GKE resource labels not sent #951

ocervell opened this issue Jul 28, 2020 · 34 comments · Fixed by #937
Labels
bug Something isn't working

Comments

@ocervell
Copy link

ocervell commented Jul 28, 2020

When running on GKE, the metrics exported with the OpenCensusMetricsExporter do not have any GKE resource labels (pod_name, container_name, node_name, etc...). Should those be added to the resource labels before shipping the metric with the exporters ?

@ocervell ocervell added the bug Something isn't working label Jul 28, 2020
@AndrewAXue
Copy link
Contributor

That PR does not support GKE, but this PR GoogleCloudPlatform/opentelemetry-operations-python#24 in the new Google repo does.

@ocervell
Copy link
Author

ocervell commented Jul 28, 2020

Thanks looks good to me. Since this is not part of this repo, will that work without using the CloudMonitoringMetricsExplorer ?

I'm using the OpenCensusExporter as part of an architecture OT SDK --> OT agent --> OT collector pipeline --> StackdriverExporter (Go) and would like this feature in the first part (SDK).

I think those resource attributes could be added by the SDK (explicit setup), and then propagated along the pipeline.

@AndrewAXue
Copy link
Contributor

Resource Detection will work without using CloudMonitoringMetricsExplorer, you'll basically just want lines 24-32 from here https://github.com/GoogleCloudPlatform/opentelemetry-operations-python/blob/master/docs/examples/tools/cloud_resource_detector/resource_detector_metrics.py#L24. And it will pass in the resource info to OpenCensusExporter, but I don't believe it will be in the format that OpenCensusExporter requires them to be since that was built before the resource detector.

@aabmass
Copy link
Member

aabmass commented Jul 28, 2020

The OC exporter will need to be updated to send resource information, which should be simple enough looking at the code.

AFAIK, there is no general way to signal OT exporters which resource keys you care about, but I will double check the spec.

@ocervell
Copy link
Author

ocervell commented Jul 28, 2020

@AndrewAXue thanks, this should be doable. What package does the opencensus.tools. live in ? I'm getting an error saying it can't find it:

Traceback (most recent call last):
  File "/Users/ocervello/.virtualenvs/flask/lib/python3.7/site-packages/flask/cli.py", line 235, in locate_app
    __import__(module_name)
  File "/Users/ocervello/Workspace/dev/gunicorn-opentelemetry-poc/flask-app/app.py", line 28, in <module>
    from opentelemetry.tools.resource_detector import GoogleCloudResourceDetector

ModuleNotFoundError: No module named 'opentelemetry.tools'

@AndrewAXue
Copy link
Contributor

So the resource detectors are actually unreleased, you'll have to manually clone the repo https://github.com/GoogleCloudPlatform/opentelemetry-operations-python and pip install -e the package "opentelemetry-exporter-cloud-trace" that that folder is in.

@ocervell
Copy link
Author

ocervell commented Jul 28, 2020

Okay, I've tried it but the resource still shows global and I'm still not seeing any resource labels.

Here is the code I'm using:

import config
import os
import logging
import time
from flask import Flask
from opentelemetry import trace, metrics
#from opentelemetry.ext.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor
from opentelemetry.sdk.metrics import Counter, MeterProvider
from opentelemetry.sdk.resources import get_aggregated_resources
from opentelemetry.ext.opencensusexporter.metrics_exporter import OpenCensusMetricsExporter
from opentelemetry.ext.opencensusexporter.trace_exporter import OpenCensusSpanExporter
from gke_detector import GoogleCloudResourceDetector
OTEL_AGENT_ENDPOINT = os.environ['OTEL_AGENT_ENDPOINT']
span_exporter = OpenCensusSpanExporter(service_name="flask-app-tutorial",
                                       endpoint=OTEL_AGENT_ENDPOINT)
exporter = OpenCensusMetricsExporter(service_name="flask-app-tutorial",
                                     endpoint=OTEL_AGENT_ENDPOINT)
resources = get_aggregated_resources([GoogleCloudResourceDetector()])

# Metrics
metrics.set_meter_provider(MeterProvider(resource=resources))
meter = metrics.get_meter(__name__, True)
metrics.get_meter_provider().start_pipeline(meter, exporter, 5)

# Traces
trace.set_tracer_provider(TracerProvider(resource=resources))
trace.get_tracer_provider().add_span_processor(
    BatchExportSpanProcessor(span_exporter))

# Custom metrics
pid = os.getpid()
staging_labels = {"environment": "staging", "pid": str(pid)}
requests_counter = meter.create_metric(
    name="flask_app_hello_requests",
    description="Hello requests count",
    unit="1",
    value_type=int,
    metric_type=Counter,
    label_keys=(
        "environment",
        "pid",
    ),
)

# Flask application
app = Flask(__name__)
#FlaskInstrumentor().instrument_app(app)

# Logging setup
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
app.logger.info(f'Otel agent endpoint: {OTEL_AGENT_ENDPOINT}')


@app.route("/")
def hello():
    app.logger.info("Received hello request !")
    requests_counter.add(1, staging_labels)
    app.logger.debug("Counter was incremented.")
    return "Hello World!"


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=config.PORT, debug=config.DEBUG_MODE)

@AndrewAXue
Copy link
Contributor

Okay, I've tried it but the resource still shows global and I'm still not seeing any resource labels.

Here is the code I'm using:

import config
import os
import logging
import time
from flask import Flask
from opentelemetry import trace, metrics
#from opentelemetry.ext.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor
from opentelemetry.sdk.metrics import Counter, MeterProvider
from opentelemetry.sdk.resources import get_aggregated_resources
from opentelemetry.ext.opencensusexporter.metrics_exporter import OpenCensusMetricsExporter
from opentelemetry.ext.opencensusexporter.trace_exporter import OpenCensusSpanExporter
from gke_detector import GoogleCloudResourceDetector
OTEL_AGENT_ENDPOINT = os.environ['OTEL_AGENT_ENDPOINT']
span_exporter = OpenCensusSpanExporter(service_name="flask-app-tutorial",
                                       endpoint=OTEL_AGENT_ENDPOINT)
exporter = OpenCensusMetricsExporter(service_name="flask-app-tutorial",
                                     endpoint=OTEL_AGENT_ENDPOINT)
resources = get_aggregated_resources([GoogleCloudResourceDetector()])

# Metrics
metrics.set_meter_provider(MeterProvider(resource=resources))
meter = metrics.get_meter(__name__, True)
metrics.get_meter_provider().start_pipeline(meter, exporter, 5)

# Traces
trace.set_tracer_provider(TracerProvider(resource=resources))
trace.get_tracer_provider().add_span_processor(
    BatchExportSpanProcessor(span_exporter))

# Custom metrics
pid = os.getpid()
staging_labels = {"environment": "staging", "pid": str(pid)}
requests_counter = meter.create_metric(
    name="flask_app_hello_requests",
    description="Hello requests count",
    unit="1",
    value_type=int,
    metric_type=Counter,
    label_keys=(
        "environment",
        "pid",
    ),
)

# Flask application
app = Flask(__name__)
#FlaskInstrumentor().instrument_app(app)

# Logging setup
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
app.logger.info(f'Otel agent endpoint: {OTEL_AGENT_ENDPOINT}')


@app.route("/")
def hello():
    app.logger.info("Received hello request !")
    requests_counter.add(1, staging_labels)
    app.logger.debug("Counter was incremented.")
    return "Hello World!"


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=config.PORT, debug=config.DEBUG_MODE)

Looking at the OpenCensusSpanExporter code https://github.com/open-telemetry/opentelemetry-python/blob/master/ext/opentelemetry-ext-opencensusexporter/src/opentelemetry/ext/opencensusexporter/trace_exporter/__init__.py it doesn't look like it supports exporting info about Resources at all. So while you're passing that resources info to the exporter, its just not doing anything with it.

@aabmass
Copy link
Member

aabmass commented Jul 28, 2020

@ocervell It has to be added to the exporter as well. I'm adding into #679

Olivier, is it possible for the ocagent to detect the GKE resource for you?

@ocervell
Copy link
Author

ocervell commented Jul 28, 2020

@aabmass I've tried the k8sprocessor (https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/processor/k8sprocessor) without much luck yet.

There is another way to get GKE labels using Prometheus scrape config. But imo the best would be for autodetection to be baked in the SDK so that there is not much relabeling to do using OpenTelemetry processors (hard to configure / test for now), and consistency in resource labels could be achieved this way.

@aabmass
Copy link
Member

aabmass commented Jul 28, 2020

@ocervell, updated #937 to send the resources. It should work along with Andrew's resource detector. However, I'm not sure if the Collector will convert these into the correct GKE monitored resource types when it writes to Cloud Monitoring.

@nilebox, is there a way to get the OT Collector to send the correct GKE resource when running as an agent?

@nilebox
Copy link
Member

nilebox commented Jul 28, 2020

@nilebox, is there a way to get the OT Collector to send the correct GKE resource when running as an agent?

There is a Resource Detection Processor which currently supports detection of GCE, and can be extended to detect GKE as well.

@james-bebbington wrote this processor, and he is also the author of the Automatic Resource Detection proposal, so he would be the best person to suggest the recommended approach.

@ocervell
Copy link
Author

I've opened an issue for adding GKE detection to the Resource Detection Processor in the collector.

@aabmass I will try #937 shortly and give you some feedback.

@ocervell ocervell changed the title OpenCensusMetricsExporter: GKE labels not sent with OpenCensusMetricsExporter OpenCensusMetricsExporter: GKE resource labels not sent with OpenCensusMetricsExporter Jul 29, 2020
@ocervell
Copy link
Author

Still seeing the resource type set as global in Cloud Monitoring, did not seem to change anything. Code I'm using is available in gunicorn-opentelemetry-poc/ot-agent-collector-dev.

@aabmass
Copy link
Member

aabmass commented Jul 29, 2020

I dug into the config a little bit. Looks like the stackdriver exporter internally uses the OpenCensus Go exporter. The resource mapping seems to happen here: https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/v0.13.2/resource.go#L185

It looks at the generic labels for k8s (which aren't currently sent at all), but Andrew pointed out it is not checking for gcp.resource_type": "gke_container".

I ran your example to see if the python exporter is at least sending the resources correctly, and it looks right to me (on my cloudtop which is just GCE):

resource {
  labels {
    key: "cloud.account.id"
    value: "some_account"
  }
  labels {
    key: "cloud.provider"
    value: "gcp"
  }
  labels {
    key: "cloud.zone"
    value: "us-east1-b"
  }
  labels {
    key: "gcp.resource_type"
    value: "gce_instance"
  }
  labels {
    key: "host.id"
    value: "1319926662905081598"
  }
}

Rather than updating the OpenCensus exporter, I think the Collector's stackdriver exporter should be switched over to use the Go SDK GCP exporters in GoogleCloudPlatform/opentelemetry-operations-go, which are actively developed. However, even that SDK doesn't currently send the right GKE Monitored Resource from what I can tell.

@ocervell ocervell changed the title OpenCensusMetricsExporter: GKE resource labels not sent with OpenCensusMetricsExporter OpenCensusMetricsExporter: GKE resource labels not sent Jul 29, 2020
@ocervell
Copy link
Author

Yep, I also have gotten it to work. However, on GKE the resource type is set by default to gce_instance.

>>> from gke_detector import GoogleCloudResourceDetector
>>> c = GoogleCloudResourceDetector(
... )
>>> c
<gke_detector.GoogleCloudResourceDetector object at 0x7fc4282747b8>
>>> c.detect()
<opentelemetry.sdk.resources.Resource object at 0x7fc428ea2358>
>>> c.gcp_resources
{'cloud.account.id': 'rnm-datadog-sd', 'cloud.provider': 'gcp', 'cloud.zone': 'europe-west1-c', 'host.id': 107782034669724043, 'gcp.resource_type': 'gce_instance'}

I found out I had to set an env variable CONTAINER_NAME in my k8s YAML in order for it to set the right type gke_container:

            - name: CONTAINER_NAME  # required by OpenTelemetry GKE detector
              value: flask-app-tutorial

Additionally, it would be great to have the GCE/GKE resource detectors available as part of the SDK, rather than in the Cloud Monitoring exporter. I've copied the detector code to my app because the extra dependency is not worth it.

@aabmass
Copy link
Member

aabmass commented Jul 29, 2020

I found out I had to set an env variable CONTAINER_NAME in my k8s YAML in order for it to set the right type gke_container:

@AndrewAXue is that intended? I thought this was set automatically by k8s

Additionally, it would be great to have the GCE/GKE resource detectors available as part of the SDK, rather than in the Cloud Monitoring exporter.

We are waiting for the spec; if OT community decides this is ok, we will include it!

I've copied the detector code to my app because the extra dependency is not worth it.

We are planning a single package for "tools" that will have the GCP propagator and resource detector. Is is actually that much of a burden to have one extra package?

@AndrewAXue
Copy link
Contributor

AndrewAXue commented Jul 29, 2020

Yes that's intended. We rely on the user manually passing in some info via the Downward API including populating CONTAINER_NAME. It'll look something like this

apiVersion: "apps/v1"
kind: "Deployment"
metadata:
  name: "food-fish"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: "food-find"
  template:
    metadata:
      labels:
        app: "food-find"
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: "food-finder"
        image: "gcr.io/aaxue-gke/food-finder:v1"
        imagePullPolicy: "Always"
        env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CONTAINER_NAME
              value: "food-finder"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
        ports:
        - containerPort: 8080
      hostNetwork: true
      dnsPolicy: Default

I realized that this was poorly documenting it and will be making a PR adding this info to our documents.

@ocervell
Copy link
Author

ocervell commented Jul 29, 2020

We are planning a single package for "tools" that will have the GCP propagator and resource detector. Is is actually that much of a burden to have one extra package?

No, this is fine as long as the package just does this, and doesn't have too many other dependencies.

I realized that this was poorly documenting it and will be making a PR adding this info to our documents.

Thanks, is there no way to get container name from the container directly ?
I've added the CONTAINER_NAME variable but my metrics are still tagged with the resource gce_instance. Still investigating why it doesn't set it to gke_container, will check back later !

@AndrewAXue
Copy link
Contributor

AndrewAXue commented Jul 29, 2020

You'll also need to populate NAMESPACE and POD_NAME like in the example above. No there isn't a way to get container name from the container directly :/ that's why we rely on the Downward API. If it doesn't work, can you try printing the output of
from gke_detector import get_gke_resources
print(get_gke_resources())

@ocervell
Copy link
Author

ocervell commented Jul 29, 2020

@AndrewAXue okay it's working from the container after setting the CONTAINER_NAME variable:

>>> print(get_gke_resources())
{'cloud.account.id': 'rnm-datadog-sd', 'cloud.provider': 'gcp', 'cloud.zone': 'europe-west1-c', 'k8s.cluster.name': 'demo-flask', 'k8s.nam
espace.name': 'default', 'k8s.pod.name': 'flask-app-tutorial-7ccc4d757b-tqqdg', 'host.id': 107782034669724043, 'container.name': 'flask-ap
p-tutorial', 'gcp.resource_type': 'gke_container'}

Let's add the needed env variables to our documentation.

Weirdly, the end metrics resource type in Stackdriver is still gce_instance and I end up losing the GKE labels anyway:

 'resource': {'labels': {'instance_id': '107782034669724043',
                         'project_id': 'rnm-datadog-sd',
                         'zone': 'europe-west1-c'},
              'type': 'gce_instance'},

so I guess there is another problem down-the-line:gke_container resource is replaced by gce_instance in the collector, maybe because of what @aabmass pointed out with the OpenCensus Go package ?

OT Collector logs:

InstrumentationLibraryMetrics #0
ResourceMetrics #1
Resource labels:
     -> service.name: STRING(flask-app-tutorial)
     -> opencensus.starttime: STRING(2020-07-29T21:35:20.026248448Z)
     -> host.hostname: STRING(flask-app-tutorial-7ccc4d757b-k7dsn)
     -> opencensus.pid: INT(8)
     -> telemetry.sdk.version: STRING(0.12.dev0)
     -> opencensus.exporterversion: STRING(0.11.dev0)
     -> telemetry.sdk.language: STRING(PYTHON)

     -> host.id: STRING(107782034669724043)
     -> cloud.zone: STRING(europe-west1-c)
     -> cloud.account.id: STRING(rnm-datadog-sd)
     -> k8s.pod.name: STRING(flask-app-tutorial-7ccc4d757b-k7dsn)
     -> gcp.resource_type: STRING(k8s_container)
     -> container.name: STRING(flask-app-tutorial)
     -> k8s.cluster.name: STRING(demo-flask)
     -> cloud.provider: STRING(gcp)
     -> k8s.namespace.name: STRING(default)
InstrumentationLibraryMetrics #0
Metric #0
Descriptor:
     -> Name: flask_app_hello_requests
     -> Description: Hello requests count
     -> Unit: 1
     -> Type: MONOTONIC_INT64
Int64DataPoints #0
Data point labels:
     -> environment: staging
     -> pid: 8
StartTime: 1596058520026277000
Timestamp: 1596058910715795712
Value: 3075

@nilebox
Copy link
Member

nilebox commented Jul 29, 2020

Weirdly, the end metrics resource type in Stackdriver is still gce_instance and I end up losing the GKE labels anyway

@ocervell Where do you process gcp.resource_type: k8s_container? Note that the resource type should be set to container (not k8s_container) initially, and Stackdriver exporter (in OpenTelemetry Collector) will recognize and convert it to k8s_container.

See https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/366afe7b1d81a1faaa8c58e725aaeb77344b343e/resource.go#L195-L196

@aabmass
Copy link
Member

aabmass commented Jul 29, 2020

@nilebox, thanks I need to update my PR to send container in that case.

In that code though, I think that is still the incorrect monitored resource type. Going off what Andrew has used in the python exporter, should be:

- result.Type = "k8s_container"
+ result.Type = "gke_container" 

Does it make sense to update the opencensus-go-exporter-stackdriver package with this? @james-bebbington suggested writing a MapResource function in the collector instead. Thoughts?

@nilebox
Copy link
Member

nilebox commented Jul 30, 2020

@aabmass Pretty sure that the container resource type in Go is being used in quite a few projects using OpenCensus and OpenTelemetry Collector, so that would be a breaking change. Changing MapResource implementation in the OT collector would also be a breaking change for GKE Metrics Agent, for example.

FWIW, the Python implementation for OpenCensus seems to be using k8s_container: https://github.com/census-instrumentation/opencensus-python/blob/f4d8ec36cada8b385821a16795f27fe322694a1a/opencensus/common/monitored_resource/monitored_resource.py#L24

I'm surprised that this inconsistency wasn't a problem in OpenCensus before.

For backward compatibility, we may just recognize all variations (container, k8s_container and gke_container) and translate them to k8s_container in opencensus-go-exporter-stackdriver exporter. @james-bebbington WDYT?

@aabmass
Copy link
Member

aabmass commented Jul 30, 2020

For backward compatibility, we may just recognize all variations (container, k8s_container and gke_container) and translate them to k8s_container in opencensus-go-exporter-stackdriver exporter. @\james-bebbington WDYT?

@nilebox Checking the OpenCensus resource semantic conventions for possible values of type, it looks like we should be setting type: k8s? It appears container is for plain docker images and it says "this resource can be merged with a deployment service resource, a compute instance resource, and an environment resource." Then for merging, the spec says "already set labels or type fields MUST NOT be overwritten unless they are empty string."

Presumably if the docker container gets detected first, you'd get type: container, then merging with k8s resource would NOT overwrite type, even though type: k8s would be more descriptive? I will change my PR to send "container" if k8s for now. But it seems wrong that if you were running just a docker image, the Collector's exporter would send a k8s_container monitored resource.

@nilebox
Copy link
Member

nilebox commented Jul 30, 2020

it looks like we should be setting type: k8s?

k8s is for Pod, see https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/366afe7b1d81a1faaa8c58e725aaeb77344b343e/resource.go#L198-L199

It appears container is for plain docker images

A single Pod may run multiple Docker containers (see PodSpec), so if a metric is emitted for specific container, it may make sense to use container there even if it's running as part of a Pod.

@nilebox
Copy link
Member

nilebox commented Jul 30, 2020

As an example, you may be interested in emitting CPU/RAM utilization metrics for each container separately rather than for Pod as a whole.

In Kubernetes, resource limits are specified per container, but Pods are allocated based on the sum of all requested resources, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container

So when one of the containers has a memory leak, for example, you would be interested to track it separately from other containers in the Pod.

@james-bebbington
Copy link
Member

james-bebbington commented Jul 30, 2020

I'll try to summarise the above to hopefully clear things up.

Due to the nature of the progressive changes to the OC/OT data models, we have gotten to the point where there are several ways to represent a Resource:

No Model Example of how to define "Container" resource type
1 OpenCensus resource model (includes OC Type) Type: container
2 OpenTelemetry resource model (no Type; Type needs to be inferred from which attribute exist1) Existence of any attribute with prefix container.
32 OpenTelemetry resource model with attribute gcp.resource_type
(hack currently added in Python only to represent resource type explicitly in OpenTelemetry)
Attribute: gcp.resource_type = k8s_container
4 OpenTelemetry resource model with attribute opencensus.resourcetype
(hack to convert between OC & OT models used internally by the Collector)
Attribute: opencensus.resourcetype = container
5 Cloud Monitoring resource model (includes CM Type) Type: k8s_container3

1 This is so OpenTelemetry can have a more expressive hierarchical set of resource info instead of being limited to a specific "type" (e.g. process info + container info + pod info + k8s info + cloud info).
2 I would suggest we consider removing (3) to simplify this situation slightly if possible (I recall @AndrewAXue having some reason why this might not be possible).
3 I'm not really sure what the distinction is between gke_container & k8s_container - looks like we have agreed to ignore gke_container for now since it was not supported by OC (this seems reasonable but I also recommend confirming with Cloud Monitoring team)

We then need to support at least the following paths of exporting data to Cloud Monitoring4:

  1. OC SDK -> (possibly via Collector) -> Cloud Monitoring, i.e. 1 -> 5 (in the OC exporter)
  2. OT SDK -> Cloud Monitoring, i.e 2 -> 5 (in the OT exporter)
  3. OT SDK -> Collector -> OC (using current OC exporter) -> Cloud Monitoring ==> 2 -> 1 (where type is not set correctly) -> 5 (after we complete migration of Collector exporter to OT, this will become the same as the point above)

Points 1 & 2 are already implemented (although note this is inconsistent due to the gcp.resource_type hack). We need to add code in the Collector exporter wrapper that converts from OT type (based on the existence of attributes or on gcp.resource_type depending on which approach we prefer) to OC type. The OC exporter will then not need to be changed. A similar alternative would be to create a MapResource function in the collector to map directly from OT resource -> CM resource and bypass the OC exporter resource mapping logic as Aaron mentioned above. In either case this would need to be backwards compatible (support the OC type being set directly in OC as the first option, or derived from OT attributes as a fallback).

The other option is to just fast forward to using the new OT exporter in the Collector (and verify this works in a way that is backwards compatible with OC) but given this code path is used heavily in production we need to be careful with this rollout and imo should wait until the OTLP proto is finalized.

4 There are other possible pipelines such as when the Collector translates between OC & OT models during its internal pipeline, but I'm pretty sure this will be handled without issue


TLDR Recommendations:

  • Remove gcp.resource_type from Python detector
  • Change Python exporter to infer Cloud Monitoring resource type from existence of attributes
  • Update the Collector Stackdriver exporter/wrapper to have a backwards compatible way of mapping from OT resource attributes to Cloud Monitoring resource type (in addition to from OC type)

@nilebox
Copy link
Member

nilebox commented Jul 30, 2020

Fix on the OpenTelemetry Collector side: open-telemetry/opentelemetry-collector#1462

@AndrewAXue
Copy link
Contributor

Thanks for the detailed writeup, I think it made it clear to everyone, even those without context on the Collectors (me) what the issue is. I'm a proponent of Model 3 for 2 reasons:

  1. In the OT SDK, at the moment we detect resource info, we know what resource type it is. Following Model 2 would essentially mean ignoring that resource type info, and then manually extracting it later on.
  2. Ambiguity. Dataflow Job and k8s_cluster are both supported by Cloud Monitoring and have very similar labels. Since our resource detectors follow the OT semantic conversions we can't name attributes whatever we want, making it likely we'll use the same set of attribute names for multiple resource types, if not now somewhere down the road as resource support increases.

So it seems that Models 2, 3 are are only used in the OT SDK -> Cloud Monitoring and OT SDK -> Collector routes. The former supports model 3, so it seems the only problem is building a good mapping from 3 -> 1 for the collector. We can keep Model 2 as a backup in case Model 3 isn't there for whatever reason, but I think the risk of ambiguity should make us want to avoid it as much as possible.

@aabmass
Copy link
Member

aabmass commented Jul 30, 2020

A single Pod may run multiple Docker containers (see PodSpec), so if a metric is emitted for specific container, it may make sense to use container there even if it's running as part of a Pod.

@nilebox that makes sense, but I don't see how type container necessarily means k8s. The stackdriver exporter would set result.Type = "k8s_container" in several cases that aren't GKE, but still send type container? E.g. Containers on GCE, maybe other GCP environments using Docker

Also, looking at open-telemetry/opentelemetry-collector#1462, should I just continue to send a blank type field in this PR and let the Collector infer it instead?

@nilebox
Copy link
Member

nilebox commented Jul 30, 2020

The stackdriver exporter would set result.Type = "k8s_container" in several cases that aren't GKE, but still send type container?

It will try and fail to do so, because it also requires all k8s_container labels to be present, including e.g. k8s.pod.name, which will surely be missing for GCE container.

So once it fails to treat resource as k8s_container, Stackdriver exporter will just fallback to the default "global" type.

should I just continue to send a blank type field

Left a comment, you should still set resource type for OpenCensus exporter, but it's fine to omit it in OTLP exporter thanks to the change in collector.

@james-bebbington
Copy link
Member

So it seems that Models 2, 3 are are only used in the OT SDK -> Cloud Monitoring and OT SDK -> Collector routes. The former supports model 3, so it seems the only problem is building a good mapping from 3 -> 1 for the collector. We can keep Model 2 as a backup in case Model 3 isn't there for whatever reason, but I think the risk of ambiguity should make us want to avoid it as much as possible.

The problem, from my point of view, is that Model 3 is not defined in the OT spec. We could consider proposing a change to the spec so there is a standard resource attribute key called type to support vendors with a resource model similar to Cloud Monitoring, and then use that. But otherwise we are adding potential confusion - consider for example users who wants to set the resource type manually using the OT conventions and have this work across Vendors (that is kind of the goal of OT).

I also don't think there is as much ambiguity as you think. I don't believe there are any cases where we cannot infer the correct Cloud Monitoring resource type from OT semantic conventions, and if there were it would just imply we need to add additional semantic conventions (doesn't have to be a 'type' property).

@ocervell
Copy link
Author

ocervell commented Aug 4, 2020

I'm not sure if this will help, but here is what the stackdriver-prometheus-sidecar uses: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/retrieval/resource_map.go.

I haven't had issues ingesting metrics using Prometheus + the sidecar (resource type is set correctly all the time). Maybe we could inspire ourselves from this detection logic ?

srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this issue Nov 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants