-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCensusMetricsExporter: GKE resource labels not sent #951
Comments
That PR does not support GKE, but this PR GoogleCloudPlatform/opentelemetry-operations-python#24 in the new Google repo does. |
Thanks looks good to me. Since this is not part of this repo, will that work without using the I'm using the I think those resource attributes could be added by the SDK (explicit setup), and then propagated along the pipeline. |
Resource Detection will work without using CloudMonitoringMetricsExplorer, you'll basically just want lines 24-32 from here https://github.com/GoogleCloudPlatform/opentelemetry-operations-python/blob/master/docs/examples/tools/cloud_resource_detector/resource_detector_metrics.py#L24. And it will pass in the resource info to OpenCensusExporter, but I don't believe it will be in the format that OpenCensusExporter requires them to be since that was built before the resource detector. |
The OC exporter will need to be updated to send resource information, which should be simple enough looking at the code. AFAIK, there is no general way to signal OT exporters which resource keys you care about, but I will double check the spec. |
@AndrewAXue thanks, this should be doable. What package does the
|
So the resource detectors are actually unreleased, you'll have to manually clone the repo https://github.com/GoogleCloudPlatform/opentelemetry-operations-python and |
Okay, I've tried it but the resource still shows Here is the code I'm using: import config
import os
import logging
import time
from flask import Flask
from opentelemetry import trace, metrics
#from opentelemetry.ext.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor
from opentelemetry.sdk.metrics import Counter, MeterProvider
from opentelemetry.sdk.resources import get_aggregated_resources
from opentelemetry.ext.opencensusexporter.metrics_exporter import OpenCensusMetricsExporter
from opentelemetry.ext.opencensusexporter.trace_exporter import OpenCensusSpanExporter
from gke_detector import GoogleCloudResourceDetector
OTEL_AGENT_ENDPOINT = os.environ['OTEL_AGENT_ENDPOINT']
span_exporter = OpenCensusSpanExporter(service_name="flask-app-tutorial",
endpoint=OTEL_AGENT_ENDPOINT)
exporter = OpenCensusMetricsExporter(service_name="flask-app-tutorial",
endpoint=OTEL_AGENT_ENDPOINT)
resources = get_aggregated_resources([GoogleCloudResourceDetector()])
# Metrics
metrics.set_meter_provider(MeterProvider(resource=resources))
meter = metrics.get_meter(__name__, True)
metrics.get_meter_provider().start_pipeline(meter, exporter, 5)
# Traces
trace.set_tracer_provider(TracerProvider(resource=resources))
trace.get_tracer_provider().add_span_processor(
BatchExportSpanProcessor(span_exporter))
# Custom metrics
pid = os.getpid()
staging_labels = {"environment": "staging", "pid": str(pid)}
requests_counter = meter.create_metric(
name="flask_app_hello_requests",
description="Hello requests count",
unit="1",
value_type=int,
metric_type=Counter,
label_keys=(
"environment",
"pid",
),
)
# Flask application
app = Flask(__name__)
#FlaskInstrumentor().instrument_app(app)
# Logging setup
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
app.logger.info(f'Otel agent endpoint: {OTEL_AGENT_ENDPOINT}')
@app.route("/")
def hello():
app.logger.info("Received hello request !")
requests_counter.add(1, staging_labels)
app.logger.debug("Counter was incremented.")
return "Hello World!"
if __name__ == "__main__":
app.run(host="0.0.0.0", port=config.PORT, debug=config.DEBUG_MODE) |
Looking at the OpenCensusSpanExporter code https://github.com/open-telemetry/opentelemetry-python/blob/master/ext/opentelemetry-ext-opencensusexporter/src/opentelemetry/ext/opencensusexporter/trace_exporter/__init__.py it doesn't look like it supports exporting info about Resources at all. So while you're passing that resources info to the exporter, its just not doing anything with it. |
@aabmass I've tried the There is another way to get GKE labels using Prometheus scrape config. But imo the best would be for autodetection to be baked in the SDK so that there is not much relabeling to do using OpenTelemetry processors (hard to configure / test for now), and consistency in resource labels could be achieved this way. |
@ocervell, updated #937 to send the resources. It should work along with Andrew's resource detector. However, I'm not sure if the Collector will convert these into the correct GKE monitored resource types when it writes to Cloud Monitoring. @nilebox, is there a way to get the OT Collector to send the correct GKE resource when running as an agent? |
There is a Resource Detection Processor which currently supports detection of GCE, and can be extended to detect GKE as well. @james-bebbington wrote this processor, and he is also the author of the Automatic Resource Detection proposal, so he would be the best person to suggest the recommended approach. |
Still seeing the resource type set as |
I dug into the config a little bit. Looks like the stackdriver exporter internally uses the OpenCensus Go exporter. The resource mapping seems to happen here: https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/v0.13.2/resource.go#L185 It looks at the generic labels for k8s (which aren't currently sent at all), but Andrew pointed out it is not checking for I ran your example to see if the python exporter is at least sending the resources correctly, and it looks right to me (on my cloudtop which is just GCE):
Rather than updating the OpenCensus exporter, I think the Collector's stackdriver exporter should be switched over to use the Go SDK GCP exporters in GoogleCloudPlatform/opentelemetry-operations-go, which are actively developed. However, even that SDK doesn't currently send the right GKE Monitored Resource from what I can tell. |
Yep, I also have gotten it to work. However, on GKE the resource type is set by default to
I found out I had to set an env variable
Additionally, it would be great to have the GCE/GKE resource detectors available as part of the SDK, rather than in the Cloud Monitoring exporter. I've copied the detector code to my app because the extra dependency is not worth it. |
@AndrewAXue is that intended? I thought this was set automatically by k8s
We are waiting for the spec; if OT community decides this is ok, we will include it!
We are planning a single package for "tools" that will have the GCP propagator and resource detector. Is is actually that much of a burden to have one extra package? |
Yes that's intended. We rely on the user manually passing in some info via the Downward API including populating
I realized that this was poorly documenting it and will be making a PR adding this info to our documents. |
No, this is fine as long as the package just does this, and doesn't have too many other dependencies.
Thanks, is there no way to get container name from the container directly ? |
You'll also need to populate |
@AndrewAXue okay it's working from the container after setting the
Let's add the needed env variables to our documentation. Weirdly, the end metrics resource type in Stackdriver is still
so I guess there is another problem down-the-line: OT Collector logs:
|
@ocervell Where do you process |
@nilebox, thanks I need to update my PR to send In that code though, I think that is still the incorrect monitored resource type. Going off what Andrew has used in the python exporter, should be: - result.Type = "k8s_container"
+ result.Type = "gke_container" Does it make sense to update the opencensus-go-exporter-stackdriver package with this? @james-bebbington suggested writing a |
@aabmass Pretty sure that the FWIW, the Python implementation for OpenCensus seems to be using I'm surprised that this inconsistency wasn't a problem in OpenCensus before. For backward compatibility, we may just recognize all variations ( |
@nilebox Checking the OpenCensus resource semantic conventions for possible values of Presumably if the docker container gets detected first, you'd get |
A single Pod may run multiple Docker containers (see PodSpec), so if a metric is emitted for specific container, it may make sense to use |
As an example, you may be interested in emitting CPU/RAM utilization metrics for each container separately rather than for Pod as a whole. In Kubernetes, resource limits are specified per container, but Pods are allocated based on the sum of all requested resources, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container So when one of the containers has a memory leak, for example, you would be interested to track it separately from other containers in the Pod. |
I'll try to summarise the above to hopefully clear things up. Due to the nature of the progressive changes to the OC/OT data models, we have gotten to the point where there are several ways to represent a Resource:
1 This is so OpenTelemetry can have a more expressive hierarchical set of resource info instead of being limited to a specific "type" (e.g. process info + container info + pod info + k8s info + cloud info). We then need to support at least the following paths of exporting data to Cloud Monitoring4:
Points 1 & 2 are already implemented (although note this is inconsistent due to the The other option is to just fast forward to using the new OT exporter in the Collector (and verify this works in a way that is backwards compatible with OC) but given this code path is used heavily in production we need to be careful with this rollout and imo should wait until the OTLP proto is finalized. 4 There are other possible pipelines such as when the Collector translates between OC & OT models during its internal pipeline, but I'm pretty sure this will be handled without issue TLDR Recommendations:
|
Fix on the OpenTelemetry Collector side: open-telemetry/opentelemetry-collector#1462 |
Thanks for the detailed writeup, I think it made it clear to everyone, even those without context on the Collectors (me) what the issue is. I'm a proponent of Model 3 for 2 reasons:
So it seems that Models 2, 3 are are only used in the OT SDK -> Cloud Monitoring and OT SDK -> Collector routes. The former supports model 3, so it seems the only problem is building a good mapping from 3 -> 1 for the collector. We can keep Model 2 as a backup in case Model 3 isn't there for whatever reason, but I think the risk of ambiguity should make us want to avoid it as much as possible. |
@nilebox that makes sense, but I don't see how type Also, looking at open-telemetry/opentelemetry-collector#1462, should I just continue to send a blank |
It will try and fail to do so, because it also requires all So once it fails to treat resource as
Left a comment, you should still set resource type for OpenCensus exporter, but it's fine to omit it in OTLP exporter thanks to the change in collector. |
The problem, from my point of view, is that Model 3 is not defined in the OT spec. We could consider proposing a change to the spec so there is a standard resource attribute key called I also don't think there is as much ambiguity as you think. I don't believe there are any cases where we cannot infer the correct Cloud Monitoring resource type from OT semantic conventions, and if there were it would just imply we need to add additional semantic conventions (doesn't have to be a 'type' property). |
I'm not sure if this will help, but here is what the I haven't had issues ingesting metrics using Prometheus + the sidecar (resource type is set correctly all the time). Maybe we could inspire ourselves from this detection logic ? |
When running on GKE, the metrics exported with the
OpenCensusMetricsExporter
do not have any GKE resource labels (pod_name
,container_name
,node_name
, etc...). Should those be added to the resource labels before shipping the metric with the exporters ?The text was updated successfully, but these errors were encountered: