Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count metric is overridden instead of added/summed #970

Closed
leocavalcante opened this issue Apr 12, 2023 · 33 comments
Closed

Count metric is overridden instead of added/summed #970

leocavalcante opened this issue Apr 12, 2023 · 33 comments
Labels
bug Something isn't working

Comments

@leocavalcante
Copy link

leocavalcante commented Apr 12, 2023

Describe your environment

"php": "^8,1",
"open-telemetry/sdk": "^0.0.17",
"symfony/http-client": "^6.2",
"guzzlehttp/promises": "^1.5",
"php-http/message-factory": "^1.0",
"nyholm/psr7": "^1.6",
"open-telemetry/sdk-contrib": "^0.0.17"

Steps to reproduce

$transport = (new GrpcTransportFactory())->create('http://localhost:4317' . OtlpUtil::method(Signals::METRICS));
$exporter = new MetricExporter($transport);
$reader = new ExportingReader($exporter, ClockFactory::getDefault());

$meter = MeterProvider::builder()
       ->addReader($reader)
       ->build()
       ->getMeter('example');

$counter = $meter->createCounter('http_requests');

$counter->add(1);

$reader->collect();

https://github.com/leocavalcante/otel-php-metrics/blob/main/index.php

What is the expected behavior?
The http_requests{} metric being added while I hit curl localhost:8080

What is the actual behavior?
It remains at the value of 1

Additional context
https://github.com/leocavalcante/otel-php-metrics

@leocavalcante leocavalcante added the bug Something isn't working label Apr 12, 2023
@brettmc
Copy link
Collaborator

brettmc commented Apr 18, 2023

@leocavalcante can you please try adding cumulative temporality to your exporter?

$exporter = new MetricExporter($transport, Temporality::CUMULATIVE);

I was just scanning these docs: https://opentelemetry.io/docs/reference/specification/metrics/data-model/#example-use-cases

@leocavalcante
Copy link
Author

@brettmc, thank you for the suggestion leocavalcante/otel-php-metrics@53b9422, but it didn't work.

@leocavalcante
Copy link
Author

leocavalcante commented Apr 18, 2023

Please, note that using Swoole https://github.com/leocavalcante/otel-php-metrics/blob/main/swoole.php it works just fine.

I really suspect that it's something about the stateless/shared-nothing FPM model, each request on its own process, has it's own Counter and despite the same name, each call to collect() overrides the previous value instead of adding. Maybe something with the Collector instead of the PHP-SDK?

@eliseuborges
Copy link

@brettmc I've also use Temporality::CUMULATIVE and didn't work.

@Nevay
Copy link
Contributor

Nevay commented Apr 18, 2023

Every MeterProvider has its own internal state and creates its own time series. This leads to alignment issues when exporting metrics from multiple meter providers with identical resource attributes (setting service.instance.id should prevent this). The prometheus exporter follows the spec and resets the aggregated sum if the timestamps do not match.

open-telemetry/opentelemetry-collector-contrib#4968 seems related.

@bobstrecansky
Copy link
Collaborator

@leocavalcante - has this helped resolve your issue?

@leocavalcante
Copy link
Author

leocavalcante commented Jun 7, 2023

Actually, it didn't, @bobstrecansky.

PHP-FPM uses a shared-nothing model for each of the incoming requests (each requests knows/counts only itself). I would have to set a unique service.instance.id for each request, which is impractical for the metric cardinality.

To use the SDK properly with PHP-FPM, its needs to somehow share the counter between processes/requests, like the Prometheus client does using Redis or APC.

@achetronic
Copy link

Hey, this is happening here too. I assume @leocavalcante is right on that. IMO, the proper way to do it is supporting APC/u to use as native approach as possible. Is there any possibility to implement something similar?

@leocavalcante
Copy link
Author

I guess that a ApcMetricRegistry or RedisMetricRegistry can do the trick: https://github.com/open-telemetry/opentelemetry-php/tree/main/src/SDK/Metrics/MetricRegistry

@achetronic
Copy link

I guess that a ApcMetricRegistry or RedisMetricRegistry can do the trick: https://github.com/open-telemetry/opentelemetry-php/tree/main/src/SDK/Metrics/MetricRegistry

Hey, are you planning to implement it? :)

@leocavalcante
Copy link
Author

Hey, are you planning to implement it? :)

Actually, no. I'm not sure if it is the way to go. Just guessing.

@achetronic
Copy link

achetronic commented Jul 13, 2023

Hey, are you planning to implement it? :)

Actually, no. I'm not sure if it is the way to go. Just guessing.

Well, having an external system like Redis/memcached can add lag but enforce the values for the counters, and related stuff. Do you if APCu with FPM in front could work?

Saying this because FPM generates several serving pools, so I have the impression that APCu will have a shared cache inside each pool, but not between the pools

@leocavalcante
Copy link
Author

I meant I'm not sure if it is about creating a new MetricRegistry implementation, not about using APC or Redis to share memory.

@brettmc
Copy link
Collaborator

brettmc commented Jul 13, 2023

What would be responsible for exporting metrics from storage (apcu/redis)? I can understand using shared storage, but not how the exporting side might work. Is there any prior art in other language implementations?

@leocavalcante
Copy link
Author

Maybe the Prometheus client can serve as an implementation reference?
https://github.com/PromPHP/prometheus_client_php

@achetronic
Copy link

What would be responsible for exporting metrics from storage (apcu/redis)? I can understand using shared storage, but not how the exporting side might work. Is there any prior art in other language implementations?

Well, some time ago I had a similar problem and solved in a tricky way, I would say. Basically I transformed the in-memory object and stored into FS. Of course, this is something that increased a bit the fs read/write iops but the latency was not that bad

Of course, this did solve some issues but not them all because of the same: FPM in front

I would say when an instrumentation is created, it's needed to create and populate the connection to a, for example, Redis, and check if this connection is available on each request?

HDYT this should be implemented? some ideas?

@brettmc
Copy link
Collaborator

brettmc commented Aug 23, 2023

Since we're getting close to release candidate, and no solution is in sight for this one, I'm thinking to document it as a known issue.
Can we generalise the problem to something like "Aggregated metrics do not work well with shared-nothing PHP runtimes"? I assume that it's an issue with any metric?

@leocavalcante
Copy link
Author

leocavalcante commented Aug 24, 2023

@brettmc, an interesting information: I had the same problem using OTel's Collector Statsd receiver. It could be a problem at the Prometheus exporter not properly handling Delta metrics because I have two OTLP exporters (New Relic and Dynatrace) for the same metrics and it works for the Statsd's Delta metrics there.
Haven't tested with PHP's SDK yet, but it should work as well for Delta.

@brettmc
Copy link
Collaborator

brettmc commented Aug 29, 2023

I sought feedback from Reiley Yang from the Technical Committee (who did our metrics TC review). He also thinks that this is not a bug, but expected behaviour when using shared-nothing PHP runtime.
I'd therefore like to close this off as a bug. I'll document this in our docs over on opentelemetry.io as a gotcha, or "feature".
We can also create a feature request to investigate/implement some sort of local aggregator, if folks are interested in tackling that?

@leocavalcante
Copy link
Author

Sounds good to me. Not a bug, but a limitation when using Cumulative Temporality with PHP-FPM.
PHP-FPM users should stick to Delta metrics.

@achetronic
Copy link

@brettmc @leocavalcante Hello there, and sorry in advance for the delay. I agree with this is the expected behavior for shared-nothing environments. The problem is, most PHP production environments our there are using PHP-FPM in front of it. This is not a bug in the OTEL side, of course, but the reason why some other libraries like Prometheus SDK for PHP implemented the possibility to use external systems to store temporary the metrics before being collected by the collector.

Honestly, I think solving this is something really really needed due to the way the industry uses PHP. May be not in the future if some alternatives like ReactPHP or Swoole get more popular, but ATM it is :)

@leocavalcante
Copy link
Author

Just to clarify: PHP-FPM users are still able to use the SDK as-is, just keep de default Delta temporality instead of using Cumulative.
I've been working with @eliseuborges at our company and he successfully sent metrics to Dynatrace, from a Laravel/Lumen + PHP-FPM environment, using de SDK.

@achetronic
Copy link

Just to clarify: PHP-FPM users are still able to use the SDK as-is, just keep de default Delta temporality instead of using Cumulative.
I've been working with @eliseuborges at our company and he successfully sent metrics to Dynatrace, from a Laravel/Lumen + PHP-FPM environment, using de SDK.

I know, I know, but the problem are not metrics like gauge ones, where you set directly the value, but for example the counters, that are always set to 0 for each request

In my opinion, for business metrics like whatever_successful_sum they are completely needed, as you need to add a value based on the application's previous one, and not based on 0

@leocavalcante
Copy link
Author

Actually, counters works just fine as well. We have tested exactly with a counter.

@achetronic
Copy link

achetronic commented Aug 30, 2023

Actually, counters works just fine as well. We have tested exactly with a counter.

Not working for me. So some questions:

  • Are you sure you are keeping the number between requests with more than 1 replica in k8s environments?
  • How are you using it then?
  • Do you have high cardinality labels on that metric?
  • Have you tested with a metric with no labels?

I am asking because if you have shared-nothing envs, how does OTEL sdk store the information between the requests? AFAIK is not using ACPu neither

In fact, I actually had to use Prometheus SDK for PHP (+ PHP-FPM) apps and OTEL SDK for the rest of them

@leocavalcante
Copy link
Author

It doesn't need to store the information on the SDK side, it just sends a new value to be incremented in the counter. That's the key difference between Delta and Cumulative.

Delta

image

Cumulative

Contrast with cumulative aggregation temporality where we expect to report the full sum since ‘start’ (where usually start means a process/application start):

image

Which means, when using Delta you don't need to store the value of the counter since the beginning.


Not working for me

How are you trying to see the Metrics? Using OTel's Collector Prometheus exporter?
Try sending it directly to a APM that supports Delta metrics like New Relic or Dynatrace.

@achetronic
Copy link

It doesn't need to store the information on the SDK side, it just sends a new value to be incremented in the counter. That's the key difference between Delta and Cumulative.

Delta

image

Cumulative

Contrast with cumulative aggregation temporality where we expect to report the full sum since ‘start’ (where usually start means a process/application start):

image

Which means, when using Delta you don't need to store the value of the counter since the beginning.


Not working for me

How are you trying to see the Metrics? Using OTel's Collector Prometheus exporter?
Try sending it directly to a APM that supports Delta metrics like New Relic or Dynatrace.

Thank you for detailed explanation. Now I see I was right on the requirements :)

I am using cumulative metrics for some business metrics that need to have the result in that way. The SDK, sending the metrics against Otel collector and from there to Grafana Mimir (for most applications, but for some others, the collector is pulling the /metrics endpoint)

The problem I see with delta metrics is the fact that some business metrics like failed / successful_payments_sum need an absolute reference to compare success between both them. That's the reason to use cumulative there

@brettmc
Copy link
Collaborator

brettmc commented Aug 30, 2023

Honestly, I think solving this is something really really needed due to the way the industry uses PHP.

I agree. The reason for treating it as an enhancement rather than a bug is that the SDK is doing everything correctly, we think. A fix may end up being some new feature(s) added, and maybe not as part of the SDK itself.

There is certainly a deficiency somewhere, but a fix will take some thought and planning and somebody to drive it. Having an external data store sounds complex.
Is it something that the collector could do (via a plugin/extension), so that it's solved once for all languages (if there are any other languages where this is a problem?)

@eliseuborges
Copy link

Hey guys,

As @leocavalcante said, we are having success to send the metrics to Dynatrace, including counters.

@brettmc I'm not an expert on collectors, but I think your idea is a good way to solve this problem.
Maybe the collector could acting as application side car to receiving metrics from php application, aggregating and send forward time by time like prometheus protocol wants to.

By the way, the stack prometheus + OpenTelemetry + php-fpm is, for me, the major case to implement that solution. It was also our old stack and we had the same problem as @achetronic.

What do you think about this approach @brettmc? Do you think there is any other solution to work with this stack php-fpm + OpenTelemetry + prometheus?

@leocavalcante
Copy link
Author

leocavalcante commented Aug 31, 2023

The problem I see with Delta metrics is the fact that some business metrics like failed / successful_payments_sum need an absolute reference to compare success between both them. That's the reason to use cumulative there

@achetronic, the problem isn't with Delta metrics. You can normally use Delta metrics for business metrics, for monotonic counters like successful payments, it will be added at your APM. The problem here is the Collector's exporter.

Try adding an otlp exporter to you collector that sends to an APM like New Relic or Dynatrace, or send your metrics directly to these platforms. You will see they will work just fine, receiving your Delta metrics and summing up as a regular counter.
For each new successful payment a new Delta metric is sent and added at the end on the platform layer.

Tomorrow I will update my example repo, linked in the issue, to send the metrics to a free New Relic account to show you it working.

@achetronic
Copy link

@brettmc Understood and agree that should be driven separatedly :) but honestly, I don't think this can be implemented in the collector side, acting as a common collector, without causing a huge waste of resources on it. In fact, we had to disable the queues in some parts of the collector due to the huge amount of metrics we handle per cluster (after it, it works smoothly). May be you are right and we need some fresh ideas on that 😃

@eliseuborges Yes, I think the common stack is the only reason to implement it. I don't like that stack since PHP can embrace ReactPHP/Swoole, but you know, the industry is moving slow on that regard yet

@leocavalcante I understand your point perfectly but honestly, our goal after years of being vendor locked is to be vendor agnostic as much as possible, so NewRelic/Dynatrace are not options. I understand you are suggesting this to proof your point about Delta metrics not being a problem, but IMHO we should think about fixing this in all the stacks used there, and (Prometheus/Thanos/Mimir) are the main one for this. Without covering the main ones, the project is basically useless :)

I think you are right suggesting we can sum the Delta results and aggregate them using a time window to have more or less an absolute result over the year, but I think we should make it easier for the common developers to work with this kind of stuff to increase adoption with common stack, as most PHP developers out there understand first a cumulative counter on their common old stack more than the rest of the things, right? I will scrape your link 👁️

@leocavalcante
Copy link
Author

@achetronic there is it: leocavalcante/otel-php-metrics@c109686

Results:
Screenshot 2023-08-31 at 10 02 57

Prom still with the problem:
Screenshot 2023-08-31 at 10 03 30

I'm making this point to clarify that the problem isn't the SDK.
Would be really really nice if the SDK could properly handle Cumulative temporality, but the real problem here is:

Prometheus exporter can't handle Delta metrics.

That is the real problem, not the SDK. It fails with Statsd too, for example.

Instead of thinking there is a bug on the PHP SDK, the real bug is at the Prometheus exporter, that affects not only the PHP SDK, but whatever sends Delta metrics, like Statsd.

We should be addressing to PRs like this ones:

@achetronic
Copy link

achetronic commented Aug 31, 2023

@leocavalcante hey, i'm not using that exporter :) In my case I'm using PrometheusRemoteWrite exporter and I understand your point because exact same behavior on my side (this was the reason to switch to Prometheus SDK).

Taking apart the issue with the Cumulative metrics, that is a real issue IMO, and talking only about Delta ones, I think for them, the behavior is the expected as one request in this stack (PHP + PHP-FPM) will compute one increment for a counter, so delta is exactly 1 in that case, unless you set explicitly another value in, for example, a gauge metric, you know

Another option is to force Prometheus to compute a result over the time, but it's not easy in the Prometheus side, as you need a quite complex Prometheus rule to compute a moving window for a metric, and this is not doable in a lot of cases, and moreover not desirable as you can have a lot of them.

May be, as an idea, it could be solved as delta_to_cumulative or just a _cumulative exporter/transformer in the collector side? I think this can really solve the issue but can consume a huge amount of memory for that as it has to keep the metrics in memory?

IMHO, this could solve the issue until the new feature for external-stored-cumulative is implemented in the SDK, and only until then, because once this is implemented, we could say that the SDK can behave the same in this stack as it does for a common scenario with standalone PHP or whatever another language. WDYT?

Just for those recently coming into the discussion, or whatever, we have exactly the opposite in the collector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants