out_prometheus_exporter: use flb_downstream #7867

braydonk · 2023-08-25T04:23:23Z

This PR heavily refactors the Prometheus Exporter to use flb_downstream to handle connections instead of Monkey. It takes a lot of inspiration from in_http in its implementation.

#7056

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

[N/A] Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

[N/A] Run local packaging test showing all targets (including any new ones) build.
[N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

Documentation required for this feature

Backporting

[N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Heavily refactor the Prometheus Exporter to use flb_downstream to handle connections instead of Monkey. Signed-off-by: braydonk <braydonk@google.com>

braydonk · 2023-08-25T04:25:02Z

Converting to draft until I get another sec to get the valgrind and debug log. In the meantime, please try the PR and try other use cases I may not have thought of, I have currently only tested with fluent bit self metrics.

Will also address the TODO I left in and the files that don't end with newlines.

Signed-off-by: braydonk <braydonk@google.com>

Previously was creating a full mk_lib_ctx, but that is very complicated to free properly. Switching to just allocating the server struct with keepalive set. For some reason, this exposed the fact that I wasn't previously creating the pthread key for the metrics list. I don't know why it worked before, but now I properly create the key and use its exit callback to free the metrics. Signed-off-by: braydonk <braydonk@google.com>

I missed freeing the content type header line, but also I missed actually using the Content-Type header before so added that too. Signed-off-by: braydonk <braydonk@google.com>

Changed the way I construct this response so it is (imo) more readable. Signed-off-by: braydonk <braydonk@google.com>

[This spec](https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format) suggests that this should be the Content-Type for text-based Prometheus metrics. Signed-off-by: braydonk <braydonk@google.com>

I was testing the empty content response and left this bit in. The last 2 commits will not function properly due to this. Signed-off-by: braydonk <braydonk@google.com>

Since there is one canonical source of metrics that needs to be maintained and requested, there is no benefit to allowing this plugin to be asynchronous. Signed-off-by: braydonk <braydonk@google.com>

Forgot to destroy a string I allocate for the content type when sending metrics. Signed-off-by: braydonk <braydonk@google.com>

I actually already pass in the right Content-Type when it's actually Prometheus Metric content, so might as well leave this default the way it was before. Signed-off-by: braydonk <braydonk@google.com>

braydonk · 2023-08-27T00:22:06Z

Testing this change:

Built Fluent Bit in RelWithDebInfo mode with -DFLB_VALGRIND
curl localhost:2021/metrics to get metrics from the plugin

Config to test:

[SERVICE]
    Log_Level     debug

[INPUT]
    name            fluentbit_metrics
    tag             internal_metrics
    scrape_interval 2

[INPUT]
    name tail
    tag  x
    path x.txt

[OUTPUT]
    name            prometheus_exporter
    match           internal_metrics
    host            0.0.0.0
    port            2021

[OUTPUT]
    name stdout
    match x

Debug logs: https://gist.github.com/braydonk/4151f2d9068b6d125138820589af688f

Valgrind output: https://gist.github.com/braydonk/97c4e05af213f8cc26e5299f64d3318a

I think the invalid reads in the valgrind log might be false positives resulting from optimization, but if you can find anything wrong that I'm not seeing please let me know.

braydonk · 2023-08-27T00:23:56Z

I don't have access to a good Windows environment to fix the Windows build issues, that might be a couple days til I can get those resolved. Will leave in draft for now until I can fix those up.

braydonk · 2023-08-28T13:43:43Z

Does the FLB_OUTPUT_SYNCHRONOUS option also force the plugin to run in the main thread? Cause while it would not make sense for this plugin to have multiple workers, it actually would make sense for it to run in its own thread. Does this concept exist in Fluent Bit?

leonardo-albertovich · 2023-08-28T17:22:13Z

I'm not really familiar with the singleplex implementation, I think @matthewfala might be able to shed some light on it but according to the flush source code it seems that FLB_OUTPUT_SYNCHRONOUS can be active while an output plugin runs in threaded mode.

matthewfala · 2023-08-28T19:18:30Z

Hi braydonk@

Yes, as Leonardo pointed out, the FLB_OUTPUT_SYNCHRONOUS singleplex scheduler mode can be used for both 0 and multiple workers - so engine thread works and worker threads also work. It ensures that at most one task is active for a given plugin instance at a time. If you have multiple plugins, then there will be one task active per plugin at a time.

We had to write this for cloudwatch logs plugin. The API at the time had a sequence token which ensured that multiple requests did not occur at the same time. One option was to use sync networking and 1 worker to assure that multiple network requests wouldn't happen at the same time and fail. But we found the sync networking mode to have issues causing a hang after about 1 week of runtime in standard logging workloads which was not ideal.

The singleplex scheduler allows for async networking to be used, while tasks are kept synchronous, making everyone happy :).

leonardo-albertovich · 2023-08-29T10:54:42Z

I can't remember if I checked this before but does singleplex mode guarantee sequential order when a flush operation fails with a code that results in a retry?

braydonk · 2023-08-30T14:17:44Z

Thank you for the explanation @matthewfala!

The unique challenge for this plugin is that it's actually constantly maintaining state; on every flush it receives it updates its current state, and is just always waiting for requests to read that state. The ideal setup for me would be for the plugin to run in a thread so it's not bogging down the main thread, but for there to only be one worker. It sounds like FLB_OUTPUT_SYNCHRONOUS does achieve the linear schedule I was hoping for, but doesn't restrict the number of workers.

I suppose an alternative would be instead of maintaining state in pthread keys, we could maintain the list of metrics in the output context and mutex lock it when updating and reading. The reason I wasn't a huge fan of this is that it seems like it would sorta ruin the point of having multiple workers, since each one would have to wait on the mutex to read the metrics. Any opinions on this @leonardo-albertovich?

leonardo-albertovich · 2023-08-30T15:55:27Z

I think I'm missing the point because my conclusion was that combining FLB_OUTPUT_SYNCHRONOUS with a bit of code to ensure that the output plugin defaults to 1 worker, does not allow a worker count higher than 1 (and maybe issues a warning if it's set to 0) would be the simplest and optimal approach.

Did I miss something there?

braydonk · 2023-08-30T16:52:30Z

Perhaps I missed something; I was under the impression there wasn't a way to actually limit the number of workers. Is that actually possible to force this plugin to only have 1 worker? Because that would resolve my concern

leonardo-albertovich · 2023-08-30T16:59:31Z

You can set the number of worker threads through the Workers output plugin key in the configuration file and you can set the default value for the plugin through the flb_output_plugin instance that's used to expose each plugin (eg. stdout).

What I suggested was setting the default to 1 and then verify that value in the init callback.

Because of how flush scheduling works even if you didn't use the singleplex functionality and multiple flush coroutines are created they will be executed in a linear fashion as long as they don't yield which in you case wouldn't happen because all your flush callback does is take the metrics contexts contained in the chunk and load them into its working data set.

braydonk · 2023-08-31T18:43:39Z

What I suggested was setting the default to 1 and then verify that value in the init callback.

Is this okay/is there precedence for this?

leonardo-albertovich · 2025-05-19T14:09:19Z

Sorry, I missed this reply. I'm not aware of any precedence but I can't see a reason why it wouldn't be a viable approach.

Do you think we could resume this and get it merged @braydonk?

out_prometheus_exporter: use flb_downstream

4714b56

Heavily refactor the Prometheus Exporter to use flb_downstream to handle connections instead of Monkey. Signed-off-by: braydonk <braydonk@google.com>

braydonk requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners August 25, 2023 04:23

braydonk marked this pull request as draft August 25, 2023 04:23

github-actions bot added the docs-required label Aug 25, 2023

braydonk had a problem deploying to pr August 25, 2023 04:23 — with GitHub Actions Failure

braydonk had a problem deploying to pr August 25, 2023 04:24 — with GitHub Actions Failure

leonardo-albertovich self-assigned this Aug 25, 2023

out_prometheus_exporter: fixed returning 0

9f5f44d

Signed-off-by: braydonk <braydonk@google.com>

braydonk had a problem deploying to pr August 26, 2023 00:29 — with GitHub Actions Failure

braydonk had a problem deploying to pr August 26, 2023 01:53 — with GitHub Actions Failure

out_prometheus_exporter: fix leaks, Content-Type

3db8de8

I missed freeing the content type header line, but also I missed actually using the Content-Type header before so added that too. Signed-off-by: braydonk <braydonk@google.com>

braydonk had a problem deploying to pr August 26, 2023 02:17 — with GitHub Actions Failure

braydonk added 2 commits August 26, 2023 19:29

out_prometheus_exporter: construct response better

3c8002c

Changed the way I construct this response so it is (imo) more readable. Signed-off-by: braydonk <braydonk@google.com>

out_prometheus_exporter: use correct Content-Type

2acb275

[This spec](https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format) suggests that this should be the Content-Type for text-based Prometheus metrics. Signed-off-by: braydonk <braydonk@google.com>

braydonk had a problem deploying to pr August 26, 2023 23:33 — with GitHub Actions Failure

braydonk had a problem deploying to pr August 26, 2023 23:34 — with GitHub Actions Failure

out_prometheus_exporter: revert testing change

0e22427

I was testing the empty content response and left this bit in. The last 2 commits will not function properly due to this. Signed-off-by: braydonk <braydonk@google.com>

braydonk had a problem deploying to pr August 26, 2023 23:37 — with GitHub Actions Failure

out_prometheus_exporter: make plugin synchronous

042ce32

Since there is one canonical source of metrics that needs to be maintained and requested, there is no benefit to allowing this plugin to be asynchronous. Signed-off-by: braydonk <braydonk@google.com>

braydonk had a problem deploying to pr August 26, 2023 23:47 — with GitHub Actions Failure

braydonk added 2 commits August 26, 2023 20:12

out_prometheus_exporter: fix string leak

ed16b68

Forgot to destroy a string I allocate for the content type when sending metrics. Signed-off-by: braydonk <braydonk@google.com>

out_prometheus_exporter: saner default contenttype

1feb52a

I actually already pass in the right Content-Type when it's actually Prometheus Metric content, so might as well leave this default the way it was before. Signed-off-by: braydonk <braydonk@google.com>

braydonk had a problem deploying to pr August 27, 2023 00:19 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

out_prometheus_exporter: use flb_downstream #7867

out_prometheus_exporter: use flb_downstream #7867

Uh oh!

braydonk commented Aug 25, 2023 •

edited

Loading

Uh oh!

braydonk commented Aug 25, 2023 •

edited

Loading

Uh oh!

braydonk commented Aug 27, 2023 •

edited

Loading

Uh oh!

braydonk commented Aug 27, 2023

Uh oh!

braydonk commented Aug 28, 2023

Uh oh!

leonardo-albertovich commented Aug 28, 2023

Uh oh!

matthewfala commented Aug 28, 2023 •

edited

Loading

Uh oh!

leonardo-albertovich commented Aug 29, 2023

Uh oh!

braydonk commented Aug 30, 2023

Uh oh!

leonardo-albertovich commented Aug 30, 2023

Uh oh!

braydonk commented Aug 30, 2023

Uh oh!

leonardo-albertovich commented Aug 30, 2023

Uh oh!

braydonk commented Aug 31, 2023

Uh oh!

leonardo-albertovich commented May 19, 2025

Uh oh!

Uh oh!

out_prometheus_exporter: use flb_downstream #7867

Are you sure you want to change the base?

out_prometheus_exporter: use flb_downstream #7867

Uh oh!

Conversation

braydonk commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

braydonk commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

braydonk commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

braydonk commented Aug 27, 2023

Uh oh!

braydonk commented Aug 28, 2023

Uh oh!

leonardo-albertovich commented Aug 28, 2023

Uh oh!

matthewfala commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonardo-albertovich commented Aug 29, 2023

Uh oh!

braydonk commented Aug 30, 2023

Uh oh!

leonardo-albertovich commented Aug 30, 2023

Uh oh!

braydonk commented Aug 30, 2023

Uh oh!

leonardo-albertovich commented Aug 30, 2023

Uh oh!

braydonk commented Aug 31, 2023

Uh oh!

leonardo-albertovich commented May 19, 2025

Uh oh!

Uh oh!

braydonk commented Aug 25, 2023 •

edited

Loading

braydonk commented Aug 25, 2023 •

edited

Loading

braydonk commented Aug 27, 2023 •

edited

Loading

matthewfala commented Aug 28, 2023 •

edited

Loading