-
Notifications
You must be signed in to change notification settings - Fork 1.8k
out_prometheus_exporter: use flb_downstream #7867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Heavily refactor the Prometheus Exporter to use flb_downstream to handle connections instead of Monkey. Signed-off-by: braydonk <braydonk@google.com>
Converting to draft until I get another sec to get the valgrind and debug log. In the meantime, please try the PR and try other use cases I may not have thought of, I have currently only tested with fluent bit self metrics. Will also address the TODO I left in and the files that don't end with newlines. |
Signed-off-by: braydonk <braydonk@google.com>
Previously was creating a full mk_lib_ctx, but that is very complicated to free properly. Switching to just allocating the server struct with keepalive set. For some reason, this exposed the fact that I wasn't previously creating the pthread key for the metrics list. I don't know why it worked before, but now I properly create the key and use its exit callback to free the metrics. Signed-off-by: braydonk <braydonk@google.com>
I missed freeing the content type header line, but also I missed actually using the Content-Type header before so added that too. Signed-off-by: braydonk <braydonk@google.com>
Changed the way I construct this response so it is (imo) more readable. Signed-off-by: braydonk <braydonk@google.com>
[This spec](https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format) suggests that this should be the Content-Type for text-based Prometheus metrics. Signed-off-by: braydonk <braydonk@google.com>
I was testing the empty content response and left this bit in. The last 2 commits will not function properly due to this. Signed-off-by: braydonk <braydonk@google.com>
Since there is one canonical source of metrics that needs to be maintained and requested, there is no benefit to allowing this plugin to be asynchronous. Signed-off-by: braydonk <braydonk@google.com>
Forgot to destroy a string I allocate for the content type when sending metrics. Signed-off-by: braydonk <braydonk@google.com>
I actually already pass in the right Content-Type when it's actually Prometheus Metric content, so might as well leave this default the way it was before. Signed-off-by: braydonk <braydonk@google.com>
Testing this change:
Config to test:
Debug logs: https://gist.github.com/braydonk/4151f2d9068b6d125138820589af688f Valgrind output: https://gist.github.com/braydonk/97c4e05af213f8cc26e5299f64d3318a I think the invalid reads in the valgrind log might be false positives resulting from optimization, but if you can find anything wrong that I'm not seeing please let me know. |
I don't have access to a good Windows environment to fix the Windows build issues, that might be a couple days til I can get those resolved. Will leave in draft for now until I can fix those up. |
Does the |
I'm not really familiar with the singleplex implementation, I think @matthewfala might be able to shed some light on it but according to the flush source code it seems that |
Hi braydonk@ Yes, as Leonardo pointed out, the We had to write this for cloudwatch logs plugin. The API at the time had a sequence token which ensured that multiple requests did not occur at the same time. One option was to use sync networking and 1 worker to assure that multiple network requests wouldn't happen at the same time and fail. But we found the sync networking mode to have issues causing a hang after about 1 week of runtime in standard logging workloads which was not ideal. The singleplex scheduler allows for async networking to be used, while tasks are kept synchronous, making everyone happy :). |
I can't remember if I checked this before but does singleplex mode guarantee sequential order when a flush operation fails with a code that results in a retry? |
Thank you for the explanation @matthewfala! The unique challenge for this plugin is that it's actually constantly maintaining state; on every flush it receives it updates its current state, and is just always waiting for requests to read that state. The ideal setup for me would be for the plugin to run in a thread so it's not bogging down the main thread, but for there to only be one worker. It sounds like I suppose an alternative would be instead of maintaining state in pthread keys, we could maintain the list of metrics in the output context and mutex lock it when updating and reading. The reason I wasn't a huge fan of this is that it seems like it would sorta ruin the point of having multiple workers, since each one would have to wait on the mutex to read the metrics. Any opinions on this @leonardo-albertovich? |
I think I'm missing the point because my conclusion was that combining Did I miss something there? |
Perhaps I missed something; I was under the impression there wasn't a way to actually limit the number of workers. Is that actually possible to force this plugin to only have 1 worker? Because that would resolve my concern |
You can set the number of worker threads through the What I suggested was setting the default to 1 and then verify that value in the Because of how flush scheduling works even if you didn't use the singleplex functionality and multiple flush coroutines are created they will be executed in a linear fashion as long as they don't yield which in you case wouldn't happen because all your |
Is this okay/is there precedence for this? |
Sorry, I missed this reply. I'm not aware of any precedence but I can't see a reason why it wouldn't be a viable approach. Do you think we could resume this and get it merged @braydonk? |
This PR heavily refactors the Prometheus Exporter to use
flb_downstream
to handle connections instead of Monkey. It takes a lot of inspiration fromin_http
in its implementation.#7056
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-test
label to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.