-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the prometheus metrics API is tool slow #7353
Comments
Thanks for the reply, I have read these 2 issues, and they have some help for me to solve the problem. However, in the case of APISIX, it did not solve the problem at the source. I plan to change the exporter.lua script to reduce the metrics dataset and see if the problem can be solved. In other words, to solve this problem, users need to change the code to try to solve it, which is not a good way to solve the problem. The reason I had this problem was actually because I was changing the gateway, from Envoy to APISIX. As far as Envoy is concerned, even when the metrics data is as large as several hundred MB, the response of Envoy will not be slow. I think, as mentioned in other issues, APISIX's Prometheus plugin could try to provide some customizable options to help people with similar problems. Thank you very much for your reply, if there is progress, I will reply in this issue. |
Here's what I've seen work so far: reduce the number of metrics that aren't needed. ref: #4273 |
Prometheus plugin used in global config, switch enabled true, then it can collector router level metrics. |
I did a further follow up on this issue. The root of this problem is on Etcd. When APISIX nodes establish more connections (such as more than 200) to the same Etcd node, and APISIX communicates with Etcd through TLS certificates, this problem can be reproduced. This problem leads to 2 points:
I have created a related issue in the Etcd project and reproduced the problem. Related issues: #7078 Etcd version 3.5.5 will fix this issue. And, I rebuilt and deployed Etcd based on its fixed PR and tested, and it has been confirmed that this problem can be fixed. |
@xuminwlt In my opinion, there is historical data in the Promethues plugin, and disable the plugin does not solve the problem, and even if the Node of upstream changes, the Prometheus plugin will always keep the historical Node data. |
Thank you for your research!
This was due to too many metrics (tens of thousands), and I had done some optimizations to the upstream nginx-lua-prometheus, but it didn't solve the problem completely. ref: knyar/nginx-lua-prometheus#139 one idea in the community now is to provide options to control the type of metrics to reduce the total number of metrics, see: #7211 (comment) |
This problem does exist when the amount of metrics data is large, and we found that it also leads to abnormally high CPU usage. so we modified the prometheus plugin to record only the necessary information, and the streamlined prometheus plugin works well |
@zuiyangqingzhou Have you tried the nginx-lua-prometheus optimization introduced by @tzssangglass ? |
This optimization is also limited in that some processes cannot be removed, such as sorting tens of thousands of keys, and regular and string splicing. 😅 |
@tzssangglass @tokers |
@hansedong Would you like to submit a PR to add this important fact to the FAQ? |
@tokers I'd love to do this, please, how do I add this to the FAQ? |
The FAQ page is https://apisix.apache.org/docs/apisix/FAQ/, and you can submit a PR to apisix-website: https://github.com/apache/apisix-website |
@tokers Thanks a lot, I'll give it a try. |
@tokers I'm a little confused, I see that the content of the FAQ page doesn't seem to be in the apache/apisix-website project, but in apache/apisix, specifically https://github.com/apache/apisix/blob/master/docs/en/latest/FAQ.md? |
Oops, you're right, that's the correct place. |
Description
I use APISIX in our microservice platform, there are thousands of microservices, that is, there are thousands of Route and Upstream resources in APISIX.
When I switched the online traffic to APISIX and the monitoring platform Prometheus fetched time series data from APISIX's metrics API, APISIX's response took a long time, which in turn caused Prometheus to fetch data timeout.
In order to check the network reasons, on the APISIX node, it is very slow to get metrics data through curl, so the root of the problem lies in APISIX itself.
As shown above:
How should I troubleshoot this issue?
Environment
apisix version
): 2.13.2uname -a
): Linux knode10-132-14-174 4.19.206 # 1 SMP Wed Sep 15 16:18:07 CST 2021 x86_64 x86_64 x86_64 GNU/Linuxopenresty -V
ornginx -V
):curl http://127.0.0.1:9090/v1/server_info
): 3.5.4luarocks --version
): noThe text was updated successfully, but these errors were encountered: