Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] configure controller prometheus endpoint for collecting metrics #1520

Closed
wants to merge 10 commits into from

Conversation

ccfishk
Copy link
Contributor

@ccfishk ccfishk commented Jul 8, 2021

What this PR does / why we need it:
Controller metrics - memory, cpu, threads, API response should be used to evaluate controller runtime.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
Start an http server integrating with prometheus go library collecting controller metrics during runtime.
#705
#1278

Special notes for your reviewer:

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR
  • [] able to provide runtime metrics using /metrics, /build, /debug/profile/ endpoints

@ccfishk ccfishk requested a review from a team as a code owner July 8, 2021 21:32
@ccfishk ccfishk temporarily deployed to Configure ci July 8, 2021 21:32 Inactive
@ccfishk ccfishk changed the title [kic-2.0] prometheus setup for controller [kic-2.0] configure prometheus for controller collecting cpu/memory/network Jul 8, 2021
@ccfishk ccfishk self-assigned this Jul 8, 2021
@ccfishk ccfishk added the area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. label Jul 8, 2021
@ccfishk ccfishk changed the title [kic-2.0] configure prometheus for controller collecting cpu/memory/network [kic-2.0] configure prometheus for controller collecting cpu/memory/network metrics Jul 8, 2021
@codecov
Copy link

codecov bot commented Jul 8, 2021

Codecov Report

Merging #1520 (c8fdc40) into next (0a503e5) will increase coverage by 0.31%.
The diff coverage is 57.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##             next    #1520      +/-   ##
==========================================
+ Coverage   46.13%   46.45%   +0.31%     
==========================================
  Files          71       72       +1     
  Lines        6745     6822      +77     
==========================================
+ Hits         3112     3169      +57     
- Misses       3262     3282      +20     
  Partials      371      371              
Flag Coverage Δ
integration-test 46.45% <57.14%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
internal/ctrlutils/utils.go 93.75% <ø> (-2.09%) ⬇️
internal/mgrutils/reports.go 47.61% <0.00%> (-7.94%) ⬇️
internal/manager/prometheus.go 63.63% <63.63%> (ø)
internal/manager/run.go 49.29% <100.00%> (+1.46%) ⬆️
internal/proxy/clientgo_cached_proxy_resolver.go 61.58% <0.00%> (+1.32%) ⬆️
...trollers/configuration/zz_generated_controllers.go 34.63% <0.00%> (+2.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a503e5...c8fdc40. Read the comment docs.

Copy link
Contributor

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Railgun should not be concerned with installation of Prometheus.

If you need Prometheus for local development, have you considered using some Prometheus helm chart or the Prometheus Operator?

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 8, 2021

Railgun should not be concerned with installation of Prometheus.

If you need Prometheus for local development, have you considered using some Prometheus helm chart or the Prometheus Operator?

Right, basically, they are the same tooling - prometheus. The advantage of this one is easy to configure/access.

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 9, 2021

This will be used in next release. close atm.

@shaneutt
Copy link
Contributor

shaneutt commented Jul 11, 2021

I wanted to point out that KTF now has the concept of addons for clusters, which can be used for testing e.g.:

	metallb := metallbaddon.New()
	kong := kongaddon.New()
	builder := environment.NewBuilder().WithAddons(kong, metallb)
	env, err := builder.Build(ctx)

If you're looking to add prometheus to your testing environment, we could add an issue to KTF to add that as an available cluster addon and then maybe look into that after KIC 2.0 beta releases.

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 12, 2021

I wanted to point out that KTF now has the concept of addons for clusters, which can be used for testing e.g.:

	metallb := metallbaddon.New()
	kong := kongaddon.New()
	builder := environment.NewBuilder().WithAddons(kong, metallb)
	env, err := builder.Build(ctx)

If you're looking to add prometheus to your testing environment, we could add an issue to KTF to add that as an available cluster addon and then maybe look into that after KIC 2.0 beta releases.

As part of KIC2.0, we want use prometheus tooling controller performance comparing to the way proxy collecting data.

@shaneutt
Copy link
Contributor

I wanted to point out that KTF now has the concept of addons for clusters, which can be used for testing e.g.:

	metallb := metallbaddon.New()
	kong := kongaddon.New()
	builder := environment.NewBuilder().WithAddons(kong, metallb)
	env, err := builder.Build(ctx)

If you're looking to add prometheus to your testing environment, we could add an issue to KTF to add that as an available cluster addon and then maybe look into that after KIC 2.0 beta releases.

As part of KIC2.0, we want use prometheus tooling controller performance comparing to the way proxy collecting data.

Yes, that sounds good. Do you have any thoughts about adding a prometheus addon to the KTF?

@ccfishk ccfishk temporarily deployed to Configure ci July 14, 2021 18:52 Inactive
@ccfishk ccfishk added area/feature New feature or request priority/medium size/small and removed area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. labels Jul 14, 2021
@ccfishk ccfishk changed the title [kic-2.0] configure prometheus for controller collecting cpu/memory/network metrics [kic-2.0] configure controller prometheus endpoint for collecting metrics Jul 14, 2021
@ccfishk ccfishk changed the title [kic-2.0] configure controller prometheus endpoint for collecting metrics [chore] configure controller prometheus endpoint for collecting metrics Jul 14, 2021
@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 14, 2021

I wanted to point out that KTF now has the concept of addons for clusters, which can be used for testing e.g.:

	metallb := metallbaddon.New()
	kong := kongaddon.New()
	builder := environment.NewBuilder().WithAddons(kong, metallb)
	env, err := builder.Build(ctx)

If you're looking to add prometheus to your testing environment, we could add an issue to KTF to add that as an available cluster addon and then maybe look into that after KIC 2.0 beta releases.

As part of KIC2.0, we want use prometheus tooling controller performance comparing to the way proxy collecting data.

Yes, that sounds good. Do you have any thoughts about adding a prometheus addon to the KTF?

I would create an issue addressing the prometheus addon there.

@mflendrich
Copy link
Contributor

mflendrich commented Jul 21, 2021

We're putting this on hold for now until all other items blocking the KIC 2.0 beta release are resolved - because #705 has lower priority than finishing KIC 2.0.

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 26, 2021

close temporarily.

@ccfishk ccfishk closed this Jul 26, 2021
@ccfishk ccfishk changed the title [chore] configure controller prometheus endpoint for collecting metrics [wip][chore] configure controller prometheus endpoint for collecting metrics Jul 26, 2021
@ccfishk ccfishk reopened this Jul 26, 2021
@ccfishk ccfishk changed the title [wip][chore] configure controller prometheus endpoint for collecting metrics [chore] configure controller prometheus endpoint for collecting metrics Jul 26, 2021
@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 26, 2021

Michal mentioned he wants to take a look at this, removing WIP atm.

@ccfishk ccfishk linked an issue Jul 27, 2021 that may be closed by this pull request
4 tasks
@shaneutt
Copy link
Contributor

I was looking at this branch in terms of #1591 to see if I could proactively rebase it for you @ccfishk but the changes in here with the prometheus/ directory don't seem to fit with the root, let me know how you want to proceed and if you're OK with me rebasing them into a different directory?

@rainest
Copy link
Contributor

rainest commented Jul 28, 2021

I think we should be fine to just close this and open up new PRs for a revised solution to #705 and #1497. This one has a rather confusing history since it was originally installing Prometheus and then moved to creating a new listener for it, whereas what we want now for #705 is to add new metrics into the existing kubebuilder metrics server. That server is already working and contains built-in metrics for the controller, so for example, at 10c0b42:

$ kubectl port-forward trk-kong-c7d47997f-lx76q 10255:10255
15:53:55-0700 yagody $ http localhost:10255/metrics | grep 'controller="ingress"' | grep reconcile | grep result
controller_runtime_reconcile_total{controller="ingress",result="error"} 0
controller_runtime_reconcile_total{controller="ingress",result="requeue"} 0
controller_runtime_reconcile_total{controller="ingress",result="requeue_after"} 0
controller_runtime_reconcile_total{controller="ingress",result="success"} 2
15:54:16-0700 yagody $ kubectl get ing --all-namespaces
NAMESPACE   NAME             CLASS    HOSTS                ADDRESS   PORTS   AGE
bar         demo             <none>   *                              80      4m46s
default     trk-kong-proxy   <none>   proxy.kong.example             80      4m31s
foo         demo             <none>   *                              80      5m13s

Class display is borked because lol v1beta1 still, but the two demo Ingresses have class Kong.

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 28, 2021

I was looking at this branch in terms of #1591 to see if I could proactively rebase it for you @ccfishk but the changes in here with the prometheus/ directory don't seem to fit with the root, let me know how you want to proceed and if you're OK with me rebasing them into a different directory?

@shaneutt I am OK to rebasing them into internal directory.

Copy link
Contributor

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the high level:

  • This PR adds a Prometheus /metrics endpoint but Railgun already has a /metrics endpoint
  • This PR adds a /healthz and readyz endpoint but Railgun already has both
  • This PR adds pprof endpoints but Railgun already has these
  • This PR adds a /build endpoint and I don't understand its purpose
  • This PR adds a /stop endpoint and I don't understand its purpose, plus I see it as a DoS security vulnerability

I think that the best way to add metrics (as described in the PR title and description) would be by publishing additional metrics on the /metrics endpoint already exposed by controller-runtime.

logger logr.Logger,
wg *sync.WaitGroup) {
defer wg.Done()
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Railgun has a healthz check already - here is a test that verifies that:

func TestHealthEndpoint(t *testing.T) {

There's no need to add another one.

w.WriteHeader(http.StatusOK)
})

mux.Handle("/metrics", promhttp.Handler())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, Railgun has a Prometheus metrics listener already, too:

func TestMetricsEndpoint(t *testing.T) {


mux.Handle("/metrics", promhttp.Handler())

mux.HandleFunc("/build", func(w http.ResponseWriter, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use of this endpoint? why do we need it?

}
})

mux.HandleFunc("/stop", func(w http.ResponseWriter, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? If there is no demonstrated need, I suggest dropping this:

  • because YAGNI
  • for security reasons: this allows everybody with raw HTTP access to KIC to perform a DoS attack by killing the controller.

}
})

if enableProfiling {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a diagnostics server exposing pprof:

func installProfilingHandlers(mux *http.ServeMux) {

@@ -112,6 +112,7 @@ func Run(ctx context.Context, c *config.Config) error {
}

go FlipKnativeController(mgr, proxy, &c.KnativeIngressEnabled, c, setupLog)
go RunHTTP(setupLog)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other additional HTTP servers we have (that are not part of controller-runtime) are started under rootcmd.Run and not under manager.Run:

func Run(ctx context.Context, c *manager.Config) error {

This means that, if we want to add more HTTP servers to Railgun, we should be doing that in rootcmd.Run. But given the other comments (all the endpoints offered by the newly added HTTP server are either duplicates of services Railgun already offers, or raised my thoughts about their usefulness) we may well end up without that additional HTTP server.

const classSpec = "IngressClassName"
const (
// classSpec indicates the fieldName for objects which support indicating their Ingress Class by spec
classSpec = "IngressClassName"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused constant

@shaneutt
Copy link
Contributor

I was looking at this branch in terms of #1591 to see if I could proactively rebase it for you @ccfishk but the changes in here with the prometheus/ directory don't seem to fit with the root, let me know how you want to proceed and if you're OK with me rebasing them into a different directory?

@shaneutt I am OK to rebasing them into internal directory.

OK! I've made the rebase for you which includes the adjustments from next look it over and make sure everything seems 👍 or let me know if you need any help 🖖

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 29, 2021

publishing additional metrics

On the high level:

  • This PR adds a Prometheus /metrics endpoint but Railgun already has a /metrics endpoint
  • This PR adds a /healthz and readyz endpoint but Railgun already has both
  • This PR adds pprof endpoints but Railgun already has these
  • This PR adds a /build endpoint and I don't understand its purpose
  • This PR adds a /stop endpoint and I don't understand its purpose, plus I see it as a DoS security vulnerability

I think that the best way to add metrics (as described in the PR title and description) would be by publishing additional metrics on the /metrics endpoint already exposed by controller-runtime.

Agree @mflendrich that publish needed metrics using controller-runtime would be the exact change I am going to work on.

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 29, 2021

I think we should be fine to just close this and open up new PRs for a revised solution to #705 and #1497. This one has a rather confusing history since it was originally installing Prometheus and then moved to creating a new listener for it, whereas what we want now for #705 is to add new metrics into the existing kubebuilder metrics server. That server is already working and contains built-in metrics for the controller, so for example, at 10c0b42:

$ kubectl port-forward trk-kong-c7d47997f-lx76q 10255:10255
15:53:55-0700 yagody $ http localhost:10255/metrics | grep 'controller="ingress"' | grep reconcile | grep result
controller_runtime_reconcile_total{controller="ingress",result="error"} 0
controller_runtime_reconcile_total{controller="ingress",result="requeue"} 0
controller_runtime_reconcile_total{controller="ingress",result="requeue_after"} 0
controller_runtime_reconcile_total{controller="ingress",result="success"} 2
15:54:16-0700 yagody $ kubectl get ing --all-namespaces
NAMESPACE   NAME             CLASS    HOSTS                ADDRESS   PORTS   AGE
bar         demo             <none>   *                              80      4m46s
default     trk-kong-proxy   <none>   proxy.kong.example             80      4m31s
foo         demo             <none>   *                              80      5m13s

Class display is borked because lol v1beta1 still, but the two demo Ingresses have class Kong.

@rainest Got your point of re-use existing metrics and returns the data to the user from there.

Sorry about the change of this PR, but I think the good thing is change based on review/comments is addressed accordingly.

@ccfishk
Copy link
Contributor Author

ccfishk commented Jul 30, 2021

Per offline slack and comments, close this PR and open a new one.

@ccfishk ccfishk closed this Jul 30, 2021
@shaneutt shaneutt deleted the controller-prometheus branch October 27, 2021 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus metrics for the controller
5 participants