Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vpa] Documentation for configuration options #3784

Closed
chris-vest opened this issue Dec 24, 2020 · 13 comments
Closed

[vpa] Documentation for configuration options #3784

chris-vest opened this issue Dec 24, 2020 · 13 comments
Labels
area/vertical-pod-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@chris-vest
Copy link

Which component are you using?:

VPA recommender.

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Better user experience and easier debugging.

Describe the solution you'd like.:

Documentation for these VPA recommender configuration options:

--add-dir-header="false"
--address=":8942"
--alsologtostderr="false"
--checkpoints-gc-interval="10m0s"
--checkpoints-timeout="1m0s"
--container-name-label="name"
--container-namespace-label="namespace"
--container-pod-name-label="pod_name"
--cpu-histogram-decay-half-life="24h0m0s"
--history-length="8d"
--history-resolution="1h"
--kube-api-burst="10"
--kube-api-qps="5"
--log-backtrace-at=":0"
--log-dir=""
--log-file=""
--log-file-max-size="1800"
--logtostderr="true"
--memory-aggregation-interval="24h0m0s"
--memory-aggregation-interval-count="8"
--memory-histogram-decay-half-life="24h0m0s"
--memory-saver="false"
--metric-for-pod-labels="up{job=\"kubernetes-pods\"}"
--min-checkpoints="10"
--pod-label-prefix="pod_label_"
--pod-name-label="kubernetes_pod_name"
--pod-namespace-label="kubernetes_namespace"
--pod-recommendation-min-cpu-millicores="15"
--pod-recommendation-min-memory-mb="100"
--prometheus-address="http://thanos-querier.monitoring.svc.cluster.local:9090"
--prometheus-cadvisor-job-name="kubernetes-nodes-cadvisor"
--prometheus-query-timeout="5m"
--recommendation-margin-fraction="0.15"
--recommender-interval="1m0s"
--skip-headers="false"
--skip-log-headers="false"
--stderrthreshold="2"
--storage="prometheus"
--v="6"
--vmodule=""
--vpa-object-namespace=""

Granted, some of these are pretty self-explanatory, but some of not obvious. For example, the pod-label-prefix configuration option - how is that used and do I need to configure it? I know other people might think that, because I certainly did. Users shouldn't have to dig through the code in order to understand what they do.

Describe any alternative solutions you've considered.:

Little to no documentation, as it stands now - I feel like that's not an ideal scenario.

@chris-vest chris-vest added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 24, 2020
@bskiba
Copy link
Member

bskiba commented Jan 12, 2021

We do not have up to date documentation of the parameters (I suppose it would get out of date very quickly, bt you can run the binary with the --help option to get the flag description.
docker run -it k8s.gcr.io/autoscaling/vpa-recommender:0.9.0 ./vpa-recommender --help

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2021
@jppitout
Copy link

jppitout commented Jun 8, 2021

Please could we get more complete documentation on setting up Prometheus as a history provider for the VPA recommender component?

For example how to customize it and verify that it is indeed working? Also including whether CPU and memory queries are customizable or not?

It would be nice if the documentation could include which jobs get queried e.g. is it the "kubernetes-nodes-cadvisor" and "kubernetes-pods", just the one, or are there more?

Running the previously recommended command:

docker run -it k8s.gcr.io/autoscaling/vpa-recommender:0.9.2 ./vpa-recommender --help

The descriptions of these options are too similar i.e. "Label name to look for container names"... are they all looking for container names (or is one looking for pod names)?... are they used in conjunction or either/or? :

      --container-name-label string                   Label name to look for container names (default "name")
      --container-namespace-label string              Label name to look for container names (default "namespace")
      --container-pod-name-label string               Label name to look for container names (default "pod_name")
      --pod-name-label string                         Label name to look for container names (default "kubernetes_pod_name")
      --pod-namespace-label string                    Label name to look for container names (default "kubernetes_namespace")

I'm having trouble wrapping my head around when above and below options should be used:

      --metric-for-pod-labels string                  Which metric to look for pod labels in metrics (default "up{job=\"kubernetes-pods\"}")
      --pod-label-prefix string                       Which prefix to look for pod labels in metrics (default "pod_label_")

Would it be possible to provide examples and/or elaborate on all of the above?

@jppitout
Copy link

jppitout commented Jun 8, 2021

Here are some instances where such docs might have helped:

@jppitout
Copy link

jppitout commented Jun 8, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 6, 2021
@davidquarles
Copy link

Please could we get more complete documentation on setting up Prometheus as a history provider for the VPA recommender component?

For example how to customize it and verify that it is indeed working? Also including whether CPU and memory queries are customizable or not?

It would be nice if the documentation could include which jobs get queried e.g. is it the "kubernetes-nodes-cadvisor" and "kubernetes-pods", just the one, or are there more?

Running the previously recommended command:

docker run -it k8s.gcr.io/autoscaling/vpa-recommender:0.9.2 ./vpa-recommender --help

The descriptions of these options are too similar i.e. "Label name to look for container names"... are they all looking for container names (or is one looking for pod names)?... are they used in conjunction or either/or? :

      --container-name-label string                   Label name to look for container names (default "name")
      --container-namespace-label string              Label name to look for container names (default "namespace")
      --container-pod-name-label string               Label name to look for container names (default "pod_name")
      --pod-name-label string                         Label name to look for container names (default "kubernetes_pod_name")
      --pod-namespace-label string                    Label name to look for container names (default "kubernetes_namespace")

I'm having trouble wrapping my head around when above and below options should be used:

      --metric-for-pod-labels string                  Which metric to look for pod labels in metrics (default "up{job=\"kubernetes-pods\"}")
      --pod-label-prefix string                       Which prefix to look for pod labels in metrics (default "pod_label_")

Would it be possible to provide examples and/or elaborate on all of the above?

I am / we are absurdly grateful for this project and the value it provides, having used it extensively over the last few years, but after fighting the prometheus integration setup for the first time ever for awhile last night I agree with this. It is rather obtuse trying to figure out what is going on with these options and requires a detailed analysis of the underlying codebase. Even after doing so, I wasn't successful.

It also isn't super clear what happens to the existing checkpoints when migrating storage backends and what behavior one can expect to occur in this process, which is a bit scary given that we've already littered our production environment with VPA.

My fragile understanding thus far:

  • The cadvisor metrics are range-queried in bulk and the --container-*-label flags map to the labels in those metrics
  • Those metrics are matched against pod series contained in the --metric-for-pod-labels metric / query, i.e. container-namespace-label == pod-namespace-label && container-pod-name-label == pod-name-label
  • Additional labels found on the metric-for-pod-labels series are parsed into memory (anything prefixed by pod-label-prefix, i.e. with --pod-label-prefix=label_, label_foo="bar" => foo: bar)
  • I'm guessing the labels are then used to match against the VPA target's selector? I did not get that far up the callstack when stepping through the code.

Is that all accurate? I tried using the kube-state-metrics kube_pod_labels for --metric-for-pod-labels, since our prometheus config is only scraping pod's with the scrape annotation and the default up{job="kubernetes-pods"} is thus filtered, but something is still amiss and I was seeing lots of these before I gave up for the evening:

Error adding metric sample for container {{velero velero-6778d944c5-t5xqj} velero}: sample discarded (invalid or out of order)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@michaelswierszcz
Copy link

Usage of /recommender:
      --add-dir-header                                If true, adds the file directory to the header
      --address string                                The address to expose Prometheus metrics. (default ":8942")
      --alsologtostderr                               log to standard error as well as files
      --checkpoints-gc-interval duration              How often orphaned checkpoints should be garbage collected (default 10m0s)
      --checkpoints-timeout duration                  Timeout for writing checkpoints since the start of the recommender's main loop (default 1m0s)
      --container-name-label string                   Label name to look for container names (default "name")
      --container-namespace-label string              Label name to look for container names (default "namespace")
      --container-pod-name-label string               Label name to look for container names (default "pod_name")
      --cpu-histogram-decay-half-life duration        The amount of time it takes a historical CPU usage sample to lose half of its weight. (default 24h0m0s)
      --history-length string                         How much time back prometheus have to be queried to get historical metrics (default "8d")
      --history-resolution string                     Resolution at which Prometheus is queried for historical metrics (default "1h")
      --kube-api-burst float                          QPS burst limit when making requests to Kubernetes apiserver (default 10)
      --kube-api-qps float                            QPS limit when making requests to Kubernetes apiserver (default 5)
      --log-backtrace-at traceLocation                when logging hits line file:N, emit a stack trace (default :0)
      --log-dir string                                If non-empty, write log files in this directory
      --log-file string                               If non-empty, use this log file
      --log-file-max-size uint                        Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                                   log to standard error instead of files (default true)
      --memory-aggregation-interval duration          The length of a single interval, for which the peak memory usage is computed. Memory usage peaks are aggregated in multiples of this interval. In other words there is one memory usage sample per interval (the maximum usage over that interval) (default 24h0m0s)
      --memory-aggregation-interval-count int         The number of consecutive memory-aggregation-intervals which make up the MemoryAggregationWindowLength which in turn is the period for memory usage aggregation by VPA. In other words, MemoryAggregationWindowLength = memory-aggregation-interval * memory-aggregation-interval-count. (default 8)
      --memory-histogram-decay-half-life duration     The amount of time it takes a historical memory usage sample to lose half of its weight. In other words, a fresh usage sample is twice as 'important' as one with age equal to the half life period. (default 24h0m0s)
      --memory-saver                                  If true, only track pods which have an associated VPA
      --metric-for-pod-labels string                  Which metric to look for pod labels in metrics (default "up{job=\"kubernetes-pods\"}")
      --min-checkpoints int                           Minimum number of checkpoints to write per recommender's main loop (default 10)
      --pod-label-prefix string                       Which prefix to look for pod labels in metrics (default "pod_label_")
      --pod-name-label string                         Label name to look for container names (default "kubernetes_pod_name")
      --pod-namespace-label string                    Label name to look for container names (default "kubernetes_namespace")
      --pod-recommendation-min-cpu-millicores float   Minimum CPU recommendation for a pod (default 25)
      --pod-recommendation-min-memory-mb float        Minimum memory recommendation for a pod (default 250)
      --prometheus-address string                     Where to reach for Prometheus metrics
      --prometheus-cadvisor-job-name string           Name of the prometheus job name which scrapes the cAdvisor metrics (default "kubernetes-cadvisor")
      --prometheus-query-timeout string               How long to wait before killing long queries (default "5m")
      --recommendation-margin-fraction float          Fraction of usage added as the safety margin to the recommended request (default 0.15)
      --recommender-interval duration                 How often metrics should be fetched (default 1m0s)
      --skip-headers                                  If true, avoid header prefixes in the log messages
      --skip-log-headers                              If true, avoid headers when opening log files
      --stderrthreshold severity                      logs at or above this threshold go to stderr (default 2)
      --storage string                                Specifies storage mode. Supported values: prometheus, checkpoint (default)
  -v, --v Level                                       number for the log level verbosity
      --vmodule moduleSpec                            comma-separated list of pattern=N settings for file-filtered logging
      --vpa-object-namespace string                   Namespace to search for VPA objects and pod stats. Empty means all namespaces will be used.

@jianlong0808
Copy link

I think this source code can explain

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants