Issues with TiDB Monitor #1949

DanielZhangQD · 2020-03-17T05:53:11Z

Bug Report

What version of Kubernetes are you using?

1.12.8
What version of TiDB Operator are you using?

v1.1.0-beta.2
What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

local-storage
What's the status of the TiDB cluster pods?

Running
What did you do?

Upgrade TiDB Operator from v1.0 to v1.1 and Create TiDB Monitor CR.
What did you expect to see?
Monitor works as expected.
What did you see instead?

The existing Monitor deployment is rolling updated after CR created.

Service NodePorts are changed after CR created.
Before CR creation:

dan10-grafana            NodePort    10.233.22.65    <none>        3000:31558/TCP                   23h
dan10-monitor-reloader   NodePort    10.233.6.191    <none>        9089:30400/TCP                   23h
dan10-prometheus         NodePort    10.233.16.45    <none>        9090:32184/TCP                   23h

After CR creation:

dan10-grafana            NodePort    10.233.22.65    <none>        3000:32287/TCP                   24h
dan10-monitor-reloader   NodePort    10.233.6.191    <none>        9089:30400/TCP                   24h
dan10-prometheus         NodePort    10.233.16.45    <none>        9090:30273/TCP                   24h
dan10-reloader           NodePort    10.233.13.22    <none>        9089:32448/TCP                   40m

The reloader service name is changed from <cluster>-monitor-reloader to <cluster>-monitor-reloader. (Please see detail in above info)

The monitor deployment rolling updates continuously.

dan10-monitor-765b58fcb-bwn5l             0/3     Pending                      0          0s
dan10-monitor-765b58fcb-bwn5l             0/3     Pending                      0          0s
dan10-monitor-765b58fcb-bwn5l             0/3     Init:0/1                     0          1s
dan10-monitor-765b58fcb-bwn5l             0/3     PodInitializing              0          7s
dan10-monitor-765b58fcb-bwn5l             3/3     Running                      0          9s
dan10-monitor-765b58fcb-bwn5l             3/3     Terminating                  0          56s
dan10-monitor-765b58fcb-bwn5l             0/3     Terminating                  0          59s
dan10-monitor-765b58fcb-bwn5l             0/3     Terminating                  0          60s
dan10-monitor-765b58fcb-bwn5l             0/3     Terminating                  0          60s
dan10-monitor-746b7bdb84-mlkcw            0/3     Pending                      0          0s
dan10-monitor-746b7bdb84-mlkcw            0/3     Pending                      0          0s
dan10-monitor-746b7bdb84-mlkcw            0/3     Init:0/1                     0          0s
dan10-monitor-746b7bdb84-mlkcw            0/3     Init:0/1                     0          6s
dan10-monitor-746b7bdb84-mlkcw            0/3     PodInitializing              0          7s
dan10-monitor-746b7bdb84-mlkcw            3/3     Running                      0          12s
dan10-monitor-746b7bdb84-mlkcw            3/3     Terminating                  0          27s
dan10-monitor-746b7bdb84-mlkcw            0/3     Terminating                  0          29s
dan10-monitor-746b7bdb84-mlkcw            0/3     Terminating                  0          30s
dan10-monitor-746b7bdb84-mlkcw            0/3     Terminating                  0          30s
dan10-monitor-765b58fcb-r4l9p             0/3     Pending                      0          0s
dan10-monitor-765b58fcb-r4l9p             0/3     Pending                      0          0s
dan10-monitor-765b58fcb-r4l9p             0/3     Init:0/1                     0          0s
dan10-monitor-765b58fcb-r4l9p             0/3     PodInitializing              0          6s
dan10-monitor-765b58fcb-r4l9p             3/3     Running                      0          10s

If change service.portName of one service, the NodePort is changed too.
Before update:

dan10-grafana            NodePort    10.233.22.65    <none>        3000:32287/TCP                   26h

After update:

dan10-grafana            NodePort    10.233.22.65    <none>        3000:30562/TCP                   26h

The text was updated successfully, but these errors were encountered:

DanielZhangQD · 2020-03-17T06:12:55Z

CR:

apiVersion: pingcap.com/v1alpha1
kind: TidbMonitor
metadata:
  name: dan10
spec:
  clusters:
  - name: dan10
  prometheus:
    baseImage: prom/prometheus
    version: v2.11.1
    resources:
      limits: {}
      #   cpu: 8000m
      #   memory: 8Gi
      requests: {}
      #   cpu: 4000m
      #   memory: 4Gi
    imagePullPolicy: IfNotPresent
    logLevel: info
    reserveDays: 12
    service:
      type: NodePort
      portName: http-prometheus 
  grafana:
    baseImage: grafana/grafana
    version: 6.0.1
    imagePullPolicy: IfNotPresent
    logLevel: info
    resources:
      limits: {}
      #   cpu: 8000m
      #   memory: 8Gi
      requests: {}
      #   cpu: 4000m
      #   memory: 4Gi
    username: admin
    password: admin
    envs:
      # Configure Grafana using environment variables except GF_PATHS_DATA, GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD
      # Ref https://grafana.com/docs/installation/configuration/#using-environment-variables
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_NAME: "Main Org."
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Viewer"
      # if grafana is running behind a reverse proxy with subpath http://foo.bar/grafana
      # GF_SERVER_DOMAIN: foo.bar
      # GF_SERVER_ROOT_URL: "%(protocol)s://%(domain)s/grafana/"
    service:
      type: NodePort
      portName: http-grafana
  initializer:
    baseImage: pingcap/tidb-monitor-initializer
    version: v3.0.9
    imagePullPolicy: Always
    resources: {}
    # limits:
    #  cpu: 50m
    #  memory: 64Mi
    # requests:
    #  cpu: 50m
    #  memory: 64Mi
  reloader:
    baseImage: pingcap/tidb-monitor-reloader
    version: v1.0.1
    imagePullPolicy: IfNotPresent
    service:
      type: NodePort
      portName: tcp-reloader
    resources: {}
      # limits:
      #  cpu: 50m
      #  memory: 64Mi
      # requests:
      #  cpu: 50m
      #  memory: 64Mi
  imagePullPolicy: IfNotPresent
  persistent: true
  storageClassName: local-storage
  storage: 10Gi
  nodeSelector: {}
  annotations: {}
  tolerations: []
  kubePrometheusURL: http://prometheus-k8s.monitoring.svc:9090
  alertmanagerURL: ""

DanielZhangQD · 2020-03-17T06:39:11Z

For the monitor deployment continuously rolling update issue, it's probable caused by code https://github.com/pingcap/tidb-operator/blob/master/pkg/monitor/monitor/util.go#L507, wherewith multiple envs, the order of envs may be different during each sync.

DanielZhangQD · 2020-03-17T06:41:33Z

For the NodePort change issue, it's probable caused by code https://github.com/pingcap/tidb-operator/blob/master/pkg/controller/generic_control.go#L250, only the clusterIP is reserved during the service update.

DanielZhangQD added this to the v1.1.0 milestone Mar 17, 2020

DanielZhangQD mentioned this issue Mar 17, 2020

update permission for tidb-controller-manager and add example for tidb-monitor #1954

Merged

Yisaer added area/monitor monitoring type/bug Something isn't working labels Mar 17, 2020

Yisaer mentioned this issue Mar 17, 2020

Fix TidbMonitor several error #1962

Merged

mahjonp mentioned this issue Mar 17, 2020

k8s cluster: Support tidb monitor pingcap/tipocket#85

Merged

sre-bot mentioned this issue Mar 18, 2020

update permission for tidb-controller-manager and add example for tidb-monitor (#1954) #1969

Merged

DanielZhangQD assigned Yisaer Mar 18, 2020

Yisaer closed this as completed in #1962 Mar 19, 2020

sre-bot mentioned this issue Mar 19, 2020

Fix TidbMonitor several error (#1962) #1985

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with TiDB Monitor #1949

Issues with TiDB Monitor #1949

DanielZhangQD commented Mar 17, 2020 •

edited

Loading

DanielZhangQD commented Mar 17, 2020

DanielZhangQD commented Mar 17, 2020

DanielZhangQD commented Mar 17, 2020

Issues with TiDB Monitor #1949

Issues with TiDB Monitor #1949

Comments

DanielZhangQD commented Mar 17, 2020 • edited Loading

Bug Report

DanielZhangQD commented Mar 17, 2020

DanielZhangQD commented Mar 17, 2020

DanielZhangQD commented Mar 17, 2020

DanielZhangQD commented Mar 17, 2020 •

edited

Loading