You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When new metrics/labels/exporters are added to be scraped by prometheus, make sure the following list **is updated** as well to keep track of what metrics/labels are needed or not.
6
+
7
+
The following is a list of metrics that are currently in use.
8
+
9
+
#### Cortex metrics
10
+
11
+
1. cortex_in_flight_requests with the following labels:
12
+
1. api_name
13
+
1. cortex_async_request_count with the following labels:
14
+
1. api_name
15
+
1. api_kind
16
+
1. status_code
17
+
1. cortex_async_queue_length with the following labels:
18
+
1. api_name
19
+
1. api_kind
20
+
1. cortex_async_latency_bucket with the following labels:
21
+
1. api_name
22
+
1. api_kind
23
+
1. cortex_batch_succeeded with the following labels:
24
+
1. api_name
25
+
1. cortex_batch_failed with the following labels:
26
+
1. api_name
27
+
1. cortex_time_per_batch_sum with the following labels:
28
+
1. api_name
29
+
1. cortex_time_per_batch_count with the following labels:
30
+
1. api_name
31
+
32
+
#### Istio metrics
33
+
34
+
1. istio_requests_total with the following labels:
35
+
1. destination_service
36
+
1. response_code
37
+
1. istio_request_duration_milliseconds_bucket with the following labels:
38
+
1. destination_service
39
+
1. le
40
+
1. istio_request_duration_milliseconds_sum with the following labels:
41
+
1. destination_service
42
+
1. istio_request_duration_milliseconds_count with the following labels:
43
+
1. destination_service
44
+
45
+
#### Kubelet metrics
46
+
1. container_cpu_usage_seconds_total with the following labels:
47
+
1. pod
48
+
1. container
49
+
1. name
50
+
1. container_memory_working_set_bytes with the following labels:
51
+
1. pod
52
+
1. name
53
+
1. container
54
+
55
+
#### Kube-state-metrics metrics
56
+
57
+
1. kube_pod_container_resource_requests with the following labels:
58
+
1. exported_pod
59
+
1. resource
60
+
1. exported_container (required for not dropping the values for each container of each pod)
61
+
1. kube_pod_info with the following labels:
62
+
1. exported_pod
63
+
1. kube_deployment_status_replicas_available with the following labels:
64
+
1. deployment
65
+
1. kube_job_status_active with the following labels:
66
+
1. job_name
67
+
68
+
#### DCGM metrics
69
+
70
+
1. DCGM_FI_DEV_GPU_UTIL with the following labels:
71
+
1. exported_pod
72
+
1. DCGM_FI_DEV_FB_USED with the following labels:
73
+
1. exported_pod
74
+
1. DCGM_FI_DEV_FB_FREE with the following labels:
75
+
1. exported_pod
76
+
77
+
#### Node metrics
78
+
79
+
1. node_cpu_seconds_total with the following labels:
80
+
1. job
81
+
1. mode
82
+
1. instance
83
+
1. cpu
84
+
1. node_load1 with the following labels:
85
+
1. job
86
+
1. instance
87
+
1. node_load5 with the following labels:
88
+
1. job
89
+
1. instance
90
+
1. node_load15 with the following labels:
91
+
1. job
92
+
1. instance
93
+
1. node_exporter_build_info with the following labels:
94
+
1. job
95
+
1. instance
96
+
1. node_memory_MemTotal_bytes with the following labels:
97
+
1. job
98
+
1. instance
99
+
1. node_memory_MemFree_bytes with the following labels:
100
+
1. job
101
+
1. instance
102
+
1. node_memory_Buffers_bytes with the following labels:
103
+
1. job
104
+
1. instance
105
+
1. node_memory_Cached_bytes with the following labels:
106
+
1. job
107
+
1. instance
108
+
1. node_memory_MemAvailable_bytes with the following labels:
109
+
1. job
110
+
1. instance
111
+
1. node_disk_read_bytes_total with the following labels:
112
+
1. job
113
+
1. instance
114
+
1. device
115
+
1. node_disk_written_bytes_total with the following labels:
116
+
1. job
117
+
1. instance
118
+
1. device
119
+
1. node_disk_io_time_seconds_total with the following labels:
120
+
1. job
121
+
1. instance
122
+
1. device
123
+
1. node_filesystem_size_bytes with the following labels:
124
+
1. job
125
+
1. instance
126
+
1. fstype
127
+
1. mountpoint
128
+
1. device
129
+
1. node_filesystem_avail_bytes with the following labels:
130
+
1. job
131
+
1. instance
132
+
1. fstype
133
+
1. device
134
+
1. node_network_receive_bytes_total with the following labels:
135
+
1. job
136
+
1. instance
137
+
1. device
138
+
1. node_network_transmit_bytes_total with the following labels:
139
+
1. job
140
+
1. instance
141
+
1. device
142
+
143
+
##### Prometheus rules for the node exporter
144
+
145
+
1. instance:node_cpu_utilisation:rate1m from the following metrics:
146
+
1. node_cpu_seconds_total with the following labels:
147
+
1. job
148
+
1. mode
149
+
1. instance:node_num_cpu:sum from the following metrics:
150
+
1. node_cpu_seconds_total with the following labels:
151
+
1. job
152
+
1. instance:node_load1_per_cpu:ratio from the following metrics:
153
+
1. node_load1 with the following labels:
154
+
1. job
155
+
1. instance:node_memory_utilisation:ratio from the following metrics:
156
+
1. node_memory_MemTotal_bytes with the following labels:
157
+
1. job
158
+
1. node_memory_MemAvailable_bytes with the following labels:
159
+
1. job
160
+
1. instance:node_vmstat_pgmajfault:rate1m with the following metrics:
161
+
1. node_vmstat_pgmajfault with the following labels:
162
+
1. job
163
+
1. instance_device:node_disk_io_time_seconds:rate1m with the following metrics:
164
+
1. node_disk_io_time_seconds_total with the following labels:
165
+
1. job
166
+
1. device
167
+
1. instance_device:node_disk_io_time_weighted_seconds:rate1m with the following metrics:
168
+
1. node_disk_io_time_weighted_seconds with the following labels:
169
+
1. job
170
+
1. device
171
+
1. instance:node_network_receive_bytes_excluding_lo:rate1m with the following metrics:
172
+
1. node_network_receive_bytes_total with the following labels:
173
+
1. job
174
+
1. device
175
+
1. instance:node_network_transmit_bytes_excluding_lo:rate1m with the following metrics:
176
+
1. node_network_transmit_bytes_total with the following labels:
177
+
1. job
178
+
1. device
179
+
1. instance:node_network_receive_drop_excluding_lo:rate1m with the following metrics:
180
+
1. node_network_receive_drop_total with the following labels:
181
+
1. job
182
+
1. device
183
+
1. instance:node_network_transmit_drop_excluding_lo:rate1m with the following metrics:
184
+
1. node_network_transmit_drop_total with the following labels:
185
+
1. job
186
+
1. device
187
+
188
+
## Re-introducing dropped metrics/labels
189
+
190
+
If you need to add some metrics/labels back for some particular use case, comment out every `metricRelabelings:` section (except the one from the `prometheus-operator.yaml` file), determine which metrics/labels you want to add back (i.e. by using the explorer from Grafana) and then re-edit the appropriate `metricRelabelings:` sections to account for the un-dropped metrics/labels.
191
+
192
+
## Prometheus Analysis
193
+
194
+
### Go Pprof
195
+
196
+
To analyse the memory allocations of prometheus, run `kubectl port-forward prometheus-prometheus-0 9090:9090`, and then run `go tool pprof -symbolize=remote -inuse_space localhost:9090/debug/pprof/heap`. Once you get the interpreter, you can run `top` or `dot` for a more detailed hierarchy of the memory usage.
197
+
198
+
### TSDB
199
+
200
+
To analyse the TSDB of prometheus, exec into the `prometheus-prometheus-0` pod, `cd` into `/tmp`, and run the following code-block:
# instance type for prometheus (use a larger instance for clusters exceeding 500 nodes)
96
+
prometheus_instance_type: "t3.medium"
94
97
```
95
98
96
99
The docker images used by the cluster can also be overridden. They can be configured by adding any of these keys to your cluster configuration file (default values are shown):
0 commit comments