This topic gives an overview of the Prometheus and Grafana packages, and how you can install them in Tanzu Kubernetes Grid (TKG) workload clusters to provide monitoring services for the cluster.
Installation: Install Prometheus on a workload cluster in one of the following ways, based on its deployment option:
TKG on Supervisor:
Standalone management cluster: Install Prometheus in Workload Clusters Deployed by a Standalone Management Cluster
Install Grafana on a workload cluster in one of the following ways, based on its deployment option:
TKG on Supervisor:
Standalone management cluster: Install Grafana in Workload Clusters Deployed by a Standalone Management Cluster
NoteAs of v2.5, TKG does not support clusters on AWS or Azure. See the End of Support for TKG Management and Workload Clusters on AWS and Azure in the Tanzu Kubernetes Grid v2.5 Release Notes.
Prometheus collects metrics from configured targets at given intervals, evaluates rule expressions, and displays the results.
Alertmanager triggers alerts if some condition is observed to be true.
Grafana lets you query, visualize, alert on, and explore your metrics no matter where they are stored.
The following diagram shows how the monitoring components on a cluster interact.
The Prometheus package installs on a TKG cluster the containers listed in the table. The package pulls the containers from the VMware public registry specified in Package Repository.
Container | Resource Type | Replicas | Description |
---|---|---|---|
prometheus-alertmanager |
Deployment | 1 | Handles alerts sent by client applications such as the Prometheus server. |
prometheus-cadvisor |
DaemonSet | 5 | Analyzes and exposes resource usage and performance data from running containers |
prometheus-kube-state-metrics |
Deployment | 1 | Monitors node status and capacity, replica-set compliance, pod, job, and cronjob status, resource requests and limits. |
prometheus-node-exporter |
DaemonSet | 5 | Exporter for hardware and OS metrics exposed by kernels. |
prometheus-pushgateway |
Deployment | 1 | Service that allows you to push metrics from jobs which cannot be scraped. |
prometheus-server |
Deployment | 1 | Provides core functionality, including scraping, rule processing, and alerting. |
Below is an example prometheus-data-values.yaml
file.
Note the following:
prometheus.system.tanzu
(virtual_host_fqdn:).storageClassName
for the default storage policy.storageClassName
for the vSphere storage policy.namespace: prometheus-monitoring
alertmanager:
config:
alertmanager_yml: |
global: {}
receivers:
- name: default-receiver
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_interval: 5m
group_wait: 10s
receiver: default-receiver
repeat_interval: 3h
deployment:
replicas: 1
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
updateStrategy: Recreate
pvc:
accessMode: ReadWriteOnce
storage: 2Gi
storageClassName: default
service:
port: 80
targetPort: 9093
type: ClusterIP
ingress:
alertmanager_prefix: /alertmanager/
alertmanagerServicePort: 80
enabled: true
prometheus_prefix: /
prometheusServicePort: 80
tlsCertificate:
ca.crt: |
-----BEGIN CERTIFICATE-----
MIIFczCCA1ugAwIBAgIQTYJITQ3SZ4BBS9UzXfJIuTANBgkqhkiG9w0BAQsFADBM
...
w0oGuTTBfxSMKs767N3G1q5tz0mwFpIqIQtXUSmaJ+9p7IkpWcThLnyYYo1IpWm/
ZHtjzZMQVA==
-----END CERTIFICATE-----
tls.crt: |
-----BEGIN CERTIFICATE-----
MIIHxTCCBa2gAwIBAgITIgAAAAQnSpH7QfxTKAAAAAAABDANBgkqhkiG9w0BAQsF
...
YYsIjp7/f+Pk1DjzWx8JIAbzItKLucDreAmmDXqk+DrBP9LYqtmjB0n7nSErgK8G
sA3kGCJdOkI0kgF10gsinaouG2jVlwNOsw==
-----END CERTIFICATE-----
tls.key: |
-----BEGIN PRIVATE KEY-----
MIIJRAIBADANBgkqhkiG9w0BAQEFAASCCS4wggkqAgEAAoICAQDOGHT8I12KyQGS
...
l1NzswracGQIzo03zk/X3Z6P2YOea4BkZ0Iwh34wOHJnTkfEeSx6y+oSFMcFRthT
yfFCZUk/sVCc/C1a4VigczXftUGiRrTR
-----END PRIVATE KEY-----
virtual_host_fqdn: prometheus.system.tanzu
kube_state_metrics:
deployment:
replicas: 1
service:
port: 80
targetPort: 8080
telemetryPort: 81
telemetryTargetPort: 8081
type: ClusterIP
node_exporter:
daemonset:
hostNetwork: false
updatestrategy: RollingUpdate
service:
port: 9100
targetPort: 9100
type: ClusterIP
prometheus:
pspNames: "vmware-system-restricted"
config:
alerting_rules_yml: |
{}
alerts_yml: |
{}
prometheus_yml: |
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
rule_files:
- /etc/config/alerting_rules.yml
- /etc/config/recording_rules.yml
- /etc/config/alerts
- /etc/config/rules
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['prometheus-kube-state-metrics.prometheus.svc.cluster.local:8080']
- job_name: 'node-exporter'
static_configs:
- targets: ['prometheus-node-exporter.prometheus.svc.cluster.local:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- alertmanager.prometheus.svc:80
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
regex: default
action: keep
- source_labels: [__meta_kubernetes_pod_label_app]
regex: prometheus
action: keep
- source_labels: [__meta_kubernetes_pod_label_component]
regex: alertmanager
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
regex: .*
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex:
action: drop
recording_rules_yml: |
groups:
- name: kube-apiserver.rules
interval: 3m
rules:
- expr: |2
(
(
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[1d]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[1d]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[1d]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[1d]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[1d]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[1d]))
labels:
verb: read
record: apiserver_request:burnrate1d
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[1h]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[1h]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[1h]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[1h]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[1h]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[1h]))
labels:
verb: read
record: apiserver_request:burnrate1h
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[2h]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[2h]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[2h]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[2h]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[2h]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[2h]))
labels:
verb: read
record: apiserver_request:burnrate2h
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[30m]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[30m]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[30m]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[30m]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[30m]))
labels:
verb: read
record: apiserver_request:burnrate30m
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[3d]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[3d]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[3d]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[3d]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[3d]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[3d]))
labels:
verb: read
record: apiserver_request:burnrate3d
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[5m]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[5m]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[5m]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[5m]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m]))
labels:
verb: read
record: apiserver_request:burnrate5m
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[6h]))
-
(
(
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[6h]))
or
vector(0)
)
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[6h]))
+
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[6h]))
)
)
+
# errors
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[6h]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[6h]))
labels:
verb: read
record: apiserver_request:burnrate6h
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1d]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[1d]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1d]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1d]))
labels:
verb: write
record: apiserver_request:burnrate1d
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1h]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[1h]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1h]))
labels:
verb: write
record: apiserver_request:burnrate1h
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[2h]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[2h]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[2h]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[2h]))
labels:
verb: write
record: apiserver_request:burnrate2h
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[30m]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[30m]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[30m]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[30m]))
labels:
verb: write
record: apiserver_request:burnrate30m
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[3d]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[3d]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[3d]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[3d]))
labels:
verb: write
record: apiserver_request:burnrate3d
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[5m]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[5m]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
labels:
verb: write
record: apiserver_request:burnrate5m
- expr: |2
(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[6h]))
-
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[6h]))
)
+
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[6h]))
)
/
sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[6h]))
labels:
verb: write
record: apiserver_request:burnrate6h
- expr: |
sum by (code,resource) (rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m]))
labels:
verb: read
record: code_resource:apiserver_request_total:rate5m
- expr: |
sum by (code,resource) (rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
labels:
verb: write
record: code_resource:apiserver_request_total:rate5m
- expr: |
histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m]))) > 0
labels:
quantile: "0.99"
verb: read
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0
labels:
quantile: "0.99"
verb: write
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |2
sum(rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)
/
sum(rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)
record: cluster:apiserver_request_duration_seconds:mean5m
- expr: |
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- interval: 3m
name: kube-apiserver-availability.rules
rules:
- expr: |2
1 - (
(
# write too slow
sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
-
sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
) +
(
# read too slow
sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
-
(
(
sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d]))
or
vector(0)
)
+
sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d]))
+
sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
)
) +
# errors
sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
)
/
sum(code:apiserver_request_total:increase30d)
labels:
verb: all
record: apiserver_request:availability30d
- expr: |2
1 - (
sum(increase(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[30d]))
-
(
# too slow
(
sum(increase(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d]))
or
vector(0)
)
+
sum(increase(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[30d]))
+
sum(increase(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
)
+
# errors
sum(code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
)
/
sum(code:apiserver_request_total:increase30d{verb="read"})
labels:
verb: read
record: apiserver_request:availability30d
- expr: |2
1 - (
(
# too slow
sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
-
sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
)
+
# errors
sum(code:apiserver_request_total:increase30d{verb="write",code=~"5.."} or vector(0))
)
/
sum(code:apiserver_request_total:increase30d{verb="write"})
labels:
verb: write
record: apiserver_request:availability30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"2.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"2.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"2.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"2.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"2.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"2.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"3.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"3.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"3.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"3.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"3.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"3.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"4.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"4.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"4.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"4.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"4.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"4.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"5.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"5.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"5.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"5.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"5.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"5.."}[30d]))
record: code_verb:apiserver_request_total:increase30d
- expr: |
sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"})
labels:
verb: read
record: code:apiserver_request_total:increase30d
- expr: |
sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
labels:
verb: write
record: code:apiserver_request_total:increase30d
rules_yml: |
{}
deployment:
configmapReload:
containers:
args:
- --volume-dir=/etc/config
- --webhook-url=http://127.0.0.1:9090/-/reload
containers:
args:
- --storage.tsdb.retention.time=42d
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
replicas: 1
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
updateStrategy: Recreate
pvc:
accessMode: ReadWriteOnce
storage: 20Gi
storageClassName: default
service:
port: 80
targetPort: 9090
type: ClusterIP
pushgateway:
deployment:
replicas: 1
service:
port: 9091
targetPort: 9091
type: ClusterIP
The Prometheus configuration is set in the prometheus-data-values.yaml
file. The table lists and describes the available parameters.
Parameter | Description | Type | Default |
---|---|---|---|
monitoring.namespace | Namespace where Prometheus will be deployed | string | tanzu-system-monitoring |
monitoring.create_namespace | The flag indicates whether to create the namespace specified by monitoring.namespace | boolean | false |
monitoring.prometheus_server.config.prometheus_yaml | Kubernetes cluster monitor config details to be passed to Prometheus | yaml file | prometheus.yaml |
monitoring.prometheus_server.config.alerting_rules_yaml | Detailed alert rules defined in Prometheus | yaml file | alerting_rules.yaml |
monitoring.prometheus_server.config.recording_rules_yaml | Detailed record rules defined in Prometheus | yaml file | recording_rules.yaml |
monitoring.prometheus_server.service.type | Type of service to expose Prometheus. Supported Values: ClusterIP | string | ClusterIP |
monitoring.prometheus_server.enable_alerts.kubernetes_api | Enable SLO alerting for the Kubernetes API in Prometheus | boolean | true |
monitoring.prometheus_server.sc.aws_type | AWS type defined for storageclass on AWS | string | gp2 |
monitoring.prometheus_server.sc.aws_fsType | AWS file system type defined for storageclass on AWS | string | ext4 |
monitoring.prometheus_server.sc.allowVolumeExpansion | Define if volume expansion allowed for storageclass on AWS | boolean | true |
monitoring.prometheus_server.pvc.annotations | Storage class annotations | map | {} |
monitoring.prometheus_server.pvc.storage_class | Storage class to use for persistent volume claim. By default this is null and default provisioner is used | string | null |
monitoring.prometheus_server.pvc.accessMode | Define access mode for persistent volume claim. Supported values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany | string | ReadWriteOnce |
monitoring.prometheus_server.pvc.storage | Define storage size for persistent volume claim | string | 8Gi |
monitoring.prometheus_server.deployment.replicas | Number of prometheus replicas | integer | 1 |
monitoring.prometheus_server.image.repository | Location of the repository with the Prometheus image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.prometheus_server.image.name | Name of Prometheus image | string | prometheus |
monitoring.prometheus_server.image.tag | Prometheus image tag. This value may need to be updated if you are upgrading the version. | string | v2.17.1_vmware.1 |
monitoring.prometheus_server.image.pullPolicy | Prometheus image pull policy | string | IfNotPresent |
monitoring.alertmanager.config.slack_demo | Slack notification configuration for Alertmanager | string | slack_demo: name: slack_demo slack_configs: - api_url: https://hooks.slack.com channel: '#alertmanager-test' |
monitoring.alertmanager.config.email_receiver | Email notification configuration for Alertmanager | string | email_receiver: name: email-receiver email_configs: - to: demo@tanzu.com send_resolved: false from: from-email@tanzu.com smarthost: smtp.eample.com:25 require_tls: false |
monitoring.alertmanager.service.type | Type of service to expose Alertmanager. Supported Values: ClusterIP | string | ClusterIP |
monitoring.alertmanager.image.repository | Location of the repository with the Alertmanager image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.alertmanager.image.name | Name of Alertmanager image | string | alertmanager |
monitoring.alertmanager.image.tag | Alertmanager image tag. This value may need to be updated if you are upgrading the version. | string | v0.20.0_vmware.1 |
monitoring.alertmanager.image.pullPolicy | Alertmanager image pull policy | string | IfNotPresent |
monitoring.alertmanager.pvc.annotations | Storage class annotations | map | {} |
monitoring.alertmanager.pvc.storage_class | Storage class to use for persistent volume claim. By default this is null and default provisioner is used. | string | null |
monitoring.alertmanager.pvc.accessMode | Define access mode for persistent volume claim. Supported values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany | string | ReadWriteOnce |
monitoring.alertmanager.pvc.storage | Define storage size for persistent volume claim | string | 2Gi |
monitoring.alertmanager.deployment.replicas | Number of alertmanager replicas | integer | 1 |
monitoring.kube_state_metrics.image.repository | Repository containing kube-state-metircs image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.kube_state_metrics.image.name | Name of kube-state-metircs image | string | kube-state-metrics |
monitoring.kube_state_metrics.image.tag | kube-state-metircs image tag. This value may need to be updated if you are upgrading the version. | string | v1.9.5_vmware.1 |
monitoring.kube_state_metrics.image.pullPolicy | kube-state-metircs image pull policy | string | IfNotPresent |
monitoring.kube_state_metrics.deployment.replicas | Number of kube-state-metrics replicas | integer | 1 |
monitoring.node_exporter.image.repository | Repository containing node-exporter image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.node_exporter.image.name | Name of node-exporter image | string | node-exporter |
monitoring.node_exporter.image.tag | node-exporter image tag. This value may need to be updated if you are upgrading the version. | string | v0.18.1_vmware.1 |
monitoring.node_exporter.image.pullPolicy | node-exporter image pull policy | string | IfNotPresent |
monitoring.node_exporter.hostNetwork | If set to `hostNetwork: true`, the pod can use the network namespace and network resources of the node. | boolean | false |
monitoring.node_exporter.deployment.replicas | Number of node-exporter replicas | integer | 1 |
monitoring.pushgateway.image.repository | Repository containing pushgateway image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.pushgateway.image.name | Name of pushgateway image | string | pushgateway |
monitoring.pushgateway.image.tag | pushgateway image tag. This value may need to be updated if you are upgrading the version. | string | v1.2.0_vmware.1 |
monitoring.pushgateway.image.pullPolicy | pushgateway image pull policy | string | IfNotPresent |
monitoring.pushgateway.deployment.replicas | Number of pushgateway replicas | integer | 1 |
monitoring.cadvisor.image.repository | Repository containing cadvisor image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.cadvisor.image.name | Name of cadvisor image | string | cadvisor |
monitoring.cadvisor.image.tag | cadvisor image tag. This value may need to be updated if you are upgrading the version. | string | v0.36.0_vmware.1 |
monitoring.cadvisor.image.pullPolicy | cadvisor image pull policy | string | IfNotPresent |
monitoring.cadvisor.deployment.replicas | Number of cadvisor replicas | integer | 1 |
monitoring.ingress.enabled | Enable/disable ingress for prometheus and alertmanager | boolean | false To use ingress, set this field to true and deploy [Contour](#). To access Prometheus, update your local /etc/hosts with an entry that maps prometheus.system.tanzu to a worker node IP address. |
monitoring.ingress.virtual_host_fqdn | Hostname for accessing Prometheus and Alertmanager | string | prometheus.system.tanzu |
monitoring.ingress.prometheus_prefix | Path prefix for prometheus | string | / |
monitoring.ingress.alertmanager_prefix | Path prefix for alertmanager | string | /alertmanager/ |
monitoring.ingress.tlsCertificate.tls.crt | Optional cert for ingress if you want to use your own TLS cert. A self signed cert is generated by default | string | Generated cert |
monitoring.ingress.tlsCertificate.tls.key | Optional cert private key for ingress if you want to use your own TLS cert. | string | Generated cert key |
The following table lists configuration parameters of the Prometheus package and describes their default values.
You can set the following configuration values in your prometheus-data-values.yaml
file.
To review and edit the Prometheus package’s current configuration parameters and values, retrieve its values schema:
tanzu package available get prometheus.tanzu.vmware.com/2.45.0+vmware.1-tkg.1 -n AVAILABLE-PACKAGE-NAMESPACE --values-schema
Parameter | Description | Type | Default |
---|---|---|---|
namespace |
Namespace where Prometheus will be deployed. | String | tanzu-system-monitoring |
prometheus.deployment.replicas |
Number of Prometheus replicas. | String | 1 |
prometheus.deployment.containers.args |
Prometheus container arguments. You can configure this parameter to change retention time. For information about configuring Prometheus storage parameters, see the Prometheus documentation. Note Longer retention times require more storage capacity than shorter retention times. It might be necessary to increase the persistent volume claim size if you are significantly increasing the retention time. | List | n/a |
prometheus.deployment.containers.resources |
Prometheus container resource requests and limits. | Map | {} |
prometheus.deployment.podAnnotations |
The Prometheus deployments pod annotations. | Map | {} |
prometheus.deployment.podLabels |
The Prometheus deployments pod labels. | Map | {} |
prometheus.deployment.configMapReload.containers.args |
Configmap-reload container arguments. | List | n/a |
prometheus.deployment.configMapReload.containers.resources |
Configmap-reload container resource requests and limits. | Map | {} |
prometheus.service.type |
Type of service to expose Prometheus. Supported Values: ClusterIP . |
String | ClusterIP |
prometheus.service.port |
Prometheus service port. | Integer | 80 |
prometheus.service.targetPort |
Prometheus service target port. | Integer | 9090 |
prometheus.service.labels |
Prometheus service labels. | Map | {} |
prometheus.service.annotations |
Prometheus service annotations. | Map | {} |
prometheus.pvc.annotations |
Storage class annotations. | Map | {} |
prometheus.pvc.storageClassName |
Storage class to use for persistent volume claim. By default this is null and default provisioner is used. | String | null |
prometheus.pvc.accessMode |
Define access mode for persistent volume claim. Supported values: ReadWriteOnce , ReadOnlyMany , ReadWriteMany . |
String | ReadWriteOnce |
prometheus.pvc.storage |
Define storage size for persistent volume claim. | String | 150Gi |
prometheus.config.prometheus_yml |
For information about the global Prometheus configuration, see the Prometheus documentation. | YAML file | prometheus.yaml |
prometheus.config.alerting_rules_yml |
For information about the Prometheus alerting rules, see the Prometheus documentation. | YAML file | alerting_rules.yaml |
prometheus.config.recording_rules_yml |
For information about the Prometheus recording rules, see the Prometheus documentation. | YAML file | recording_rules.yaml |
prometheus.config.alerts_yml |
Additional prometheus alerting rules are configured here. | YAML file | alerts_yml.yaml |
prometheus.config.rules_yml |
Additional prometheus recording rules are configured here. | YAML file | rules_yml.yaml |
alertmanager.deployment.replicas |
Number of alertmanager replicas. | Integer | 1 |
alertmanager.deployment.containers.resources |
Alertmanager container resource requests and limits. | Map | {} |
alertmanager.deployment.podAnnotations |
The Alertmanager deployments pod annotations. | Map | {} |
alertmanager.deployment.podLabels |
The Alertmanager deployments pod labels. | Map | {} |
alertmanager.service.type |
Type of service to expose Alertmanager. Supported Values: ClusterIP . |
String | ClusterIP |
alertmanager.service.port |
Alertmanager service port. | Integer | 80 |
alertmanager.service.targetPort |
Alertmanager service target port. | Integer | 9093 |
alertmanager.service.labels |
Alertmanager service labels. | Map | {} |
alertmanager.service.annotations |
Alertmanager service annotations. | Map | {} |
alertmanager.pvc.annotations |
Storage class annotations. | Map | {} |
alertmanager.pvc.storageClassName |
Storage class to use for persistent volume claim. By default this is null and default provisioner is used. | String | null |
alertmanager.pvc.accessMode |
Define access mode for persistent volume claim. Supported values: ReadWriteOnce , ReadOnlyMany , ReadWriteMany . |
String | ReadWriteOnce |
alertmanager.pvc.storage |
Define storage size for persistent volume claim. | String | 2Gi |
alertmanager.config.alertmanager_yml |
For information about the global YAML configuration for Alert Manager, see the Prometheus documentation. | YAML file | alertmanager_yml |
kube_state_metrics.deployment.replicas |
Number of kube-state-metrics replicas. | Integer | 1 |
kube_state_metrics.deployment.containers.resources |
kube-state-metrics container resource requests and limits. | Map | {} |
kube_state_metrics.deployment.podAnnotations |
The kube-state-metrics deployments pod annotations. | Map | {} |
kube_state_metrics.deployment.podLabels |
The kube-state-metrics deployments pod labels. | Map | {} |
kube_state_metrics.service.type |
Type of service to expose kube-state-metrics. Supported Values: ClusterIP . |
String | ClusterIP |
kube_state_metrics.service.port |
kube-state-metrics service port. | Integer | 80 |
kube_state_metrics.service.targetPort |
kube-state-metrics service target port. | Integer | 8080 |
kube_state_metrics.service.telemetryPort |
kube-state-metrics service telemetry port. | Integer | 81 |
kube_state_metrics.service.telemetryTargetPort |
kube-state-metrics service target telemetry port. | Integer | 8081 |
kube_state_metrics.service.labels |
kube-state-metrics service labels. | Map | {} |
kube_state_metrics.service.annotations |
kube-state-metrics service annotations. | Map | {} |
node_exporter.daemonset.replicas |
Number of node-exporter replicas. | Integer | 1 |
node_exporter.daemonset.containers.resources |
node-exporter container resource requests and limits. | Map | {} |
node_exporter.daemonset.hostNetwork |
Host networking requested for this pod. | boolean | false |
node_exporter.daemonset.podAnnotations |
The node-exporter deployments pod annotations. | Map | {} |
node_exporter.daemonset.podLabels |
The node-exporter deployments pod labels. | Map | {} |
node_exporter.service.type |
Type of service to expose node-exporter. Supported Values: ClusterIP . |
String | ClusterIP |
node_exporter.service.port |
node-exporter service port. | Integer | 9100 |
node_exporter.service.targetPort |
node-exporter service target port. | Integer | 9100 |
node_exporter.service.labels |
node-exporter service labels. | Map | {} |
node_exporter.service.annotations |
node-exporter service annotations. | Map | {} |
pushgateway.deployment.replicas |
Number of pushgateway replicas. | Integer | 1 |
pushgateway.deployment.containers.resources |
pushgateway container resource requests and limits. | Map | {} |
pushgateway.deployment.podAnnotations |
The pushgateway deployments pod annotations. | Map | {} |
pushgateway.deployment.podLabels |
The pushgateway deployments pod labels. | Map | {} |
pushgateway.service.type |
Type of service to expose pushgateway. Supported Values: ClusterIP . |
String | ClusterIP |
pushgateway.service.port |
pushgateway service port. | Integer | 9091 |
pushgateway.service.targetPort |
pushgateway service target port. | Integer | 9091 |
pushgateway.service.labels |
pushgateway service labels. | Map | {} |
pushgateway.service.annotations |
pushgateway service annotations. | Map | {} |
cadvisor.daemonset.replicas |
Number of cadvisor replicas. | Integer | 1 |
cadvisor.daemonset.containers.resources |
cadvisor container resource requests and limits. | Map | {} |
cadvisor.daemonset.podAnnotations |
The cadvisor deployments pod annotations. | Map | {} |
cadvisor.daemonset.podLabels |
The cadvisor deployments pod labels. | Map | {} |
ingress.enabled |
Activate/Deactivate ingress for prometheus and alertmanager. | Boolean | false |
ingress.virtual_host_fqdn |
Hostname for accessing promethues and alertmanager. | String | prometheus.system.tanzu |
ingress.prometheus_prefix |
Path prefix for prometheus. | String | / |
ingress.alertmanager_prefix |
Path prefix for alertmanager. | String | /alertmanager/ |
ingress.prometheusServicePort |
Prometheus service port to proxy traffic to. | Integer | 80 |
ingress.alertmanagerServicePort |
Alertmanager service port to proxy traffic to. | Integer | 80 |
ingress.tlsCertificate.tls.crt |
Optional certificate for ingress if you want to use your own TLS certificate. A self signed certificate is generated by default. Note tls.crt is a key and not nested. |
String | Generated cert |
ingress.tlsCertificate.tls.key |
Optional certificate private key for ingress if you want to use your own TLS certificate. Note tls.key is a key and not nested. |
String | Generated cert key |
ingress.tlsCertificate.ca.crt |
Optional CA certificate. Note ca.crt is a key and not nested. |
String | CA certificate |
You can set the following fields in the Prometheus Server ConfigMap
.
Parameter | Description | Type | Default |
---|---|---|---|
evaluation_interval | frequency to evaluate rules | duration | 1m |
scrape_interval | frequency to scrape targets | duration | 1m |
scrape_timeout | How long until a scrape request times out | duration | 10s |
rule_files | Rule files specifies a list of globs. Rules and alerts are read from all matching files | yaml file | |
scrape_configs | A list of scrape configurations. | list | |
job_name | The job name assigned to scraped metrics by default | string | |
kubernetes_sd_configs | List of Kubernetes service discovery configurations. | list | |
relabel_configs | List of target relabel configurations. | list | |
action | Action to perform based on regex matching. | string | |
regex | Regular expression against which the extracted value is matched. | string | |
source_labels | The source labels select values from existing labels. | string | |
scheme | Configures the protocol scheme used for requests. | string | |
tls_config | Configures the scrape request’s TLS settings. | string | |
ca_file | CA certificate to validate API server certificate with. | filename | |
insecure_skip_verify | Disable validation of the server certificate. | boolean | |
bearer_token_file | Optional bearer token file authentication information. | filename | |
replacement | Replacement value against which a regex replace is performed if the regular expression matches. | string | |
target_label | Label to which the resulting value is written in a replace action. | string |
You can set the following fields in the Alert Manager ConfigMap
.
Parameter | Description | Type | Default |
---|---|---|---|
resolve_timeout | ResolveTimeout is the default value used by alertmanager if the alert does not include EndsAt | duration | 5m |
smtp_smarthost | The SMTP host through which emails are sent. | duration | 1m |
slack_api_url | The Slack webhook URL. | string | global.slack_api_url |
pagerduty_url | The pagerduty URL to send API requests to. | string | global.pagerduty_url |
templates | Files from which custom notification template definitions are read | file path | |
group_by | group the alerts by label | string | |
group_interval | set time to wait before sending a notification about new alerts that are added to a group | duration | 5m |
group_wait | How long to initially wait to send a notification for a group of alerts | duration | 30s |
repeat_interval | How long to wait before sending a notification again if it has already been sent successfully for an alert | duration | 4h |
receivers | A list of notification receivers. | list | |
severity | Severity of the incident. | string | |
channel | The channel or user to send notifications to. | string | |
html | The HTML body of the email notification. | string | |
text | The text body of the email notification. | string | |
send_resolved | Whether or not to notify about resolved alerts. | filename | |
email_configs | Configurations for email integration | boolean |
Prometheus Pod Annotations
Annotations on pods allow a fine control of the scraping process. These annotations must be part of the pod metadata. They will have no effect if set on other objects such as Services or DaemonSets.
Pod Annotation | Description |
---|---|
prometheus.io/scrape |
The default configuration will scrape all pods and, if set to false, this annotation will exclude the pod from the scraping process. |
prometheus.io/path |
If the metrics path is not /metrics, define it with this annotation. |
prometheus.io/port |
Scrape the pod on the indicated port instead of the pod’s declared ports (default is a port-free target if none are declared). |
The DaemonSet manifest below will instruct Prometheus to scrape all of its pods on port 9102.
apiVersion: apps/v1beta2 # for versions before 1.8.0 use extensions/v1beta1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: weave
labels:
app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9102'
spec:
containers:
- name: fluentd-elasticsearch
image: gcr.io/google-containers/fluentd-elasticsearch:1.20
The Grafana package installs on the cluster the container listed in the table. For more information, see https://grafana.com/. The Grafana package pulls the container from the VMware public registry specified in Package Repository.
Container | Resource Type | Replicas | Description |
---|---|---|---|
Grafana | Deployment | 2 | Data visualization |
Below is an example grafana-data-values.yaml
file with the following customizations:
namespace: grafana-dashboard
grafana:
pspNames: "vmware-system-restricted"
deployment:
replicas: 1
updateStrategy: Recreate
pvc:
accessMode: ReadWriteOnce
storage: 2Gi
storageClassName: default
secret:
admin_user: YWRtaW4=
admin_password: YWRtaW4=
type: Opaque
service:
port: 80
targetPort: 3000
type: LoadBalancer
ingress:
enabled: true
prefix: /
servicePort: 80
virtual_host_fqdn: grafana.system.tanzu
The Grafana configuration is set in grafana-data-values.yaml
. The table lists and describes the available parameters.
Parameter | Description | Type | Default |
---|---|---|---|
monitoring.namespace | Namespace where Prometheus will be deployed | string | tanzu-system-monitoring |
monitoring.create_namespace | The flag indicates whether to create the namespace specified by monitoring.namespace | boolean | false |
monitoring.grafana.cluster_role.apiGroups | api group defined for grafana clusterrole | list | [""] |
monitoring.grafana.cluster_role.resources | resources defined for grafana clusterrole | list | [“configmaps”, “secrets”] |
monitoring.grafana.cluster_role.verbs | access permission defined for clusterrole | list | [“get”, “watch”, “list”] |
monitoring.grafana.config.grafana_ini | Grafana configuration file details | config file | grafana.ini In this file, grafana_net URL is used to integrate with Grafana, for example, to import the dashboard directly from Grafana.com. |
monitoring.grafana.config.datasource.type | Grafana datasource type | string | prometheus |
monitoring.grafana.config.datasource.access | access mode. proxy or direct (Server or Browser in the UI) | string | proxy |
monitoring.grafana.config.datasource.isDefault | mark as default Grafana datasource | boolean | true |
monitoring.grafana.config.provider_yaml | Config file to define grafana dashboard provider | yaml file | provider.yaml |
monitoring.grafana.service.type | Type of service to expose Grafana. Supported Values: ClusterIP, NodePort, LoadBalancer | string | vSphere: NodePort, aws/azure: LoadBalancer |
monitoring.grafana.pvc.storage_class | Define access mode for persistent volume claim. Supported values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany | string | ReadWriteOnce |
monitoring.grafana.pvc.storage | Define storage size for persistent volume claim | string | 2Gi |
monitoring.grafana.deployment.replicas | Number of grafana replicas | integer | 1 |
monitoring.grafana.image.repository | Location of the repository with the Grafana image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/grafana |
monitoring.grafana.image.name | Name of Grafana image | string | grafana |
monitoring.grafana.image.tag | Grafana image tag. This value may need to be updated if you are upgrading the version. | string | v7.3.5_vmware.1 |
monitoring.grafana.image.pullPolicy | Grafana image pull policy | string | IfNotPresent |
monitoring.grafana.secret.type | Secret type defined for Grafana dashboard | string | Opaque |
monitoring.grafana.secret.admin_user | username to access Grafana dashboard | string | YWRtaW4=Value is base64 encoded; to decode: echo "xxxxxx" | base64 --decode |
monitoring.grafana.secret.admin_password | password to access Grafana dashboard | string | null |
monitoring.grafana.secret.ldap_toml | If using ldap auth, ldap configuration file path | string | "" |
monitoring.grafana_init_container.image.repository | Repository containing grafana init container image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/grafana |
monitoring.grafana_init_container.image.name | Name of grafana init container image | string | k8s-sidecar |
monitoring.grafana_init_container.image.tag | Grafana init container image tag. This value may need to be updated if you are upgrading the version. | string | 0.1.99 |
monitoring.grafana_init_container.image.pullPolicy | grafana init container image pull policy | string | IfNotPresent |
monitoring.grafana_sc_dashboard.image.repository | Repository containing the Grafana dashboard image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). | string | projects.registry.vmware.com/tkg/grafana |
monitoring.grafana_sc_dashboard.image.name | Name of grafana dashboard image | string | k8s-sidecar |
monitoring.grafana_sc_dashboard.image.tag | Grafana dashboard image tag. This value may need to be updated if you are upgrading the version. | string | 0.1.99 |
monitoring.grafana_sc_dashboard.image.pullPolicy | grafana dashboard image pull policy | string | IfNotPresent |
monitoring.grafana.ingress.enabled | Enable/disable ingress for grafana | boolean | true |
monitoring.grafana.ingress.virtual_host_fqdn | Hostname for accessing grafana | string | grafana.system.tanzu |
monitoring.grafana.ingress.prefix | Path prefix for grafana | string | / |
monitoring.grafana.ingress.tlsCertificate.tls.crt | Optional cert for ingress if you want to use your own TLS cert. A self signed cert is generated by default | string | Generated cert |
monitoring.grafana.ingress.tlsCertificate.tls.key | Optional cert private key for ingress if you want to use your own TLS cert. | string | Generated cert key |
The following table lists configuration parameters of the Grafana package and describes their default values.
You can set the following configuration values in your grafana-data-values.yaml
file.
To review and edit the Grafana package’s current configuration parameters and values, retrieve its values schema:
tanzu package available get grafana.tanzu.vmware.com/10.0.1+vmware.1-tkg.1 -n AVAILABLE-PACKAGE-NAMESPACE --values-schema
Parameter | Description | Type | Default |
---|---|---|---|
namespace | Namespace where Grafana will be deployed. | String | tanzu-system-dashboards |
grafana.deployment.replicas | Number of Grafana replicas. | Integer | 1 |
grafana.deployment.containers.resources | Grafana container resource requests and limits. | Map | {} |
grafana.deployment.k8sSidecar.containers.resources | k8s-sidecar container resource requests and limits. | Map | {} |
grafana.deployment.podAnnotations | The Grafana deployments pod annotations. | Map | {} |
grafana.deployment.podLabels | The Grafana deployments pod labels. | Map | {} |
grafana.service.type | Type of service to expose Grafana. Supported Values: ClusterIP , NodePort , LoadBalancer . (For vSphere set this to NodePort ) |
String | LoadBalancer |
grafana.service.port | Grafana service port. | Integer | 80 |
grafana.service.targetPort | Grafana service target port. | Integer | 9093 |
grafana.service.labels | Grafana service labels. | Map | {} |
grafana.service.annotations | Grafana service annotations. | Map | {} |
grafana.config.grafana_ini | For information about Grafana configuration, see Grafana Configuration Defaults in GitHub. | Config file | grafana.ini |
grafana.config.datasource_yaml | For information about datasource config, see the Grafana documentation. | String | prometheus |
grafana.config.dashboardProvider_yaml | For information about dashboard provider config, see the Grafana documentation. | YAML file | provider.yaml |
grafana.pvc.annotations | Storage class to use for persistent volume claim. By default this is null and default provisioner is used. | String | null |
grafana.pvc.storageClassName | Storage class to use for persistent volume claim. By default this is null and default provisioner is used. | String | null |
grafana.pvc.accessMode | Define access mode for persistent volume claim. Supported values: ReadWriteOnce , ReadOnlyMany , ReadWriteMany . |
String | ReadWriteOnce |
grafana.pvc.storage | Define storage size for persistent volume claim. | String | 2Gi |
grafana.secret.type | Secret type defined for Grafana dashboard. | String | Opaque |
grafana.secret.admin_user | Base64-encoded username to access Grafana dashboard. Defaults to YWRtaW4= , which is equivalent to admin in plain text. |
String | YWRtaW4= |
grafana.secret.admin_password | Base64-encoded password to access Grafana dashboard. Defaults to YWRtaW4= , which is equivalent to admin in plain text. |
String | YWRtaW4= |
ingress.enabled | Activate/Deactivate ingress for grafana. | Boolean | true |
ingress.virtual_host_fqdn | Hostname for accessing grafana. | String | grafana.system.tanzu |
ingress.prefix | Path prefix for grafana. | String | / |
ingress.servicePort | Grafana service port to proxy traffic to. | Integer | 80 |
ingress.tlsCertificate.tls.crt | Optional certificate for ingress if you want to use your own TLS cert. A self signed certificate is generated by default. Note tls.crt is a key and not nested. |
String | Generated cert |
ingress.tlsCertificate.tls.key | Optional certificate private key for ingress if you want to use your own TLS certificate. Note tls.key is a key and not nested. |
String | Generated cert private key |
ingress.tlsCertificate.ca.crt | Optional CA certificate. Note ca.crt is a key and not nested. |
String | CA certificate |