mirror of
				https://github.com/prometheus-operator/kube-prometheus.git
				synced 2025-10-30 23:51:01 +01:00 
			
		
		
		
	*: migrate kube-prometheus to Prometheus Operator repo
This commit is contained in:
		
							parent
							
								
									81c0d2f4d3
								
							
						
					
					
						commit
						285272bf28
					
				
							
								
								
									
										133
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										133
									
								
								README.md
									
									
									
									
									
								
							| @ -4,135 +4,6 @@ This repository collects Kubernetes manifests, dashboards, and alerting rules | |||||||
| combined with documentation and scripts to provide single-command deployments | combined with documentation and scripts to provide single-command deployments | ||||||
| of end-to-end Kubernetes cluster monitoring. | of end-to-end Kubernetes cluster monitoring. | ||||||
| 
 | 
 | ||||||
| ## Prerequisites | # This repository has moved | ||||||
| 
 | 
 | ||||||
| First, you need a running Kubernetes cluster. If you don't have one, follow the | This repository has been merged with the [Prometheus Operator](https://github.com/coreos/prometheus-operator). It can now be found under [`contrib/kube-prometheus`](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus). | ||||||
| instructions of [bootkube](https://github.com/kubernetes-incubator/bootkube) or |  | ||||||
| [minikube](https://github.com/kubernetes/minikube). Some sample contents of this |  | ||||||
| repository are adapted to work with a [multi-node setup](https://github.com/kubernetes-incubator/bootkube/tree/master/hack/multi-node) |  | ||||||
| using [bootkube](https://github.com/kubernetes-incubator/bootkube). |  | ||||||
| 
 |  | ||||||
| ## Monitoring Kubernetes |  | ||||||
| 
 |  | ||||||
| The manifests used here use the [Prometheus Operator](https://github.com/coreos/prometheus-operator), |  | ||||||
| which manages Prometheus servers and their configuration in a cluster. With a single command we can install |  | ||||||
| 
 |  | ||||||
| * The Operator itself |  | ||||||
| * The Prometheus [node_exporter](https://github.com/prometheus/node_exporter) |  | ||||||
| * [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) |  | ||||||
| * The [Prometheus specification](https://github.com/coreos/prometheus-operator/blob/master/Documentation/prometheus.md) based on which the Operator deploys a Prometheus setup |  | ||||||
| * A Prometheus configuration covering monitoring of all Kubernetes core components and exporters |  | ||||||
| * A default set of alerting rules on the cluster component's health |  | ||||||
| * A Grafana instance serving dashboards on cluster metrics |  | ||||||
| * A three node highly available Alertmanager cluster |  | ||||||
| 
 |  | ||||||
| Simply run: |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| export KUBECONFIG=<path>          # defaults to "~/.kube/config" |  | ||||||
| hack/cluster-monitoring/deploy |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| After all pods are ready, you can reach: |  | ||||||
| 
 |  | ||||||
| * Prometheus UI on node port `30900` |  | ||||||
| * Alertmanager UI on node port `30903` |  | ||||||
| * Grafana on node port `30902` |  | ||||||
| 
 |  | ||||||
| To tear it all down again, run: |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| hack/cluster-monitoring/teardown |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| ## Monitoring custom services |  | ||||||
| 
 |  | ||||||
| The example manifests in [/manifests/examples/example-app](/manifests/examples/example-app) |  | ||||||
| deploy a fake service exposing Prometheus metrics. They additionally define a new Prometheus |  | ||||||
| server and a [`ServiceMonitor`](https://github.com/coreos/prometheus-operator/blob/master/Documentation/service-monitor.md), |  | ||||||
| which specifies how the example service should be monitored. |  | ||||||
| The Prometheus Operator will deploy and configure the desired Prometheus instance and continiously |  | ||||||
| manage its life cycle. |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| hack/example-service-monitoring/deploy |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| After all pods are ready you can reach the Prometheus server on node port `30100` and observe |  | ||||||
| how it monitors the service as specified. Same as before, this Prometheus server automatically |  | ||||||
| discovers the Alertmanager cluster deployed in the [Monitoring Kubernetes](#Monitoring-Kubernetes) |  | ||||||
| section. |  | ||||||
| 
 |  | ||||||
| Teardown: |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| hack/example-service-monitoring/teardown |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| ## Dashboarding |  | ||||||
| 
 |  | ||||||
| The provided manifests deploy a Grafana instance serving dashboards provided via a ConfigMap. |  | ||||||
| To modify, delete, or add dashboards, the `grafana-dashboards` ConfigMap must be modified. |  | ||||||
| 
 |  | ||||||
| Currently, Grafana does not support serving dashboards from static files. Instead, the `grafana-watcher` |  | ||||||
| sidecar container aims to emulate the behavior, by keeping the Grafana database always in sync |  | ||||||
| with the provided ConfigMap. Hence, the Grafana pod is effectively stateless. |  | ||||||
| This allows managing dashboards via `git` etc. and easily deploying them via CD pipelines. |  | ||||||
| 
 |  | ||||||
| In the future, a separate Grafana operator will support gathering dashboards from multiple |  | ||||||
| ConfigMaps based on label selection. |  | ||||||
| 
 |  | ||||||
| ## Roadmap |  | ||||||
| 
 |  | ||||||
| * Grafana Operator that dynamically discovers and deploys dashboards from ConfigMaps |  | ||||||
| * KPM/Helm packages to easily provide production-ready cluster-monitoring setup (essentially contents of `hack/cluster-monitoring`) |  | ||||||
| * Add meta-monitoring to default cluster monitoring setup |  | ||||||
| * Build out the provided dashboards and alerts for cluster monitoring to have full coverage of all system aspects |  | ||||||
| 
 |  | ||||||
| ## Monitoring other Cluster Components |  | ||||||
| 
 |  | ||||||
| Discovery of API servers and kubelets works the same across all clusters. |  | ||||||
| Depending on a cluster's setup several other core components, such as etcd or the |  | ||||||
| scheduler, may be deployed in different ways. |  | ||||||
| The easiest integration point is for the cluster operator to provide headless services |  | ||||||
| of all those components to provide a common interface of discovering them. With that |  | ||||||
| setup they will automatically be discovered by the provided Prometheus configuration. |  | ||||||
| 
 |  | ||||||
| For the `kube-scheduler` and `kube-controller-manager` there are headless |  | ||||||
| services prepared, simply add them to your running cluster: |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| kubectl -n kube-system create manifests/k8s/ |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| > Hint: if you use this for a cluster not created with bootkube, make sure you |  | ||||||
| > populate an endpoints object with the address to your `kube-scheduler` and |  | ||||||
| > `kube-controller-manager`, or adapt the label selectors to match your setup. |  | ||||||
| 
 |  | ||||||
| Aside from Kubernetes specific components, etcd is an important part of a |  | ||||||
| working cluster, but is typically deployed outside of it. This monitoring |  | ||||||
| setup assumes that it is made visible from within the cluster through a headless |  | ||||||
| service as well. |  | ||||||
| 
 |  | ||||||
| > Note that minikube hides some components like etcd so to see the extend of |  | ||||||
| > this setup we recommend setting up a [local cluster using bootkube](https://github.com/kubernetes-incubator/bootkube/tree/master/hack/multi-node). |  | ||||||
| 
 |  | ||||||
| An example for bootkube's multi-node vagrant setup is [here](/manifests/etcd/etcd-bootkube-vagrant-multi.yaml). |  | ||||||
| 
 |  | ||||||
| > Hint: this is merely an example for a local setup. The addresses will have to |  | ||||||
| > be adapted for a setup, that is not a single etcd bootkube created cluster. |  | ||||||
| 
 |  | ||||||
| With that setup the headless services provide endpoint lists consumed by |  | ||||||
| Prometheus to discover the endpoints as targets: |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| $ kubectl get endpoints --all-namespaces |  | ||||||
| NAMESPACE     NAME                                           ENDPOINTS          AGE |  | ||||||
| default       kubernetes                                     172.17.4.101:443   2h |  | ||||||
| kube-system   kube-controller-manager-prometheus-discovery   10.2.30.2:10252    1h |  | ||||||
| kube-system   kube-scheduler-prometheus-discovery            10.2.30.4:10251    1h |  | ||||||
| monitoring    etcd-k8s                                       172.17.4.51:2379   1h |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| ## Other Documentation |  | ||||||
| [Install Docs for a cluster created with KOPS on AWS](docs/KOPSonAWS.md) |  | ||||||
|  | |||||||
| @ -1,860 +0,0 @@ | |||||||
| { |  | ||||||
|   "dashboard": |  | ||||||
| { |  | ||||||
|   "__inputs": [ |  | ||||||
|     { |  | ||||||
|       "name": "DS_PROMETHEUS", |  | ||||||
|       "label": "prometheus", |  | ||||||
|       "description": "", |  | ||||||
|       "type": "datasource", |  | ||||||
|       "pluginId": "prometheus", |  | ||||||
|       "pluginName": "Prometheus" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "__requires": [ |  | ||||||
|     { |  | ||||||
|       "type": "grafana", |  | ||||||
|       "id": "grafana", |  | ||||||
|       "name": "Grafana", |  | ||||||
|       "version": "4.1.1" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "panel", |  | ||||||
|       "id": "graph", |  | ||||||
|       "name": "Graph", |  | ||||||
|       "version": "" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "datasource", |  | ||||||
|       "id": "prometheus", |  | ||||||
|       "name": "Prometheus", |  | ||||||
|       "version": "1.0.0" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "panel", |  | ||||||
|       "id": "singlestat", |  | ||||||
|       "name": "Singlestat", |  | ||||||
|       "version": "" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "annotations": { |  | ||||||
|     "list": [] |  | ||||||
|   }, |  | ||||||
|   "description": "Dashboard to get an overview of one server", |  | ||||||
|   "editable": true, |  | ||||||
|   "gnetId": 22, |  | ||||||
|   "graphTooltip": 0, |  | ||||||
|   "hideControls": false, |  | ||||||
|   "id": null, |  | ||||||
|   "links": [], |  | ||||||
|   "refresh": false, |  | ||||||
|   "rows": [ |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 3, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum(rate(node_cpu{mode=\"idle\"}[2m])) * 100", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 10, |  | ||||||
|               "legendFormat": "", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 50 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Idle cpu", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "percent", |  | ||||||
|               "label": "cpu usage", |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": 0, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 9, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_load1)", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "load 1m", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 20, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_load5)", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "load 5m", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 20, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_load15)", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "load 15m", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 20, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "System load", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "percentunit", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 4, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 9, |  | ||||||
|           "stack": true, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory usage", |  | ||||||
|               "metric": "memo", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 4, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_memory_Buffers)", |  | ||||||
|               "interval": "", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory buffers", |  | ||||||
|               "metric": "memo", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 4, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_memory_Cached)", |  | ||||||
|               "interval": "", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory cached", |  | ||||||
|               "metric": "memo", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 4, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(node_memory_MemFree)", |  | ||||||
|               "interval": "", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory free", |  | ||||||
|               "metric": "memo", |  | ||||||
|               "refId": "D", |  | ||||||
|               "step": 4, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Memory usage", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "individual" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": "0", |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(50, 172, 45, 0.97)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(245, 54, 54, 0.9)" |  | ||||||
|           ], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "format": "percent", |  | ||||||
|           "gauge": { |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "show": true, |  | ||||||
|             "thresholdLabels": false, |  | ||||||
|             "thresholdMarkers": true |  | ||||||
|           }, |  | ||||||
|           "id": 5, |  | ||||||
|           "interval": null, |  | ||||||
|           "links": [], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "nullText": null, |  | ||||||
|           "postfix": "", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "prefix": "", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "to": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 3, |  | ||||||
|           "sparkline": { |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)", |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "show": false |  | ||||||
|           }, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "metric": "", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 60, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": "80, 90", |  | ||||||
|           "title": "Memory usage", |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "value": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "valueName": "avg" |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 6, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "read", |  | ||||||
|               "yaxis": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "alias": "{instance=\"172.17.0.1:9100\"}", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "alias": "io time", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 9, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum(rate(node_disk_bytes_read[5m]))", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "read", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 8, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(rate(node_disk_bytes_written[5m]))", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "written", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 8 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum(rate(node_disk_io_time_ms[5m]))", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "io time", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 8 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Disk I/O", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "ms", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(50, 172, 45, 0.97)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(245, 54, 54, 0.9)" |  | ||||||
|           ], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "format": "percentunit", |  | ||||||
|           "gauge": { |  | ||||||
|             "maxValue": 1, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "show": true, |  | ||||||
|             "thresholdLabels": false, |  | ||||||
|             "thresholdMarkers": true |  | ||||||
|           }, |  | ||||||
|           "id": 7, |  | ||||||
|           "interval": null, |  | ||||||
|           "links": [], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "nullText": null, |  | ||||||
|           "postfix": "", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "prefix": "", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "to": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 3, |  | ||||||
|           "sparkline": { |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)", |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "show": false |  | ||||||
|           }, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "(sum(node_filesystem_size{device!=\"rootfs\"}) - sum(node_filesystem_free{device!=\"rootfs\"})) / sum(node_filesystem_size{device!=\"rootfs\"})", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 60, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": "0.75, 0.9", |  | ||||||
|           "title": "Disk space usage", |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "value": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "valueName": "current" |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 8, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "transmitted ", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum(rate(node_network_receive_bytes{device!~\"lo\"}[5m]))", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 10, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Network received", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 10, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "transmitted ", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum(rate(node_network_transmit_bytes{device!~\"lo\"}[5m]))", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 10, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Network transmitted", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "schemaVersion": 14, |  | ||||||
|   "style": "dark", |  | ||||||
|   "tags": [ |  | ||||||
|     "prometheus" |  | ||||||
|   ], |  | ||||||
|   "templating": { |  | ||||||
|     "list": [] |  | ||||||
|   }, |  | ||||||
|   "time": { |  | ||||||
|     "from": "now-1h", |  | ||||||
|     "to": "now" |  | ||||||
|   }, |  | ||||||
|   "timepicker": { |  | ||||||
|     "refresh_intervals": [ |  | ||||||
|       "5s", |  | ||||||
|       "10s", |  | ||||||
|       "30s", |  | ||||||
|       "1m", |  | ||||||
|       "5m", |  | ||||||
|       "15m", |  | ||||||
|       "30m", |  | ||||||
|       "1h", |  | ||||||
|       "2h", |  | ||||||
|       "1d" |  | ||||||
|     ], |  | ||||||
|     "time_options": [ |  | ||||||
|       "5m", |  | ||||||
|       "15m", |  | ||||||
|       "1h", |  | ||||||
|       "6h", |  | ||||||
|       "12h", |  | ||||||
|       "24h", |  | ||||||
|       "2d", |  | ||||||
|       "7d", |  | ||||||
|       "30d" |  | ||||||
|     ] |  | ||||||
|   }, |  | ||||||
|   "timezone": "browser", |  | ||||||
|   "title": "All Nodes", |  | ||||||
|   "version": 1 |  | ||||||
| }, |  | ||||||
|   "inputs": [ |  | ||||||
|     { |  | ||||||
|       "name": "DS_PROMETHEUS", |  | ||||||
|       "pluginId": "prometheus", |  | ||||||
|       "type": "datasource", |  | ||||||
|       "value": "prometheus" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "overwrite": true |  | ||||||
| } |  | ||||||
| @ -1,817 +0,0 @@ | |||||||
| { |  | ||||||
|   "dashboard": { |  | ||||||
|   "__inputs": [ |  | ||||||
|     { |  | ||||||
|       "name": "DS_PROMETHEUS", |  | ||||||
|       "label": "prometheus", |  | ||||||
|       "description": "", |  | ||||||
|       "type": "datasource", |  | ||||||
|       "pluginId": "prometheus", |  | ||||||
|       "pluginName": "Prometheus" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "__requires": [ |  | ||||||
|     { |  | ||||||
|       "type": "panel", |  | ||||||
|       "id": "singlestat", |  | ||||||
|       "name": "Singlestat", |  | ||||||
|       "version": "" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "panel", |  | ||||||
|       "id": "graph", |  | ||||||
|       "name": "Graph", |  | ||||||
|       "version": "" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "grafana", |  | ||||||
|       "id": "grafana", |  | ||||||
|       "name": "Grafana", |  | ||||||
|       "version": "3.1.1" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "datasource", |  | ||||||
|       "id": "prometheus", |  | ||||||
|       "name": "Prometheus", |  | ||||||
|       "version": "1.0.0" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "id": null, |  | ||||||
|   "title": "Deployment", |  | ||||||
|   "tags": [], |  | ||||||
|   "style": "dark", |  | ||||||
|   "timezone": "browser", |  | ||||||
|   "editable": true, |  | ||||||
|   "hideControls": false, |  | ||||||
|   "sharedCrosshair": true, |  | ||||||
|   "rows": [ |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "editable": true, |  | ||||||
|       "height": "200px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "title": "CPU", |  | ||||||
|           "error": false, |  | ||||||
|           "span": 4, |  | ||||||
|           "editable": true, |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "isNew": true, |  | ||||||
|           "id": 8, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "refId": "A", |  | ||||||
|               "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) ", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "step": 600 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "links": [], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "interval": null, |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "format": "none", |  | ||||||
|           "prefix": "", |  | ||||||
|           "postfix": "cores", |  | ||||||
|           "nullText": null, |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "value": "null", |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "to": "null", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "valueName": "avg", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "valueFontSize": "110%", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "thresholds": "", |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "sparkline": { |  | ||||||
|             "show": true, |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)" |  | ||||||
|           }, |  | ||||||
|           "gauge": { |  | ||||||
|             "show": false, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "thresholdMarkers": true, |  | ||||||
|             "thresholdLabels": false |  | ||||||
|           } |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "title": "Memory", |  | ||||||
|           "error": false, |  | ||||||
|           "span": 4, |  | ||||||
|           "editable": true, |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "isNew": true, |  | ||||||
|           "id": 9, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "refId": "A", |  | ||||||
|               "expr": "sum(container_memory_usage_bytes{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}) / 1024^3", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "step": 600 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "links": [], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "interval": null, |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "format": "none", |  | ||||||
|           "prefix": "", |  | ||||||
|           "postfix": "GB", |  | ||||||
|           "nullText": null, |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "value": "null", |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "to": "null", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "valueName": "avg", |  | ||||||
|           "prefixFontSize": "80%", |  | ||||||
|           "valueFontSize": "110%", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "thresholds": "", |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "sparkline": { |  | ||||||
|             "show": true, |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)" |  | ||||||
|           }, |  | ||||||
|           "gauge": { |  | ||||||
|             "show": false, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "thresholdMarkers": true, |  | ||||||
|             "thresholdLabels": false |  | ||||||
|           } |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "title": "Network", |  | ||||||
|           "error": false, |  | ||||||
|           "span": 4, |  | ||||||
|           "editable": true, |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "isNew": true, |  | ||||||
|           "id": 7, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "refId": "A", |  | ||||||
|               "expr": "sum(rate(container_network_transmit_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) + sum(rate(container_network_receive_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) ", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "step": 600 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "links": [], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "interval": null, |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "format": "Bps", |  | ||||||
|           "prefix": "", |  | ||||||
|           "postfix": "", |  | ||||||
|           "nullText": null, |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "value": "null", |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "to": "null", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "valueName": "avg", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "thresholds": "", |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "sparkline": { |  | ||||||
|             "show": true, |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)" |  | ||||||
|           }, |  | ||||||
|           "gauge": { |  | ||||||
|             "show": false, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "thresholdMarkers": false, |  | ||||||
|             "thresholdLabels": false |  | ||||||
|           } |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "title": "Row", |  | ||||||
|       "showTitle": false |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "title": "New row", |  | ||||||
|       "height": "100px", |  | ||||||
|       "editable": true, |  | ||||||
|       "collapse": false, |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "title": "Desired Replicas", |  | ||||||
|           "error": false, |  | ||||||
|           "span": 3, |  | ||||||
|           "editable": true, |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "isNew": true, |  | ||||||
|           "id": 5, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "refId": "A", |  | ||||||
|               "expr": "kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "step": 600, |  | ||||||
|               "metric": "kube_deployment_spec_replicas" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "links": [], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "interval": null, |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "format": "none", |  | ||||||
|           "prefix": "", |  | ||||||
|           "postfix": "", |  | ||||||
|           "nullText": null, |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "value": "null", |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "to": "null", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "valueName": "avg", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "thresholds": "", |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "sparkline": { |  | ||||||
|             "show": false, |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)" |  | ||||||
|           }, |  | ||||||
|           "gauge": { |  | ||||||
|             "show": false, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "thresholdMarkers": false, |  | ||||||
|             "thresholdLabels": false |  | ||||||
|           }, |  | ||||||
|           "decimals": null |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "title": "Available Replicas", |  | ||||||
|           "error": false, |  | ||||||
|           "span": 3, |  | ||||||
|           "editable": true, |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "isNew": true, |  | ||||||
|           "id": 6, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "refId": "A", |  | ||||||
|               "expr": "kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "step": 600 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "links": [], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "interval": null, |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "format": "none", |  | ||||||
|           "prefix": "", |  | ||||||
|           "postfix": "", |  | ||||||
|           "nullText": null, |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "value": "null", |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "to": "null", |  | ||||||
|               "text": "N/A" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "valueName": "avg", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "thresholds": "", |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "sparkline": { |  | ||||||
|             "show": false, |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)" |  | ||||||
|           }, |  | ||||||
|           "gauge": { |  | ||||||
|             "show": false, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "thresholdMarkers": true, |  | ||||||
|             "thresholdLabels": false |  | ||||||
|           } |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "format": "none", |  | ||||||
|           "gauge": { |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "show": false, |  | ||||||
|             "thresholdLabels": false, |  | ||||||
|             "thresholdMarkers": true |  | ||||||
|           }, |  | ||||||
|           "id": 3, |  | ||||||
|           "interval": null, |  | ||||||
|           "isNew": true, |  | ||||||
|           "links": [], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "nullText": null, |  | ||||||
|           "postfix": "", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "prefix": "", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "to": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 3, |  | ||||||
|           "sparkline": { |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)", |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "show": false |  | ||||||
|           }, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_status_observed_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 600 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": "", |  | ||||||
|           "title": "Observed Generation", |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "value": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "valueName": "avg" |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(245, 54, 54, 0.9)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(50, 172, 45, 0.97)" |  | ||||||
|           ], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "format": "none", |  | ||||||
|           "gauge": { |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "show": false, |  | ||||||
|             "thresholdLabels": false, |  | ||||||
|             "thresholdMarkers": true |  | ||||||
|           }, |  | ||||||
|           "id": 2, |  | ||||||
|           "interval": null, |  | ||||||
|           "isNew": true, |  | ||||||
|           "links": [], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "nullText": null, |  | ||||||
|           "postfix": "", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "prefix": "", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "to": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 3, |  | ||||||
|           "sparkline": { |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)", |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "show": false |  | ||||||
|           }, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_metadata_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 600 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": "", |  | ||||||
|           "title": "Metadata Generation", |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "value": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "valueName": "avg" |  | ||||||
|         } |  | ||||||
|       ] |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "editable": true, |  | ||||||
|       "height": "350px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": { |  | ||||||
|             "threshold1": null, |  | ||||||
|             "threshold1Color": "rgba(216, 200, 27, 0.27)", |  | ||||||
|             "threshold2": null, |  | ||||||
|             "threshold2Color": "rgba(234, 112, 112, 0.22)" |  | ||||||
|           }, |  | ||||||
|           "id": 1, |  | ||||||
|           "isNew": true, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false, |  | ||||||
|             "hideZero": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [], |  | ||||||
|           "span": 12, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_status_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "current replicas", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 30 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "available", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 30 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_status_replicas_unavailable{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "unavailable", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 30 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_status_replicas_updated{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "updated", |  | ||||||
|               "refId": "D", |  | ||||||
|               "step": 30 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "desired", |  | ||||||
|               "refId": "E", |  | ||||||
|               "step": 30 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Replicas", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": true, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "none", |  | ||||||
|               "label": "", |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": false |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "transparent": false |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "title": "New row", |  | ||||||
|       "showTitle": false |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "time": { |  | ||||||
|     "from": "now-6h", |  | ||||||
|     "to": "now" |  | ||||||
|   }, |  | ||||||
|   "timepicker": { |  | ||||||
|     "refresh_intervals": [ |  | ||||||
|       "5s", |  | ||||||
|       "10s", |  | ||||||
|       "30s", |  | ||||||
|       "1m", |  | ||||||
|       "5m", |  | ||||||
|       "15m", |  | ||||||
|       "30m", |  | ||||||
|       "1h", |  | ||||||
|       "2h", |  | ||||||
|       "1d" |  | ||||||
|     ], |  | ||||||
|     "time_options": [ |  | ||||||
|       "5m", |  | ||||||
|       "15m", |  | ||||||
|       "1h", |  | ||||||
|       "6h", |  | ||||||
|       "12h", |  | ||||||
|       "24h", |  | ||||||
|       "2d", |  | ||||||
|       "7d", |  | ||||||
|       "30d" |  | ||||||
|     ] |  | ||||||
|   }, |  | ||||||
|   "templating": { |  | ||||||
|     "list": [ |  | ||||||
|       { |  | ||||||
|         "allValue": ".*", |  | ||||||
|         "current": {}, |  | ||||||
|         "datasource": "${DS_PROMETHEUS}", |  | ||||||
|         "hide": 0, |  | ||||||
|         "includeAll": false, |  | ||||||
|         "label": "Namespace", |  | ||||||
|         "multi": false, |  | ||||||
|         "name": "deployment_namespace", |  | ||||||
|         "options": [], |  | ||||||
|         "query": "label_values(kube_deployment_metadata_generation, namespace)", |  | ||||||
|         "refresh": 1, |  | ||||||
|         "regex": "", |  | ||||||
|         "sort": 0, |  | ||||||
|         "tagValuesQuery": null, |  | ||||||
|         "tagsQuery": "", |  | ||||||
|         "type": "query", |  | ||||||
|         "useTags": false |  | ||||||
|       }, |  | ||||||
|       { |  | ||||||
|         "allValue": null, |  | ||||||
|         "current": {}, |  | ||||||
|         "datasource": "${DS_PROMETHEUS}", |  | ||||||
|         "hide": 0, |  | ||||||
|         "includeAll": false, |  | ||||||
|         "label": "Deployment", |  | ||||||
|         "multi": false, |  | ||||||
|         "name": "deployment_name", |  | ||||||
|         "options": [], |  | ||||||
|         "query": "label_values(kube_deployment_metadata_generation{namespace=\"$deployment_namespace\"}, deployment)", |  | ||||||
|         "refresh": 1, |  | ||||||
|         "regex": "", |  | ||||||
|         "sort": 0, |  | ||||||
|         "tagValuesQuery": "", |  | ||||||
|         "tagsQuery": "deployment", |  | ||||||
|         "type": "query", |  | ||||||
|         "useTags": false |  | ||||||
|       } |  | ||||||
|     ] |  | ||||||
|   }, |  | ||||||
|   "annotations": { |  | ||||||
|     "list": [] |  | ||||||
|   }, |  | ||||||
|   "schemaVersion": 12, |  | ||||||
|   "version": 2, |  | ||||||
|   "links": [], |  | ||||||
|   "gnetId": null |  | ||||||
| }, |  | ||||||
|     "inputs": [ |  | ||||||
|         { |  | ||||||
|             "name": "DS_PROMETHEUS", |  | ||||||
|             "pluginId": "prometheus", |  | ||||||
|             "type": "datasource", |  | ||||||
|             "value": "prometheus" |  | ||||||
|         } |  | ||||||
|     ], |  | ||||||
|     "overwrite": true |  | ||||||
| } |  | ||||||
| @ -1,409 +0,0 @@ | |||||||
| { |  | ||||||
|     "dashboard": { |  | ||||||
|         "__inputs": [ |  | ||||||
|             { |  | ||||||
|                 "description": "", |  | ||||||
|                 "label": "prometheus", |  | ||||||
|                 "name": "DS_PROMETHEUS", |  | ||||||
|                 "pluginId": "prometheus", |  | ||||||
|                 "pluginName": "Prometheus", |  | ||||||
|                 "type": "datasource" |  | ||||||
|             } |  | ||||||
|         ], |  | ||||||
|         "__requires": [ |  | ||||||
|             { |  | ||||||
|                 "id": "graph", |  | ||||||
|                 "name": "Graph", |  | ||||||
|                 "type": "panel", |  | ||||||
|                 "version": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|                 "id": "grafana", |  | ||||||
|                 "name": "Grafana", |  | ||||||
|                 "type": "grafana", |  | ||||||
|                 "version": "3.1.1" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|                 "id": "prometheus", |  | ||||||
|                 "name": "Prometheus", |  | ||||||
|                 "type": "datasource", |  | ||||||
|                 "version": "1.0.0" |  | ||||||
|             } |  | ||||||
|         ], |  | ||||||
|         "annotations": { |  | ||||||
|             "list": [] |  | ||||||
|         }, |  | ||||||
|         "editable": true, |  | ||||||
|         "gnetId": null, |  | ||||||
|         "hideControls": false, |  | ||||||
|         "id": null, |  | ||||||
|         "links": [], |  | ||||||
|         "rows": [ |  | ||||||
|             { |  | ||||||
|                 "collapse": false, |  | ||||||
|                 "editable": true, |  | ||||||
|                 "height": "250px", |  | ||||||
|                 "panels": [ |  | ||||||
|                     { |  | ||||||
|                         "aliasColors": {}, |  | ||||||
|                         "bars": false, |  | ||||||
|                         "datasource": "${DS_PROMETHEUS}", |  | ||||||
|                         "editable": true, |  | ||||||
|                         "error": false, |  | ||||||
|                         "fill": 1, |  | ||||||
|                         "grid": { |  | ||||||
|                             "threshold1": null, |  | ||||||
|                             "threshold1Color": "rgba(216, 200, 27, 0.27)", |  | ||||||
|                             "threshold2": null, |  | ||||||
|                             "threshold2Color": "rgba(234, 112, 112, 0.22)" |  | ||||||
|                         }, |  | ||||||
|                         "id": 1, |  | ||||||
|                         "isNew": true, |  | ||||||
|                         "legend": { |  | ||||||
|                             "alignAsTable": true, |  | ||||||
|                             "avg": true, |  | ||||||
|                             "current": true, |  | ||||||
|                             "max": false, |  | ||||||
|                             "min": false, |  | ||||||
|                             "rightSide": true, |  | ||||||
|                             "show": true, |  | ||||||
|                             "total": false, |  | ||||||
|                             "values": true |  | ||||||
|                         }, |  | ||||||
|                         "lines": true, |  | ||||||
|                         "linewidth": 2, |  | ||||||
|                         "links": [], |  | ||||||
|                         "nullPointMode": "connected", |  | ||||||
|                         "percentage": false, |  | ||||||
|                         "pointradius": 5, |  | ||||||
|                         "points": false, |  | ||||||
|                         "renderer": "flot", |  | ||||||
|                         "seriesOverrides": [], |  | ||||||
|                         "span": 12, |  | ||||||
|                         "stack": false, |  | ||||||
|                         "steppedLine": false, |  | ||||||
|                         "targets": [ |  | ||||||
|                             { |  | ||||||
|                                 "expr": "sum by(container_name) (container_memory_usage_bytes{pod_name=\"$pod\", container_name=~\"$container\", container_name!=\"POD\"})", |  | ||||||
|                                 "interval": "10s", |  | ||||||
|                                 "intervalFactor": 1, |  | ||||||
|                                 "legendFormat": "Current: {{ container_name }}", |  | ||||||
|                                 "metric": "container_memory_usage_bytes", |  | ||||||
|                                 "refId": "A", |  | ||||||
|                                 "step": 10 |  | ||||||
|                             }, |  | ||||||
|                             { |  | ||||||
|                                 "expr": "kube_pod_container_requested_memory_bytes{pod=\"$pod\", container=~\"$container\"}", |  | ||||||
|                                 "interval": "10s", |  | ||||||
|                                 "intervalFactor": 2, |  | ||||||
|                                 "legendFormat": "Requested: {{ container }}", |  | ||||||
|                                 "metric": "kube_pod_container_requested_memory_bytes", |  | ||||||
|                                 "refId": "B", |  | ||||||
|                                 "step": 20 |  | ||||||
|                             } |  | ||||||
|                         ], |  | ||||||
|                         "timeFrom": null, |  | ||||||
|                         "timeShift": null, |  | ||||||
|                         "title": "Memory Usage", |  | ||||||
|                         "tooltip": { |  | ||||||
|                             "msResolution": true, |  | ||||||
|                             "shared": true, |  | ||||||
|                             "sort": 0, |  | ||||||
|                             "value_type": "cumulative" |  | ||||||
|                         }, |  | ||||||
|                         "type": "graph", |  | ||||||
|                         "xaxis": { |  | ||||||
|                             "show": true |  | ||||||
|                         }, |  | ||||||
|                         "yaxes": [ |  | ||||||
|                             { |  | ||||||
|                                 "format": "bytes", |  | ||||||
|                                 "label": null, |  | ||||||
|                                 "logBase": 1, |  | ||||||
|                                 "max": null, |  | ||||||
|                                 "min": null, |  | ||||||
|                                 "show": true |  | ||||||
|                             }, |  | ||||||
|                             { |  | ||||||
|                                 "format": "short", |  | ||||||
|                                 "label": null, |  | ||||||
|                                 "logBase": 1, |  | ||||||
|                                 "max": null, |  | ||||||
|                                 "min": null, |  | ||||||
|                                 "show": true |  | ||||||
|                             } |  | ||||||
|                         ] |  | ||||||
|                     } |  | ||||||
|                 ], |  | ||||||
|                 "title": "Row" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|                 "collapse": false, |  | ||||||
|                 "editable": true, |  | ||||||
|                 "height": "250px", |  | ||||||
|                 "panels": [ |  | ||||||
|                     { |  | ||||||
|                         "aliasColors": {}, |  | ||||||
|                         "bars": false, |  | ||||||
|                         "datasource": "${DS_PROMETHEUS}", |  | ||||||
|                         "editable": true, |  | ||||||
|                         "error": false, |  | ||||||
|                         "fill": 1, |  | ||||||
|                         "grid": { |  | ||||||
|                             "threshold1": null, |  | ||||||
|                             "threshold1Color": "rgba(216, 200, 27, 0.27)", |  | ||||||
|                             "threshold2": null, |  | ||||||
|                             "threshold2Color": "rgba(234, 112, 112, 0.22)" |  | ||||||
|                         }, |  | ||||||
|                         "id": 2, |  | ||||||
|                         "isNew": true, |  | ||||||
|                         "legend": { |  | ||||||
|                             "alignAsTable": true, |  | ||||||
|                             "avg": true, |  | ||||||
|                             "current": true, |  | ||||||
|                             "max": false, |  | ||||||
|                             "min": false, |  | ||||||
|                             "rightSide": true, |  | ||||||
|                             "show": true, |  | ||||||
|                             "total": false, |  | ||||||
|                             "values": true |  | ||||||
|                         }, |  | ||||||
|                         "lines": true, |  | ||||||
|                         "linewidth": 2, |  | ||||||
|                         "links": [], |  | ||||||
|                         "nullPointMode": "connected", |  | ||||||
|                         "percentage": false, |  | ||||||
|                         "pointradius": 5, |  | ||||||
|                         "points": false, |  | ||||||
|                         "renderer": "flot", |  | ||||||
|                         "seriesOverrides": [], |  | ||||||
|                         "span": 12, |  | ||||||
|                         "stack": false, |  | ||||||
|                         "steppedLine": false, |  | ||||||
|                         "targets": [ |  | ||||||
|                             { |  | ||||||
|                                 "expr": "sum by (container_name)( rate(container_cpu_usage_seconds_total{image!=\"\",container_name!=\"POD\",pod_name=\"$pod\"}[1m] ) )", |  | ||||||
|                                 "intervalFactor": 2, |  | ||||||
|                                 "legendFormat": "{{ container_name }}", |  | ||||||
|                                 "refId": "A", |  | ||||||
|                                 "step": 30 |  | ||||||
|                             } |  | ||||||
|                         ], |  | ||||||
|                         "timeFrom": null, |  | ||||||
|                         "timeShift": null, |  | ||||||
|                         "title": "CPU Usage", |  | ||||||
|                         "tooltip": { |  | ||||||
|                             "msResolution": true, |  | ||||||
|                             "shared": true, |  | ||||||
|                             "sort": 0, |  | ||||||
|                             "value_type": "cumulative" |  | ||||||
|                         }, |  | ||||||
|                         "type": "graph", |  | ||||||
|                         "xaxis": { |  | ||||||
|                             "show": true |  | ||||||
|                         }, |  | ||||||
|                         "yaxes": [ |  | ||||||
|                             { |  | ||||||
|                                 "format": "short", |  | ||||||
|                                 "label": null, |  | ||||||
|                                 "logBase": 1, |  | ||||||
|                                 "max": null, |  | ||||||
|                                 "min": null, |  | ||||||
|                                 "show": true |  | ||||||
|                             }, |  | ||||||
|                             { |  | ||||||
|                                 "format": "short", |  | ||||||
|                                 "label": null, |  | ||||||
|                                 "logBase": 1, |  | ||||||
|                                 "max": null, |  | ||||||
|                                 "min": null, |  | ||||||
|                                 "show": true |  | ||||||
|                             } |  | ||||||
|                         ] |  | ||||||
|                     } |  | ||||||
|                 ], |  | ||||||
|                 "title": "New row" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|                 "collapse": false, |  | ||||||
|                 "editable": true, |  | ||||||
|                 "height": "250px", |  | ||||||
|                 "panels": [ |  | ||||||
|                     { |  | ||||||
|                         "aliasColors": {}, |  | ||||||
|                         "bars": false, |  | ||||||
|                         "datasource": "${DS_PROMETHEUS}", |  | ||||||
|                         "editable": true, |  | ||||||
|                         "error": false, |  | ||||||
|                         "fill": 1, |  | ||||||
|                         "grid": { |  | ||||||
|                             "threshold1": null, |  | ||||||
|                             "threshold1Color": "rgba(216, 200, 27, 0.27)", |  | ||||||
|                             "threshold2": null, |  | ||||||
|                             "threshold2Color": "rgba(234, 112, 112, 0.22)" |  | ||||||
|                         }, |  | ||||||
|                         "id": 3, |  | ||||||
|                         "isNew": true, |  | ||||||
|                         "legend": { |  | ||||||
|                             "alignAsTable": true, |  | ||||||
|                             "avg": true, |  | ||||||
|                             "current": true, |  | ||||||
|                             "max": false, |  | ||||||
|                             "min": false, |  | ||||||
|                             "rightSide": true, |  | ||||||
|                             "show": true, |  | ||||||
|                             "total": false, |  | ||||||
|                             "values": true |  | ||||||
|                         }, |  | ||||||
|                         "lines": true, |  | ||||||
|                         "linewidth": 2, |  | ||||||
|                         "links": [], |  | ||||||
|                         "nullPointMode": "connected", |  | ||||||
|                         "percentage": false, |  | ||||||
|                         "pointradius": 5, |  | ||||||
|                         "points": false, |  | ||||||
|                         "renderer": "flot", |  | ||||||
|                         "seriesOverrides": [], |  | ||||||
|                         "span": 12, |  | ||||||
|                         "stack": false, |  | ||||||
|                         "steppedLine": false, |  | ||||||
|                         "targets": [ |  | ||||||
|                             { |  | ||||||
|                                 "expr": "sort_desc(sum by (pod_name) (rate (container_network_receive_bytes_total{pod_name=\"$pod\"}[1m]) ))", |  | ||||||
|                                 "intervalFactor": 2, |  | ||||||
|                                 "legendFormat": "{{ pod_name }}", |  | ||||||
|                                 "refId": "A", |  | ||||||
|                                 "step": 30 |  | ||||||
|                             } |  | ||||||
|                         ], |  | ||||||
|                         "timeFrom": null, |  | ||||||
|                         "timeShift": null, |  | ||||||
|                         "title": "Network I/O", |  | ||||||
|                         "tooltip": { |  | ||||||
|                             "msResolution": true, |  | ||||||
|                             "shared": true, |  | ||||||
|                             "sort": 0, |  | ||||||
|                             "value_type": "cumulative" |  | ||||||
|                         }, |  | ||||||
|                         "type": "graph", |  | ||||||
|                         "xaxis": { |  | ||||||
|                             "show": true |  | ||||||
|                         }, |  | ||||||
|                         "yaxes": [ |  | ||||||
|                             { |  | ||||||
|                                 "format": "bytes", |  | ||||||
|                                 "label": null, |  | ||||||
|                                 "logBase": 1, |  | ||||||
|                                 "max": null, |  | ||||||
|                                 "min": null, |  | ||||||
|                                 "show": true |  | ||||||
|                             }, |  | ||||||
|                             { |  | ||||||
|                                 "format": "short", |  | ||||||
|                                 "label": null, |  | ||||||
|                                 "logBase": 1, |  | ||||||
|                                 "max": null, |  | ||||||
|                                 "min": null, |  | ||||||
|                                 "show": true |  | ||||||
|                             } |  | ||||||
|                         ] |  | ||||||
|                     } |  | ||||||
|                 ], |  | ||||||
|                 "title": "New row" |  | ||||||
|             } |  | ||||||
|         ], |  | ||||||
|         "schemaVersion": 12, |  | ||||||
|         "sharedCrosshair": true, |  | ||||||
|         "style": "dark", |  | ||||||
|         "tags": [], |  | ||||||
|         "templating": { |  | ||||||
|             "list": [ |  | ||||||
|                 { |  | ||||||
|                     "allValue": ".*", |  | ||||||
|                     "current": {}, |  | ||||||
|                     "datasource": "${DS_PROMETHEUS}", |  | ||||||
|                     "hide": 0, |  | ||||||
|                     "includeAll": true, |  | ||||||
|                     "label": "Namespace", |  | ||||||
|                     "multi": false, |  | ||||||
|                     "name": "namespace", |  | ||||||
|                     "options": [], |  | ||||||
|                     "query": "label_values(kube_pod_info, namespace)", |  | ||||||
|                     "refresh": 1, |  | ||||||
|                     "regex": "", |  | ||||||
|                     "type": "query" |  | ||||||
|                 }, |  | ||||||
|                 { |  | ||||||
|                     "current": {}, |  | ||||||
|                     "datasource": "${DS_PROMETHEUS}", |  | ||||||
|                     "hide": 0, |  | ||||||
|                     "includeAll": false, |  | ||||||
|                     "label": "Pod", |  | ||||||
|                     "multi": false, |  | ||||||
|                     "name": "pod", |  | ||||||
|                     "options": [], |  | ||||||
|                     "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)", |  | ||||||
|                     "refresh": 1, |  | ||||||
|                     "regex": "", |  | ||||||
|                     "type": "query" |  | ||||||
|                 }, |  | ||||||
|                 { |  | ||||||
|                     "allValue": ".*", |  | ||||||
|                     "current": {}, |  | ||||||
|                     "datasource": "${DS_PROMETHEUS}", |  | ||||||
|                     "hide": 0, |  | ||||||
|                     "includeAll": true, |  | ||||||
|                     "label": "Container", |  | ||||||
|                     "multi": false, |  | ||||||
|                     "name": "container", |  | ||||||
|                     "options": [], |  | ||||||
|                     "query": "label_values(kube_pod_container_info{namespace=\"$namespace\", pod=\"$pod\"}, container)", |  | ||||||
|                     "refresh": 1, |  | ||||||
|                     "regex": "", |  | ||||||
|                     "type": "query" |  | ||||||
|                 } |  | ||||||
|             ] |  | ||||||
|         }, |  | ||||||
|         "time": { |  | ||||||
|             "from": "now-6h", |  | ||||||
|             "to": "now" |  | ||||||
|         }, |  | ||||||
|         "timepicker": { |  | ||||||
|             "refresh_intervals": [ |  | ||||||
|                 "5s", |  | ||||||
|                 "10s", |  | ||||||
|                 "30s", |  | ||||||
|                 "1m", |  | ||||||
|                 "5m", |  | ||||||
|                 "15m", |  | ||||||
|                 "30m", |  | ||||||
|                 "1h", |  | ||||||
|                 "2h", |  | ||||||
|                 "1d" |  | ||||||
|             ], |  | ||||||
|             "time_options": [ |  | ||||||
|                 "5m", |  | ||||||
|                 "15m", |  | ||||||
|                 "1h", |  | ||||||
|                 "6h", |  | ||||||
|                 "12h", |  | ||||||
|                 "24h", |  | ||||||
|                 "2d", |  | ||||||
|                 "7d", |  | ||||||
|                 "30d" |  | ||||||
|             ] |  | ||||||
|         }, |  | ||||||
|         "timezone": "browser", |  | ||||||
|         "title": "Pods", |  | ||||||
|         "version": 26 |  | ||||||
|     }, |  | ||||||
|     "inputs": [ |  | ||||||
|         { |  | ||||||
|             "name": "DS_PROMETHEUS", |  | ||||||
|             "pluginId": "prometheus", |  | ||||||
|             "type": "datasource", |  | ||||||
|             "value": "prometheus" |  | ||||||
|         } |  | ||||||
|     ], |  | ||||||
|     "overwrite": true |  | ||||||
| } |  | ||||||
| @ -1,880 +0,0 @@ | |||||||
| { |  | ||||||
|   "dashboard": |  | ||||||
| { |  | ||||||
|   "__inputs": [ |  | ||||||
|     { |  | ||||||
|       "name": "DS_PROMETHEUS", |  | ||||||
|       "label": "prometheus", |  | ||||||
|       "description": "", |  | ||||||
|       "type": "datasource", |  | ||||||
|       "pluginId": "prometheus", |  | ||||||
|       "pluginName": "Prometheus" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "__requires": [ |  | ||||||
|     { |  | ||||||
|       "type": "grafana", |  | ||||||
|       "id": "grafana", |  | ||||||
|       "name": "Grafana", |  | ||||||
|       "version": "4.1.1" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "panel", |  | ||||||
|       "id": "graph", |  | ||||||
|       "name": "Graph", |  | ||||||
|       "version": "" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "datasource", |  | ||||||
|       "id": "prometheus", |  | ||||||
|       "name": "Prometheus", |  | ||||||
|       "version": "1.0.0" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "type": "panel", |  | ||||||
|       "id": "singlestat", |  | ||||||
|       "name": "Singlestat", |  | ||||||
|       "version": "" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "annotations": { |  | ||||||
|     "list": [] |  | ||||||
|   }, |  | ||||||
|   "description": "Dashboard to get an overview of one server", |  | ||||||
|   "editable": true, |  | ||||||
|   "gnetId": 22, |  | ||||||
|   "graphTooltip": 0, |  | ||||||
|   "hideControls": false, |  | ||||||
|   "id": null, |  | ||||||
|   "links": [], |  | ||||||
|   "refresh": false, |  | ||||||
|   "rows": [ |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 3, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "100 - (avg by (cpu) (irate(node_cpu{mode=\"idle\", instance=\"$server\"}[5m])) * 100)", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 10, |  | ||||||
|               "legendFormat": "{{cpu}}", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 50 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Idle cpu", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "percent", |  | ||||||
|               "label": "cpu usage", |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": 100, |  | ||||||
|               "min": 0, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 9, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "node_load1{instance=\"$server\"}", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "load 1m", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 20, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "node_load5{instance=\"$server\"}", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "load 5m", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 20, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "node_load15{instance=\"$server\"}", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "load 15m", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 20, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "System load", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "percentunit", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 4, |  | ||||||
|           "legend": { |  | ||||||
|             "alignAsTable": false, |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "hideEmpty": false, |  | ||||||
|             "hideZero": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "rightSide": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 9, |  | ||||||
|           "stack": true, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}", |  | ||||||
|               "hide": false, |  | ||||||
|               "interval": "", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory used", |  | ||||||
|               "metric": "", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 4 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "node_memory_Buffers{instance=\"$server\"}", |  | ||||||
|               "interval": "", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory buffers", |  | ||||||
|               "metric": "", |  | ||||||
|               "refId": "E", |  | ||||||
|               "step": 4 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "node_memory_Cached{instance=\"$server\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory cached", |  | ||||||
|               "metric": "", |  | ||||||
|               "refId": "F", |  | ||||||
|               "step": 4 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "node_memory_MemFree{instance=\"$server\"}", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "memory free", |  | ||||||
|               "metric": "", |  | ||||||
|               "refId": "D", |  | ||||||
|               "step": 4 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Memory usage", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "individual" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": "0", |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "short", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(50, 172, 45, 0.97)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(245, 54, 54, 0.9)" |  | ||||||
|           ], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "format": "percent", |  | ||||||
|           "gauge": { |  | ||||||
|             "maxValue": 100, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "show": true, |  | ||||||
|             "thresholdLabels": false, |  | ||||||
|             "thresholdMarkers": true |  | ||||||
|           }, |  | ||||||
|           "id": 5, |  | ||||||
|           "interval": null, |  | ||||||
|           "links": [], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "nullText": null, |  | ||||||
|           "postfix": "", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "prefix": "", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "to": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 3, |  | ||||||
|           "sparkline": { |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)", |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "show": false |  | ||||||
|           }, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "((node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"}  - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}) / node_memory_MemTotal{instance=\"$server\"}) * 100", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 60, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": "80, 90", |  | ||||||
|           "title": "Memory usage", |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "value": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "valueName": "avg" |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 6, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "read", |  | ||||||
|               "yaxis": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "alias": "{instance=\"172.17.0.1:9100\"}", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "alias": "io time", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 9, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "sum by (instance) (rate(node_disk_bytes_read{instance=\"$server\"}[2m]))", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "read", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 8, |  | ||||||
|               "target": "" |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum by (instance) (rate(node_disk_bytes_written{instance=\"$server\"}[2m]))", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "written", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 8 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "expr": "sum by (instance) (rate(node_disk_io_time_ms{instance=\"$server\"}[2m]))", |  | ||||||
|               "intervalFactor": 4, |  | ||||||
|               "legendFormat": "io time", |  | ||||||
|               "refId": "C", |  | ||||||
|               "step": 8 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Disk I/O", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "ms", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "cacheTimeout": null, |  | ||||||
|           "colorBackground": false, |  | ||||||
|           "colorValue": false, |  | ||||||
|           "colors": [ |  | ||||||
|             "rgba(50, 172, 45, 0.97)", |  | ||||||
|             "rgba(237, 129, 40, 0.89)", |  | ||||||
|             "rgba(245, 54, 54, 0.9)" |  | ||||||
|           ], |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "format": "percentunit", |  | ||||||
|           "gauge": { |  | ||||||
|             "maxValue": 1, |  | ||||||
|             "minValue": 0, |  | ||||||
|             "show": true, |  | ||||||
|             "thresholdLabels": false, |  | ||||||
|             "thresholdMarkers": true |  | ||||||
|           }, |  | ||||||
|           "id": 7, |  | ||||||
|           "interval": null, |  | ||||||
|           "links": [], |  | ||||||
|           "mappingType": 1, |  | ||||||
|           "mappingTypes": [ |  | ||||||
|             { |  | ||||||
|               "name": "value to text", |  | ||||||
|               "value": 1 |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "name": "range to text", |  | ||||||
|               "value": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "maxDataPoints": 100, |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "nullText": null, |  | ||||||
|           "postfix": "", |  | ||||||
|           "postfixFontSize": "50%", |  | ||||||
|           "prefix": "", |  | ||||||
|           "prefixFontSize": "50%", |  | ||||||
|           "rangeMaps": [ |  | ||||||
|             { |  | ||||||
|               "from": "null", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "to": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 3, |  | ||||||
|           "sparkline": { |  | ||||||
|             "fillColor": "rgba(31, 118, 189, 0.18)", |  | ||||||
|             "full": false, |  | ||||||
|             "lineColor": "rgb(31, 120, 193)", |  | ||||||
|             "show": false |  | ||||||
|           }, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "(sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"}) - sum(node_filesystem_free{device!=\"rootfs\",instance=\"$server\"})) / sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"})", |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 60, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": "0.75, 0.9", |  | ||||||
|           "title": "Disk space usage", |  | ||||||
|           "type": "singlestat", |  | ||||||
|           "valueFontSize": "80%", |  | ||||||
|           "valueMaps": [ |  | ||||||
|             { |  | ||||||
|               "op": "=", |  | ||||||
|               "text": "N/A", |  | ||||||
|               "value": "null" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "valueName": "current" |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     }, |  | ||||||
|     { |  | ||||||
|       "collapse": false, |  | ||||||
|       "height": "250px", |  | ||||||
|       "panels": [ |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 8, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "transmitted ", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "rate(node_network_receive_bytes{instance=\"$server\",device!~\"lo\"}[5m])", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "{{device}}", |  | ||||||
|               "refId": "A", |  | ||||||
|               "step": 10, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Network received", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         }, |  | ||||||
|         { |  | ||||||
|           "alerting": {}, |  | ||||||
|           "aliasColors": {}, |  | ||||||
|           "bars": false, |  | ||||||
|           "datasource": "${DS_PROMETHEUS}", |  | ||||||
|           "editable": true, |  | ||||||
|           "error": false, |  | ||||||
|           "fill": 1, |  | ||||||
|           "grid": {}, |  | ||||||
|           "id": 10, |  | ||||||
|           "legend": { |  | ||||||
|             "avg": false, |  | ||||||
|             "current": false, |  | ||||||
|             "max": false, |  | ||||||
|             "min": false, |  | ||||||
|             "show": true, |  | ||||||
|             "total": false, |  | ||||||
|             "values": false |  | ||||||
|           }, |  | ||||||
|           "lines": true, |  | ||||||
|           "linewidth": 2, |  | ||||||
|           "links": [], |  | ||||||
|           "nullPointMode": "connected", |  | ||||||
|           "percentage": false, |  | ||||||
|           "pointradius": 5, |  | ||||||
|           "points": false, |  | ||||||
|           "renderer": "flot", |  | ||||||
|           "seriesOverrides": [ |  | ||||||
|             { |  | ||||||
|               "alias": "transmitted ", |  | ||||||
|               "yaxis": 2 |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "span": 6, |  | ||||||
|           "stack": false, |  | ||||||
|           "steppedLine": false, |  | ||||||
|           "targets": [ |  | ||||||
|             { |  | ||||||
|               "expr": "rate(node_network_transmit_bytes{instance=\"$server\",device!~\"lo\"}[5m])", |  | ||||||
|               "hide": false, |  | ||||||
|               "intervalFactor": 2, |  | ||||||
|               "legendFormat": "{{device}}", |  | ||||||
|               "refId": "B", |  | ||||||
|               "step": 10, |  | ||||||
|               "target": "" |  | ||||||
|             } |  | ||||||
|           ], |  | ||||||
|           "thresholds": [], |  | ||||||
|           "timeFrom": null, |  | ||||||
|           "timeShift": null, |  | ||||||
|           "title": "Network transmitted", |  | ||||||
|           "tooltip": { |  | ||||||
|             "msResolution": false, |  | ||||||
|             "shared": true, |  | ||||||
|             "sort": 0, |  | ||||||
|             "value_type": "cumulative" |  | ||||||
|           }, |  | ||||||
|           "type": "graph", |  | ||||||
|           "xaxis": { |  | ||||||
|             "mode": "time", |  | ||||||
|             "name": null, |  | ||||||
|             "show": true, |  | ||||||
|             "values": [] |  | ||||||
|           }, |  | ||||||
|           "yaxes": [ |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             }, |  | ||||||
|             { |  | ||||||
|               "format": "bytes", |  | ||||||
|               "label": null, |  | ||||||
|               "logBase": 1, |  | ||||||
|               "max": null, |  | ||||||
|               "min": null, |  | ||||||
|               "show": true |  | ||||||
|             } |  | ||||||
|           ] |  | ||||||
|         } |  | ||||||
|       ], |  | ||||||
|       "repeat": null, |  | ||||||
|       "repeatIteration": null, |  | ||||||
|       "repeatRowId": null, |  | ||||||
|       "showTitle": false, |  | ||||||
|       "title": "New row", |  | ||||||
|       "titleSize": "h6" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "schemaVersion": 14, |  | ||||||
|   "style": "dark", |  | ||||||
|   "tags": [ |  | ||||||
|     "prometheus" |  | ||||||
|   ], |  | ||||||
|   "templating": { |  | ||||||
|     "list": [ |  | ||||||
|       { |  | ||||||
|         "allValue": null, |  | ||||||
|         "current": {}, |  | ||||||
|         "datasource": "${DS_PROMETHEUS}", |  | ||||||
|         "hide": 0, |  | ||||||
|         "includeAll": false, |  | ||||||
|         "label": null, |  | ||||||
|         "multi": false, |  | ||||||
|         "name": "server", |  | ||||||
|         "options": [], |  | ||||||
|         "query": "label_values(node_boot_time, instance)", |  | ||||||
|         "refresh": 1, |  | ||||||
|         "regex": "", |  | ||||||
|         "sort": 0, |  | ||||||
|         "tagValuesQuery": "", |  | ||||||
|         "tags": [], |  | ||||||
|         "tagsQuery": "", |  | ||||||
|         "type": "query", |  | ||||||
|         "useTags": false |  | ||||||
|       } |  | ||||||
|     ] |  | ||||||
|   }, |  | ||||||
|   "time": { |  | ||||||
|     "from": "now-1h", |  | ||||||
|     "to": "now" |  | ||||||
|   }, |  | ||||||
|   "timepicker": { |  | ||||||
|     "refresh_intervals": [ |  | ||||||
|       "5s", |  | ||||||
|       "10s", |  | ||||||
|       "30s", |  | ||||||
|       "1m", |  | ||||||
|       "5m", |  | ||||||
|       "15m", |  | ||||||
|       "30m", |  | ||||||
|       "1h", |  | ||||||
|       "2h", |  | ||||||
|       "1d" |  | ||||||
|     ], |  | ||||||
|     "time_options": [ |  | ||||||
|       "5m", |  | ||||||
|       "15m", |  | ||||||
|       "1h", |  | ||||||
|       "6h", |  | ||||||
|       "12h", |  | ||||||
|       "24h", |  | ||||||
|       "2d", |  | ||||||
|       "7d", |  | ||||||
|       "30d" |  | ||||||
|     ] |  | ||||||
|   }, |  | ||||||
|   "timezone": "browser", |  | ||||||
|   "title": "Nodes", |  | ||||||
|   "version": 1 |  | ||||||
| }, |  | ||||||
|   "inputs": [ |  | ||||||
|     { |  | ||||||
|       "name": "DS_PROMETHEUS", |  | ||||||
|       "pluginId": "prometheus", |  | ||||||
|       "type": "datasource", |  | ||||||
|       "value": "prometheus" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "overwrite": true |  | ||||||
| } |  | ||||||
| @ -1,7 +0,0 @@ | |||||||
| { |  | ||||||
|     "access": "proxy", |  | ||||||
|     "basicAuth": false, |  | ||||||
|     "name": "prometheus", |  | ||||||
|     "type": "prometheus", |  | ||||||
|     "url": "http://prometheus-k8s.monitoring.svc:9090" |  | ||||||
| } |  | ||||||
| @ -1,121 +0,0 @@ | |||||||
| ### General cluster availability ### |  | ||||||
| 
 |  | ||||||
| # alert if another failed peer will result in an unavailable cluster |  | ||||||
| ALERT InsufficientPeers |  | ||||||
|   IF count(up{job="etcd-k8s"} == 0) > (count(up{job="etcd-k8s"}) / 2 - 1) |  | ||||||
|   FOR 3m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Etcd cluster small", |  | ||||||
|     description = "If one more etcd peer goes down the cluster will be unavailable", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ### HTTP requests alerts ### |  | ||||||
| 
 |  | ||||||
| # alert if more than 1% of requests to an HTTP endpoint have failed with a non 4xx response |  | ||||||
| ALERT HighNumberOfFailedHTTPRequests |  | ||||||
|   IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code!~"4[0-9]{2}"}[5m])) |  | ||||||
|     / sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.01 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "a high number of HTTP requests are failing", |  | ||||||
|     description = "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # alert if more than 5% of requests to an HTTP endpoint have failed with a non 4xx response |  | ||||||
| ALERT HighNumberOfFailedHTTPRequests |  | ||||||
|   IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code!~"4[0-9]{2}"}[5m]))  |  | ||||||
|     / sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.05 |  | ||||||
|   FOR 5m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "a high number of HTTP requests are failing", |  | ||||||
|     description = "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # alert if 50% of requests get a 4xx response |  | ||||||
| ALERT HighNumberOfFailedHTTPRequests |  | ||||||
|   IF sum by(method) (rate(etcd_http_failed_total{job="etcd-k8s", code=~"4[0-9]{2}"}[5m])) |  | ||||||
|     / sum by(method) (rate(etcd_http_received_total{job="etcd-k8s"}[5m])) > 0.5 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "a high number of HTTP requests are failing", |  | ||||||
|     description = "{{ $value }}% of requests for {{ $labels.method }} failed with 4xx responses on etcd instance {{ $labels.instance }}", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # alert if the 99th percentile of HTTP requests take more than 150ms |  | ||||||
| ALERT HTTPRequestsSlow |  | ||||||
|   IF histogram_quantile(0.99, rate(etcd_http_successful_duration_second_bucket[5m])) > 0.15 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "slow HTTP requests", |  | ||||||
|     description = "on ectd instance {{ $labels.instance }} HTTP requests to {{ $label.method }} are slow", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ### File descriptor alerts ### |  | ||||||
| 
 |  | ||||||
| instance:fd_utilization = process_open_fds / process_max_fds |  | ||||||
| 
 |  | ||||||
| # alert if file descriptors are likely to exhaust within the next 4 hours |  | ||||||
| ALERT FdExhaustionClose |  | ||||||
|   IF predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "file descriptors soon exhausted", |  | ||||||
|     description = "{{ $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors soon", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # alert if file descriptors are likely to exhaust within the next hour |  | ||||||
| ALERT FdExhaustionClose |  | ||||||
|   IF predict_linear(instance:fd_utilization[10m], 3600) > 1 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "file descriptors soon exhausted", |  | ||||||
|     description = "{{ $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors soon", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ### etcd proposal alerts ### |  | ||||||
| 
 |  | ||||||
| # alert if there are several failed proposals within an hour |  | ||||||
| ALERT HighNumberOfFailedProposals |  | ||||||
|   IF increase(etcd_server_proposal_failed_total{job="etcd"}[1h]) > 5 |  | ||||||
|   LABELS { |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "a high number of failed proposals within the etcd cluster are happening", |  | ||||||
|     description = "etcd instance {{ $labels.instance }} has seen {{ $value }} proposal failures within the last hour", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ### etcd disk io latency alerts ### |  | ||||||
| 
 |  | ||||||
| # alert if 99th percentile of fsync durations is higher than 500ms |  | ||||||
| ALERT HighFsyncDurations |  | ||||||
|   IF histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m])) > 0.5 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "high fsync durations", |  | ||||||
|     description = "ectd instance {{ $labels.instance }} fync durations are high", |  | ||||||
|   } |  | ||||||
| @ -1,388 +0,0 @@ | |||||||
| # NOTE: These rules were kindly contributed by the SoundCloud engineering team. |  | ||||||
| 
 |  | ||||||
| ### Container resources ### |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:spec_memory_limit_bytes = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_spec_memory_limit_bytes{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:spec_cpu_shares = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_spec_cpu_shares{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:cpu_usage:rate = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       irate( |  | ||||||
|         container_cpu_usage_seconds_total{container_name!=""}[5m] |  | ||||||
|       ), |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:memory_usage:bytes = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_memory_usage_bytes{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:memory_working_set:bytes = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_memory_working_set_bytes{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:memory_rss:bytes = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_memory_rss{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:memory_cache:bytes = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_memory_cache{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:disk_usage:bytes = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|     label_replace( |  | ||||||
|       container_disk_usage_bytes{container_name!=""}, |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:memory_pagefaults:rate = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name,scope,type) ( |  | ||||||
|     label_replace( |  | ||||||
|       irate( |  | ||||||
|         container_memory_failures_total{container_name!=""}[5m] |  | ||||||
|       ), |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster_namespace_controller_pod_container:memory_oom:rate = |  | ||||||
|   sum by (cluster,namespace,controller,pod_name,container_name,scope,type) ( |  | ||||||
|     label_replace( |  | ||||||
|       irate( |  | ||||||
|         container_memory_failcnt{container_name!=""}[5m] |  | ||||||
|       ), |  | ||||||
|       "controller", "$1", |  | ||||||
|       "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|     ) |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| ### Cluster resources ### |  | ||||||
| 
 |  | ||||||
| cluster:memory_allocation:percent = |  | ||||||
|   100 * sum by (cluster) ( |  | ||||||
|     container_spec_memory_limit_bytes{pod_name!=""} |  | ||||||
|   ) / sum by (cluster) ( |  | ||||||
|     machine_memory_bytes |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster:memory_used:percent = |  | ||||||
|   100 * sum by (cluster) ( |  | ||||||
|     container_memory_usage_bytes{pod_name!=""} |  | ||||||
|   ) / sum by (cluster) ( |  | ||||||
|     machine_memory_bytes |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster:cpu_allocation:percent = |  | ||||||
|   100 * sum by (cluster) ( |  | ||||||
|     container_spec_cpu_shares{pod_name!=""} |  | ||||||
|   ) / sum by (cluster) ( |  | ||||||
|     container_spec_cpu_shares{id="/"} * on(cluster,instance) machine_cpu_cores |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| cluster:node_cpu_use:percent = |  | ||||||
|   100 * sum by (cluster) ( |  | ||||||
|     rate(node_cpu{mode!="idle"}[5m]) |  | ||||||
|   ) / sum by (cluster) ( |  | ||||||
|     machine_cpu_cores |  | ||||||
|   ) |  | ||||||
| 
 |  | ||||||
| ### API latency ### |  | ||||||
| 
 |  | ||||||
| # Raw metrics are in microseconds. Convert to seconds. |  | ||||||
| cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|   histogram_quantile( |  | ||||||
|     0.99, |  | ||||||
|     sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket) |  | ||||||
|   ) / 1e6 |  | ||||||
| cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|   histogram_quantile( |  | ||||||
|     0.9, |  | ||||||
|     sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket) |  | ||||||
|   ) / 1e6 |  | ||||||
| cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|   histogram_quantile( |  | ||||||
|     0.5, |  | ||||||
|     sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket) |  | ||||||
|   ) / 1e6 |  | ||||||
| 
 |  | ||||||
| ### Scheduling latency ### |  | ||||||
| 
 |  | ||||||
| cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|   histogram_quantile(0.99,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6 |  | ||||||
| cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|   histogram_quantile(0.9,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6 |  | ||||||
| cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|   histogram_quantile(0.5,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6 |  | ||||||
| 
 |  | ||||||
| cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|   histogram_quantile(0.99,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6 |  | ||||||
| cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|   histogram_quantile(0.9,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6 |  | ||||||
| cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|   histogram_quantile(0.5,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6 |  | ||||||
| 
 |  | ||||||
| cluster:scheduler_binding_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|   histogram_quantile(0.99,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6 |  | ||||||
| cluster:scheduler_binding_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|   histogram_quantile(0.9,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6 |  | ||||||
| cluster:scheduler_binding_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|   histogram_quantile(0.5,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6 |  | ||||||
| 
 |  | ||||||
| ALERT K8SNodeDown |  | ||||||
|   IF up{job="kubelet"} == 0 |  | ||||||
|   FOR 1h |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Kubelet cannot be scraped", |  | ||||||
|     description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SNodeNotReady |  | ||||||
|   IF kube_node_status_ready{condition="true"} == 0 |  | ||||||
|   FOR 1h |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning", |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Node status is NotReady", |  | ||||||
|     description = "The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SManyNodesNotReady |  | ||||||
|   IF |  | ||||||
|     count by (cluster) (kube_node_status_ready{condition="true"} == 0) > 1 |  | ||||||
|     AND |  | ||||||
|       ( |  | ||||||
|         count by (cluster) (kube_node_status_ready{condition="true"} == 0) |  | ||||||
|       / |  | ||||||
|         count by (cluster) (kube_node_status_ready{condition="true"}) |  | ||||||
|       ) > 0.2 |  | ||||||
|   FOR 1m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical", |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Many K8s nodes are Not Ready", |  | ||||||
|     description = "{{ $value }} K8s nodes (more than 10% of cluster {{ $labels.cluster }}) are in the NotReady state.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SKubeletNodeExporterDown |  | ||||||
|   IF up{job="node-exporter"} == 0 |  | ||||||
|   FOR 15m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Kubelet node_exporter cannot be scraped", |  | ||||||
|     description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SKubeletDown |  | ||||||
|   IF absent(up{job="kubelet"}) or count by (cluster) (up{job="kubelet"} == 0) / count by (cluster) (up{job="kubelet"}) > 0.1 |  | ||||||
|   FOR 1h |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Many Kubelets cannot be scraped", |  | ||||||
|     description = "Prometheus failed to scrape more than 10% of kubelets, or all Kubelets have disappeared from service discovery.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SApiserverDown |  | ||||||
|   IF up{job="kubernetes"} == 0 |  | ||||||
|   FOR 15m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "API server unreachable", |  | ||||||
|     description = "An API server could not be scraped.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # Disable for non HA kubernetes setups. |  | ||||||
| ALERT K8SApiserverDown |  | ||||||
|   IF absent({job="kubernetes"}) or (count by(cluster) (up{job="kubernetes"} == 1) < count by(cluster) (up{job="kubernetes"})) |  | ||||||
|   FOR 5m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "API server unreachable", |  | ||||||
|     description = "Prometheus failed to scrape multiple API servers, or all API servers have disappeared from service discovery.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SSchedulerDown |  | ||||||
|   IF absent(up{job="kube-scheduler"}) or (count by(cluster) (up{job="kube-scheduler"} == 1) == 0) |  | ||||||
|   FOR 5m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical", |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Scheduler is down", |  | ||||||
|     description = "There is no running K8S scheduler. New pods are not being assigned to nodes.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SControllerManagerDown |  | ||||||
|   IF absent(up{job="kube-controller-manager"}) or (count by(cluster) (up{job="kube-controller-manager"} == 1) == 0) |  | ||||||
|   FOR 5m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical", |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Controller manager is down", |  | ||||||
|     description = "There is no running K8S controller manager. Deployments and replication controllers are not making progress.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SConntrackTableFull |  | ||||||
|   IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 50 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Number of tracked connections is near the limit", |  | ||||||
|     description = "The nf_conntrack table is {{ $value }}% full.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SConntrackTableFull |  | ||||||
|   IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 90 |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Number of tracked connections is near the limit", |  | ||||||
|     description = "The nf_conntrack table is {{ $value }}% full.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # To catch the conntrack sysctl de-tuning when it happens |  | ||||||
| ALERT K8SConntrackTuningMissing |  | ||||||
|   IF node_nf_conntrack_udp_timeout > 10 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning", |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Node does not have the correct conntrack tunings", |  | ||||||
|     description = "Nodes keep un-setting the correct tunings, investigate when it happens.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8STooManyOpenFiles |  | ||||||
|   IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 50 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "{{ $labels.job }} has too many open file descriptors", |  | ||||||
|     description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8STooManyOpenFiles |  | ||||||
|   IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 80 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "critical" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "{{ $labels.job }} has too many open file descriptors", |  | ||||||
|     description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| # Some verbs excluded because they are expected to be long-lasting: |  | ||||||
| # WATCHLIST is long-poll, CONNECT is `kubectl exec`. |  | ||||||
| ALERT K8SApiServerLatency |  | ||||||
|   IF histogram_quantile( |  | ||||||
|       0.99, |  | ||||||
|       sum without (instance,node,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"}) |  | ||||||
|     ) / 1e6 > 1.0 |  | ||||||
|   FOR 10m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Kubernetes apiserver latency is high", |  | ||||||
|     description = "99th percentile Latency for {{ $labels.verb }} requests to the kube-apiserver is higher than 1s.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SApiServerEtcdAccessLatency |  | ||||||
|   IF etcd_request_latencies_summary{quantile="0.99"} / 1e6 > 1.0 |  | ||||||
|   FOR 15m |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning" |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Access to etcd is slow", |  | ||||||
|     description = "99th percentile latency for apiserver to access etcd is higher than 1s.", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| ALERT K8SKubeletTooManyPods |  | ||||||
|   IF kubelet_running_pod_count > 100 |  | ||||||
|   LABELS { |  | ||||||
|     service = "k8s", |  | ||||||
|     severity = "warning", |  | ||||||
|   } |  | ||||||
|   ANNOTATIONS { |  | ||||||
|     summary = "Kubelet is close to pod limit", |  | ||||||
|     description = "Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110", |  | ||||||
|   } |  | ||||||
| 
 |  | ||||||
| @ -1,35 +0,0 @@ | |||||||
| # Adding kube-prometheus to [KOPS](https://github.com/kubernetes/kops) on AWS 1.5.x |  | ||||||
| 
 |  | ||||||
| 
 |  | ||||||
| ## Prerequisites |  | ||||||
| 
 |  | ||||||
| A running Kubernetes cluster created with [KOPS](https://github.com/kubernetes/kops). |  | ||||||
|   |  | ||||||
| These instructions have currently been tested with  **topology=public** on AWS with KOPS 1.5.1 and Kubernetes 1.5.x |  | ||||||
| 
 |  | ||||||
| ## Open AWS Security Groups: |  | ||||||
| 1. Open port 9100 on the masters security group to the nodes security group |  | ||||||
| 1. Open ports 10250-10252 on the masters security group to the nodes security group. |  | ||||||
| 
 |  | ||||||
| Example script below requires $AWS\_DEFAULT_PROFILE and [$NAME](https://github.com/kubernetes/kops/blob/master/docs/aws.md#prepare-local-environment) |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| MASTER_SG=$(aws --profile ${AWS_DEFAULT_PROFILE} ec2 describe-security-groups --filters "Name=tag:Name,Values=masters.$NAME" --query "SecurityGroups[*].GroupId[]" --output=text) |  | ||||||
| NODES_SG=$(aws --profile ${AWS_DEFAULT_PROFILE} ec2 describe-security-groups --filters "Name=tag:Name,Values=nodes.$NAME" --query "SecurityGroups[*].GroupId[]" --output=text) |  | ||||||
| aws --profile ${AWS_DEFAULT_PROFILE} ec2 authorize-security-group-ingress --group-id $MASTER_SG --protocol tcp --port 9100 --source-group $NODES_SG |  | ||||||
| aws --profile ${AWS_DEFAULT_PROFILE} ec2 authorize-security-group-ingress --group-id $MASTER_SG --protocol tcp --port 10250-10252 --source-group $NODES_SG |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| ## Adding kube-prometheus |  | ||||||
| Following the instructions in the [README](https://github.com/coreos/kube-prometheus/blob/master/README.md): |  | ||||||
| 
 |  | ||||||
| Example: |  | ||||||
| 
 |  | ||||||
| ```bash |  | ||||||
| git clone -b master https://github.com/coreos/kube-prometheus.git kube-prometheus-temp; |  | ||||||
| cd kube-prometheus-temp |  | ||||||
| ./hack/cluster-monitoring/deploy |  | ||||||
| kubectl -n kube-system create -f manifests/k8s/self-hosted/ |  | ||||||
| cd - |  | ||||||
| rm -rf kube-prometheus-temp |  | ||||||
| ``` |  | ||||||
| @ -1,41 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| if [ -z "${KUBECONFIG}" ]; then |  | ||||||
|     export KUBECONFIG=~/.kube/config |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| if [ -z "${NAMESPACE}" ]; then |  | ||||||
|     NAMESPACE=monitoring |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| kubectl create namespace "$NAMESPACE" |  | ||||||
| 
 |  | ||||||
| kctl() { |  | ||||||
|     kubectl --namespace "$NAMESPACE" "$@" |  | ||||||
| } |  | ||||||
| 
 |  | ||||||
| kctl apply -f manifests/prometheus-operator.yaml |  | ||||||
| 
 |  | ||||||
| # Wait for TPRs to be ready. |  | ||||||
| printf "Waiting for Operator to register third party objects..." |  | ||||||
| until kctl get servicemonitor > /dev/null 2>&1; do sleep 1; printf "."; done |  | ||||||
| until kctl get prometheus > /dev/null 2>&1; do sleep 1; printf "."; done |  | ||||||
| until kctl get alertmanager > /dev/null 2>&1; do sleep 1; printf "."; done |  | ||||||
| echo "done!" |  | ||||||
| 
 |  | ||||||
| kctl apply -f manifests/exporters |  | ||||||
| kctl apply -f manifests/grafana |  | ||||||
| 
 |  | ||||||
| kctl apply -f manifests/prometheus/prometheus-k8s-rules.yaml |  | ||||||
| kctl apply -f manifests/prometheus/prometheus-k8s-service.yaml |  | ||||||
| 
 |  | ||||||
| kctl apply -f manifests/alertmanager/alertmanager-config.yaml |  | ||||||
| kctl apply -f manifests/alertmanager/alertmanager-service.yaml |  | ||||||
| 
 |  | ||||||
| # `kubectl apply` is currently not working for third party resources so we are |  | ||||||
| # using `kubectl create` here for the time being. |  | ||||||
| # (https://github.com/kubernetes/kubernetes/issues/29542) |  | ||||||
| kctl create -f manifests/prometheus/prometheus-k8s-servicemonitors.yaml |  | ||||||
| kctl create -f manifests/prometheus/prometheus-k8s.yaml |  | ||||||
| kctl create -f manifests/alertmanager/alertmanager.yaml |  | ||||||
| 
 |  | ||||||
| @ -1,6 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| hack/cluster-monitoring/deploy |  | ||||||
| 
 |  | ||||||
| awk 'FNR==1{print "---"}1' manifests/k8s/minikube/*.yaml | sed s/MINIKUBE_IP/`minikube ip`/g | kubectl --namespace=kube-system apply -f -  |  | ||||||
| 
 |  | ||||||
| @ -1,6 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| hack/cluster-monitoring/teardown |  | ||||||
| 
 |  | ||||||
| kubectl --namespace=kube-system delete -f manifests/k8s/minikube |  | ||||||
| 
 |  | ||||||
| @ -1,6 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| hack/cluster-monitoring/deploy |  | ||||||
| 
 |  | ||||||
| kubectl --namespace=kube-system apply -f manifests/k8s/self-hosted |  | ||||||
| 
 |  | ||||||
| @ -1,6 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| hack/cluster-monitoring/teardown |  | ||||||
| 
 |  | ||||||
| kubectl --namespace=kube-system delete -f manifests/k8s/self-hosted |  | ||||||
| 
 |  | ||||||
| @ -1,24 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| if [ -z "${KUBECONFIG}" ]; then |  | ||||||
|     export KUBECONFIG=~/.kube/config |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| if [ -z "${NAMESPACE}" ]; then |  | ||||||
|     NAMESPACE=monitoring |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| kctl() { |  | ||||||
|     kubectl --namespace "$NAMESPACE" "$@" |  | ||||||
| } |  | ||||||
| 
 |  | ||||||
| kctl delete -f manifests/exporters |  | ||||||
| kctl delete -f manifests/grafana |  | ||||||
| kctl delete -f manifests/prometheus |  | ||||||
| kctl delete -f manifests/alertmanager |  | ||||||
| 
 |  | ||||||
| # Hack: wait a bit to let the controller delete the deployed Prometheus server. |  | ||||||
| sleep 5 |  | ||||||
| 
 |  | ||||||
| kctl delete -f manifests/prometheus-operator.yaml |  | ||||||
| 
 |  | ||||||
| @ -1,19 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| if [ -z "${KUBECONFIG}" ]; then |  | ||||||
|     KUBECONFIG=~/.kube/config |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| if [ -z "${NAMESPACE}" ]; then |  | ||||||
|     NAMESPACE=default |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" apply -f manifests/examples/example-app/prometheus-frontend-svc.yaml |  | ||||||
| kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" apply -f manifests/examples/example-app/example-app.yaml |  | ||||||
| 
 |  | ||||||
| # `kubectl apply` is currently not working for third party resources so we are |  | ||||||
| # using `kubectl create` here for the time being. |  | ||||||
| # (https://github.com/kubernetes/kubernetes/issues/29542) |  | ||||||
| kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" create -f manifests/examples/example-app/prometheus-frontend.yaml |  | ||||||
| kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" create -f manifests/examples/example-app/servicemonitor-frontend.yaml |  | ||||||
| 
 |  | ||||||
| @ -1,12 +0,0 @@ | |||||||
| #!/usr/bin/env bash |  | ||||||
| 
 |  | ||||||
| if [ -z "${KUBECONFIG}" ]; then |  | ||||||
|     KUBECONFIG=~/.kube/config |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| if [ -z "${NAMESPACE}" ]; then |  | ||||||
|     NAMESPACE=default |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| kubectl --namespace "$NAMESPACE" --kubeconfig="$KUBECONFIG" delete -f manifests/examples/example-app |  | ||||||
| 
 |  | ||||||
| @ -1,8 +0,0 @@ | |||||||
| #!/bin/bash |  | ||||||
| 
 |  | ||||||
| # Generate Alert Rules ConfigMap |  | ||||||
| kubectl create configmap --dry-run=true prometheus-k8s-rules --from-file=assets/prometheus/rules/ -oyaml > manifests/prometheus/prometheus-k8s-rules.yaml |  | ||||||
| 
 |  | ||||||
| # Generate Dashboard ConfigMap |  | ||||||
| kubectl create configmap --dry-run=true grafana-dashboards --from-file=assets/grafana/ -oyaml > manifests/grafana/grafana-dashboards.yaml |  | ||||||
| 
 |  | ||||||
| @ -1,50 +0,0 @@ | |||||||
| #!/bin/bash -eu |  | ||||||
| 
 |  | ||||||
| # Intended usage: |  | ||||||
| #  * Edit dashboard in Grafana (you need to login first with admin/admin |  | ||||||
| #    login/password). |  | ||||||
| #  * Save dashboard in Grafana to check is specification is correct. |  | ||||||
| #    Looks like this is the only way to check is dashboard specification |  | ||||||
| #    has error. |  | ||||||
| #  * Download dashboard specification as JSON file in Grafana: |  | ||||||
| #    Share -> Export -> Save to file. |  | ||||||
| #  * Wrap dashboard specification to make it digestable by kube-prometheus: |  | ||||||
| #      ./hack/scripts/wrap-dashboard.sh Nodes-1488465802729.json |  | ||||||
| #  * Replace dashboard specification: |  | ||||||
| #      mv Nodes-1488465802729.json assets/grafana/node-dashboard.json |  | ||||||
| #  * Regenerate Grafana configmap: |  | ||||||
| #      ./hack/scripts/generate-configmaps.sh |  | ||||||
| #  * Apply new configmap: |  | ||||||
| #      kubectl -n monitoring apply -f manifests/grafana/grafana-cm.yaml |  | ||||||
| 
 |  | ||||||
| if [ "$#" -ne 1 ]; then |  | ||||||
|     echo "Usage: $0 path-to-dashboard.json" |  | ||||||
|     exit 1 |  | ||||||
| fi |  | ||||||
| 
 |  | ||||||
| json=$1 |  | ||||||
| temp=$(tempfile -m 0644) |  | ||||||
| 
 |  | ||||||
| cat >> $temp <<EOF |  | ||||||
| { |  | ||||||
|   "dashboard": |  | ||||||
| EOF |  | ||||||
| 
 |  | ||||||
| cat $json >> $temp |  | ||||||
| 
 |  | ||||||
| cat >> $temp <<EOF |  | ||||||
| , |  | ||||||
|   "inputs": [ |  | ||||||
|     { |  | ||||||
|       "name": "DS_PROMETHEUS", |  | ||||||
|       "pluginId": "prometheus", |  | ||||||
|       "type": "datasource", |  | ||||||
|       "value": "prometheus" |  | ||||||
|     } |  | ||||||
|   ], |  | ||||||
|   "overwrite": true |  | ||||||
| } |  | ||||||
| EOF |  | ||||||
| 
 |  | ||||||
| mv $temp $json |  | ||||||
| 
 |  | ||||||
| @ -1,18 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: ConfigMap |  | ||||||
| metadata: |  | ||||||
|   name: alertmanager-main |  | ||||||
| data: |  | ||||||
|   alertmanager.yaml: |- |  | ||||||
|     global: |  | ||||||
|       resolve_timeout: 5m |  | ||||||
|     route: |  | ||||||
|       group_by: ['job'] |  | ||||||
|       group_wait: 30s |  | ||||||
|       group_interval: 5m |  | ||||||
|       repeat_interval: 12h |  | ||||||
|       receiver: 'webhook' |  | ||||||
|     receivers: |  | ||||||
|     - name: 'webhook' |  | ||||||
|       webhook_configs: |  | ||||||
|       - url: 'http://alertmanagerwh:30500/' |  | ||||||
| @ -1,14 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: alertmanager-main |  | ||||||
| spec: |  | ||||||
|   type: NodePort |  | ||||||
|   ports: |  | ||||||
|   - name: web |  | ||||||
|     nodePort: 30903 |  | ||||||
|     port: 9093 |  | ||||||
|     protocol: TCP |  | ||||||
|     targetPort: web |  | ||||||
|   selector: |  | ||||||
|     alertmanager: main |  | ||||||
| @ -1,9 +0,0 @@ | |||||||
| apiVersion: "monitoring.coreos.com/v1alpha1" |  | ||||||
| kind: "Alertmanager" |  | ||||||
| metadata: |  | ||||||
|   name: "main" |  | ||||||
|   labels: |  | ||||||
|     alertmanager: "main" |  | ||||||
| spec: |  | ||||||
|   replicas: 3 |  | ||||||
|   version: v0.5.1 |  | ||||||
| @ -1,28 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: etcd-k8s |  | ||||||
|   labels: |  | ||||||
|     k8s-app: etcd |  | ||||||
| spec: |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: api |  | ||||||
|     port: 2379 |  | ||||||
|     protocol: TCP |  | ||||||
| --- |  | ||||||
| apiVersion: v1 |  | ||||||
| kind: Endpoints |  | ||||||
| metadata: |  | ||||||
|   name: etcd-k8s |  | ||||||
|   labels: |  | ||||||
|     k8s-app: etcd |  | ||||||
| subsets: |  | ||||||
| - addresses: |  | ||||||
|   - ip: 10.142.0.2 |  | ||||||
|     nodeName: 10.142.0.2 |  | ||||||
|   ports: |  | ||||||
|   - name: api |  | ||||||
|     port: 2379 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,28 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: etcd-k8s |  | ||||||
|   labels: |  | ||||||
|     k8s-app: etcd |  | ||||||
| spec: |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: api |  | ||||||
|     port: 2379 |  | ||||||
|     protocol: TCP |  | ||||||
| --- |  | ||||||
| apiVersion: v1 |  | ||||||
| kind: Endpoints |  | ||||||
| metadata: |  | ||||||
|   name: etcd-k8s |  | ||||||
|   labels: |  | ||||||
|     k8s-app: etcd |  | ||||||
| subsets: |  | ||||||
| - addresses: |  | ||||||
|   - ip: 172.17.4.51 |  | ||||||
|     nodeName: 172.17.4.51 |  | ||||||
|   ports: |  | ||||||
|   - name: api |  | ||||||
|     port: 2379 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,34 +0,0 @@ | |||||||
| kind: Service |  | ||||||
| apiVersion: v1 |  | ||||||
| metadata:  |  | ||||||
|   name: example-app |  | ||||||
|   labels: |  | ||||||
|     tier: frontend |  | ||||||
| spec:  |  | ||||||
|   selector:  |  | ||||||
|     app: example-app  |  | ||||||
|   ports: |  | ||||||
|   - name: web |  | ||||||
|     protocol: TCP |  | ||||||
|     port: 8080 |  | ||||||
|     targetPort: web |  | ||||||
| --- |  | ||||||
| apiVersion: extensions/v1beta1 |  | ||||||
| kind: Deployment |  | ||||||
| metadata: |  | ||||||
|   name: example-app |  | ||||||
| spec: |  | ||||||
|   replicas: 4 |  | ||||||
|   template: |  | ||||||
|     metadata: |  | ||||||
|       labels: |  | ||||||
|         app: example-app |  | ||||||
|         version: 1.1.3 |  | ||||||
|     spec: |  | ||||||
|       containers: |  | ||||||
|       - name: example-app  |  | ||||||
|         image: quay.io/fabxc/prometheus_demo_service |  | ||||||
|         ports: |  | ||||||
|         - name: web |  | ||||||
|           containerPort: 8080 |  | ||||||
|           protocol: TCP |  | ||||||
| @ -1,14 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: prometheus-frontend |  | ||||||
| spec: |  | ||||||
|   type: NodePort |  | ||||||
|   ports: |  | ||||||
|   - name: web |  | ||||||
|     nodePort: 30100 |  | ||||||
|     port: 9090 |  | ||||||
|     protocol: TCP |  | ||||||
|     targetPort: web |  | ||||||
|   selector: |  | ||||||
|     prometheus: prometheus-frontend |  | ||||||
| @ -1,24 +0,0 @@ | |||||||
| apiVersion: monitoring.coreos.com/v1alpha1 |  | ||||||
| kind: Prometheus |  | ||||||
| metadata: |  | ||||||
|   name: prometheus-frontend |  | ||||||
|   namespace: default |  | ||||||
|   labels: |  | ||||||
|     prometheus: frontend |  | ||||||
| spec: |  | ||||||
|   version: v1.5.2 |  | ||||||
|   serviceMonitorSelector: |  | ||||||
|     matchLabels: |  | ||||||
|       tier: frontend |  | ||||||
|   resources: |  | ||||||
|     requests: |  | ||||||
|       # 2Gi is default, but won't schedule if you don't have a node with >2Gi |  | ||||||
|       # memory. Modify based on your target and time-series count for |  | ||||||
|       # production use. This value is mainly meant for demonstration/testing |  | ||||||
|       # purposes. |  | ||||||
|       memory: 400Mi |  | ||||||
|   alerting: |  | ||||||
|     alertmanagers: |  | ||||||
|     - namespace: monitoring |  | ||||||
|       name: alertmanager-main |  | ||||||
|       port: web |  | ||||||
| @ -1,13 +0,0 @@ | |||||||
| apiVersion: monitoring.coreos.com/v1alpha1 |  | ||||||
| kind: ServiceMonitor |  | ||||||
| metadata: |  | ||||||
|   name: frontend |  | ||||||
|   labels: |  | ||||||
|     tier: frontend |  | ||||||
| spec: |  | ||||||
|   selector: |  | ||||||
|     matchLabels: |  | ||||||
|       tier: frontend |  | ||||||
|   endpoints: |  | ||||||
|   - port: web |  | ||||||
|     interval: 10s |  | ||||||
| @ -1,25 +0,0 @@ | |||||||
| apiVersion: extensions/v1beta1 |  | ||||||
| kind: Deployment |  | ||||||
| metadata: |  | ||||||
|   name: kube-state-metrics |  | ||||||
| spec: |  | ||||||
|   replicas: 1 |  | ||||||
|   template: |  | ||||||
|     metadata: |  | ||||||
|       labels: |  | ||||||
|         app: kube-state-metrics |  | ||||||
|     spec: |  | ||||||
|       containers: |  | ||||||
|       - name: kube-state-metrics |  | ||||||
|         image: gcr.io/google_containers/kube-state-metrics:v0.4.1 |  | ||||||
|         ports: |  | ||||||
|         - name: metrics |  | ||||||
|           containerPort: 8080 |  | ||||||
|         resources: |  | ||||||
|           requests: |  | ||||||
|             memory: 30Mi |  | ||||||
|             cpu: 100m |  | ||||||
|           limits: |  | ||||||
|             memory: 50Mi |  | ||||||
|             cpu: 200m |  | ||||||
| 
 |  | ||||||
| @ -1,18 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   labels: |  | ||||||
|     app: kube-state-metrics |  | ||||||
|     k8s-app: kube-state-metrics |  | ||||||
|   annotations: |  | ||||||
|     alpha.monitoring.coreos.com/non-namespaced: "true" |  | ||||||
|   name: kube-state-metrics |  | ||||||
| spec: |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 8080 |  | ||||||
|     targetPort: metrics |  | ||||||
|     protocol: TCP |  | ||||||
|   selector: |  | ||||||
|     app: kube-state-metrics |  | ||||||
| 
 |  | ||||||
| @ -1,45 +0,0 @@ | |||||||
| apiVersion: extensions/v1beta1 |  | ||||||
| kind: DaemonSet |  | ||||||
| metadata: |  | ||||||
|   name: node-exporter |  | ||||||
| spec: |  | ||||||
|   template: |  | ||||||
|     metadata: |  | ||||||
|       labels: |  | ||||||
|         app: node-exporter |  | ||||||
|       name: node-exporter |  | ||||||
|     spec: |  | ||||||
|       hostNetwork: true |  | ||||||
|       hostPID: true |  | ||||||
|       containers: |  | ||||||
|       - image:  quay.io/prometheus/node-exporter:v0.13.0 |  | ||||||
|         args: |  | ||||||
|         - "-collector.procfs=/host/proc" |  | ||||||
|         - "-collector.sysfs=/host/sys" |  | ||||||
|         name: node-exporter |  | ||||||
|         ports: |  | ||||||
|         - containerPort: 9100 |  | ||||||
|           hostPort: 9100 |  | ||||||
|           name: scrape |  | ||||||
|         resources: |  | ||||||
|           requests: |  | ||||||
|             memory: 30Mi |  | ||||||
|             cpu: 100m |  | ||||||
|           limits: |  | ||||||
|             memory: 50Mi |  | ||||||
|             cpu: 200m |  | ||||||
|         volumeMounts: |  | ||||||
|         - name: proc |  | ||||||
|           readOnly:  true |  | ||||||
|           mountPath: /host/proc |  | ||||||
|         - name: sys |  | ||||||
|           readOnly: true |  | ||||||
|           mountPath: /host/sys |  | ||||||
|       volumes: |  | ||||||
|       - name: proc |  | ||||||
|         hostPath: |  | ||||||
|           path: /proc |  | ||||||
|       - name: sys |  | ||||||
|         hostPath: |  | ||||||
|           path: /sys |  | ||||||
| 
 |  | ||||||
| @ -1,17 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   labels: |  | ||||||
|     app: node-exporter |  | ||||||
|     k8s-app: node-exporter |  | ||||||
|   name: node-exporter |  | ||||||
| spec: |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 9100 |  | ||||||
|     protocol: TCP |  | ||||||
|   selector: |  | ||||||
|     app: node-exporter |  | ||||||
| 
 |  | ||||||
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							| @ -1,56 +0,0 @@ | |||||||
| apiVersion: extensions/v1beta1 |  | ||||||
| kind: Deployment |  | ||||||
| metadata: |  | ||||||
|   name: grafana |  | ||||||
| spec: |  | ||||||
|   replicas: 1 |  | ||||||
|   template: |  | ||||||
|     metadata: |  | ||||||
|       labels: |  | ||||||
|         app: grafana |  | ||||||
|     spec: |  | ||||||
|       containers: |  | ||||||
|       - name: grafana |  | ||||||
|         image: grafana/grafana:4.1.1 |  | ||||||
|         env: |  | ||||||
|         - name: GF_AUTH_BASIC_ENABLED |  | ||||||
|           value: "true" |  | ||||||
|         - name: GF_AUTH_ANONYMOUS_ENABLED |  | ||||||
|           value: "true" |  | ||||||
|         volumeMounts: |  | ||||||
|         - name: grafana-storage |  | ||||||
|           mountPath: /var/grafana-storage |  | ||||||
|         ports: |  | ||||||
|         - name: web |  | ||||||
|           containerPort: 3000 |  | ||||||
|         resources: |  | ||||||
|           requests: |  | ||||||
|             memory: 100Mi |  | ||||||
|             cpu: 100m |  | ||||||
|           limits: |  | ||||||
|             memory: 300Mi |  | ||||||
|             cpu: 300m |  | ||||||
|       - name: grafana-watcher |  | ||||||
|         image: quay.io/coreos/grafana-watcher:latest |  | ||||||
|         args: |  | ||||||
|           - '--watch-dir=/var/grafana-dashboards' |  | ||||||
|           - '--grafana-url=http://admin:admin@localhost:3000' |  | ||||||
|         volumeMounts: |  | ||||||
|         - name: grafana-dashboards |  | ||||||
|           mountPath: /var/grafana-dashboards |  | ||||||
|         resources: |  | ||||||
|           requests: |  | ||||||
|             memory: "16Mi" |  | ||||||
|             cpu: "50m" |  | ||||||
|           limits: |  | ||||||
|             memory: "32Mi" |  | ||||||
|             cpu: "100m" |  | ||||||
|         volumeMounts: |  | ||||||
|         - name: grafana-dashboards |  | ||||||
|           mountPath: /var/grafana-dashboards |  | ||||||
|       volumes: |  | ||||||
|       - name: grafana-storage |  | ||||||
|         emptyDir: {} |  | ||||||
|       - name: grafana-dashboards |  | ||||||
|         configMap: |  | ||||||
|           name: grafana-dashboards |  | ||||||
| @ -1,15 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: grafana |  | ||||||
|   labels: |  | ||||||
|     app: grafana |  | ||||||
| spec: |  | ||||||
|   type: NodePort |  | ||||||
|   ports: |  | ||||||
|   - name: web |  | ||||||
|     port: 3000 |  | ||||||
|     protocol: TCP |  | ||||||
|     nodePort: 30902 |  | ||||||
|   selector: |  | ||||||
|     app: grafana |  | ||||||
| @ -1,28 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: kube-controller-manager-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-controller-manager |  | ||||||
| spec: |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 10252 |  | ||||||
|     targetPort: 10252 |  | ||||||
|     protocol: TCP |  | ||||||
| --- |  | ||||||
| apiVersion: v1 |  | ||||||
| kind: Endpoints |  | ||||||
| metadata: |  | ||||||
|   name: kube-controller-manager-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-controller-manager |  | ||||||
| subsets: |  | ||||||
| - addresses: |  | ||||||
|   - ip: MINIKUBE_IP |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 10252 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,28 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: kube-scheduler-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-scheduler |  | ||||||
| spec: |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 10251 |  | ||||||
|     targetPort: 10251 |  | ||||||
|     protocol: TCP |  | ||||||
| --- |  | ||||||
| apiVersion: v1 |  | ||||||
| kind: Endpoints |  | ||||||
| metadata: |  | ||||||
|   name: kube-scheduler-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-scheduler |  | ||||||
| subsets: |  | ||||||
| - addresses: |  | ||||||
|   - ip: MINIKUBE_IP |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 10251 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,16 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: kube-controller-manager-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-controller-manager |  | ||||||
| spec: |  | ||||||
|   selector: |  | ||||||
|     k8s-app: kube-controller-manager |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 10252 |  | ||||||
|     targetPort: 10252 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,20 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: kube-dns-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-dns |  | ||||||
| spec: |  | ||||||
|   selector: |  | ||||||
|     k8s-app: kube-dns |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics-skydns |  | ||||||
|     port: 10055 |  | ||||||
|     targetPort: 10055 |  | ||||||
|     protocol: TCP |  | ||||||
|   - name: http-metrics-dnsmasq |  | ||||||
|     port: 10054 |  | ||||||
|     targetPort: 10054 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,16 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: kube-scheduler-prometheus-discovery |  | ||||||
|   labels: |  | ||||||
|     k8s-app: kube-scheduler |  | ||||||
| spec: |  | ||||||
|   selector: |  | ||||||
|     k8s-app: kube-scheduler |  | ||||||
|   type: ClusterIP |  | ||||||
|   clusterIP: None |  | ||||||
|   ports: |  | ||||||
|   - name: http-metrics |  | ||||||
|     port: 10251 |  | ||||||
|     targetPort: 10251 |  | ||||||
|     protocol: TCP |  | ||||||
| @ -1,26 +0,0 @@ | |||||||
| apiVersion: extensions/v1beta1 |  | ||||||
| kind: Deployment |  | ||||||
| metadata: |  | ||||||
|   name: prometheus-operator |  | ||||||
|   labels: |  | ||||||
|     operator: prometheus |  | ||||||
| spec: |  | ||||||
|   replicas: 1 |  | ||||||
|   template: |  | ||||||
|     metadata: |  | ||||||
|       labels: |  | ||||||
|         operator: prometheus |  | ||||||
|     spec: |  | ||||||
|       containers: |  | ||||||
|        - name: prometheus-operator |  | ||||||
|          image: quay.io/coreos/prometheus-operator:v0.6.0 |  | ||||||
|          args: |  | ||||||
|            - "--kubelet-object=kube-system/kubelet" |  | ||||||
|            - "--config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1" |  | ||||||
|          resources: |  | ||||||
|            requests: |  | ||||||
|              cpu: 100m |  | ||||||
|              memory: 50Mi |  | ||||||
|            limits: |  | ||||||
|              cpu: 200m |  | ||||||
|              memory: 300Mi |  | ||||||
| @ -1,447 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| data: |  | ||||||
|   etcd2.rules: "### General cluster availability ###\n\n# alert if another failed |  | ||||||
|     peer will result in an unavailable cluster\nALERT InsufficientPeers\n  IF count(up{job=\"etcd-k8s\"} |  | ||||||
|     == 0) > (count(up{job=\"etcd-k8s\"}) / 2 - 1)\n  FOR 3m\n  LABELS {\n    severity |  | ||||||
|     = \"critical\"\n  }\n  ANNOTATIONS {\n    summary = \"Etcd cluster small\",\n |  | ||||||
|     \   description = \"If one more etcd peer goes down the cluster will be unavailable\",\n |  | ||||||
|     \ }\n\n### HTTP requests alerts ###\n\n# alert if more than 1% of requests to |  | ||||||
|     an HTTP endpoint have failed with a non 4xx response\nALERT HighNumberOfFailedHTTPRequests\n |  | ||||||
|     \ IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", code!~\"4[0-9]{2}\"}[5m]))\n |  | ||||||
|     \   / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m])) > |  | ||||||
|     0.01\n  FOR 10m\n  LABELS {\n    severity = \"warning\"\n  }\n  ANNOTATIONS {\n |  | ||||||
|     \   summary = \"a high number of HTTP requests are failing\",\n    description |  | ||||||
|     = \"{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance |  | ||||||
|     {{ $labels.instance }}\",\n  }\n\n# alert if more than 5% of requests to an HTTP |  | ||||||
|     endpoint have failed with a non 4xx response\nALERT HighNumberOfFailedHTTPRequests\n |  | ||||||
|     \ IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", code!~\"4[0-9]{2}\"}[5m])) |  | ||||||
|     \n    / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m])) |  | ||||||
|     > 0.05\n  FOR 5m\n  LABELS {\n    severity = \"critical\"\n  }\n  ANNOTATIONS |  | ||||||
|     {\n    summary = \"a high number of HTTP requests are failing\",\n    description |  | ||||||
|     = \"{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance |  | ||||||
|     {{ $labels.instance }}\",\n  }\n\n# alert if 50% of requests get a 4xx response\nALERT |  | ||||||
|     HighNumberOfFailedHTTPRequests\n  IF sum by(method) (rate(etcd_http_failed_total{job=\"etcd-k8s\", |  | ||||||
|     code=~\"4[0-9]{2}\"}[5m]))\n    / sum by(method) (rate(etcd_http_received_total{job=\"etcd-k8s\"}[5m])) |  | ||||||
|     > 0.5\n  FOR 10m\n  LABELS {\n    severity = \"critical\"\n  }\n  ANNOTATIONS |  | ||||||
|     {\n    summary = \"a high number of HTTP requests are failing\",\n    description |  | ||||||
|     = \"{{ $value }}% of requests for {{ $labels.method }} failed with 4xx responses |  | ||||||
|     on etcd instance {{ $labels.instance }}\",\n  }\n\n# alert if the 99th percentile |  | ||||||
|     of HTTP requests take more than 150ms\nALERT HTTPRequestsSlow\n  IF histogram_quantile(0.99, |  | ||||||
|     rate(etcd_http_successful_duration_second_bucket[5m])) > 0.15\n  FOR 10m\n  LABELS |  | ||||||
|     {\n    severity = \"warning\"\n  }\n  ANNOTATIONS {\n    summary = \"slow HTTP |  | ||||||
|     requests\",\n    description = \"on ectd instance {{ $labels.instance }} HTTP |  | ||||||
|     requests to {{ $label.method }} are slow\",\n  }\n\n### File descriptor alerts |  | ||||||
|     ###\n\ninstance:fd_utilization = process_open_fds / process_max_fds\n\n# alert |  | ||||||
|     if file descriptors are likely to exhaust within the next 4 hours\nALERT FdExhaustionClose\n |  | ||||||
|     \ IF predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1\n  FOR 10m\n  LABELS |  | ||||||
|     {\n    severity = \"warning\"\n  }\n  ANNOTATIONS {\n    summary = \"file descriptors |  | ||||||
|     soon exhausted\",\n    description = \"{{ $labels.job }} instance {{ $labels.instance |  | ||||||
|     }} will exhaust in file descriptors soon\",\n  }\n\n# alert if file descriptors |  | ||||||
|     are likely to exhaust within the next hour\nALERT FdExhaustionClose\n  IF predict_linear(instance:fd_utilization[10m], |  | ||||||
|     3600) > 1\n  FOR 10m\n  LABELS {\n    severity = \"critical\"\n  }\n  ANNOTATIONS |  | ||||||
|     {\n    summary = \"file descriptors soon exhausted\",\n    description = \"{{ |  | ||||||
|     $labels.job }} instance {{ $labels.instance }} will exhaust in file descriptors |  | ||||||
|     soon\",\n  }\n\n### etcd proposal alerts ###\n\n# alert if there are several failed |  | ||||||
|     proposals within an hour\nALERT HighNumberOfFailedProposals\n  IF increase(etcd_server_proposal_failed_total{job=\"etcd\"}[1h]) |  | ||||||
|     > 5\n  LABELS {\n    severity = \"warning\"\n  }\n  ANNOTATIONS {\n    summary |  | ||||||
|     = \"a high number of failed proposals within the etcd cluster are happening\",\n |  | ||||||
|     \   description = \"etcd instance {{ $labels.instance }} has seen {{ $value }} |  | ||||||
|     proposal failures within the last hour\",\n  }\n\n### etcd disk io latency alerts |  | ||||||
|     ###\n\n# alert if 99th percentile of fsync durations is higher than 500ms\nALERT |  | ||||||
|     HighFsyncDurations\n  IF histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m])) |  | ||||||
|     > 0.5\n  FOR 10m\n  LABELS {\n    severity = \"warning\"\n  }\n  ANNOTATIONS {\n |  | ||||||
|     \   summary = \"high fsync durations\",\n    description = \"ectd instance {{ |  | ||||||
|     $labels.instance }} fync durations are high\",\n  }\n" |  | ||||||
|   kubernetes.rules: |+ |  | ||||||
|     # NOTE: These rules were kindly contributed by the SoundCloud engineering team. |  | ||||||
| 
 |  | ||||||
|     ### Container resources ### |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:spec_memory_limit_bytes = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_spec_memory_limit_bytes{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:spec_cpu_shares = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_spec_cpu_shares{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:cpu_usage:rate = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           irate( |  | ||||||
|             container_cpu_usage_seconds_total{container_name!=""}[5m] |  | ||||||
|           ), |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:memory_usage:bytes = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_memory_usage_bytes{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:memory_working_set:bytes = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_memory_working_set_bytes{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:memory_rss:bytes = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_memory_rss{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:memory_cache:bytes = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_memory_cache{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:disk_usage:bytes = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name) ( |  | ||||||
|         label_replace( |  | ||||||
|           container_disk_usage_bytes{container_name!=""}, |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:memory_pagefaults:rate = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name,scope,type) ( |  | ||||||
|         label_replace( |  | ||||||
|           irate( |  | ||||||
|             container_memory_failures_total{container_name!=""}[5m] |  | ||||||
|           ), |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster_namespace_controller_pod_container:memory_oom:rate = |  | ||||||
|       sum by (cluster,namespace,controller,pod_name,container_name,scope,type) ( |  | ||||||
|         label_replace( |  | ||||||
|           irate( |  | ||||||
|             container_memory_failcnt{container_name!=""}[5m] |  | ||||||
|           ), |  | ||||||
|           "controller", "$1", |  | ||||||
|           "pod_name", "^(.*)-[a-z0-9]+" |  | ||||||
|         ) |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     ### Cluster resources ### |  | ||||||
| 
 |  | ||||||
|     cluster:memory_allocation:percent = |  | ||||||
|       100 * sum by (cluster) ( |  | ||||||
|         container_spec_memory_limit_bytes{pod_name!=""} |  | ||||||
|       ) / sum by (cluster) ( |  | ||||||
|         machine_memory_bytes |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster:memory_used:percent = |  | ||||||
|       100 * sum by (cluster) ( |  | ||||||
|         container_memory_usage_bytes{pod_name!=""} |  | ||||||
|       ) / sum by (cluster) ( |  | ||||||
|         machine_memory_bytes |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster:cpu_allocation:percent = |  | ||||||
|       100 * sum by (cluster) ( |  | ||||||
|         container_spec_cpu_shares{pod_name!=""} |  | ||||||
|       ) / sum by (cluster) ( |  | ||||||
|         container_spec_cpu_shares{id="/"} * on(cluster,instance) machine_cpu_cores |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     cluster:node_cpu_use:percent = |  | ||||||
|       100 * sum by (cluster) ( |  | ||||||
|         rate(node_cpu{mode!="idle"}[5m]) |  | ||||||
|       ) / sum by (cluster) ( |  | ||||||
|         machine_cpu_cores |  | ||||||
|       ) |  | ||||||
| 
 |  | ||||||
|     ### API latency ### |  | ||||||
| 
 |  | ||||||
|     # Raw metrics are in microseconds. Convert to seconds. |  | ||||||
|     cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|       histogram_quantile( |  | ||||||
|         0.99, |  | ||||||
|         sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket) |  | ||||||
|       ) / 1e6 |  | ||||||
|     cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|       histogram_quantile( |  | ||||||
|         0.9, |  | ||||||
|         sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket) |  | ||||||
|       ) / 1e6 |  | ||||||
|     cluster_resource_verb:apiserver_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|       histogram_quantile( |  | ||||||
|         0.5, |  | ||||||
|         sum by(le,cluster,job,resource,verb) (apiserver_request_latencies_bucket) |  | ||||||
|       ) / 1e6 |  | ||||||
| 
 |  | ||||||
|     ### Scheduling latency ### |  | ||||||
| 
 |  | ||||||
|     cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|       histogram_quantile(0.99,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6 |  | ||||||
|     cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|       histogram_quantile(0.9,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6 |  | ||||||
|     cluster:scheduler_e2e_scheduling_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|       histogram_quantile(0.5,sum by (le,cluster) (scheduler_e2e_scheduling_latency_microseconds_bucket)) / 1e6 |  | ||||||
| 
 |  | ||||||
|     cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|       histogram_quantile(0.99,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6 |  | ||||||
|     cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|       histogram_quantile(0.9,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6 |  | ||||||
|     cluster:scheduler_scheduling_algorithm_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|       histogram_quantile(0.5,sum by (le,cluster) (scheduler_scheduling_algorithm_latency_microseconds_bucket)) / 1e6 |  | ||||||
| 
 |  | ||||||
|     cluster:scheduler_binding_latency:quantile_seconds{quantile="0.99"} = |  | ||||||
|       histogram_quantile(0.99,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6 |  | ||||||
|     cluster:scheduler_binding_latency:quantile_seconds{quantile="0.9"} = |  | ||||||
|       histogram_quantile(0.9,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6 |  | ||||||
|     cluster:scheduler_binding_latency:quantile_seconds{quantile="0.5"} = |  | ||||||
|       histogram_quantile(0.5,sum by (le,cluster) (scheduler_binding_latency_microseconds_bucket)) / 1e6 |  | ||||||
| 
 |  | ||||||
|     ALERT K8SNodeDown |  | ||||||
|       IF up{job="kubelet"} == 0 |  | ||||||
|       FOR 1h |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Kubelet cannot be scraped", |  | ||||||
|         description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SNodeNotReady |  | ||||||
|       IF kube_node_status_ready{condition="true"} == 0 |  | ||||||
|       FOR 1h |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning", |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Node status is NotReady", |  | ||||||
|         description = "The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SManyNodesNotReady |  | ||||||
|       IF |  | ||||||
|         count by (cluster) (kube_node_status_ready{condition="true"} == 0) > 1 |  | ||||||
|         AND |  | ||||||
|           ( |  | ||||||
|             count by (cluster) (kube_node_status_ready{condition="true"} == 0) |  | ||||||
|           / |  | ||||||
|             count by (cluster) (kube_node_status_ready{condition="true"}) |  | ||||||
|           ) > 0.2 |  | ||||||
|       FOR 1m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical", |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Many K8s nodes are Not Ready", |  | ||||||
|         description = "{{ $value }} K8s nodes (more than 10% of cluster {{ $labels.cluster }}) are in the NotReady state.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SKubeletNodeExporterDown |  | ||||||
|       IF up{job="node-exporter"} == 0 |  | ||||||
|       FOR 15m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Kubelet node_exporter cannot be scraped", |  | ||||||
|         description = "Prometheus could not scrape a {{ $labels.job }} for more than one hour.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SKubeletDown |  | ||||||
|       IF absent(up{job="kubelet"}) or count by (cluster) (up{job="kubelet"} == 0) / count by (cluster) (up{job="kubelet"}) > 0.1 |  | ||||||
|       FOR 1h |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Many Kubelets cannot be scraped", |  | ||||||
|         description = "Prometheus failed to scrape more than 10% of kubelets, or all Kubelets have disappeared from service discovery.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SApiserverDown |  | ||||||
|       IF up{job="kubernetes"} == 0 |  | ||||||
|       FOR 15m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "API server unreachable", |  | ||||||
|         description = "An API server could not be scraped.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     # Disable for non HA kubernetes setups. |  | ||||||
|     ALERT K8SApiserverDown |  | ||||||
|       IF absent({job="kubernetes"}) or (count by(cluster) (up{job="kubernetes"} == 1) < count by(cluster) (up{job="kubernetes"})) |  | ||||||
|       FOR 5m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "API server unreachable", |  | ||||||
|         description = "Prometheus failed to scrape multiple API servers, or all API servers have disappeared from service discovery.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SSchedulerDown |  | ||||||
|       IF absent(up{job="kube-scheduler"}) or (count by(cluster) (up{job="kube-scheduler"} == 1) == 0) |  | ||||||
|       FOR 5m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical", |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Scheduler is down", |  | ||||||
|         description = "There is no running K8S scheduler. New pods are not being assigned to nodes.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SControllerManagerDown |  | ||||||
|       IF absent(up{job="kube-controller-manager"}) or (count by(cluster) (up{job="kube-controller-manager"} == 1) == 0) |  | ||||||
|       FOR 5m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical", |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Controller manager is down", |  | ||||||
|         description = "There is no running K8S controller manager. Deployments and replication controllers are not making progress.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SConntrackTableFull |  | ||||||
|       IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 50 |  | ||||||
|       FOR 10m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Number of tracked connections is near the limit", |  | ||||||
|         description = "The nf_conntrack table is {{ $value }}% full.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SConntrackTableFull |  | ||||||
|       IF 100*node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 90 |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Number of tracked connections is near the limit", |  | ||||||
|         description = "The nf_conntrack table is {{ $value }}% full.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     # To catch the conntrack sysctl de-tuning when it happens |  | ||||||
|     ALERT K8SConntrackTuningMissing |  | ||||||
|       IF node_nf_conntrack_udp_timeout > 10 |  | ||||||
|       FOR 10m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning", |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Node does not have the correct conntrack tunings", |  | ||||||
|         description = "Nodes keep un-setting the correct tunings, investigate when it happens.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8STooManyOpenFiles |  | ||||||
|       IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 50 |  | ||||||
|       FOR 10m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "{{ $labels.job }} has too many open file descriptors", |  | ||||||
|         description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8STooManyOpenFiles |  | ||||||
|       IF 100*process_open_fds{job=~"kubelet|kubernetes"} / process_max_fds > 80 |  | ||||||
|       FOR 10m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "critical" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "{{ $labels.job }} has too many open file descriptors", |  | ||||||
|         description = "{{ $labels.node }} is using {{ $value }}% of the available file/socket descriptors.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     # Some verbs excluded because they are expected to be long-lasting: |  | ||||||
|     # WATCHLIST is long-poll, CONNECT is `kubectl exec`. |  | ||||||
|     ALERT K8SApiServerLatency |  | ||||||
|       IF histogram_quantile( |  | ||||||
|           0.99, |  | ||||||
|           sum without (instance,node,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH"}) |  | ||||||
|         ) / 1e6 > 1.0 |  | ||||||
|       FOR 10m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Kubernetes apiserver latency is high", |  | ||||||
|         description = "99th percentile Latency for {{ $labels.verb }} requests to the kube-apiserver is higher than 1s.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SApiServerEtcdAccessLatency |  | ||||||
|       IF etcd_request_latencies_summary{quantile="0.99"} / 1e6 > 1.0 |  | ||||||
|       FOR 15m |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning" |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Access to etcd is slow", |  | ||||||
|         description = "99th percentile latency for apiserver to access etcd is higher than 1s.", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
|     ALERT K8SKubeletTooManyPods |  | ||||||
|       IF kubelet_running_pod_count > 100 |  | ||||||
|       LABELS { |  | ||||||
|         service = "k8s", |  | ||||||
|         severity = "warning", |  | ||||||
|       } |  | ||||||
|       ANNOTATIONS { |  | ||||||
|         summary = "Kubelet is close to pod limit", |  | ||||||
|         description = "Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110", |  | ||||||
|       } |  | ||||||
| 
 |  | ||||||
| kind: ConfigMap |  | ||||||
| metadata: |  | ||||||
|   creationTimestamp: null |  | ||||||
|   name: prometheus-k8s-rules |  | ||||||
| @ -1,14 +0,0 @@ | |||||||
| apiVersion: v1 |  | ||||||
| kind: Service |  | ||||||
| metadata: |  | ||||||
|   name: prometheus-k8s |  | ||||||
| spec: |  | ||||||
|   type: NodePort |  | ||||||
|   ports: |  | ||||||
|   - name: web |  | ||||||
|     nodePort: 30900 |  | ||||||
|     port: 9090 |  | ||||||
|     protocol: TCP |  | ||||||
|     targetPort: web |  | ||||||
|   selector: |  | ||||||
|     prometheus: k8s |  | ||||||
| @ -1,69 +0,0 @@ | |||||||
| apiVersion: monitoring.coreos.com/v1alpha1 |  | ||||||
| kind: ServiceMonitor |  | ||||||
| metadata: |  | ||||||
|   name: kube-apiserver |  | ||||||
|   labels: |  | ||||||
|     k8s-apps: https |  | ||||||
| spec: |  | ||||||
|   jobLabel: provider |  | ||||||
|   selector: |  | ||||||
|     matchLabels: |  | ||||||
|       component: apiserver |  | ||||||
|       provider: kubernetes |  | ||||||
|   namespaceSelector: |  | ||||||
|     matchNames: |  | ||||||
|     - default |  | ||||||
|   endpoints: |  | ||||||
|   - port: https |  | ||||||
|     interval: 15s |  | ||||||
|     scheme: https |  | ||||||
|     tlsConfig: |  | ||||||
|       caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt |  | ||||||
|       serverName: kubernetes |  | ||||||
|     bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token |  | ||||||
| --- |  | ||||||
| apiVersion: monitoring.coreos.com/v1alpha1 |  | ||||||
| kind: ServiceMonitor |  | ||||||
| metadata: |  | ||||||
|   name: k8s-apps-https |  | ||||||
|   labels: |  | ||||||
|     k8s-apps: https |  | ||||||
| spec: |  | ||||||
|   jobLabel: k8s-app |  | ||||||
|   selector: |  | ||||||
|     matchExpressions: |  | ||||||
|     - {key: k8s-app, operator: Exists} |  | ||||||
|   namespaceSelector: |  | ||||||
|     matchNames: |  | ||||||
|     - kube-system |  | ||||||
|   endpoints: |  | ||||||
|   - port: https-metrics |  | ||||||
|     interval: 15s |  | ||||||
|     scheme: https |  | ||||||
|     tlsConfig: |  | ||||||
|       caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt |  | ||||||
|       insecureSkipVerify: true |  | ||||||
|     bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token |  | ||||||
| --- |  | ||||||
| apiVersion: monitoring.coreos.com/v1alpha1 |  | ||||||
| kind: ServiceMonitor |  | ||||||
| metadata: |  | ||||||
|   name: k8s-apps-http |  | ||||||
|   labels: |  | ||||||
|     k8s-apps: http |  | ||||||
| spec: |  | ||||||
|   jobLabel: k8s-app |  | ||||||
|   selector: |  | ||||||
|     matchExpressions: |  | ||||||
|     - {key: k8s-app, operator: Exists} |  | ||||||
|   namespaceSelector: |  | ||||||
|     matchNames: |  | ||||||
|     - kube-system |  | ||||||
|     - monitoring |  | ||||||
|   endpoints: |  | ||||||
|   - port: http-metrics |  | ||||||
|     interval: 15s |  | ||||||
|   - port: http-metrics-dnsmasq |  | ||||||
|     interval: 15s |  | ||||||
|   - port: http-metrics-skydns |  | ||||||
|     interval: 15s |  | ||||||
| @ -1,24 +0,0 @@ | |||||||
| apiVersion: monitoring.coreos.com/v1alpha1 |  | ||||||
| kind: Prometheus |  | ||||||
| metadata: |  | ||||||
|   name: k8s |  | ||||||
|   labels: |  | ||||||
|     prometheus: k8s |  | ||||||
| spec: |  | ||||||
|   replicas: 2 |  | ||||||
|   version: v1.5.2 |  | ||||||
|   serviceMonitorSelector: |  | ||||||
|     matchExpression: |  | ||||||
|     - {key: k8s-apps, operator: Exists} |  | ||||||
|   resources: |  | ||||||
|     requests: |  | ||||||
|       # 2Gi is default, but won't schedule if you don't have a node with >2Gi |  | ||||||
|       # memory. Modify based on your target and time-series count for |  | ||||||
|       # production use. This value is mainly meant for demonstration/testing |  | ||||||
|       # purposes. |  | ||||||
|       memory: 400Mi |  | ||||||
|   alerting: |  | ||||||
|     alertmanagers: |  | ||||||
|     - namespace: monitoring |  | ||||||
|       name: alertmanager-main |  | ||||||
|       port: web |  | ||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user