mirror of
https://github.com/prometheus/prometheus.git
synced 2025-12-18 16:01:06 +01:00
The return value of functions relating to the current time, e.g. time(), is set by promtool to start at timestamp 0 at the start of a test's evaluation. This has the very nice consequence that tests can run reliably without depending on when they are run. It does, however, mean that tests will give out results that can be unexpected by users. If this behaviour is documented, then users will be empowered to write tests for their rules that use time-dependent functions. (Closes: prometheus/docs#1464) Signed-off-by: Gabriel Filion <lelutin@torproject.org>
305 lines
11 KiB
Markdown
305 lines
11 KiB
Markdown
---
|
|
title: Unit testing for rules
|
|
sort_rank: 6
|
|
---
|
|
|
|
You can use `promtool` to test your rules.
|
|
|
|
```shell
|
|
# For a single test file.
|
|
./promtool test rules test.yml
|
|
|
|
# If you have multiple test files, say test1.yml,test2.yml,test3.yml
|
|
./promtool test rules test1.yml test2.yml test3.yml
|
|
```
|
|
|
|
## Test file format
|
|
|
|
```yaml
|
|
# This is a list of rule files to consider for testing. Globs are supported.
|
|
rule_files:
|
|
[ - <file_name> ]
|
|
|
|
[ evaluation_interval: <duration> | default = 1m ]
|
|
|
|
# Setting fuzzy_compare true will very slightly weaken floating point comparisons.
|
|
# This will (effectively) ignore differences in the last bit of the mantissa.
|
|
[ fuzzy_compare: <boolean> | default = false ]
|
|
|
|
# The order in which group names are listed below will be the order of evaluation of
|
|
# rule groups (at a given evaluation time). The order is guaranteed only for the groups mentioned below.
|
|
# All the groups need not be mentioned below.
|
|
group_eval_order:
|
|
[ - <group_name> ]
|
|
|
|
# All the tests are listed here.
|
|
tests:
|
|
[ - <test_group> ]
|
|
```
|
|
|
|
### `<test_group>`
|
|
|
|
``` yaml
|
|
# Series data
|
|
[ interval: <duration> | default = evaluation_interval ]
|
|
input_series:
|
|
[ - <series> ]
|
|
|
|
# Name of the test group
|
|
[ name: <string> ]
|
|
|
|
# Unit tests for the above data.
|
|
|
|
# Unit tests for alerting rules. We consider the alerting rules from the input file.
|
|
alert_rule_test:
|
|
[ - <alert_test_case> ]
|
|
|
|
# Unit tests for PromQL expressions.
|
|
promql_expr_test:
|
|
[ - <promql_test_case> ]
|
|
|
|
# External labels accessible to the alert template.
|
|
external_labels:
|
|
[ <labelname>: <string> ... ]
|
|
|
|
# External URL accessible to the alert template.
|
|
# Usually set using --web.external-url.
|
|
[ external_url: <string> ]
|
|
```
|
|
|
|
### `<series>`
|
|
|
|
```yaml
|
|
# This follows the usual series notation '<metric name>{<label name>=<label value>, ...}'
|
|
# Examples:
|
|
# series_name{label1="value1", label2="value2"}
|
|
# go_goroutines{job="prometheus", instance="localhost:9090"}
|
|
series: <string>
|
|
|
|
# This uses expanding notation.
|
|
# Expanding notation:
|
|
# 'a+bxn' becomes 'a a+b a+(2*b) a+(3*b) … a+(n*b)'
|
|
# Read this as series starts at a, then n further samples incrementing by b.
|
|
# 'a-bxn' becomes 'a a-b a-(2*b) a-(3*b) … a-(n*b)'
|
|
# Read this as series starts at a, then n further samples decrementing by b (or incrementing by negative b).
|
|
# 'axn' becomes 'a a a … a' (a n+1 times) - it's a shorthand for 'a+0xn'
|
|
# There are special values to indicate missing and stale samples:
|
|
# '_' represents a missing sample from scrape
|
|
# 'stale' indicates a stale sample
|
|
# Examples:
|
|
# 1. '-2+4x3' becomes '-2 2 6 10' - series starts at -2, then 3 further samples incrementing by 4.
|
|
# 2. ' 1-2x4' becomes '1 -1 -3 -5 -7' - series starts at 1, then 4 further samples decrementing by 2.
|
|
# 3. ' 1x4' becomes '1 1 1 1 1' - shorthand for '1+0x4', series starts at 1, then 4 further samples incrementing by 0.
|
|
# 4. ' 1 _x3 stale' becomes '1 _ _ _ stale' - the missing sample cannot increment, so 3 missing samples are produced by the '_x3' expression.
|
|
#
|
|
# Native histogram notation:
|
|
# Native histograms can be used instead of floating point numbers using the following notation:
|
|
# {{schema:1 sum:-0.3 count:3.1 z_bucket:7.1 z_bucket_w:0.05 buckets:[5.1 10 7] offset:-3 n_buckets:[4.1 5] n_offset:-5 counter_reset_hint:gauge}}
|
|
# Native histograms support the same expanding notation as floating point numbers, i.e. 'axn', 'a+bxn' and 'a-bxn'.
|
|
# All properties are optional and default to 0. The order is not important. The following properties are supported:
|
|
# - schema (int):
|
|
# Currently valid schema numbers are -53 and -4 <= n <= 8.
|
|
# Schema -53 is the custom buckets schema, upper bucket boundaries are defined in custom_values
|
|
# like for classic histograms, and you shouldn't use z_bucket, z_bucket_w, n_buckets, n_offset.
|
|
# The rest are base-2 standard schemas, where 1.0 is a bucket boundary in each case, and
|
|
# then each power of two is divided into 2^n logarithmic buckets. Or
|
|
# in other words, each bucket boundary is the previous boundary times
|
|
# 2^(2^-n).
|
|
# - sum (float):
|
|
# The sum of all observations, including the zero bucket.
|
|
# - count (non-negative float):
|
|
# The number of observations, including those that are NaN and including the zero bucket.
|
|
# - z_bucket (non-negative float):
|
|
# The sum of all observations in the zero bucket.
|
|
# - z_bucket_w (non-negative float):
|
|
# The width of the zero bucket.
|
|
# If z_bucket_w > 0, the zero bucket contains all observations -z_bucket_w <= x <= z_bucket_w.
|
|
# Otherwise, the zero bucket only contains observations that are exactly 0.
|
|
# - buckets (list of non-negative floats):
|
|
# Observation counts in positive buckets. Each represents an absolute count.
|
|
# - offset (int):
|
|
# The starting index of the first entry in the positive buckets.
|
|
# - n_buckets (list of non-negative floats):
|
|
# Observation counts in negative buckets. Each represents an absolute count.
|
|
# - n_offset (int):
|
|
# The starting index of the first entry in the negative buckets.
|
|
# - counter_reset_hint (one of 'unknown', 'reset', 'not_reset' or 'gauge')
|
|
# The counter reset hint associated with this histogram. Defaults to 'unknown' if not set.
|
|
# - custom_values (list of floats in ascending order):
|
|
# The upper limits for custom buckets when schema is -53.
|
|
# These have the same role as the 'le' numbers in classic histograms.
|
|
# Do not append '+Inf' at the end, it is implicit.
|
|
values: <string>
|
|
```
|
|
|
|
### `<alert_test_case>`
|
|
|
|
Prometheus allows you to have same alertname for different alerting rules. Hence in this unit testing, you have to list the union of all the firing alerts for the alertname under a single `<alert_test_case>`.
|
|
|
|
``` yaml
|
|
# The time elapsed from time=0s when the alerts have to be checked.
|
|
eval_time: <duration>
|
|
|
|
# Name of the alert to be tested.
|
|
alertname: <string>
|
|
|
|
# List of expected alerts which are firing under the given alertname at
|
|
# given evaluation time. If you want to test if an alerting rule should
|
|
# not be firing, then you can mention the above fields and leave 'exp_alerts' empty.
|
|
exp_alerts:
|
|
[ - <alert> ]
|
|
```
|
|
|
|
### `<alert>`
|
|
|
|
``` yaml
|
|
# These are the expanded labels and annotations of the expected alert.
|
|
# Note: labels also include the labels of the sample associated with the
|
|
# alert (same as what you see in `/alerts`, without series `__name__` and `alertname`)
|
|
exp_labels:
|
|
[ <labelname>: <string> ]
|
|
exp_annotations:
|
|
[ <labelname>: <string> ]
|
|
```
|
|
|
|
### `<promql_test_case>`
|
|
|
|
```yaml
|
|
# Expression to evaluate
|
|
expr: <string>
|
|
|
|
# The time elapsed from time=0s when the expression has to be evaluated.
|
|
eval_time: <duration>
|
|
|
|
# Expected samples at the given evaluation time.
|
|
exp_samples:
|
|
[ - <sample> ]
|
|
```
|
|
|
|
### `<sample>`
|
|
|
|
```yaml
|
|
# Labels of the sample in usual series notation '<metric name>{<label name>=<label value>, ...}'
|
|
# Examples:
|
|
# series_name{label1="value1", label2="value2"}
|
|
# go_goroutines{job="prometheus", instance="localhost:9090"}
|
|
labels: <string>
|
|
|
|
# The expected value of the PromQL expression.
|
|
value: <number>
|
|
```
|
|
|
|
## Example
|
|
|
|
This is an example input file for unit testing which passes the test. `test.yml` is the test file which follows the syntax above and `alerts.yml` contains the alerting rules.
|
|
|
|
With `alerts.yml` in the same directory, run `./promtool test rules test.yml`.
|
|
|
|
### `test.yml`
|
|
|
|
```yaml
|
|
# This is the main input for unit testing.
|
|
# Only this file is passed as command line argument.
|
|
|
|
rule_files:
|
|
- alerts.yml
|
|
|
|
evaluation_interval: 1m
|
|
|
|
tests:
|
|
# Test 1.
|
|
- interval: 1m
|
|
# Series data.
|
|
input_series:
|
|
- series: 'up{job="prometheus", instance="localhost:9090"}'
|
|
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
|
|
- series: 'up{job="node_exporter", instance="localhost:9100"}'
|
|
values: '1+0x6 0 0 0 0 0 0 0 0' # 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
|
|
- series: 'go_goroutines{job="prometheus", instance="localhost:9090"}'
|
|
values: '10+10x2 30+20x5' # 10 20 30 30 50 70 90 110 130
|
|
- series: 'go_goroutines{job="node_exporter", instance="localhost:9100"}'
|
|
values: '10+10x7 10+30x4' # 10 20 30 40 50 60 70 80 10 40 70 100 130
|
|
|
|
# Unit test for alerting rules.
|
|
alert_rule_test:
|
|
# Unit test 1.
|
|
- eval_time: 10m
|
|
alertname: InstanceDown
|
|
exp_alerts:
|
|
# Alert 1.
|
|
- exp_labels:
|
|
severity: page
|
|
instance: localhost:9090
|
|
job: prometheus
|
|
exp_annotations:
|
|
summary: "Instance localhost:9090 down"
|
|
description: "localhost:9090 of job prometheus has been down for more than 5 minutes."
|
|
# Unit tests for promql expressions.
|
|
promql_expr_test:
|
|
# Unit test 1.
|
|
- expr: go_goroutines > 5
|
|
eval_time: 4m
|
|
exp_samples:
|
|
# Sample 1.
|
|
- labels: 'go_goroutines{job="prometheus",instance="localhost:9090"}'
|
|
value: 50
|
|
# Sample 2.
|
|
- labels: 'go_goroutines{job="node_exporter",instance="localhost:9100"}'
|
|
value: 50
|
|
```
|
|
|
|
### `alerts.yml`
|
|
|
|
```yaml
|
|
# This is the rules file.
|
|
|
|
groups:
|
|
- name: example
|
|
rules:
|
|
|
|
- alert: InstanceDown
|
|
expr: up == 0
|
|
for: 5m
|
|
labels:
|
|
severity: page
|
|
annotations:
|
|
summary: "Instance {{ $labels.instance }} down"
|
|
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
|
|
|
|
- alert: AnotherInstanceDown
|
|
expr: up == 0
|
|
for: 10m
|
|
labels:
|
|
severity: page
|
|
annotations:
|
|
summary: "Instance {{ $labels.instance }} down"
|
|
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
|
|
```
|
|
|
|
### Time within tests
|
|
|
|
It should be noted that in all tests, either in `alert_test_case` or
|
|
`promql_test_case`, the output from all functions related to the current time,
|
|
for example the `time()` and `day_of_*()` functions, will output a consistent value
|
|
for tests.
|
|
|
|
At the start of the test evaluation, `time()` returns 0 and therefore when under test
|
|
`time()` will return a value of `0 + eval_time`.
|
|
|
|
If you need to write tests for alerts that use functions relating to the current
|
|
time, make sure that the values given to your `input_series` are placed far
|
|
enough in the past, relative to the evaluation time described above. The values
|
|
can for example be negative timestamps so that with a very small `eval_time` the
|
|
alert can be expected to trigger.
|
|
|
|
Another method that's known to work is to instead bump `eval_time` in the future
|
|
so that the timestamp output by `time()` will be a higher value and the values
|
|
in `input_series` will be far enough apart from that point in time so that the
|
|
alerts will trigger. This method has the downside of making promtool generate a
|
|
timeseries database that contains a value for each `input_series` for each
|
|
`interval` for the given test. This can become very slow relatively easily and
|
|
can end up consuming a lot of RAM for running your test. By instead using values
|
|
for `input_series` relative to the timestamp described above even though the
|
|
values go into negative numbers, you can keep `eval_time` fairly lower and avoid
|
|
making your tests run very slowly.
|