Commit Graph

645 Commits

Author SHA1 Message Date
Bryan Boreham
95d47c0512 [TESTS] Rules: remove brittle TestNewGroup
It broke on different implementation of 'NewNopLogger'.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-06-23 15:34:03 +01:00
Anand Rajagopal
41d08003c5 refactor: Move 'for' state restoration metrics to defer block
Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
2025-05-28 16:04:49 +00:00
tongjicoder
4fe20fa340 chore: fix some comments
Signed-off-by: tongjicoder <tongjicoder@icloud.com>
2025-05-27 23:14:41 +08:00
Subhramit Basu
44e27a876e
Add parse alerting for rules files (#16601)
Builds over https://github.com/prometheus/prometheus/pull/16462
Addresses comments, adds invalid rules file

Signed-off-by: subhramit <subhramit.bb@live.in>
Co-authored-by: marcodebba <marcodebonis74@gmail.com>
2025-05-26 18:29:34 +02:00
hardlydearly
ba4b058b7a refactor: use slices.Contains to simplify code
Signed-off-by: hardlydearly <799511800@qq.com>
2025-05-09 08:27:10 +02:00
Arve Knudsen
e7e3ab2824
Fix linting issues found by golangci-lint v2.0.2 (#16368)
* Fix linting issues found by golangci-lint v2.0.2

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-05-03 19:05:13 +02:00
George Krajcsovits
d085f7da9b
Revert "ruler: correct logging of alert name & template data" 2025-03-25 15:36:34 +01:00
George Krajcsovits
e19ff7a000
Merge pull request #15093 from dimitarvdimitrov/dimitar/ruler/unsupported-logger
ruler: correct logging of alert name & template data
2025-03-25 13:47:22 +01:00
Matthieu MOREL
5fa1146e21
chore: enable gci linter (#16245)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-03-22 15:46:13 +00:00
Matthieu MOREL
c7d4b53ec1 chore: enable unused-parameter from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-02-19 19:50:28 +01:00
Julien Duchesne
77a5698190
Ruler Concurrency: Consider __name__ matchers (#16039)
When calculating dependencies between rules, we sometimes run into `{__name__...}` matchers
These can be used the same way as the actual rule names
This will enable even more rules to run concurrently

The new logic is also not slower:
```
julienduchesne@triceratops prometheus % benchstat test-old.txt test.txt
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/rules
cpu: Apple M3 Pro
                 │ test-old.txt │              test.txt               │
                 │    sec/op    │   sec/op     vs base                │
DependencyMap-11    1.206µ ± 7%   1.024µ ± 7%  -15.10% (p=0.000 n=10)

                 │ test-old.txt │               test.txt               │
                 │     B/op     │     B/op      vs base                │
DependencyMap-11   1.720Ki ± 0%   1.438Ki ± 0%  -16.35% (p=0.000 n=10)

                 │ test-old.txt │              test.txt              │
                 │  allocs/op   │ allocs/op   vs base                │
DependencyMap-11     39.00 ± 0%   34.00 ± 0%  -12.82% (p=0.000 n=10)
```

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2025-02-17 11:19:16 +00:00
frazou
9b4c8f6be2
rulefmt: support YAML aliases for Alert/Record/Expr (#14957)
* rulefmt: add tests with YAML aliases for Alert/Record/Expr

Altough somewhat discouraged in favour of using proper configuration
management tools to generate full YAML, it can still be useful in some
situations to use YAML anchors/aliases in rules.

The current implementation is however confusing: aliases will work
everywhere except on the alert/record name and expr

This first commit adds (failing) tests to illustrate the issue, the next
one fixes it. The YAML test file is intentionally filled with anchors
and aliases. Although this is probably not representative of a real-world
use case (which would have less of them), it errs on the safer side.

Signed-off-by: François HORTA <fhorta@scaleway.com>

* rulefmt: support YAML aliases for Alert/Record/Expr

This fixes the use of YAML aliases in alert/recording rule names and
expressions. A side effect of this change is that the RuleNode YAML type is
no longer propagated deeper in the codebase, instead the generic Rule type
can now be used.

Signed-off-by: François HORTA <fhorta@scaleway.com>

* rulefmt: Add test for YAML merge combined with aliases

Currently this does work, but adding a test for the related
functionally here makes sense.

Signed-off-by: David Leadbeater <dgl@dgl.cx>

* rulefmt: Rebase to latest changes

Signed-off-by: David Leadbeater <dgl@dgl.cx>

---------

Signed-off-by: François HORTA <fhorta@scaleway.com>
Signed-off-by: David Leadbeater <dgl@dgl.cx>
Co-authored-by: David Leadbeater <dgl@dgl.cx>
2025-02-13 20:48:33 +11:00
Dimitar Dimitrov
237139c992
Merge branch 'main' into dimitar/ruler/unsupported-logger
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2025-02-12 10:55:53 +01:00
Dimitar Dimitrov
2a8ae586f4
ruler: stop all rule groups asynchronously on shutdown (#15804)
* ruler: stop all rule groups asynchronously on shutdown

During shutdown of the rules manager some rule groups have already stopped and are missing evaluations while we're waiting for other groups to finish their evaluation.

When there are many groups (in the thousands), the whole shutdown process can take up to 10 minutes, during which we get miss evaluations.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Use wrappers in stop(); rename awaitStopped()

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Add comment

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

---------

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2025-01-20 21:26:58 +01:00
Bartlomiej Plotka
39d7242cfe
Merge pull request #15669 from rajagopalanand/restore-for-state
rules: Add an option to restore new rule groups added to existing rule manager
2025-01-17 15:37:07 +01:00
Giedrius Statkevičius
92218ecb9b promtool: add --ignore-unknown-fields
Add --ignore-unknown-fields that ignores unknown fields in rule group
files. There are lots of tools in the ecosystem that "like" to extend
the rule group file structure but they are currently unreadable by
promtool if there's anything extra. The purpose of this flag is so that
we could use the "vanilla" promtool instead of rolling our own.

Some examples of tools/code:

https://github.com/grafana/mimir/blob/main/pkg/mimirtool/rules/rwrulefmt/rulefmt.go
8898eb3cc5/pkg/rules/rules.go (L18-L25)

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
2025-01-15 11:34:28 +02:00
Julien Duchesne
0a19f1268e
Rule Concurrency: Simpler loop for sequential (default) executions (#15801) 2025-01-09 23:14:21 +00:00
Julien Duchesne
a768a3b95e
Rule Concurrency: Test safe abort of rule evaluations (#15797)
This test was added in the Grafana fork a while ago: https://github.com/grafana/mimir-prometheus/pull/714 and has been helpful to make sure we can safely terminate rule evaluations early
The new rule evaluation logic (done here: https://github.com/prometheus/prometheus/pull/15681) does not have the bug, but the test was useful to verify that

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2025-01-08 16:32:48 +00:00
Julien Duchesne
1a27ab29b8
Rules: Store dependencies instead of boolean (#15689)
* Rules: Store dependencies instead of boolean
To improve https://github.com/prometheus/prometheus/pull/15681 further, we'll need to store the dependencies and dependents of each

Right now, if a rule has both (at least 1) dependents and dependencies, it is not possible to determine the order to run the rules and they must all run sequentially

This PR only changes the dependents and dependencies attributes of rules, it does not implement a new topological sort algorithm

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Store a slice of Rule instead

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Add `BenchmarkRuleDependencyController_AnalyseRules` for future reference

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

---------

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2025-01-06 20:48:38 +00:00
Julien Duchesne
8067f27971
RuleConcurrencyController: Add SplitGroupIntoBatches method (#15681)
* `RuleConcurrencyController`: Add `SplitGroupIntoBatches` method
The concurrency implementation can now return a slice of concurrent rule batches
This allows for additional concurrency as opposed to the current interface which is limited by the order in which the rules have been loaded

Also, I removed the `concurrencyController` attribute from the group. That information is duplicated in the opts.RuleConcurrencyController` attribute, leading to some confusing behavior, especially in tests.

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Address PR comments

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Apply suggestions from code review

Co-authored-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: Julien Duchesne <julienduchesne@live.com>

---------

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
Signed-off-by: Julien Duchesne <julienduchesne@live.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
2025-01-06 18:51:19 +00:00
Dimitar Dimitrov
8b45384a44
Make templateData a struct
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2024-12-24 11:10:13 +02:00
Julien Duchesne
615195372d
Ruler: Move inner eval function (#15688)
Refactoring in prevision of https://github.com/prometheus/prometheus/pull/15681

By moving the `eval` function outside of the for loop, we can modify the rule execution order more easily (without huge git changes)

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2024-12-17 21:11:55 +00:00
Julien Duchesne
e2f037e554
rules: Add new RuleEvaluationTimeSum field to groups (#15672)
* feat(ruler): Add new `RuleEvaluationTimeSum` field to groups
Coupled with a metric: `rule_group_last_rule_duration_sum_seconds`

This will give us more observability into how fast a group runs with or without concurrency

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Update rules/group.go

Co-authored-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: Julien Duchesne <julienduchesne@live.com>
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Apply suggestions from code review

Co-authored-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: Julien Duchesne <julienduchesne@live.com>
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Remove `in seconds`. A duration is a duration

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

---------

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
Signed-off-by: Julien Duchesne <julienduchesne@live.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
2024-12-13 21:42:46 +00:00
Julien Duchesne
7802ca263d
RuleDependencyController: Fix for indeterminate conditions (#15560)
The dependency map being empty meant that all the rules were being set as having no dependencies or dependents. Which is the opposite of what we want
Added two new tests that verify the behavior, they failed before the fix, running all the rules concurrently

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2024-12-13 16:48:29 +00:00
Anand Rajagopal
ceb2f653ba Add an option to restore new rule groups added to existing rule manager
Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
2024-12-13 03:15:53 +00:00
Arve Knudsen
c2e28f21ba
rules.NewGroup: Fix when no logger is passed (#15356)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-11-21 16:53:06 +01:00
Matthieu MOREL
af1a19fc78 enable errorf rule from perfsprint linter
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-11-06 16:50:36 +01:00
Vanshika
cccbe72514
TSDB: Fix some edge cases when OOO is enabled (#14710)
Fix some edge cases when OOO is enabled

Signed-off-by: Vanshikav123 <vanshikav928@gmail.com>
Signed-off-by: Vanshika <102902652+Vanshikav123@users.noreply.github.com>
Signed-off-by: Jesus Vazquez <jesusvzpg@gmail.com>
Co-authored-by: Jesus Vazquez <jesusvzpg@gmail.com>
2024-10-23 17:34:28 +02:00
Bryan Boreham
70e2d23027
Merge pull request #11474 from clwluvw/group-label
[FEATURE] rules: add labels at group level
2024-10-21 14:47:12 +01:00
Dimitar Dimitrov
5b2cfc5e45
Merge branch 'main' into dimitar/ruler/unsupported-logger
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2024-10-09 16:53:21 +02:00
TJ Hoplock
6ebfbd2d54 chore!: adopt log/slog, remove go-kit/log
For: #14355

This commit updates Prometheus to adopt stdlib's log/slog package in
favor of go-kit/log. As part of converting to use slog, several other
related changes are required to get prometheus working, including:
- removed unused logging util func `RateLimit()`
- forward ported the util/logging/Deduper logging by implementing a small custom slog.Handler that does the deduping before chaining log calls to the underlying real slog.Logger
- move some of the json file logging functionality to use prom/common package functionality
- refactored some of the new json file logging for scraping
- changes to promql.QueryLogger interface to swap out logging methods for relevant slog sugar wrappers
- updated lots of tests that used/replicated custom logging functionality, attempting to keep the logical goal of the tests consistent after the transition
- added a healthy amount of `if logger == nil { $makeLogger }` type conditional checks amongst various functions where none were provided -- old code that used the go-kit/log.Logger interface had several places where there were nil references when trying to use functions like `With()` to add keyvals on the new *slog.Logger type

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2024-10-07 15:58:50 -04:00
Dimitar Dimitrov
b544127d9a
ruler: correct logging of alert name
The error messages we see in the embedded ruler in Mimir right now are contain `alert="unsupported value type"` and `data="unsupported value type"`

```
ts=2024-10-04T09:59:56.900015691Z caller=alerting.go:384 level=warn component=ruler insight=true user=XXX alert="unsupported value type" msg="Expanding alert template failed" err="error executing template __alert_service_mesh_down_alert: template: __alert_service_mesh_down_alert:1:139: executing \"__alert_service_mesh_down_alert\" at <.registry_name>: can't evaluate field registry_name in type struct { Labels map[string]string; ExternalLabels map[string]string; ExternalURL string; Value interface {} }" data="unsupported value type"
```

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2024-10-04 12:19:12 +02:00
Nathan Baulch
50cd453c8f
chore: Fix typos (#14868)
* Fix typos

---------

Signed-off-by: Nathan Baulch <nathan.baulch@gmail.com>
2024-09-10 22:32:03 +02:00
Arve Knudsen
99204f23ee Merge remote-tracking branch 'prometheus/main' into arve/close-engine
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-08-29 09:52:54 +02:00
riskrole
406bf775aa chore: fix some comments
Signed-off-by: riskrole <yuhang@before.tech>
2024-08-28 11:26:57 +08:00
Arve Knudsen
c9a460d570 Merge remote-tracking branch 'prometheus/main' into arve/close-engine
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-08-26 12:17:10 +02:00
Max Amin
84b819a69f
feat: add Google cloud roundtripper for remote write (#14346)
* feat: Google Auth for remote write

Signed-off-by: Max Amin <maxamin@google.com>

---------

Signed-off-by: Max Amin <maxamin@google.com>
2024-07-30 16:25:19 +01:00
Seena Fallah
f253d36361 rule: allow merging labels from group level
Support merging labels from groups to rule labels

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2024-07-26 20:18:05 +02:00
Arve Knudsen
7c873004c7 Merge remote-tracking branch 'prometheus/main' into arve/close-engine 2024-07-26 11:48:33 +02:00
gotjosh
465891cc56
Rules: Refactor concurrency controller interface (#14491)
* Rules: Refactor concurrency controller interface

Even though the main purpose of this refactor is to modify the interface of the concurrency controller to accept a Context. I did two drive-by modifications that I think are sensible:

1. I have moved the check for dependencies on rules to the controller itself - this aligns with how the controller should behave as it is a deciding factor on wether we should run concurrently or not.
2. I cleaned up some unused methods from the days of the old interface before #13527 changed it.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2024-07-22 14:11:18 +01:00
Arve Knudsen
fbc9eddfaf Refactor engine creation in tests
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-07-14 13:58:51 +02:00
Arve Knudsen
fec6adadcd Merge remote-tracking branch 'prometheus/main' into arve/close-engine 2024-07-14 13:19:11 +02:00
Saswata Mukherjee
398f42de5f
Add label-matcher support to Rules API (#10194)
* Add label-matcher support to Rules API

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Implement suggestions

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Match any matcherSet instead of all

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Don't treat labels.Labels as slice

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Remove non-templated check and fix tests

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Update docs

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix comment

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Add comment for matching logic, fix tests after rebase

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Co-authored-by: Yijie Qin <qinyijie@amazon.com>
2024-07-10 13:18:29 +01:00
Arve Knudsen
e8ae8cf012 Merge remote-tracking branch 'prometheus/main' into arve/close-engine
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-07-01 10:47:21 +02:00
Raphael Silva
e0c9b2ee19 Fix linting errors
Signed-off-by: Raphael Silva <rapphil@gmail.com>
2024-06-28 23:44:08 +00:00
Raphael Silva
cd5a7b5020 Make rules Manager Update method no-op after Close
This has to be done because Close and Update methods are accessed concurrently.

Signed-off-by: Raphael Silva <rapphil@gmail.com>
2024-06-28 23:39:46 +00:00
Jeanette Tan
dda5f48c9e Merge branch 'main' into nhcb-review-2 2024-06-20 22:50:00 +08:00
Oleg Zaytsev
4c1e71fa0b
Reduce the flakiness of TestAsyncRuleEvaluation (#14300)
* Reduce the flakiness of TestAsyncRuleEvaluation

This tests sleeps for 15 millisecond per rule group, and then comprares
the entire execution time to be smaller than a multiple of that delay.

The ruleCount is 6, so it assumes that the test will come to the
assertions in less than 90ms.

Meanwhile, the Github's Windows runner:
- ...Huh, oh? What? How much time? milliwhat? Sorry I don't speak that.

TL;DR, this increases the delay to 250 millisecond. This won't prevent
the test from being flaky, but will reduce the flakiness by several
orders of magnitude and hopefully won't be an issue anymore.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Make tests parallel

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

---------

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
2024-06-14 15:02:46 +02:00
Arve Knudsen
b7320ef636 Merge remote-tracking branch 'prometheus/main' into arve/close-engine 2024-06-14 10:51:35 +02:00
Jeanette Tan
14f8dded39 Merge branch 'main' into nhcb
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
2024-06-07 19:17:14 +08:00