Commit Graph

149 Commits

Author SHA1 Message Date
Yandi Lee
8eb445b8a4
Discovery.Manager: close sync ch after sender() is stopped (#14465)
* close sync ch after sender() is stopped
* break if chan is closed

Signed-off-by: liyandi <littlepangdi@163.com>
Co-authored-by: liyandi <liyandi@xiaomi.com>
2025-07-11 17:15:01 +01:00
Siavash Safi
ef48e4cb9f
chore: refactor notifier package
Split the notifier package into smaller source files.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
2025-04-03 17:48:04 +11:00
Dimitar Dimitrov
bd5b2ea95c
ruler notifier: make batch size configurable (#16254)
* ruler notifier: make batch size configurable

In Mimir we experimented with setting a higher value for the batch size.
A 4x increase in batch size decreased the time to process a single notification by about 2x.
This reduces the processing time of the notifications queue and increases the throughput of the queue.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Update cmd/prometheus/main.go

Co-authored-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Update docs

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Use a string constant

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Add godoc comment on exported constant

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

---------

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
2025-03-24 14:22:19 +00:00
Matthieu MOREL
5fa1146e21
chore: enable gci linter (#16245)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-03-22 15:46:13 +00:00
Ayoub Mrini
3f0de72da7
Merge pull request #15979 from jrcichra/sendall-remap
notifier: Consider alert relabeling when deciding if alerts are dropped
2025-02-28 09:54:54 +01:00
Justin Cichra
78599d0ec6 notifier: Consider alert relabeling when deciding if alerts are dropped
When intentionally dropping all alerts in a batch, `SendAll` returns
false, increasing the dropped_total metric.

This makes it difficult to tell if there is a connection issue between
Prometheus and AlertManager or a result of intentionally dropping alerts.

Have `SendAll` return `true` when no batches were sent by keeping track
of the number of AlertManager request attempts.

If no attempts were made, then the send is successful.

Fixes: #15978

Signed-off-by: Justin Cichra <jcichra@cloudflare.com>
2025-02-26 09:05:40 -05:00
Matthieu MOREL
c7d4b53ec1 chore: enable unused-parameter from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-02-19 19:50:28 +01:00
Matthieu MOREL
dd5ab743ea chore(deps): use version.PrometheusUserAgent
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-01-22 07:31:02 +01:00
beorn7
e01c5cefac notifier: fix increment of metric prometheus_notifications_errors_total
Previously, prometheus_notifications_errors_total was incremented by
one whenever a batch of alerts was affected by an error during sending
to a specific alertmanager. However, the corresponding metric
prometheus_notifications_sent_total, counting all alerts that were
sent (including those where the sent ended in error), is incremented
by the batch size, i.e. the number of alerts.

Therefore, the ratio used in the mixin for the
PrometheusErrorSendingAlertsToSomeAlertmanagers alert is inconsistent.

This commit changes the increment of
prometheus_notifications_errors_total to the number of alerts that
were sent in the attempt that ended in an error. It also adjusts the
metrics help string accordingly and makes the wording in the alert in
the mixin more precise.

Signed-off-by: beorn7 <beorn@grafana.com>
2024-11-26 15:50:02 +01:00
SuperQ
bc1bc5c118
Update sigv4 library
Migrate use of prometheus/common/sigv4 to prometheus/sigv4.

Related to: https://github.com/prometheus/common/issues/709

Signed-off-by: SuperQ <superq@gmail.com>
2024-11-08 22:35:44 +01:00
Alan Protasio
c78d5b94af
Disallowing configure AM with the v1 api (#13883)
* Stop supporting Alertmanager v1

* Disallowing configure AM with the v1 api

Signed-off-by: alanprot <alanprot@gmail.com>

* Update config/config_test.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Update config/config.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Addressing coments

Signed-off-by: alanprot <alanprot@gmail.com>

* Update notifier/notifier.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Update config/config_test.go

Co-authored-by: Jan Fajerski <jan--f@users.noreply.github.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

---------

Signed-off-by: alanprot <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Co-authored-by: Jan Fajerski <jan--f@users.noreply.github.com>
2024-10-18 15:23:14 +02:00
Ayoub Mrini
d9a54284f5
Merge pull request #14987 from machine424/yafix
fix(notifier): avoid dropping known alertmanagers after each ApplyConfig
2024-10-10 21:14:35 +02:00
TJ Hoplock
6ebfbd2d54 chore!: adopt log/slog, remove go-kit/log
For: #14355

This commit updates Prometheus to adopt stdlib's log/slog package in
favor of go-kit/log. As part of converting to use slog, several other
related changes are required to get prometheus working, including:
- removed unused logging util func `RateLimit()`
- forward ported the util/logging/Deduper logging by implementing a small custom slog.Handler that does the deduping before chaining log calls to the underlying real slog.Logger
- move some of the json file logging functionality to use prom/common package functionality
- refactored some of the new json file logging for scraping
- changes to promql.QueryLogger interface to swap out logging methods for relevant slog sugar wrappers
- updated lots of tests that used/replicated custom logging functionality, attempting to keep the logical goal of the tests consistent after the transition
- added a healthy amount of `if logger == nil { $makeLogger }` type conditional checks amongst various functions where none were provided -- old code that used the go-kit/log.Logger interface had several places where there were nil references when trying to use functions like `With()` to add keyvals on the new *slog.Logger type

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2024-10-07 15:58:50 -04:00
machine424
83ee57343a
fix(notifier): stop dropping known alertmanagers on each ApplyConfig and waiting on SD to update them.
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2024-10-03 11:24:01 +02:00
machine424
3dc623d30b
test(notifier): add reproducer
Signed-off-by: machine424 <ayoubmrini424@gmail.com>

Co-authored-by: tommy0 <tommy0@mail.ru>
2024-10-01 10:16:43 +02:00
Bryan Boreham
5710ddf24f
[ENHANCEMENT] Alerts: remove metrics for removed Alertmanagers (#13909)
* [ENHANCEMENT] Alerts: remove metrics for removed Alertmanagers

So they don't continue to report stale values.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-09-26 15:32:18 +01:00
Nathan Baulch
50cd453c8f
chore: Fix typos (#14868)
* Fix typos

---------

Signed-off-by: Nathan Baulch <nathan.baulch@gmail.com>
2024-09-10 22:32:03 +02:00
Julien
481d718539
Merge pull request #14661 from prymitive/TestHangingNotifier
tests: increase TestHangingNotifier timeout
2024-08-20 14:37:16 +02:00
Arve Knudsen
250aa5031d Remove empty line
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-08-19 10:50:27 +02:00
Arve Knudsen
3a78e76282 Upgrade golangci-lint to v1.60.1
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-08-18 12:13:25 +02:00
Lukasz Mierzwa
7694c89497 Increase TestHangingNotifier timeout
This test keeps timing out on our arm64 CI server, it does use a very slow timeout and that 5ms doesn't seem to be enough.
But it 10x.

Signed-off-by: Lukasz Mierzwa <lukasz@cloudflare.com>
2024-08-12 14:53:08 +01:00
Charles Korn
2dd07fbb1b
notifier: optionally drain queued notifications before shutting down (#14290)
* Add draining of queued notifications to `notifier.Manager`

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Update docs

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Address PR feedback

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Add more logging

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Address offline feedback: remove timeout

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Ensure stopping takes priority over further processing, make tests more robust

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Make channel unbuffered

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Update docs

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Fix race in test

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Remove unnecessary context

Signed-off-by: Charles Korn <charles.korn@grafana.com>

* Make Stop safe to call multiple times

Signed-off-by: Charles Korn <charles.korn@grafana.com>

---------

Signed-off-by: Charles Korn <charles.korn@grafana.com>
2024-06-26 11:32:04 +01:00
Arve Knudsen
d902116b41 Fix various linting errors
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2024-06-24 16:11:53 -07:00
machine424
70beda092a fix(notifier): take alertmanagerSet.mtx before checking alertmanagerSet.ams in sendAll
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2024-06-19 09:43:52 +02:00
machine424
690de487e2 chore(notifier): Split 'Run()' into two goroutines: one to receive target updates and trigger reloads and the other one to send notifications.
This is done to prevent the latter operation from blocking/starving the former, as previously, the `tsets` channel was consumed by the same goroutine that consumes and feeds the buffered `n.more` channel, the `tsets` channel was less likely to be ready as it's unbuffered and only fed every `SDManager.updatert` seconds.

See https://github.com/prometheus/prometheus/issues/13676 and https://github.com/prometheus/prometheus/issues/8768

The synchronization with the sendLoop goroutine is managed through the n.mtx mutex.

This uses a similar approach than scrape manager's efbd6e41c5/scrape/manager.go (L115-L117)

The old TestHangingNotifier was replaced by the new one to more closely reflect reality.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2024-06-19 09:43:52 +02:00
machine424
94d28cd6cf chore(notifier): add a reproducer for https://github.com/prometheus/prometheus/issues/13676
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.

The existing `TestHangingNotifier` that was added in https://github.com/prometheus/prometheus/pull/10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`.

The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

Co-authored-by: Ethan Hunter <ehunter@hudson-trading.com>
2024-06-19 09:43:52 +02:00
Oleksandr Redko
f10c3454e9 Enable perfsprint linter and fix up code
Signed-off-by: Oleksandr Redko <oleksandr.red+github@gmail.com>
2024-05-15 17:51:05 +03:00
Bryan Boreham
e1dd8e72df
Merge branch 'main' into merge-2.51.2-into-main
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-04-10 15:05:52 +01:00
Matthieu MOREL
d496687c8e golangci-lint: enable usestdlibvars linter
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-04-08 19:26:23 +00:00
Simon Pasquier
8bd6ae1b20 Notifier: fix deadlock when zero alerts
When all alerts were dropped after alert relabeling, the `sendAll()`
function didn't release the lock properly which created a deadlock with
the Alertmanager target discovery.

In addition, the commit detects early when there are no Alertmanager
endpoint to notify to avoid unnecessary work.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2024-03-29 15:44:05 +01:00
Bryan Boreham
8c4e4b72a8 Notifier: pass parameters to goroutine explicitly
Avoids possible false sharing between loops.

Plausibly there is no problem in the current code, but it's easy enough to write it more safely.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-03-08 09:20:36 +00:00
Bryan Boreham
57c799132b Notifier: don't reuse payload after relabeling
Also clarify why these variables are being cleared.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-03-08 09:16:43 +00:00
darshanime
61ce18950f Fix pipeline golangci-lint error
Signed-off-by: darshanime <deathbullet@gmail.com>
2024-03-05 00:45:07 +05:30
Julien
88622cfa2c
Merge pull request #12551 from nabokihms/alertmanager-relabeling-config
Route different alerts to different alertmanagers
2024-03-04 16:45:00 +01:00
Matthieu MOREL
9c4782f1cc
golangci-lint: enable testifylint linter (#13254)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2023-12-07 11:35:01 +00:00
TJ Hoplock
26b78da281 ci: use go1.21.0 fmt to make ci happy
https://github.com/prometheus/prometheus/actions/runs/6044443719/job/16403043771?pr=12774

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2023-08-31 22:01:53 -04:00
TJ Hoplock
51d1d2cd96 feat: add AWS sigv4 support to alertmanager endpoints
Addresses: #12536

This commit adds support for configuring sigv4 to an
`alertmanager_config`. Based heavily on the sigv4 work in the remote
write client.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2023-08-31 21:47:25 -04:00
m.nabokikh
9d8463339d Fixes according to the code review
Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>
2023-07-23 00:37:30 +02:00
m.nabokikh
39d008f94f Route different alerts to different alertmanagers
Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>
2023-07-12 16:11:25 +02:00
Bryan Boreham
3711339a7d Alerts: more efficient relabel on Send
Re-use `labels.Builder` and use `relabel.ProcessBuilder` to skip a
conversion step.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-05-15 08:44:06 -04:00
Bryan Boreham
c3f267d862 Alerts: more efficient generation of target labels
Use a label builder instead of a slice when creating labels for the
target alertmanagers. This can be passed directly to
`relabel.ProcessBuilder`, skipping a copy.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-05-15 08:43:59 -04:00
Bryan Boreham
b987afa7ef labels: simplify call to get Labels from Builder
It took a `Labels` where the memory could be re-used, but in practice
this hardly ever benefitted. Especially after converting `relabel.Process`
to `relabel.ProcessBuilder`.

Comparing the parameter to `nil` was a bug; `EmptyLabels` is not `nil`
so the slice was reallocated multiple times by `append`.

Lastly `Builder.Labels()` now estimates that the final size will depend
on labels added and deleted.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-03-22 17:05:20 +00:00
Bryan Boreham
b3ca791bfd Update package notifier for new labels.Labels type
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-12-19 15:22:09 +00:00
Bryan Boreham
9619d3fd3b notifier: remove unused code
None of the actions on `lb` have any effect because its result is not
read.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-12-04 17:35:36 +00:00
Bryan Boreham
ac02cfcb79 notifier: in tests use labels.FromStrings
Replacing code which assumes the internal structure of `Labels`.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-09-09 13:34:49 +02:00
Cosrider
bef6556ca5
delete redundant alias (#11180)
Signed-off-by: Cosrider <cosrider7@gmail.com>

Signed-off-by: Cosrider <cosrider7@gmail.com>
2022-08-31 15:50:38 +02:00
Bryan Boreham
8b863c42dd
Optimise relabeling by re-using memory (#11147)
* model/relabel: Add benchmark

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* model/relabel: re-use Builder across relabels

Saves memory allocations.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* labels.Builder: allow re-use of result slice

This reduces memory allocations where the caller has a suitable slice available.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* model/relabel: re-use source values slice

To reduce memory allocations.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Unwind one change causing test failures

Restore original behaviour in PopulateLabels, where we must not overwrite the input set.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* relabel: simplify values optimisation

Use a stack-based array for up to 16 source labels, which will be the
vast majority of cases.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* lint

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-08-19 15:27:52 +05:30
Julien Pivotto
2479fb42f0
Improve notifier queue test to reduce flakiness (#10984)
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2022-07-05 15:27:26 +02:00
Julien Pivotto
02f3297719
Split notifier select in 2 to ensure newer targets are used. (#10948)
* Split notifier select in 2 to ensure newer targets are used.

Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2022-07-01 14:23:23 +02:00
Matthieu MOREL
80fbe1de96
refactor (package model): move from github.com/pkg/errors to 'errors' and 'fmt' packages (#10748)
Signed-off-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>
2022-06-16 10:38:27 +02:00