Commit Graph

77 Commits

Author SHA1 Message Date
Yandi Lee
8eb445b8a4
Discovery.Manager: close sync ch after sender() is stopped (#14465)
* close sync ch after sender() is stopped
* break if chan is closed

Signed-off-by: liyandi <littlepangdi@163.com>
Co-authored-by: liyandi <liyandi@xiaomi.com>
2025-07-11 17:15:01 +01:00
machine424
020e803ee0 chore(discovery): remove unused StaticProvider struct, library users can easily define it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
59761f631b Move m.targetsMtx.Lock down into the loop
Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups().

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-05-15 12:30:48 +01:00
Lukasz Mierzwa
7d55ee8cc8 Try fixing potential deadlocks in discovery
Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-05-12 09:13:46 +01:00
Matthieu MOREL
b472ce7010 chore: enable early-return from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-02-10 22:08:43 +01:00
TJ Hoplock
6ebfbd2d54 chore!: adopt log/slog, remove go-kit/log
For: #14355

This commit updates Prometheus to adopt stdlib's log/slog package in
favor of go-kit/log. As part of converting to use slog, several other
related changes are required to get prometheus working, including:
- removed unused logging util func `RateLimit()`
- forward ported the util/logging/Deduper logging by implementing a small custom slog.Handler that does the deduping before chaining log calls to the underlying real slog.Logger
- move some of the json file logging functionality to use prom/common package functionality
- refactored some of the new json file logging for scraping
- changes to promql.QueryLogger interface to swap out logging methods for relevant slog sugar wrappers
- updated lots of tests that used/replicated custom logging functionality, attempting to keep the logical goal of the tests consistent after the transition
- added a healthy amount of `if logger == nil { $makeLogger }` type conditional checks amongst various functions where none were provided -- old code that used the go-kit/log.Logger interface had several places where there were nil references when trying to use functions like `With()` to add keyvals on the new *slog.Logger type

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2024-10-07 15:58:50 -04:00
machine424
d23d196db5 fix(discovery): prevent the manager from storing stale targetGroups
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2024-08-30 14:39:31 +02:00
machine424
c586c15ae6 fix(discovery): make discovery manager notify consumers of dropped targets for still defined jobs
scrape/manager_test.go: add a test to check that the manager gets notified
for targets that got dropped by discovery to reproduce: https://github.com/prometheus/prometheus/issues/12858#issuecomment-1732318102

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2024-08-28 17:39:02 +02:00
beorn7
0f760f63dd lint: Revamp our linting rules, mostly around doc comments
Several things done here:

- Set `max-issues-per-linter` to 0 so that we actually see all linter
  warnings and not just 50 per linter. (As we also set
  `max-same-issues` to 0, I assume this was the intention from the
  beginning.)

- Stop using the golangci-lint default excludes (by setting
  `exclude-use-default: false`. Those are too generous and don't match
  our style conventions. (I have re-added some of the excludes
  explicitly in this commit. See below.)

- Re-add the `errcheck` exclusion we have used so far via the
  defaults.

- Exclude the signature requirement `govet` has for `Seek` methods
  because we use non-standard `Seek` methods a lot. (But we keep other
  requirements, while the default excludes completely disabled the
  check for common method segnatures.)

- Exclude warnings about missing doc comments on exported symbols. (We
  used to be pretty adamant about doc comments, but stopped that at
  some point in the past. By now, we have about 500 missing doc
  comments. We may consider reintroducing this check, but that's
  outside of the scope of this commit. The default excludes of
  golangci-lint essentially ignore doc comments completely.)

- By stop using the default excludes, we now get warnings back on
  malformed doc comments. That's the most impactful change in this
  commit. It does not enforce doc comments (again), but _if_ there is
  a doc comment, it has to have the recommended form. (Most of the
  changes in this commit are fixing this form.)

- Improve wording/spelling of some comments in .golangci.yml, and
  remove an outdated comment.

- Leave `package-comments` inactive, but add a TODO asking if we
  should change that.

- Add a new sub-linter `comment-spacings` (and fix corresponding
  comments), which avoids missing spaces after the leading `//`.

Signed-off-by: beorn7 <beorn@grafana.com>
2024-08-22 17:36:11 +02:00
machine424
94d28cd6cf chore(notifier): add a reproducer for https://github.com/prometheus/prometheus/issues/13676
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.

The existing `TestHangingNotifier` that was added in https://github.com/prometheus/prometheus/pull/10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`.

The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

Co-authored-by: Ethan Hunter <ehunter@hudson-trading.com>
2024-06-19 09:43:52 +02:00
David Ashpole
bbfc72b4e2
support unregistering discovery manager metrics (#13896)
Signed-off-by: David Ashpole <dashpole@google.com>
2024-04-05 16:19:07 +02:00
Paulin Todev
78411d5e8b
SD Managers taking over responsibility for registration of debug metrics (#13375)
SD Managers take over responsibility for SD metrics registration

---------

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
Signed-off-by: Björn Rabenstein <github@rabenste.in>
Co-authored-by: Björn Rabenstein <github@rabenste.in>
2024-01-23 16:53:55 +01:00
Paulin Todev
6a5306a53c
Use const labels for Discovery Manager metrics.
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
2023-12-11 11:14:27 +00:00
Paulin Todev
6de80d7fb0
Allow non-default registry to be used for metrics of SD components
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
2023-12-11 11:14:26 +00:00
Oleksandr Redko
fa90ca46e5 ci(lint): enable godot; append dot at the end of comments
Signed-off-by: Oleksandr Redko <Oleksandr_Redko@epam.com>
2023-10-31 19:53:38 +02:00
Julien Pivotto
4b735f02a6
Merge pull request #10569 from zzJinux/discovery-manager-run
Fix discovery managers to be properly cancelled
2023-09-29 12:07:55 +02:00
Julien Pivotto
009017a3fb Revert "Remove deleted target from discovery manager"
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2023-08-14 23:29:39 +02:00
haleyao
c5a37ddad5 Remove deleted target from discovery manager
Signed-off-by: haleyao <haleyao@tencent.com>
2023-07-10 00:09:25 +08:00
Bryan Boreham
2f58be840d service discovery: add config name to log messages
This makes it easier to connect a log message with the config it relates
to.

Each SD config has a name, either the scrape job name or something like
"config-0" for Alertmanager config.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-01-12 11:30:00 +00:00
Sebastian Poxhofer
3f9a9d1e62
chore(discoveryManager): expose Discoverer refresh function (#10531)
Signed-off-by: secustor <sebastian@poxhofer.at>
2022-06-13 21:06:15 +02:00
Jinwook Jeong
c7c7847b6f Fix discovery managers to be properly cancelled
Signed-off-by: Jinwook Jeong <vustthat@gmail.com>
2022-04-09 01:12:46 +09:00
Robert Fratto
44a5e705be
discovery: Expose custom HTTP client options to discoverers (#10462)
* discovery: expose HTTP client options to discoverers

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* discovery/http: use HTTP client options for created client

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* scrape: use a list of HTTP client options instead of just dial context

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* discovery: rephrase comment

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
2022-03-24 18:16:59 -04:00
Julien Pivotto
9621c2c0cc
Fix race with targets update during ApplyConfig (#9656)
I ended up extending the lock so refTargets remains valid for the
duration of the update.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-11-05 01:13:04 +01:00
Vladimir Kononov
1043d2b594
Discovery: abstain from restarting providers if possible (#9321) (#9349)
* Abstain from restarting discovery providers if possible (#9321)

Signed-off-by: Vladimir Kononov <krya-kryak@users.noreply.github.com>
2021-10-20 10:16:20 +02:00
Julien Pivotto
432005826d
Add a feature flag to enable the new discovery manager (#9537)
* Add a feature flag to enable the new manager

This PR creates a copy of the legacy manager and uses it by default.

It is a companion PR to #9349. With this PR, users can enable the new
discovery manager and provide us with any feedback / side effects that
the new behaviour might have.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-10-20 10:15:54 +02:00
Levi Harrison
b5f6f8fb36 Switched to go-kit/log
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-06-11 12:28:36 -04:00
Julien Pivotto
e1774b6f83 Fix the computation of prometheus_sd_discovered_targets
prometheus_sd_discovered_targets is wrongly calculated when there are
multiple SD configurations in place. One discovery manager can have
multiple groups coming from multiple service discoveries.

When multiple service discovery configs are used, we do not compute the
metric correctly, and instead just set the metric to one of the service
discoveries.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-05-14 22:38:37 +02:00
Andy Bursavich
4e6a94a27d
Invert service discovery dependencies (#7701)
This also fixes a bug in query_log_file, which now is relative to the config file like all other paths.

Signed-off-by: Andy Bursavich <abursavich@gmail.com>
2020-08-20 13:48:26 +01:00
Julien Pivotto
59de58d380
Docker Swarm service discovery (#7420)
* Docker Swarm service discovery

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-06-26 12:25:58 +02:00
Julien Pivotto
c61141ce51
Add DigitalOcean service discovery (#7407)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-06-18 17:04:41 +02:00
Marek Slabicki
8224ddec23
Capitalizing first letter of all log lines (#7043)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-11 09:22:18 +01:00
Julien Pivotto
c67f81937c
discovery: updateGroup should not create targets[poolKey] in the loop (#6903)
We can assume that not all target groups are nil in normal scernarios,
so we can create targets[poolKey] outside the loop.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-03-02 07:35:02 +00:00
johncming
17683d074c discovery: fix bug that use rlock for read. (#5928)
Signed-off-by: johncming <johncming@yahoo.com>
2020-01-22 09:57:37 +00:00
Nevill
55661ab004 Set failedConfigs only once right after registerProviders finished
Signed-off-by: Nevill <nevill.dutt@gmail.com>
2019-09-24 09:15:40 +08:00
Nevill
048f81218d Change prometheus_sd_configs_failed_total to Gauge
Signed-off-by: Nevill <nevill.dutt@gmail.com>
2019-09-16 10:38:43 +08:00
Harkishen Singh
d98d4a9bf0 remove resetting of manager properties and init manager props under locking (#5979)
Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
2019-09-06 12:46:24 +02:00
Matt Layher
302148fd69 *: apply gofmt -s
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-01-16 17:28:14 -05:00
Ilya Gladyshev
922c17e119 added name label to all discovery metrics (#5002)
Signed-off-by: Ilya Gladyshev <ilya.v.gladyshev@gmail.com>
2018-12-20 14:47:29 +00:00
Simon Pasquier
8b91d39c43
discovery: send empty group on empty SD config (#4819)
* discovery: send empty group on blank SD config

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Update comments

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Add another comment

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-11-30 17:59:57 +01:00
Simon Pasquier
a30348f1a4 discovery: add config label to discovered targets metric (#4753)
* discovery: add labels to discovered targets metric

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-18 16:46:59 +01:00
Goutham Veeramachaneni
ffb7f829ec
Merge pull request #4730 from prometheus/release-2.4
Release 2.4
2018-10-12 14:15:42 -07:00
Simon Pasquier
657199af22 Address Krasi comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-28 12:29:24 +02:00
Simon Pasquier
5df757fdd4 zookeeper: fix panic
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-28 11:39:40 +02:00
Simon Pasquier
365931ea83 discovery: add metrics + send updates from one goroutine only
The added metrics are:

* prometheus_sd_discovered_targets
* prometheus_sd_received_updates_total
* prometheus_sd_updates_delayed_total
* prometheus_sd_updates_total

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-27 15:59:42 +02:00
Simon Pasquier
48989d8996 discovery: add more tests
Co-authored-by: Camille Janicki <camille.janicki@gmail.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Krasi Georgiev
ba7eb733e8 tidy up the discovery logs,updating loops and selects (#4556)
* tidy up the discovery logs,updating loops and selects

few objects renamings

removed a very noise debug log on the k8s discovery. It would be usefull
to show some summary rather than every update as this is impossible to
follow.

added most comments as debug logs so each block becomes self
explanatory.

when the discovery receiving channel is full will retry again on the
next cycle.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* add noop logger for the SD manager tests.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* spelling nits

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-09-05 17:02:47 +05:30
Simon Pasquier
674c76adb8 discovery: coalesce identical SD configurations (#3912)
* discovery: coalesce identical SD configurations

Instead of creating as many SD providers as declared in the
configuration, the discovery manager merges identical configurations
into the same provider and keeps track of the subscribers. When
the manager receives target updates from a SD provider, it will
broadcast the updates to all interested subscribers.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-01 08:51:31 +01:00
Krasi Georgiev
53691ae261 Simplify SD update throttling (#4523)
* simplfied SD updates throtling

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* add default to catch cases when we don't have new updates.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-08-27 17:12:11 +02:00
Paul Gier
d24d2acd11 config: set target group source index during unmarshalling (#4245)
* config: set target group source index during unmarshalling

Fixes issue #4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed.  Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.

Signed-off-by: Paul Gier <pgier@redhat.com>

* [issue #4214] Test that static targets are not modified by discovery manager

Signed-off-by: Paul Gier <pgier@redhat.com>
2018-06-13 16:34:59 +01:00