prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-08-06 06:07:11 +02:00

Author	SHA1	Message	Date
Yandi Lee	8eb445b8a4	Discovery.Manager: close sync ch after sender() is stopped (#14465 ) * close sync ch after sender() is stopped * break if chan is closed Signed-off-by: liyandi <littlepangdi@163.com> Co-authored-by: liyandi <liyandi@xiaomi.com>	2025-07-11 17:15:01 +01:00
machine424	020e803ee0	chore(discovery): remove unused StaticProvider struct, library users can easily define it on their side Signed-off-by: machine424 <ayoubmrini424@gmail.com>	2025-07-09 17:10:13 +01:00
Lukasz Mierzwa	b49d143595	Fix a race in discovery manager ApplyConfig & shutdown If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever. This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled. It's done by setting a done() function that call Done() on a sync.WaitGroup: ``` if len(prov.newSubs) == 0 { wg.Add(1) prov.done = func() { wg.Done() } } ``` then calling prov.cancel() and finally waiting until all providers run done() function that by blocking it all on a wg.Wait() call. For each provider there is a goroutine created by calling Manager.startProvider(Provider): ``` func (m Manager) startProvider(ctx context.Context, p Provider) { m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs)) ctx, cancel := context.WithCancel(ctx) updates := make(chan []targetgroup.Group) p.mu.Lock() p.cancel = cancel p.mu.Unlock() go p.d.Run(ctx, updates) go m.updater(ctx, p, updates) } ``` It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call. If we look at the body of updater() method: ``` func (m Manager) updater(ctx context.Context, p Provider, updates chan []targetgroup.Group) { // Ensure targets from this provider are cleaned up. defer m.cleaner(p) for { select { case <-ctx.Done(): return [...] ``` we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner(). That cleaner() is where done() is called. So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done(). cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping: ``` func (m Manager) Run() error { go m.sender() <-m.ctx.Done() m.cancelDiscoverers() return m.ctx.Err() } ``` The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with: - We call Manager.ApplyConfig() - We stop the Manager - Manager.cancelDiscoverers() is called - Provider.cancel() is called for every Provider - cancel() causes provider context to be cancelled which terminates updater() for given Provider - cancelling context causes cleaner() method to be called for given Provider - cleaner() calls done() and exits - Provider is considered stopped at this point, there is no goroutine running that will call done() anymore - ApplyConfig iterates providers and decides that one is obsolete is must be stopped - It sets a custom done() function body with a WaitGroup.Done() call in it - Then ApplyConfig waits until all Providers run done() - But they are all stopped and no done() will be run - We wait forever This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called, if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore. Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method that returns a bool based on the value of cancel function but ApplyConfig doesn't check it. Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write provider.cancel and/or provider.done at the same time, making it all more likely to race. The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped. For that we need to mark it as stopped after cancel() is called, by setting cancel to nil. This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time. Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>	2025-07-02 16:03:10 +01:00
Lukasz Mierzwa	59761f631b	Move m.targetsMtx.Lock down into the loop Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups(). Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>	2025-05-15 12:30:48 +01:00
Lukasz Mierzwa	7d55ee8cc8	Try fixing potential deadlocks in discovery Manager.ApplyConfig() uses multiple locks: - Provider.mu - Manager.targetsMtx Manager.cleaner() uses the same locks but in the opposite order: - First it locks Manager.targetsMtx - The it locks Provider.mu I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown. From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running. From that trace it also seems like they are stuck on a lock from two functions: - cleaner waits on a RLock() - ApplyConfig waits on a Lock() I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario: - Manager.ApplyConfig() is called - Manager.ApplyConfig locks Provider.mu.Lock() - at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock() - Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there - at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig() - we end up with both functions locking each other out without any way to break that lock Re-order lock calls to try to avoid this scenario. I tried writing a test case for it but couldn't hit this issue. Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>	2025-05-12 09:13:46 +01:00
Matthieu MOREL	b472ce7010	chore: enable early-return from revive Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>	2025-02-10 22:08:43 +01:00
TJ Hoplock	6ebfbd2d54	chore!: adopt log/slog, remove go-kit/log For: #14355 This commit updates Prometheus to adopt stdlib's log/slog package in favor of go-kit/log. As part of converting to use slog, several other related changes are required to get prometheus working, including: - removed unused logging util func `RateLimit()` - forward ported the util/logging/Deduper logging by implementing a small custom slog.Handler that does the deduping before chaining log calls to the underlying real slog.Logger - move some of the json file logging functionality to use prom/common package functionality - refactored some of the new json file logging for scraping - changes to promql.QueryLogger interface to swap out logging methods for relevant slog sugar wrappers - updated lots of tests that used/replicated custom logging functionality, attempting to keep the logical goal of the tests consistent after the transition - added a healthy amount of `if logger == nil { $makeLogger }` type conditional checks amongst various functions where none were provided -- old code that used the go-kit/log.Logger interface had several places where there were nil references when trying to use functions like `With()` to add keyvals on the new *slog.Logger type Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>	2024-10-07 15:58:50 -04:00
machine424	d23d196db5	fix(discovery): prevent the manager from storing stale targetGroups Signed-off-by: machine424 <ayoubmrini424@gmail.com>	2024-08-30 14:39:31 +02:00
machine424	c586c15ae6	fix(discovery): make discovery manager notify consumers of dropped targets for still defined jobs scrape/manager_test.go: add a test to check that the manager gets notified for targets that got dropped by discovery to reproduce: https://github.com/prometheus/prometheus/issues/12858#issuecomment-1732318102 Signed-off-by: machine424 <ayoubmrini424@gmail.com>	2024-08-28 17:39:02 +02:00
beorn7	0f760f63dd	lint: Revamp our linting rules, mostly around doc comments Several things done here: - Set `max-issues-per-linter` to 0 so that we actually see all linter warnings and not just 50 per linter. (As we also set `max-same-issues` to 0, I assume this was the intention from the beginning.) - Stop using the golangci-lint default excludes (by setting `exclude-use-default: false`. Those are too generous and don't match our style conventions. (I have re-added some of the excludes explicitly in this commit. See below.) - Re-add the `errcheck` exclusion we have used so far via the defaults. - Exclude the signature requirement `govet` has for `Seek` methods because we use non-standard `Seek` methods a lot. (But we keep other requirements, while the default excludes completely disabled the check for common method segnatures.) - Exclude warnings about missing doc comments on exported symbols. (We used to be pretty adamant about doc comments, but stopped that at some point in the past. By now, we have about 500 missing doc comments. We may consider reintroducing this check, but that's outside of the scope of this commit. The default excludes of golangci-lint essentially ignore doc comments completely.) - By stop using the default excludes, we now get warnings back on malformed doc comments. That's the most impactful change in this commit. It does not enforce doc comments (again), but _if_ there is a doc comment, it has to have the recommended form. (Most of the changes in this commit are fixing this form.) - Improve wording/spelling of some comments in .golangci.yml, and remove an outdated comment. - Leave `package-comments` inactive, but add a TODO asking if we should change that. - Add a new sub-linter `comment-spacings` (and fix corresponding comments), which avoids missing spaces after the leading `//`. Signed-off-by: beorn7 <beorn@grafana.com>	2024-08-22 17:36:11 +02:00
machine424	94d28cd6cf	chore(notifier): add a reproducer for https://github.com/prometheus/prometheus/issues/13676 to show "targets groups update" starvation when the notifications queue is full and an Alertmanager is down. The existing `TestHangingNotifier` that was added in https://github.com/prometheus/prometheus/pull/10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`. The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality. Signed-off-by: machine424 <ayoubmrini424@gmail.com> Co-authored-by: Ethan Hunter <ehunter@hudson-trading.com>	2024-06-19 09:43:52 +02:00
David Ashpole	bbfc72b4e2	support unregistering discovery manager metrics (#13896 ) Signed-off-by: David Ashpole <dashpole@google.com>	2024-04-05 16:19:07 +02:00
Paulin Todev	78411d5e8b	SD Managers taking over responsibility for registration of debug metrics (#13375 ) SD Managers take over responsibility for SD metrics registration --------- Signed-off-by: Paulin Todev <paulin.todev@gmail.com> Signed-off-by: Björn Rabenstein <github@rabenste.in> Co-authored-by: Björn Rabenstein <github@rabenste.in>	2024-01-23 16:53:55 +01:00
Paulin Todev	6a5306a53c	Use const labels for Discovery Manager metrics. Signed-off-by: Paulin Todev <paulin.todev@gmail.com>	2023-12-11 11:14:27 +00:00
Paulin Todev	6de80d7fb0	Allow non-default registry to be used for metrics of SD components Signed-off-by: Paulin Todev <paulin.todev@gmail.com>	2023-12-11 11:14:26 +00:00
Oleksandr Redko	fa90ca46e5	ci(lint): enable godot; append dot at the end of comments Signed-off-by: Oleksandr Redko <Oleksandr_Redko@epam.com>	2023-10-31 19:53:38 +02:00
Julien Pivotto	4b735f02a6	Merge pull request #10569 from zzJinux/discovery-manager-run Fix discovery managers to be properly cancelled	2023-09-29 12:07:55 +02:00
Julien Pivotto	009017a3fb	Revert "Remove deleted target from discovery manager" Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2023-08-14 23:29:39 +02:00
haleyao	c5a37ddad5	Remove deleted target from discovery manager Signed-off-by: haleyao <haleyao@tencent.com>	2023-07-10 00:09:25 +08:00
Bryan Boreham	2f58be840d	service discovery: add config name to log messages This makes it easier to connect a log message with the config it relates to. Each SD config has a name, either the scrape job name or something like "config-0" for Alertmanager config. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2023-01-12 11:30:00 +00:00
Sebastian Poxhofer	3f9a9d1e62	chore(discoveryManager): expose Discoverer refresh function (#10531 ) Signed-off-by: secustor <sebastian@poxhofer.at>	2022-06-13 21:06:15 +02:00
Jinwook Jeong	c7c7847b6f	Fix discovery managers to be properly cancelled Signed-off-by: Jinwook Jeong <vustthat@gmail.com>	2022-04-09 01:12:46 +09:00
Robert Fratto	44a5e705be	discovery: Expose custom HTTP client options to discoverers (#10462 ) * discovery: expose HTTP client options to discoverers Signed-off-by: Robert Fratto <robertfratto@gmail.com> * discovery/http: use HTTP client options for created client Signed-off-by: Robert Fratto <robertfratto@gmail.com> * scrape: use a list of HTTP client options instead of just dial context Signed-off-by: Robert Fratto <robertfratto@gmail.com> * discovery: rephrase comment Signed-off-by: Robert Fratto <robertfratto@gmail.com>	2022-03-24 18:16:59 -04:00
Julien Pivotto	9621c2c0cc	Fix race with targets update during ApplyConfig (#9656 ) I ended up extending the lock so refTargets remains valid for the duration of the update. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-11-05 01:13:04 +01:00
Vladimir Kononov	1043d2b594	Discovery: abstain from restarting providers if possible (#9321 ) (#9349 ) * Abstain from restarting discovery providers if possible (#9321) Signed-off-by: Vladimir Kononov <krya-kryak@users.noreply.github.com>	2021-10-20 10:16:20 +02:00
Julien Pivotto	432005826d	Add a feature flag to enable the new discovery manager (#9537 ) * Add a feature flag to enable the new manager This PR creates a copy of the legacy manager and uses it by default. It is a companion PR to #9349. With this PR, users can enable the new discovery manager and provide us with any feedback / side effects that the new behaviour might have. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-10-20 10:15:54 +02:00
Levi Harrison	b5f6f8fb36	Switched to go-kit/log Signed-off-by: Levi Harrison <git@leviharrison.dev>	2021-06-11 12:28:36 -04:00
Julien Pivotto	e1774b6f83	Fix the computation of prometheus_sd_discovered_targets prometheus_sd_discovered_targets is wrongly calculated when there are multiple SD configurations in place. One discovery manager can have multiple groups coming from multiple service discoveries. When multiple service discovery configs are used, we do not compute the metric correctly, and instead just set the metric to one of the service discoveries. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-05-14 22:38:37 +02:00
Andy Bursavich	4e6a94a27d	Invert service discovery dependencies (#7701 ) This also fixes a bug in query_log_file, which now is relative to the config file like all other paths. Signed-off-by: Andy Bursavich <abursavich@gmail.com>	2020-08-20 13:48:26 +01:00
Julien Pivotto	59de58d380	Docker Swarm service discovery (#7420 ) * Docker Swarm service discovery Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-06-26 12:25:58 +02:00
Julien Pivotto	c61141ce51	Add DigitalOcean service discovery (#7407 ) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-06-18 17:04:41 +02:00
Marek Slabicki	8224ddec23	Capitalizing first letter of all log lines (#7043 ) Signed-off-by: Marek Slabicki <thaniri@gmail.com>	2020-04-11 09:22:18 +01:00
Julien Pivotto	c67f81937c	discovery: updateGroup should not create targets[poolKey] in the loop (#6903 ) We can assume that not all target groups are nil in normal scernarios, so we can create targets[poolKey] outside the loop. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-03-02 07:35:02 +00:00
johncming	17683d074c	discovery: fix bug that use rlock for read. (#5928 ) Signed-off-by: johncming <johncming@yahoo.com>	2020-01-22 09:57:37 +00:00
Nevill	55661ab004	Set failedConfigs only once right after registerProviders finished Signed-off-by: Nevill <nevill.dutt@gmail.com>	2019-09-24 09:15:40 +08:00
Nevill	048f81218d	Change prometheus_sd_configs_failed_total to Gauge Signed-off-by: Nevill <nevill.dutt@gmail.com>	2019-09-16 10:38:43 +08:00
Harkishen Singh	d98d4a9bf0	remove resetting of manager properties and init manager props under locking (#5979 ) Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>	2019-09-06 12:46:24 +02:00
Matt Layher	302148fd69	*: apply gofmt -s Signed-off-by: Matt Layher <mdlayher@gmail.com>	2019-01-16 17:28:14 -05:00
Ilya Gladyshev	922c17e119	added name label to all discovery metrics (#5002 ) Signed-off-by: Ilya Gladyshev <ilya.v.gladyshev@gmail.com>	2018-12-20 14:47:29 +00:00
Simon Pasquier	8b91d39c43	discovery: send empty group on empty SD config (#4819 ) * discovery: send empty group on blank SD config Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Update comments Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Add another comment Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-11-30 17:59:57 +01:00
Simon Pasquier	a30348f1a4	discovery: add config label to discovered targets metric (#4753 ) * discovery: add labels to discovered targets metric Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-10-18 16:46:59 +01:00
Goutham Veeramachaneni	ffb7f829ec	Merge pull request #4730 from prometheus/release-2.4 Release 2.4	2018-10-12 14:15:42 -07:00
Simon Pasquier	657199af22	Address Krasi comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-09-28 12:29:24 +02:00
Simon Pasquier	5df757fdd4	zookeeper: fix panic Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-09-28 11:39:40 +02:00
Simon Pasquier	365931ea83	discovery: add metrics + send updates from one goroutine only The added metrics are: * prometheus_sd_discovered_targets * prometheus_sd_received_updates_total * prometheus_sd_updates_delayed_total * prometheus_sd_updates_total Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-09-27 15:59:42 +02:00
Simon Pasquier	48989d8996	discovery: add more tests Co-authored-by: Camille Janicki <camille.janicki@gmail.com> Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-09-12 16:13:15 +02:00
Krasi Georgiev	ba7eb733e8	tidy up the discovery logs,updating loops and selects (#4556 ) * tidy up the discovery logs,updating loops and selects few objects renamings removed a very noise debug log on the k8s discovery. It would be usefull to show some summary rather than every update as this is impossible to follow. added most comments as debug logs so each block becomes self explanatory. when the discovery receiving channel is full will retry again on the next cycle. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com> * add noop logger for the SD manager tests. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com> * spelling nits Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-09-05 17:02:47 +05:30
Simon Pasquier	674c76adb8	discovery: coalesce identical SD configurations (#3912 ) * discovery: coalesce identical SD configurations Instead of creating as many SD providers as declared in the configuration, the discovery manager merges identical configurations into the same provider and keeps track of the subscribers. When the manager receives target updates from a SD provider, it will broadcast the updates to all interested subscribers. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-09-01 08:51:31 +01:00
Krasi Georgiev	53691ae261	Simplify SD update throttling (#4523 ) * simplfied SD updates throtling Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com> * add default to catch cases when we don't have new updates. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-08-27 17:12:11 +02:00
Paul Gier	d24d2acd11	config: set target group source index during unmarshalling (#4245 ) * config: set target group source index during unmarshalling Fixes issue #4214 where the scrape pool is unnecessarily reloaded for a config reload where the config hasn't changed. Previously, the discovery manager changed the static config after loading which caused the in-memory config to differ from a freshly reloaded config. Signed-off-by: Paul Gier <pgier@redhat.com> * [issue #4214] Test that static targets are not modified by discovery manager Signed-off-by: Paul Gier <pgier@redhat.com>	2018-06-13 16:34:59 +01:00

1 2

77 Commits