Commit Graph

816 Commits

Author SHA1 Message Date
Ayoub Mrini
9dc274687b
Merge pull request #16831 from machine424/nsmeta
feat(discovery/kubernetes): allow attaching namespace metadata
2025-07-17 10:30:27 +01:00
machine424
a9f6fdd910
feat(discovery/kubernetes): allow attaching namespace metadata
to ingress and service roles.

with the help of claude-4-sonnet

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-17 09:53:16 +02:00
Yandi Lee
8eb445b8a4
Discovery.Manager: close sync ch after sender() is stopped (#14465)
* close sync ch after sender() is stopped
* break if chan is closed

Signed-off-by: liyandi <littlepangdi@163.com>
Co-authored-by: liyandi <liyandi@xiaomi.com>
2025-07-11 17:15:01 +01:00
machine424
020e803ee0 chore(discovery): remove unused StaticProvider struct, library users can easily define it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
chenlujjj
a2735494e1
chore: complete error message in RegisterSDMetrics function (#14635)
Signed-off-by: chenlujjj <953546398@qq.com>
2025-07-08 12:05:24 +00:00
machine424
c2d6e528e4
feat(discovery/kubernetes): allow attaching namespace metadata
to endpointslice, endpoints and pod roles

after injecting the labels for endpointslice, claude-4-sonnet
helped transpose the code and tests to endpoints and pod roles

fixes https://github.com/prometheus/prometheus/issues/9510
supersedes https://github.com/prometheus/prometheus/pull/13798

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Paul BARRIE <paul.barrie.calmels@gmail.com>
2025-07-03 19:41:08 +02:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
357e652044 Add a test for a rare shutdown hang
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang

1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:110              runtime_SemacquireWaitGroup(*uint32(#166))
    sync         waitgroup.go:118         (*WaitGroup).Wait(*WaitGroup(#23))
    discovery    manager.go:276           (*Manager).ApplyConfig(#23, #167)
    main         main.go:964              main.func5(#120)
    main         main.go:1505             reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
    main         main.go:1182             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#26), #51)

Add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:01:42 +01:00
Bryan Boreham
d6f9ba6310 [BUILD] Docker SD: Fix up deprecated types
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-06-23 16:15:58 +01:00
Jan-Otto Kröpke
ceaa3bd6f9
discovery: add STACKIT SD (#16401) 2025-06-17 15:41:14 +02:00
Ayoub Mrini
50ba25f273
chore(docs/kubernetes SD): add a note about Endpoints API being deprecated in kubernetes 1.33+ (#16684)
* chore(docs/kubernetes SD): add a note about Endpoints API being deprecated in kubernetes 1.33+

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

* chore(discovery/kubernetes): add Endpoints API deprecation comment

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

---------

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-06-06 11:56:27 +02:00
Zhengke Zhou
45211dc72f
chore: Adjust test and add comment about DNS resolution issue for failing tests (#16200)
* chore: Add comment about DNS resolution issue for failing tests

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>

* remove unexported-return

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>

---------

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
2025-05-27 14:40:09 +02:00
Ryan Wu
091e662f4d
refactor(endpointslice): use service cache.Indexer to achieve better iteration performance (#16365)
* refactor(endpointslice): use cache.Indexer to index endpointslices by LabelServiceName so not have to iterate over all endpoint objects.

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* check the type and error early and add 'TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate' unit test to give a regression test

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* make service indexer namespaced

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* remove unneeded test func

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Apply suggestions from code review

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

---------

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
2025-05-20 20:33:25 +02:00
Ayoub Mrini
eb8d34c2ad
Merge pull request #16587 from prymitive/discoveryLocks
discovery: Try fixing potential deadlocks in discovery
2025-05-19 11:09:49 +02:00
Ben Kochie
1eaf12e99b
Add golangci-lint fmt (#16602)
With golangci-lint v2, it now has "formatters" that can be configured.
Add `golangci-lint fmt` to the `make format` in Makefile.common.
* Enable goimports formatter.

Signed-off-by: SuperQ <superq@gmail.com>
2025-05-16 11:05:35 +02:00
Lukasz Mierzwa
59761f631b Move m.targetsMtx.Lock down into the loop
Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups().

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-05-15 12:30:48 +01:00
Lukasz Mierzwa
7d55ee8cc8 Try fixing potential deadlocks in discovery
Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-05-12 09:13:46 +01:00
hardlydearly
ba4b058b7a refactor: use slices.Contains to simplify code
Signed-off-by: hardlydearly <799511800@qq.com>
2025-05-09 08:27:10 +02:00
Arve Knudsen
e7e3ab2824
Fix linting issues found by golangci-lint v2.0.2 (#16368)
* Fix linting issues found by golangci-lint v2.0.2

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-05-03 19:05:13 +02:00
Jonas Lammler
08982b177f
Add label_selector to hetzner service discovery
Allows to filter the servers when sending the listing request to the API. This feature is only available when using the `role=hcloud`.

See https://docs.hetzner.cloud/#label-selector for details on how to use the label selector.

Signed-off-by: Jonas Lammler <jonas.lammler@hetzner-cloud.de>
2025-04-30 09:24:14 +02:00
Ryan Wu
b4d3c06acb
discovery: make endpointSlice discovery more efficient (#16433)
* discovery: a change to a service with the same name but from another namespace won't enqueue the endpointSlice

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Update discovery/kubernetes/endpointslice.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Update endpointslice.go

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

---------

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
2025-04-16 16:43:30 +02:00
Zhengke Zhou
c884dd16ac
discovery: Remove ingress & endpoint slice adaptors (#16413)
* Remove ingress & endpoint slice adaptors
* fix ci

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
2025-04-09 10:25:53 +01:00
Ryan Wu
7d73c1d3f8
refactor[discovery, tsdb]: simplify error handling and remove redundant checks (#16328)
* refactor: simplify error handling and remove redundant checks

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Add the comment for return of reloading blocks failure

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Add the comment for return of reloading blocks failure

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

---------

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
2025-03-27 12:20:59 +01:00
Matthieu MOREL
5fa1146e21
chore: enable gci linter (#16245)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-03-22 15:46:13 +00:00
Matthieu MOREL
6719867196
test(kubernetes): replace equality check with JSON equality assertion (#16246)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-03-22 13:55:30 +01:00
Patryk Prus
452fd42aeb
Disable additional test as flaky on windows
Signed-off-by: Patryk Prus <p@trykpr.us>
2025-03-18 14:06:33 -04:00
machine424
b0227d1f16 chore(discovery): disable some file update tests as flaky
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-03-14 18:33:13 +01:00
Paulo Dias
9630dc656c
discovery(openstack): remove duplicated error handling for floatingips.List (#16205)
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2025-03-12 15:25:50 +01:00
dependabot[bot]
6f9f29542e
chore(deps): bump github.com/docker/docker (#16118)
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 27.5.1+incompatible to 28.0.1+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v27.5.1...v28.0.1)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-07 15:40:29 +01:00
co63oc
0e4e5a71bd
Fix typos (#16076)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-02-28 11:24:25 +11:00
Matthieu MOREL
c7d4b53ec1 chore: enable unused-parameter from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-02-19 19:50:28 +01:00
Björn Rabenstein
13c05a385c
Merge pull request #16007 from mmorel-35/revive/early-return
chore: enable early-return from revive
2025-02-11 20:31:34 +01:00
Ayoub Mrini
de6add2c7d
Merge pull request #14228 from Codelax/sd-scaleway-routed-ips
feat(scaleway-sd): add labels for multiple public IPs
2025-02-11 17:21:29 +01:00
Matthieu MOREL
b472ce7010 chore: enable early-return from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-02-10 22:08:43 +01:00
Pierre Prinetti
bb30a871ac
deps: Use Gophercloud v2
Signed-off-by: Pierre Prinetti <pierreprinetti@redhat.com>
2025-01-28 15:08:34 +01:00
Jan Fajerski
ffea9f005b
Merge pull request #15539 from paulojmdias/openstack-loadbalancer-discovery
discovery(openstack): add load balancer discovery
2025-01-28 14:10:06 +01:00
Jan Fajerski
7f37a008c4
Merge pull request #15540 from mmorel-35/prometheus/common@v0.61.0
chore(deps): use `version.PrometheusUserAgent`
2025-01-28 13:10:48 +01:00
Matthieu MOREL
dd5ab743ea chore(deps): use version.PrometheusUserAgent
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-01-22 07:31:02 +01:00
Paulo Dias
803b1565a5
fix: fix network endpoint id
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2025-01-21 11:20:15 +00:00
Paulo Dias
1d49d11786
fix: fix testing
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2025-01-21 11:18:34 +00:00
Paulo Dias
cddf729ca3
Merge branch 'main' of github.com:prometheus/prometheus into openstack-loadbalancer-discovery
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2025-01-21 11:16:52 +00:00
Paulo Dias
e8fab32ca2
discovery: move openstack floating ips function from deprecated Compute API /os-floating-ips to Network API /floatingips (#14367) 2025-01-21 11:40:15 +01:00
crystalstall
616914abe2 Signed-off-by: crystalstall <crystalruby@qq.com>
refactor: using slices.Contains to simplify the code

Signed-off-by: crystalstall <crystalruby@qq.com>
2025-01-11 00:41:51 +08:00
Paulo Dias
36ccf62692
Merge branch 'prometheus:main' into openstack-loadbalancer-discovery 2025-01-02 14:44:19 +00:00
Paulo Dias
d40e99c2ec
Merge branch 'openstack-loadbalancer-discovery' of github.com:paulojmdias/prometheus into openstack-loadbalancer-discovery
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2025-01-02 14:43:46 +00:00
Paulo Dias
cb7254158b
feat: rename status to provisioning_status and add operating_status
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2025-01-02 14:43:31 +00:00
pinglanlu
6a61efcfc3
discovery: use a more direct and less error-prone return value (#15347)
Signed-off-by: pinglanlu <pinglanlu@outlook.com>
2024-12-29 18:03:06 +01:00
Paulo Dias
a5c20713dc
Merge branch 'prometheus:main' into openstack-loadbalancer-discovery 2024-12-08 22:54:18 +00:00
Paulo Dias
713903fe48
fix: fix configuration and remove uneeded libs
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
2024-12-06 17:58:21 +00:00
Ayoub Mrini
af2a1cb10c
Merge pull request #15227 from aniketnk/i15185_1
Run discovery/kubernetes tests in parallel
2024-12-05 10:48:26 +01:00