842 Commits

Author SHA1 Message Date
Will Bollock
4aa8941eb1
fix(discovery): aws discovery test fix (#17527)
* fix: aws discovery test fix

Fixes a problem introduced after the merge of this https://github.com/prometheus/prometheus/pull/17138

PR didn't take into account another merged PR!

```
discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery
    have (*EC2SDConfig, *slog.Logger, *ec2Metrics)
    want (*EC2SDConfig, discovery.DiscovererOptions)
discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery
    have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics)
    want (*LightsailSDConfig, discovery.DiscovererOptions)
```

Signed-off-by: Will Bollock <wbollock@linode.com>

* fix: align ecs style

ECS was a new service discovery tool added after this PR was merged: https://github.com/prometheus/prometheus/pull/17138

Aligns the style of passing a single "opts" to it like almost all the other
service discovery engines now use

Signed-off-by: Will Bollock <wbollock@linode.com>

---------

Signed-off-by: Will Bollock <wbollock@linode.com>
2025-11-16 10:28:50 +00:00
Julius Hinze
987b28e26c
discovery: fix constructor arguments in aws discovery (#17526)
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
2025-11-13 15:59:14 +00:00
Bryan Boreham
e02a65b6bd
Merge pull request #17138 from wbollock/feat/prometheus_refresh_config_label
feat(metrics): add config label to refresh metrics
2025-11-13 14:51:39 +01:00
matt-gp
1b52ab9e3b
feat: AWS ECS Service Discovery
Signed-off-by: matt-gp <small_minority@hotmail.com>
2025-11-06 22:48:07 +00:00
Ben Kochie
204249fcb5
Update golangci-lint (#17478)
* Update golangci-lint to v2.6.0
* Fixup various linting issues.
* Fixup deprecations.
* Add exception for `labels.MetricName` deprecation.

Signed-off-by: SuperQ <superq@gmail.com>
2025-11-05 13:47:34 +01:00
Ben Kochie
48956f60d7
Update modernize (#17471)
Apply additional Go modernize tool improvements.

Signed-off-by: SuperQ <superq@gmail.com>
2025-11-04 05:13:49 +00:00
György Krajcsovits
fbd5353a19
Merge remote-tracking branch 'origin/release-3.7' into krajo/merge-release-372-to-main 2025-10-22 18:02:22 +02:00
George Krajcsovits
d06c96136d
Merge pull request #17376 from sysadmind/fix-17375
discovery/aws: Fix region load from IMDS
2025-10-22 11:02:56 +02:00
Joe Adams
6f9af6651e
Update lightsail to use IMDS for region
Signed-off-by: Joe Adams <github@joeadams.io>
2025-10-21 20:31:55 -04:00
Joe Adams
7a29bd2cb4
discovery/aws: Fix region load from IMDS
Loading the local region from the Instance MetaData Service broke in v3.7. This adds the IMDS call back in order to load the local region when no other method has set the region.

fixes #17375

Signed-off-by: Joe Adams <github@joeadams.io>
2025-10-20 22:47:07 -04:00
Julien Pivotto
c40a574197 discovery/ec2: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery
After the upgrade to AWS SDK v2, the EC2 and Lightsail service discovery
stopped working when using the default AWS credential chain (environment
variables, IAM roles, EC2 instance metadata, etc.).

The issue was that the code unconditionally created a StaticCredentialsProvider
with empty credentials when access_key and secret_key were not configured. In
AWS SDK v2, this causes a "static credentials are empty" error and prevents
the SDK from falling back to its default credential chain.

Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-10-17 15:52:40 +02:00
Will Bollock
e894a22b88 feat: add config label to refresh metrics
Adds a `config` label (similar to `prometheus_sd_discovered_targets`) to
refresh metrics to help identify the source of refresh issues or
performance stats. In particular for HTTP SD, it can be common to have
multiple disparate HTTP SD sources that should be identified and not
lumped together. For example if one HTTP SD service has failures, that
should be evident in its own time series seperate from other HTTP SD
sources.

`config` seemed more appropriate than `endpoint` as a general standard
for `prometheus_sd` metrics.

Docs were also updated for HTTP SD to point at the new refresh metrics
rather than the older metrics.

Signed-off-by: Will Bollock <wbollock@linode.com>
2025-10-14 11:36:14 -04:00
Harsh
24282f7b44 test(discovery/xds): speed up tests with t.Parallel()
Signed-off-by: Harsh <harshmastic@gmail.com>
2025-10-12 15:27:33 +05:30
Michael Shen
1eaddc64d0
Migrate K8s discovery service queues to use strongly typed queues
Signed-off-by: Michael Shen <mishen@umich.edu>
2025-09-23 20:32:11 -07:00
Michael Shen
9c525b84c4
Add deprecation notice to associated K8s endpoints API objects
Signed-off-by: Michael Shen <mishen@umich.edu>
2025-09-23 20:30:37 -07:00
Michael Shen
1703e54dfd
Update to k8s.io v0.33.5
Signed-off-by: Michael Shen <mishen@umich.edu>
2025-09-23 20:30:36 -07:00
Bryan Boreham
26279e5b6d
Merge pull request #17066 from cuiweixie/reflect.TypeFor-discovery
discovery: refactor to use reflect.TypeFor

Use a neater form, introduced in Go 1.22.
2025-09-16 12:22:14 +01:00
Arve Knudsen
913cc8f72b
Replace gopkg.in/yaml.v2 with go.yaml.in/yaml/v2 (#17151)
* Replace gopkg.in/yaml.v2 with go.yaml.in/yaml/v2
* Upgrade to client_golang@v1.23.2

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-09-06 13:04:24 +02:00
beorn7
747c5ee2b1 Apply analyzer "modernize" to the whole codebase
See
https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize
for details.

This ran into a few issues (arguably bugs in the modernize tool),
which I will fix in the next commit, so that we have transparency what
was done automatically.

Beyond those hiccups, I believe all the changes applied are
legitimate. Even where there might be no tangible direct gain, I would
argue it's still better to use the "modern" way to avoid micro
discussions in tiny style PRs later.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-08-27 14:48:41 +02:00
cuiweixie
8e48f43e06 discovery: refactor to use reflect.TypeFor
Signed-off-by: cuiweixie <cuiweixie@gmail.com>
2025-08-21 13:52:30 +08:00
Joe Adams
60cf922f89
Fix merge formatting
Signed-off-by: Joe Adams <github@joeadams.io>
2025-08-07 21:17:21 -04:00
Joe Adams
8350c23e76
Merge branch 'main' into aws-sdk-go-v2
Signed-off-by: Joe Adams <github@joeadams.io>
2025-08-06 10:53:49 -04:00
Matthieu MOREL
cef219c31c chore: enable unused-receiver rule from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-08-04 09:43:33 +00:00
Joe Adams
cdfb67467f
Review feedback
Signed-off-by: Joe Adams <github@joeadams.io>
2025-07-31 21:22:43 -04:00
Joe Adams
56a3bbf5c5
Fix import formatting
Signed-off-by: Joe Adams <github@joeadams.io>
2025-07-29 23:17:23 -04:00
Joe Adams
eab9b696f2
Upgrade AWS SDK to v2
AWS SDK v1 is end of life soon, so migrate to the V2 SDK. The credential loading should work more consistently with other projects that use the SDK and load credentials from the appropriate locations including from environment variables. This affects the EC2 and Lightsail service discovery features.

Signed-off-by: Joe Adams <github@joeadams.io>
2025-07-29 23:06:05 -04:00
Ayoub Mrini
9dc274687b
Merge pull request #16831 from machine424/nsmeta
feat(discovery/kubernetes): allow attaching namespace metadata
2025-07-17 10:30:27 +01:00
machine424
a9f6fdd910
feat(discovery/kubernetes): allow attaching namespace metadata
to ingress and service roles.

with the help of claude-4-sonnet

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-17 09:53:16 +02:00
Yandi Lee
8eb445b8a4
Discovery.Manager: close sync ch after sender() is stopped (#14465)
* close sync ch after sender() is stopped
* break if chan is closed

Signed-off-by: liyandi <littlepangdi@163.com>
Co-authored-by: liyandi <liyandi@xiaomi.com>
2025-07-11 17:15:01 +01:00
machine424
020e803ee0 chore(discovery): remove unused StaticProvider struct, library users can easily define it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
chenlujjj
a2735494e1
chore: complete error message in RegisterSDMetrics function (#14635)
Signed-off-by: chenlujjj <953546398@qq.com>
2025-07-08 12:05:24 +00:00
machine424
c2d6e528e4
feat(discovery/kubernetes): allow attaching namespace metadata
to endpointslice, endpoints and pod roles

after injecting the labels for endpointslice, claude-4-sonnet
helped transpose the code and tests to endpoints and pod roles

fixes https://github.com/prometheus/prometheus/issues/9510
supersedes https://github.com/prometheus/prometheus/pull/13798

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Paul BARRIE <paul.barrie.calmels@gmail.com>
2025-07-03 19:41:08 +02:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
357e652044 Add a test for a rare shutdown hang
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang

1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:110              runtime_SemacquireWaitGroup(*uint32(#166))
    sync         waitgroup.go:118         (*WaitGroup).Wait(*WaitGroup(#23))
    discovery    manager.go:276           (*Manager).ApplyConfig(#23, #167)
    main         main.go:964              main.func5(#120)
    main         main.go:1505             reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
    main         main.go:1182             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#26), #51)

Add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:01:42 +01:00
Bryan Boreham
d6f9ba6310 [BUILD] Docker SD: Fix up deprecated types
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-06-23 16:15:58 +01:00
Jan-Otto Kröpke
ceaa3bd6f9
discovery: add STACKIT SD (#16401) 2025-06-17 15:41:14 +02:00
Ayoub Mrini
50ba25f273
chore(docs/kubernetes SD): add a note about Endpoints API being deprecated in kubernetes 1.33+ (#16684)
* chore(docs/kubernetes SD): add a note about Endpoints API being deprecated in kubernetes 1.33+

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

* chore(discovery/kubernetes): add Endpoints API deprecation comment

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

---------

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-06-06 11:56:27 +02:00
Zhengke Zhou
45211dc72f
chore: Adjust test and add comment about DNS resolution issue for failing tests (#16200)
* chore: Add comment about DNS resolution issue for failing tests

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>

* remove unexported-return

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>

---------

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
2025-05-27 14:40:09 +02:00
Ryan Wu
091e662f4d
refactor(endpointslice): use service cache.Indexer to achieve better iteration performance (#16365)
* refactor(endpointslice): use cache.Indexer to index endpointslices by LabelServiceName so not have to iterate over all endpoint objects.

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* check the type and error early and add 'TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate' unit test to give a regression test

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* make service indexer namespaced

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* remove unneeded test func

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Apply suggestions from code review

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

---------

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
2025-05-20 20:33:25 +02:00
Ayoub Mrini
eb8d34c2ad
Merge pull request #16587 from prymitive/discoveryLocks
discovery: Try fixing potential deadlocks in discovery
2025-05-19 11:09:49 +02:00
Ben Kochie
1eaf12e99b
Add golangci-lint fmt (#16602)
With golangci-lint v2, it now has "formatters" that can be configured.
Add `golangci-lint fmt` to the `make format` in Makefile.common.
* Enable goimports formatter.

Signed-off-by: SuperQ <superq@gmail.com>
2025-05-16 11:05:35 +02:00
Lukasz Mierzwa
59761f631b Move m.targetsMtx.Lock down into the loop
Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups().

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-05-15 12:30:48 +01:00
Lukasz Mierzwa
7d55ee8cc8 Try fixing potential deadlocks in discovery
Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-05-12 09:13:46 +01:00
hardlydearly
ba4b058b7a refactor: use slices.Contains to simplify code
Signed-off-by: hardlydearly <799511800@qq.com>
2025-05-09 08:27:10 +02:00
Arve Knudsen
e7e3ab2824
Fix linting issues found by golangci-lint v2.0.2 (#16368)
* Fix linting issues found by golangci-lint v2.0.2

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-05-03 19:05:13 +02:00
Jonas Lammler
08982b177f
Add label_selector to hetzner service discovery
Allows to filter the servers when sending the listing request to the API. This feature is only available when using the `role=hcloud`.

See https://docs.hetzner.cloud/#label-selector for details on how to use the label selector.

Signed-off-by: Jonas Lammler <jonas.lammler@hetzner-cloud.de>
2025-04-30 09:24:14 +02:00
Ryan Wu
b4d3c06acb
discovery: make endpointSlice discovery more efficient (#16433)
* discovery: a change to a service with the same name but from another namespace won't enqueue the endpointSlice

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Update discovery/kubernetes/endpointslice.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Update endpointslice.go

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

---------

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
2025-04-16 16:43:30 +02:00
Zhengke Zhou
c884dd16ac
discovery: Remove ingress & endpoint slice adaptors (#16413)
* Remove ingress & endpoint slice adaptors
* fix ci

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
2025-04-09 10:25:53 +01:00
Ryan Wu
7d73c1d3f8
refactor[discovery, tsdb]: simplify error handling and remove redundant checks (#16328)
* refactor: simplify error handling and remove redundant checks

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Add the comment for return of reloading blocks failure

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

* Add the comment for return of reloading blocks failure

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

---------

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
2025-03-27 12:20:59 +01:00
Matthieu MOREL
5fa1146e21
chore: enable gci linter (#16245)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-03-22 15:46:13 +00:00