Previously the AWS SD ECS Role only discovered instances that used
`awsvpc` network mode, which attaches a dedicated Elastic Network
Interface (ENI). This change adds in additional logic so that we
discover instances that are using `host` and `bridge` networking modes,
where the IP address is that of the EC2 instance that is hosting the
container. Also this change exposes a number of additional labels that
relate to the EC2 instance when the launch type is `EC2`.
Signed-off-by: matt-gp <small_minority@hotmail.com>
This adds the following native histograms (with a few classic buckets for backwards compatibility), while keeping the corresponding summaries (same name, just without `_histogram`):
- `prometheus_sd_refresh_duration_histogram_seconds`
- `prometheus_rule_evaluation_duration_histogram_seconds`
- `prometheus_rule_group_duration_histogram_seconds`
- `prometheus_target_sync_length_histogram_seconds`
- `prometheus_target_interval_length_histogram_seconds`
- `prometheus_engine_query_duration_histogram_seconds`
Signed-off-by: Harsh <harshmastic@gmail.com>
Signed-off-by: harsh kumar <135993950+hxrshxz@users.noreply.github.com>
Co-authored-by: Björn Rabenstein <github@rabenste.in>
* fix: aws discovery test fix
Fixes a problem introduced after the merge of this https://github.com/prometheus/prometheus/pull/17138
PR didn't take into account another merged PR!
```
discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery
have (*EC2SDConfig, *slog.Logger, *ec2Metrics)
want (*EC2SDConfig, discovery.DiscovererOptions)
discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery
have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics)
want (*LightsailSDConfig, discovery.DiscovererOptions)
```
Signed-off-by: Will Bollock <wbollock@linode.com>
* fix: align ecs style
ECS was a new service discovery tool added after this PR was merged: https://github.com/prometheus/prometheus/pull/17138
Aligns the style of passing a single "opts" to it like almost all the other
service discovery engines now use
Signed-off-by: Will Bollock <wbollock@linode.com>
---------
Signed-off-by: Will Bollock <wbollock@linode.com>
Loading the local region from the Instance MetaData Service broke in v3.7. This adds the IMDS call back in order to load the local region when no other method has set the region.
fixes#17375
Signed-off-by: Joe Adams <github@joeadams.io>
After the upgrade to AWS SDK v2, the EC2 and Lightsail service discovery
stopped working when using the default AWS credential chain (environment
variables, IAM roles, EC2 instance metadata, etc.).
The issue was that the code unconditionally created a StaticCredentialsProvider
with empty credentials when access_key and secret_key were not configured. In
AWS SDK v2, this causes a "static credentials are empty" error and prevents
the SDK from falling back to its default credential chain.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Adds a `config` label (similar to `prometheus_sd_discovered_targets`) to
refresh metrics to help identify the source of refresh issues or
performance stats. In particular for HTTP SD, it can be common to have
multiple disparate HTTP SD sources that should be identified and not
lumped together. For example if one HTTP SD service has failures, that
should be evident in its own time series seperate from other HTTP SD
sources.
`config` seemed more appropriate than `endpoint` as a general standard
for `prometheus_sd` metrics.
Docs were also updated for HTTP SD to point at the new refresh metrics
rather than the older metrics.
Signed-off-by: Will Bollock <wbollock@linode.com>
See
https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize
for details.
This ran into a few issues (arguably bugs in the modernize tool),
which I will fix in the next commit, so that we have transparency what
was done automatically.
Beyond those hiccups, I believe all the changes applied are
legitimate. Even where there might be no tangible direct gain, I would
argue it's still better to use the "modern" way to avoid micro
discussions in tiny style PRs later.
Signed-off-by: beorn7 <beorn@grafana.com>
AWS SDK v1 is end of life soon, so migrate to the V2 SDK. The credential loading should work more consistently with other projects that use the SDK and load credentials from the appropriate locations including from environment variables. This affects the EC2 and Lightsail service discovery features.
Signed-off-by: Joe Adams <github@joeadams.io>
* close sync ch after sender() is stopped
* break if chan is closed
Signed-off-by: liyandi <littlepangdi@163.com>
Co-authored-by: liyandi <liyandi@xiaomi.com>
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:
```
if len(prov.newSubs) == 0 {
wg.Add(1)
prov.done = func() {
wg.Done()
}
}
```
then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.
For each provider there is a goroutine created by calling Manager.startProvider(*Provider):
```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
ctx, cancel := context.WithCancel(ctx)
updates := make(chan []*targetgroup.Group)
p.mu.Lock()
p.cancel = cancel
p.mu.Unlock()
go p.d.Run(ctx, updates)
go m.updater(ctx, p, updates)
}
```
It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:
```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
// Ensure targets from this provider are cleaned up.
defer m.cleaner(p)
for {
select {
case <-ctx.Done():
return
[...]
```
we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().
cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:
```
func (m *Manager) Run() error {
go m.sender()
<-m.ctx.Done()
m.cancelDiscoverers()
return m.ctx.Err()
}
```
The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:
- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever
This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.
Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.
The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang
1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
sync sema.go:110 runtime_SemacquireWaitGroup(*uint32(#166))
sync waitgroup.go:118 (*WaitGroup).Wait(*WaitGroup(#23))
discovery manager.go:276 (*Manager).ApplyConfig(#23, #167)
main main.go:964 main.func5(#120)
main main.go:1505 reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
main main.go:1182 main.func22()
run group.go:38 (*Group).Run.func1(*Group(#26), #51)
Add a test for it.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>