15829 Commits

Author SHA1 Message Date
Carrie Edwards
b6aff14ced Fix linting and flaky test
Signed-off-by: Carrie Edwards <edwrdscarrie@gmail.com>
2025-07-10 12:59:06 -07:00
Carrie Edwards
1d56a5f6d2 Refactor to use separate type and unit annotations
Signed-off-by: Carrie Edwards <edwrdscarrie@gmail.com>
2025-07-10 07:40:26 -07:00
Carrie Edwards
c13c6bc1f5 Fix linting and failing tests
Signed-off-by: Carrie Edwards <edwrdscarrie@gmail.com>
2025-07-08 14:13:37 -07:00
Carrie Edwards
9f9e408405 Rename wrapper and use struct instead of map for type/unit tracking
Signed-off-by: Carrie Edwards <edwrdscarrie@gmail.com>
2025-07-08 13:24:12 -07:00
Carrie Edwards
67971622ba Add storage wrapper for mismatched type and unit annotations
Signed-off-by: Carrie Edwards <edwrdscarrie@gmail.com>
2025-07-07 11:22:13 -07:00
machine424
ffcba01c5a chore: do not hardcode required versions in README.md
add links to the sources of truth.

It's hard to keep up to date, the "go" one
is "wrong" (not really as an old 1.22 binray could still
download/use newer toolchains...) for example.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-07 08:42:31 +01:00
Charles Korn
1e58d792a5
storage/remote: fix "http: read on closed response body" errors if chunkedSeriesSet.Next is called again after the series set is exhausted (#16838)
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-07 09:23:34 +02:00
RaphSku
938e5cb62b
docs: Added documentation for promtool configuration with http.config.file (#16522)
Includes an example.

Signed-off-by: RaphSku <rapsku.dev@gmail.com>
2025-07-07 00:00:51 +02:00
Michael Hoffmann
21b1536b5a
storage: add projection fields to select hints (#16423)
This commit adds Projection metadata to SelectHints so that downstream
storage implementations can use it to save effort when answering to
Select calls.

Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-06 12:57:19 +02:00
Arve Knudsen
f561aa795d
OTLP receiver: Generate target_info samples between the earliest and latest samples per resource (#16737)
* OTLP receiver: Generate target_info samples between the earliest and latest samples per resource

Modify the OTLP receiver to generate target_info samples between the earliest
and latest samples per resource instead of only one for the latest timestamp.
The samples are spaced lookback delta/2 apart.

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-04 14:38:16 +00:00
Jon Kartago Lamida
819500bdbc
Add ByteSize method for Labels (#16717)
Add `ByteSize()` method to different labels implementations.
One of the use case so that we can track the memory used by Labels.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
2025-07-04 15:09:01 +01:00
Arve Knudsen
5a5424cbc1
Consolidate around prometheus/common/model.ValidationScheme (#16806)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-03 15:37:46 +02:00
Bartlomiej Plotka
419d436a44
Merge pull request #16822 from prometheus/bump-otlptranslator
Bump otlptranslator to latest SHA
2025-07-03 12:40:31 +01:00
Matthias Loibl
61064cb774
Merge pull request #16819 from jscheffner/prometheus-dashboard-uid
mixin: add uid to prometheus overview dashboard
2025-07-03 11:16:05 +02:00
Julien
011c7fe87d
Merge pull request #16820 from prymitive/discoveryRace
discovery: fix a race in ApplyConfig while Prometheus is being stopped
2025-07-03 10:52:59 +02:00
github-actions[bot]
3c25eb2a0d
Merge pull request #16815 from prometheus/dependabot/go_modules/github.com/oklog/run-1.2.0
build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
2025-07-03 10:09:10 +02:00
Arthur Silva Sens
0502f2d8fb
Bump otlptranslator to latest SHA
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
2025-07-02 14:55:51 -03:00
Bryan Boreham
74aca682b7
Merge pull request #16807 from bboreham/test-sizeoflabels
[TESTS] Labels: Add a test for SizeOfLabels
2025-07-02 18:44:10 +01:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
357e652044 Add a test for a rare shutdown hang
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang

1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:110              runtime_SemacquireWaitGroup(*uint32(#166))
    sync         waitgroup.go:118         (*WaitGroup).Wait(*WaitGroup(#23))
    discovery    manager.go:276           (*Manager).ApplyConfig(#23, #167)
    main         main.go:964              main.func5(#120)
    main         main.go:1505             reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
    main         main.go:1182             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#26), #51)

Add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:01:42 +01:00
wmTJc9IK0Q
c481aaf762
codemirror-promql: Preserve source files in npm package (#16804)
* Preserve source files in codemirror-promql package

This allows for sourcemaps to work when the package is imported via ESM-native CDNs such as esm.sh

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

* Preserve source files in lezer-promql package

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

---------

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>
2025-07-02 15:31:02 +02:00
jscheffner
1be2deec88 mixin: add uid to prometheus overview dashboard
Signed-off-by: jscheffner <jscheffner@users.noreply.github.com>
2025-07-02 15:02:50 +02:00
Julien
f62d0e0385
Merge pull request #16777 from roidelapluie/add-step-promql
Add step(), min() and max() in promql duration expressions
2025-07-02 14:27:45 +02:00
Julien
432f130a32 PromQL: min/max/step: Address review comments
Signed-off-by: Julien <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:36 +02:00
Julien Pivotto
984c8de0da PromQL: Fix printing +min()
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Julien Pivotto
3af0bdee68 PromQL: min/max/step: add more tests
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Julien Pivotto
ee7d5158a7 Add step(), min(a,b) and max(a,b) in promql duration expressions
step() is a new keyword introduced to represent the query step width in duration expressions.

min(a,b) and max(a,b) return the min and max from two duration expressions.

Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Bryan Boreham
4eafbcae93 lint
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:56:28 +01:00
Bryan Boreham
e7ac3f440d [TESTS] Labels: Add a test for SizeOfLabels
This requires a bit of repetition to cover all the different builds, but
it seems worth checking that the function does what is expected.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:31:27 +01:00
Bryan Boreham
507227781b [REFACTOR] Labels: Extract test case data from TestLabels_String
So we can use them in other tests.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:31:25 +01:00
Julius Volz
bfbae39931
Merge pull request #16716 from charleskorn/charleskorn/binops-docs
docs: clarify and expand binary operations documentation
2025-07-02 10:02:17 +02:00
dependabot[bot]
6bb7e088c5
build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
Bumps [github.com/oklog/run](https://github.com/oklog/run) from 1.1.0 to 1.2.0.
- [Release notes](https://github.com/oklog/run/releases)
- [Commits](https://github.com/oklog/run/compare/v1.1.0...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/oklog/run
  dependency-version: 1.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:42:33 +00:00
Charles Korn
d19a9ab673
Remove other instances of "obvious"
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:13:46 +10:00
Charles Korn
1977452331
Address PR feedback: adjust docs to match current behaviour
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:10:20 +10:00
Charles Korn
665eb3d6cb
Address PR feedback: remove use of "obvious"
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:08:18 +10:00
Charles Korn
70df21a680
Address PR feedback: format Inf and NaN as monospace
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:07:07 +10:00
Charles Korn
9c6916f4f9
Address PR feedback: add blank lines before lists
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:06:15 +10:00
Arve Knudsen
d902abc50d
config.ScrapeConfig.Validate: Fix MetricNameEscapingScheme error messages (#16801)
* config.ScrapeConfig.Validate: Fix MetricNameEscapingScheme error messages

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-06-30 15:05:03 +00:00
Bartlomiej Plotka
2a88f562d1
Merge pull request #16800 from prometheus/merge-rel-2.53
Merge branch 'release-2.53' into main
2025-06-30 12:37:57 +01:00
bwplotka
f418ea651c Merge branch 'release-2.53' into merge-rel-2.53
Lot's of conflicts so I only ported CHANGELOG.md
2025-06-30 12:12:57 +01:00
Bartlomiej Plotka
d344ea7bf4
Merge pull request #16790 from prometheus/v2.53.4-deps
[RELASE 2.53] Prepare 2.53.5 + Bump deps
v2.53.5
2025-06-30 10:38:19 +01:00
Björn Rabenstein
c3276ea40c
Merge pull request #16789 from gopherorg/main
chore: fix some function names in comment
2025-06-27 23:17:45 +02:00
bwplotka
488a420b6e Upgrade golangci-lint due to timeouts for v1 version.
Signed-off-by: bwplotka <bwplotka@gmail.com>
2025-06-27 16:28:17 +01:00
bwplotka
ddb9f4c70a Update npm packages.
Signed-off-by: bwplotka <bwplotka@gmail.com>
2025-06-27 16:28:17 +01:00
bwplotka
fd4a786443 Prepare 2.53.5 release.
Signed-off-by: bwplotka <bwplotka@gmail.com>
2025-06-27 16:28:13 +01:00
Björn Rabenstein
9e73fb43b3
Merge pull request #16773 from prometheus/beorn7/promql
promql: Re-introduce direct mean calculation
2025-06-27 14:57:12 +02:00
beorn7
ce809e625f promql: Re-introduce direct mean calculation for better accuracy
This commit brings back direct mean calculation (for `avg` and
`avg_over_time`) but isn't an outright revert of #16569. It keeps the
improved incremental mean calculation and features generally a bit
cleaner code than before.

Also, this commit...

- ...updates the lengthy comment explaining the whole situation and
  trade-offs.

- ...divides the running sum and the Kahan compensation term
  separately (in direct mean calculation) to avoid the (unlikely)
  possibility that sum and Kahan compensation together ovorflow
  float64.

- ...uncomments the tests that should now work again on darwin/arm64.

- ...uncomments the test that should now reliably yield the
  (inaccurate) value 0 on all hardware platforms. Also, the test
  description has been updated accordingly.

- ...adds avg_over_time tests for zero and one sample in the range.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-06-27 14:34:46 +02:00
beorn7
f71daa7977 promql: Remove falsified comment from test
The test in question actually worked fine even before #16569. The
finding reported in the comment has turned out to be caused by
something else.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-06-27 14:34:46 +02:00
beorn7
2b3fc1f115 promql: Add test cases for direct mean calculation
These demonstrate that direct mean calculation has some merits after
all.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-06-27 14:34:46 +02:00
Łukasz Mierzwa
748fe6d825
Limit concurrency of scrape pool reloads (#16783)
To avoid possible overload.

As per https://github.com/prometheus/prometheus/pull/16595#issuecomment-3005027067 this changes scrape pool manager to limit the number of scrape pools that can reload at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-06-27 12:34:07 +01:00