16177 Commits

Author SHA1 Message Date
Nicolás Pazos
b43a07248f tsdb tests: fix mockIndex implementation
Signed-off-by: Nicolás Pazos <npazosmendez@gmail.com>
2025-07-10 15:59:38 -03:00
Owen Williams
d2f1f4fb27
config: Add UnderscoreEscapingWithoutSuffixes translation strategy (#16849)
The last permutation of the translation options does underscore translation but does not add suffixes.
This translation option already exists in Mimir as otel_metric_suffixes_enabled, indicating external demand for this strategy.
There is an accompanying update to prometheus-docs to explain the use of this mode: https://github.com/prometheus/docs/pull/2688

Signed-off-by: Owen Williams <owen.williams@grafana.com>
2025-07-10 11:27:23 -04:00
Björn Rabenstein
b7f984d6d2
Merge pull request #16585 from kapillamba4/fix/16393-strict
Convert PromQL tests to new syntax via basic migration mode
2025-07-10 15:45:38 +02:00
Björn Rabenstein
eb3ea163fa
promqltest: add tests for histogram_count(increase(...)) (#16854)
As `histogram_count` is playing tricks to improve performance, we
better make sure that the limitation of extrapolation below zero still
works as expected.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-10 15:44:02 +02:00
Charles Korn
8397b738bf
docs: clarify docs for PromQL aggregation operators (#16837)
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-10 15:34:57 +02:00
Björn Rabenstein
362141370d
Merge pull request #16828 from prometheus/beorn7/histogram2
promql(histograms): scale a histogram the same as the count
2025-07-10 13:26:15 +02:00
Björn Rabenstein
0672a5b045
Merge pull request #16847 from prometheus/beorn7/promql
promqltest: Test NaN sample values for quantile aggregator
2025-07-10 11:16:12 +02:00
jingchanglu
9ddb21fccb chore: fix some function names in comment
Signed-off-by: jingchanglu <jingchanglu@outlook.com>
2025-07-10 14:43:25 +08:00
Dmitry Ponomaryov
b18272a572
Add template functions to support various use cases. (#16619)
Presumably, this will help with Loki alerts, but the added functionality is also generally useful.

For one, this enables `parseDuration` to also accept negative duration (as that's something that is also used in PromQL by now).

This also adds a function `now` to return the evaluation time of the template (as seconds since epoch AKA Unix time) and a function `toDuration` (akin to `toTime`), which creates a Go `time.Duration` from a duration in seconds.

---------

Signed-off-by: Dmitry Ponomaryov <me@halje.ru>
Signed-off-by: Dmitry Ponomaryov <iamhalje@gmail.com>
2025-07-10 00:33:20 +02:00
machine424
846acc10bb chore(tsdb): remove NewLeveledCompactorWithChunkSize constructor as unused, library users ca can redefine it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
machine424
020e803ee0 chore(discovery): remove unused StaticProvider struct, library users can easily define it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
George Krajcsovits
1d79f0f47e
chore(tsdb): add a few more testcases for unlock of unlocked mtx 16332 (#16848)
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2025-07-09 16:24:46 +02:00
Banana Duck
89f011ba13
fix: unlock of unlocked mutex (#16332)
* fix: unlock on unlocked mutex

Signed-off-by: Usama Alhanaqtah <a.usama@yandex.ru>

* test coverage

Signed-off-by: Usama Alhanaqtah <a.usama@yandex.ru>

---------

Signed-off-by: Usama Alhanaqtah <a.usama@yandex.ru>
Co-authored-by: alhanaqtah.usama <alhanaqtah.usama@DEV-254.local>
2025-07-09 15:37:55 +02:00
Björn Rabenstein
d86796863f
Merge pull request #16764 from bboreham/go-get-no-d
[BUILD] Don't specify -d for go get
2025-07-09 14:14:05 +02:00
beorn7
107e4a00c3 promqltest: Test NaN sample values for quantile aggregator
Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-09 13:38:19 +02:00
Bryan Boreham
eea203702c
Prepare release 3.5.0-rc.1 (#16845)
This RC reverts the feature "OTLP: Support promoting OTel scope attributes".

Add the line back into the CHANGELOG for 3.5.0-rc.0, since we are not changing that version.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
v3.5.0-rc.1
2025-07-09 12:07:27 +01:00
Björn Rabenstein
181415c7b7
Merge pull request #16846 from liangmulu/main
docs: fix some minor issues in comments
2025-07-09 13:00:13 +02:00
liangmulu
b1a7df2c0c chore: fix some minor issues in comments
Signed-off-by: liangmulu <liangmulu@outlook.com>
2025-07-09 18:05:41 +08:00
Kapil Lamba
df0e034314 address code review comments
Signed-off-by: Kapil Lamba <kapillamba4@gmail.com>
2025-07-09 07:25:31 +05:30
Björn Rabenstein
d8c921804e
Merge pull request #16824 from afhassan/main
tsdb: add count of histogram samples to block stats
2025-07-08 20:16:13 +02:00
Björn Rabenstein
dbee82267a
Merge pull request #16725 from MichaHoffmann/mhoffmann/fix-topk-nan-arg-error-on-nonexisting-series
promql: fix topk error on NaN argument for non-existing series
2025-07-08 19:42:20 +02:00
beorn7
bcf7a822a0 promql: Prevent extrapolation below zero for histogram count
This deals with the count field of native histograms in the same way
as with simple float counters. It then scale the whole histogram with
the same factor as it has scaled the count. This will still allow
individual buckets to get extrapolated below zero, but maybe that is
fine.

This implements approach (2) as described in
https://github.com/prometheus/prometheus/issues/15976#issuecomment-3032095158

Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-08 19:01:31 +02:00
Vlad Shulcz
19fa1ed008
test(rulefmt): fix description annotation index in TestParseFileSuccessWithAliases (#16839)
Signed-off-by: shulcz <vshulcz@gmail.com>
2025-07-08 18:38:34 +02:00
Björn Rabenstein
c565e95808
Merge pull request #16825 from prometheus/beorn7/histogram
promql: add tests to demonstrate extrapolation below zero
2025-07-08 16:42:56 +02:00
chenlujjj
a2735494e1
chore: complete error message in RegisterSDMetrics function (#14635)
Signed-off-by: chenlujjj <953546398@qq.com>
2025-07-08 12:05:24 +00:00
Arthur Silva Sens
4b9d0fb92f
Revert: OTLP Support including scope metadata as metric labels (#16842)
Reverts #16730 and #16760

This is being done because we've noticed a problem in the spec that could
lead to name collisions if attributes name, version or schema_url are added
to the scope. They would collide with the already reserved labels
otel_scope_name, otel_scope_version and otel_scope_schema_url.

Since this new configuration option never made it into a release, we can
safely remove it from the 3.5 release. We'll sort this out for the 3.6 release

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
2025-07-08 10:37:19 +00:00
Ahmed Hassan
01be7bfb2e add NumFloatSamples to TSDB block stats
Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
2025-07-07 13:48:18 -07:00
Lukasz Mierzwa
559fd44be6 Rename labels.go -> labels_slicelabels.go
labels.go is now holding slicelabels code, so let's rename it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-07 12:37:42 +01:00
machine424
ffcba01c5a chore: do not hardcode required versions in README.md
add links to the sources of truth.

It's hard to keep up to date, the "go" one
is "wrong" (not really as an old 1.22 binray could still
download/use newer toolchains...) for example.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-07 08:42:31 +01:00
Charles Korn
1e58d792a5
storage/remote: fix "http: read on closed response body" errors if chunkedSeriesSet.Next is called again after the series set is exhausted (#16838)
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-07 09:23:34 +02:00
Michael Hoffmann
44ee5e2ad6 promql: fix topk error on NaN argument for non-existing series
Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-07 06:19:39 +00:00
RaphSku
938e5cb62b
docs: Added documentation for promtool configuration with http.config.file (#16522)
Includes an example.

Signed-off-by: RaphSku <rapsku.dev@gmail.com>
2025-07-07 00:00:51 +02:00
beorn7
c0a13223e7 promql: add tests to demonstrate extrapolation below zero
This shows how float counters cannot go below zero when extrapolationg
for rate/increase, and how histograms do not have that protection yet,
leading to an overestimation of the rate/increase.

This also demonstrates edge cases where the count extrapolation does
not need to be limited, but an individual bucket still goes below
zero.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-06 23:42:55 +02:00
Michael Hoffmann
21b1536b5a
storage: add projection fields to select hints (#16423)
This commit adds Projection metadata to SelectHints so that downstream
storage implementations can use it to save effort when answering to
Select calls.

Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-06 12:57:19 +02:00
Arve Knudsen
f561aa795d
OTLP receiver: Generate target_info samples between the earliest and latest samples per resource (#16737)
* OTLP receiver: Generate target_info samples between the earliest and latest samples per resource

Modify the OTLP receiver to generate target_info samples between the earliest
and latest samples per resource instead of only one for the latest timestamp.
The samples are spaced lookback delta/2 apart.

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-04 14:38:16 +00:00
Jon Kartago Lamida
819500bdbc
Add ByteSize method for Labels (#16717)
Add `ByteSize()` method to different labels implementations.
One of the use case so that we can track the memory used by Labels.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
2025-07-04 15:09:01 +01:00
sujal shah
4408a6bcaf api: Create /status/tsdb/blocks endpoint.
this endpoint serves blocks data to the client.

Signed-off-by: sujal shah <sujalshah28092004@gmail.com>
2025-07-04 03:13:54 +05:30
machine424
c2d6e528e4
feat(discovery/kubernetes): allow attaching namespace metadata
to endpointslice, endpoints and pod roles

after injecting the labels for endpointslice, claude-4-sonnet
helped transpose the code and tests to endpoints and pod roles

fixes https://github.com/prometheus/prometheus/issues/9510
supersedes https://github.com/prometheus/prometheus/pull/13798

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Paul BARRIE <paul.barrie.calmels@gmail.com>
2025-07-03 19:41:08 +02:00
Arve Knudsen
5a5424cbc1
Consolidate around prometheus/common/model.ValidationScheme (#16806)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-03 15:37:46 +02:00
Bartlomiej Plotka
419d436a44
Merge pull request #16822 from prometheus/bump-otlptranslator
Bump otlptranslator to latest SHA
2025-07-03 12:40:31 +01:00
Matthias Loibl
61064cb774
Merge pull request #16819 from jscheffner/prometheus-dashboard-uid
mixin: add uid to prometheus overview dashboard
2025-07-03 11:16:05 +02:00
Julien
011c7fe87d
Merge pull request #16820 from prymitive/discoveryRace
discovery: fix a race in ApplyConfig while Prometheus is being stopped
2025-07-03 10:52:59 +02:00
dependabot[bot]
ce2e48f39e
build(deps): bump github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatocumulativeprocessor
Bumps [github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatocumulativeprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib) from 0.128.0 to 0.129.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-collector-contrib/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CHANGELOG-API.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-collector-contrib/compare/v0.128.0...v0.129.0)

---
updated-dependencies:
- dependency-name: github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatocumulativeprocessor
  dependency-version: 0.129.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-03 08:10:56 +00:00
github-actions[bot]
3c25eb2a0d
Merge pull request #16815 from prometheus/dependabot/go_modules/github.com/oklog/run-1.2.0
build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
2025-07-03 10:09:10 +02:00
Ahmed Hassan
6d77b47d13 add numHistogramSamples to block stats
Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
2025-07-02 19:52:04 -07:00
Arthur Silva Sens
0502f2d8fb
Bump otlptranslator to latest SHA
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
2025-07-02 14:55:51 -03:00
Bryan Boreham
74aca682b7
Merge pull request #16807 from bboreham/test-sizeoflabels
[TESTS] Labels: Add a test for SizeOfLabels
2025-07-02 18:44:10 +01:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
357e652044 Add a test for a rare shutdown hang
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang

1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:110              runtime_SemacquireWaitGroup(*uint32(#166))
    sync         waitgroup.go:118         (*WaitGroup).Wait(*WaitGroup(#23))
    discovery    manager.go:276           (*Manager).ApplyConfig(#23, #167)
    main         main.go:964              main.func5(#120)
    main         main.go:1505             reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
    main         main.go:1182             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#26), #51)

Add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:01:42 +01:00
wmTJc9IK0Q
c481aaf762
codemirror-promql: Preserve source files in npm package (#16804)
* Preserve source files in codemirror-promql package

This allows for sourcemaps to work when the package is imported via ESM-native CDNs such as esm.sh

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

* Preserve source files in lezer-promql package

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

---------

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>
2025-07-02 15:31:02 +02:00