280 Commits

Author SHA1 Message Date
beorn7
7e82bdb75b tsdb: Fix commit order for mixed-typed series
Fixes https://github.com/prometheus/prometheus/issues/15177

The basic idea here is to divide the samples to be commited into (sub)
batches whenever we detect that the same series receives a sample of a
type different from the previous one. We then commit those batches one
after another, and we log them to the WAL one after another, so that
we hit both birds with the same stone. The cost of the stone is that
we have to track the sample type of each series in a map. Given the
amount of things we already track in the appender, I hope that it
won't make a dent. Note that this even addresses the NHCB special case
in the WAL.

This does a few other things that I could not resist to pick up on the
go:

- It adds more zeropool.Pools and uses the existing ones more
  consistently. My understanding is that this was merely an oversight.
  Maybe the additional pool usage will compensate for the increased
  memory demand of the map.

- Create the synthetic zero sample for histograms a bit more
  carefully. So far, we created a sample that always went into its own
  chunk. Now we create a sample that is compatible enough with the
  following sample to go into the same chunk. This changed the test
  results quite a bit. But IMHO it makes much more sense now.

- Continuing past efforts, I changed more namings of `Samples` into
  `Floats` to keep things consistent and less confusing. (Histogram
  samples are also samples.) I still avoided changing names in other
  packages.

- I added a few shortcuts `h := a.head`, saving many characters.

TODOs:

- Address @krajorama's TODOs about commit order and staleness handling.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-09-17 19:22:25 +02:00
Bryan Boreham
aa12c0d4c3
Merge pull request #17074 from prymitive/logs
TSDB: Log when GC / block write starts
2025-09-02 12:55:12 +01:00
Bryan Boreham
8e133e100f
Merge pull request #17081 from prometheus/superq/if_err_nil
tsdb: Fixup err nil checks
2025-09-02 12:37:51 +01:00
beorn7
747c5ee2b1 Apply analyzer "modernize" to the whole codebase
See
https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize
for details.

This ran into a few issues (arguably bugs in the modernize tool),
which I will fix in the next commit, so that we have transparency what
was done automatically.

Beyond those hiccups, I believe all the changes applied are
legitimate. Even where there might be no tangible direct gain, I would
argue it's still better to use the "modern" way to avoid micro
discussions in tiny style PRs later.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-08-27 14:48:41 +02:00
Lukasz Mierzwa
31282d67b7 Log when GC / block write starts
Right now Prometheus only logs when these operations are completed.
It's a bit surprising to see suddenly a message saying "I was busy doing X for the past N minutes"
so let's add a message when the operation starts, so it's easier to understand what Prometheus was doing at any point in time
when reading logs.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-08-26 10:30:22 +01:00
SuperQ
b87cbf0294
Fixup err nil checks
Cleanup double `if` statements for errors being nil / not-nil.

Signed-off-by: SuperQ <superq@gmail.com>
2025-08-25 17:37:02 +02:00
Bryan Boreham
498f63e60b
Merge pull request #17029 from pr00se/wal-checkpoint-dropped-samples
TSDB: use timestamps rather than WAL segment numbers to track how long deleted series should be retained in checkpoints
2025-08-20 11:15:10 +01:00
Ganesh Vernekar
a86d9a3858
Merge pull request #16925 from prometheus/codesome/stale-series-tracking
tsdb: Track stale series in the Head block based on stale sample
2025-08-19 15:35:19 -07:00
Ganesh Vernekar
7a947d3629 Track stale series in the Head
Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com>
2025-08-19 15:07:27 -07:00
Patryk Prus
5cb0192626
Address linter errors
Signed-off-by: Patryk Prus <p@trykpr.us>
2025-08-08 14:25:14 -04:00
Patryk Prus
0fea41ed53
Refactor keep function to work for both agent and non-agent implementations
Signed-off-by: Patryk Prus <p@trykpr.us>
2025-08-08 14:12:47 -04:00
Patryk Prus
6875022873
Update head.walExpiries with record timestamps during WAL replay
Signed-off-by: Patryk Prus <p@trykpr.us>
2025-08-08 14:12:47 -04:00
Patryk Prus
218558f543
Store mint rather than the last WAL segment in head.walExpiries during head GC
Signed-off-by: Patryk Prus <p@trykpr.us>
2025-08-08 14:12:41 -04:00
Matthieu MOREL
cef219c31c chore: enable unused-receiver rule from revive
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-08-04 09:43:33 +00:00
Bryan Boreham
a11772234d
Merge pull request #16333 from colega/fix-series-create-gc-race
fix: race condition between series creation and garbage collection
2025-04-17 12:15:11 +01:00
machine424
a825d448da feat(tsdb/(head|agent)): dereference the pools at the end of the WL replay to
not wait for an extra GC cycle until the built-in cleanup mechanism
kicks in

See https://github.com/prometheus/prometheus/pull/15778

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-04-17 13:06:08 +02:00
Ben Kochie
a721daf981
Log WAL segment loading time (#16336)
Improve readability of "WAL segment loaded" by logging the duration
of each load. This helps make it easier to spot slow WAL file load
times.

Signed-off-by: SuperQ <superq@gmail.com>
2025-03-31 06:05:14 +02:00
Oleg Zaytsev
e4fe8d8684
Create memSeries with pendingCommit=true
This fixes TestHead_RaceBetweenSeriesCreationAndGC.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
2025-03-27 11:11:57 +01:00
Matthieu MOREL
5fa1146e21
chore: enable gci linter (#16245)
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-03-22 15:46:13 +00:00
Fiona Liao
37c2ebb5fd
Make out-of-order native histograms flag a no-op and always enable (#16207)
* Remove experimental out-of-order native histogram flag

This feature has been available in Prometheus since September 2024,
and has no known issues. Therefore proposing to remove the flag
entirely and always have it on. Note that there are still two
settings that need to be configured (out-of-order time window > 0
and native histograms enabled) for this feature to work.

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>

* Update CHANGELOG

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>

* Keep feature flag with warning

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>

* Update CHANGELOG

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>

* Update tsdb/head_append.go

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>

* Update CHANGELOG.md

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>

* Update tsdb/head_append.go

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>

* Additional cleanup of comments and test names

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>

---------

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
2025-03-18 10:59:02 +00:00
Patryk Prus
401dbacf2e
Add counters for unknown series references during WAL/WBL replay
Signed-off-by: Patryk Prus <p@trykpr.us>
2025-03-17 15:17:53 -04:00
Patryk Prus
61aa82865d
TSDB: keep duplicate series records in checkpoints while their samples may still be present (#16060)
Renames the head's deleted map to walExpiries, and creates entries for any
duplicate series records encountered during WAL replay, with the expiry set
to the highest current WAL segment number. Any subsequent WAL
checkpoints will see the duplicate series entry in the walExpiries map, and
keep the series record until the last WAL segment that could contain its
samples is deleted.

Other considerations:

WBL: series records aren't written to the WBL, so there are no duplicates to deal with
agent mode: has its own WAL replay logic that handles duplicate series records differently, and is outside the scope of this PR
2025-03-05 13:45:08 -05:00
Arve Knudsen
7cbf749096
Upgrade to github.com/oklog/ulid/v2 (#16168)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-03-05 16:03:25 +01:00
Bryan Boreham
42d55505f9
Merge pull request #12659 from prymitive/memChunk
Short-cut common memChunk operations
2025-02-25 11:33:56 +00:00
machine424
d644324407
feat(tsdb/(head|agent)): reuse pools across segments to avoid generating garbage during WL replay
This is part of the "reduce WAL replay overhead/garbage" effort to help with https://github.com/prometheus/prometheus/issues/6934.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-02-10 22:40:24 +01:00
Alan Protasio
9d1abbb9ed
Call PostCreation callback only after the new series is added to the mempotings (#15579)
Signed-off-by: alanprot <alanprot@gmail.com>
2025-01-28 12:11:58 +01:00
Bryan Boreham
ac4f8a5e23
[ENHANCEMENT] TSDB: Improve calculation of space used by labels (#13880)
* [ENHANCEMENT] TSDB: Improve calculation of space used by labels

The labels for each series in the Head take up some some space in the
Postings index, but far more space in the `memSeries` structure.

Instead of having the Postings index calculate this overhead, which is
a layering violation, have the caller pass in a function to do it.

Provide three implementations of this function for the three Labels
versions.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-12-16 09:42:52 +00:00
Pedro Tanaka
bab587b9dc
Agent: allow for ingestion of CT samples (#15124)
* Remove unused option from HeadOptions

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Improve docs for appendable() method in head appender

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Ingest CT (float) samples in Agent DB

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* allow for ingestion of CT native histogram

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* adding some verification for ct ts

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Validating CT histogram before append and add newly created series to pending series

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* checking the wal for written samples

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Checking for samples in test

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* adding case for validations

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* fixing comparison when dedupelabels is enabled

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* unite tests, use table testing

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Implement CT related methods in timestampTracker for write storage

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* adding error case to test

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* removing unused fields

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Updating lastTs for series when adding CT to invalidate duplicates

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* making sure that updating the lastTS wont cause OOO later on in Commit();

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

---------

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
2024-10-27 01:06:34 +01:00
Łukasz Mierzwa
b6e22cd346 Short-cut common memChunk operations
memChunk is a linked list, speed up some common operations when there's no need to iterate all elements on the list.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2024-10-25 12:19:20 +01:00
TJ Hoplock
6ebfbd2d54 chore!: adopt log/slog, remove go-kit/log
For: #14355

This commit updates Prometheus to adopt stdlib's log/slog package in
favor of go-kit/log. As part of converting to use slog, several other
related changes are required to get prometheus working, including:
- removed unused logging util func `RateLimit()`
- forward ported the util/logging/Deduper logging by implementing a small custom slog.Handler that does the deduping before chaining log calls to the underlying real slog.Logger
- move some of the json file logging functionality to use prom/common package functionality
- refactored some of the new json file logging for scraping
- changes to promql.QueryLogger interface to swap out logging methods for relevant slog sugar wrappers
- updated lots of tests that used/replicated custom logging functionality, attempting to keep the logical goal of the tests consistent after the transition
- added a healthy amount of `if logger == nil { $makeLogger }` type conditional checks amongst various functions where none were provided -- old code that used the go-kit/log.Logger interface had several places where there were nil references when trying to use functions like `With()` to add keyvals on the new *slog.Logger type

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2024-10-07 15:58:50 -04:00
György Krajcsovits
44ebbb8458 Fix missing histogram copy in sampleRing
The specialized version of sample add to the ring:
func addH(s hSample, buf []hSample, r *sampleRing) []hSample
func addFH(s fhSample, buf []fhSample, r *sampleRing) []fhSample
already correctly copy histogram samples from the reused hReader, fhReader
buffers, but the generic version does not. This means that the
data is overwritten on the next read if the sample ring has seen histogram
and float samples at the same time and switched to generic mode.

The `genericAdd` function (which was commented anyway) is by now quite
different from the specialized functions so that this commit deletes
it.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2024-10-02 13:57:28 +02:00
Carrie Edwards
14e3c05ce8
tsdb: Add support for ingestion of out-of-order native histogram samples (#14546)
Add support for ingesting OOO native histograms

* Add flag for enabling and disabling OOO native histogram ingestion
* Update OOO querying tests to include native histogram samples
* Add OOO head tests
* Add test for OOO native histogram counter reset headers

Signed-off-by: Carrie Edwards <edwrdscarrie@gmail.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Co-authored by: Carrie Edwards <edwrdscarrie@gmail.com>
Co-authored by: Jeanette Tan <jeanette.tan@grafana.com>
Co-authored by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Co-authored by: Fiona Liao <fiona.liao@grafana.com>
2024-09-17 11:19:06 +02:00
Nathan Baulch
50cd453c8f
chore: Fix typos (#14868)
* Fix typos

---------

Signed-off-by: Nathan Baulch <nathan.baulch@gmail.com>
2024-09-10 22:32:03 +02:00
George Krajcsovits
536d9f9ce9
BUGFIX: TSDB: panic in query during truncation with OOO head (#14831)
Check if headQuerier is nil before trying to use it.

* TestQueryOOOHeadDuringTruncate: unit test to check query during truncate
Regression test for #14822

* Simulate race between query and Compact()

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2024-09-05 17:17:42 +01:00
Marco Pracucci
ef649d5968
Revert " Store mmMaxTime in same field as seriesShard"
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2024-08-26 08:56:16 +02:00
Oleg Zaytsev
0300ad58a9
Revert the option regardless of error
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
2024-07-30 11:31:31 +02:00
Oleg Zaytsev
d8e1b6bdfd
Store mmMaxTime in same field as seriesShard
We don't use seriesShard during DB initialization, so we can use the
same 8 bytes to store mmMaxTime, and save those during the rest of the
lifetime of the database.

This doesn't affect CPU performance.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
2024-07-30 10:20:29 +02:00
Bryan Boreham
d878146c70 TSDB: shrink memSeries by moving bools together
In each case the following member requires 8-byte alignment, so moving
one beside the other shrinks memSeries from 176 to 168 bytes, when
compiled with `-tags stringlabels`.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-07-15 16:09:48 +01:00
Bryan Boreham
709c5d6fc3
TSDB: Lock around access to labels in head under -tags dedupelabels (#14322)
* TSDB: Document what needs locking in memSeries

* TSDB: Lock around access to series labels

So we can modify them to reset the symbol-table.

* TSDB: Make label locking conditional on build tag

---------

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-07-05 10:11:32 +01:00
Oleg Zaytsev
fd1a89b7c8
Pass affected labels to MemPostings.Delete() (#14307)
* Pass affected labels to MemPostings.Delete

As suggested by @bboreham, we can track the labels of the deleted series
and avoid iterating through all the label/value combinations.

This looks much faster on the MemPostings.Delete call. We don't have a
benchmark on stripeSeries.gc() where we'll pay the price of iterating
the labels of each one of the deleted series.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
2024-06-18 10:28:56 +00:00
Alan Protasio
8894d65cd6
Fix head stats and hooks when replaying a corrupted snapshot (#14079)
* Fixing head stats and hooks when replaying a corrupted snapshot

Signed-off-by: alanprot <alanprot@gmail.com>

* Fixing create/removed series metrics

Signed-off-by: alanprot <alanprot@gmail.com>

* Refactoring to have common code between gc and flush method

Signed-off-by: alanprot <alanprot@gmail.com>

* Update tsdb/head.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

* refactor

Signed-off-by: alanprot <alanprot@gmail.com>

* Update tsdb/head_test.go

Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Update tsdb/head_test.go

Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>

---------

Signed-off-by: alanprot <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
2024-05-24 22:43:21 -04:00
Nick Pillitteri
481f14e1c0
TSDB: Don't rely on integer overflow in head compaction check (#13755)
* TSDB: Don't compact the head block when empty

Don't compact the Head block if there have not yet been any samples
appended.

Previously, the logic for determining if the head should be compacted
relied on the default values for min and max time and integer overflow
when they were checked in `Head.compactable()`. The check in
`Head.compactable()` effectively did `math.MinInt64 - math.MaxInt64`
which overflowed and wrapped to `1`. Since `1` is less than `1.5`
times the chunk range, compaction did not happen. This was the correct
behavior but relying on overflow wrapping is surprising.

This change add a method for checking if the min and max time for the
head is unset and uses it to short-circuit compaction in that case.
It also replaces several explicit checks for the default value to
determine if the head has not yet had any samples added.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
2024-03-26 12:17:38 +01:00
Ben Ye
ceca6c4716
[ENHANCEMENT] TSDB: Log more statistics during startup (#13838)
* log chunk snapshot and mmap chunks replay duration together with total replay duration

Signed-off-by: Ben Ye <benye@amazon.com>
2024-03-26 11:16:27 +00:00
György Krajcsovits
4d4d822c36 Add native histograms to latency/duration metrics
Dogfood native histograms.
Allow dependent projects to migrate to native histograms.

I took the defaults from client_golang.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2024-03-01 14:44:38 +01:00
Bryan Boreham
93b72ec5dd tsdb: create SymbolTables for labels as required
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-02-26 11:45:25 +00:00
Fiona Liao
52389647b2 Add type label to outOfOrderSamplesAppended metric
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
2024-02-19 15:24:39 +00:00
Bryan Boreham
cd4562d3a6
Merge pull request #13473 from bboreham/pure-mutex
tsdb: use cheaper Mutex on series
2024-01-30 09:57:08 +00:00
Marco Pracucci
501bc6419e
Add ShardedPostings() support to TSDB (#10421)
This PR is a reference implementation of the proposal described in #10420.

In addition to what described in #10420, in this PR I've introduced labels.StableHash(). The idea is to offer an hashing function which doesn't change over time, and that's used by query sharding in order to get a stable behaviour over time. The implementation of labels.StableHash() is the hashing function used by Prometheus before stringlabels, and what's used by Grafana Mimir for query sharding (because built before stringlabels was a thing).

Follow up work
As mentioned in #10420, if this PR is accepted I'm also open to upload another foundamental piece used by Grafana Mimir query sharding to accelerate the query execution: an optional, configurable and fast in-memory cache for the series hashes.

Signed-off-by: Marco Pracucci <marco@pracucci.com>
2024-01-29 11:57:27 +00:00
Bryan Boreham
66237c1996 tsdb: use cheaper Mutex on series
Mutex is 8 bytes; RWMutex is 24 bytes and much more complicated. Since
`RLock` is only used in two places, `UpdateMetadata` and `Delete`,
neither of which are hotspots, we should use the cheaper one.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-01-26 11:01:39 +00:00
Bryan Boreham
b9eab6e4b8
tsdb: simplify internal series delete function (#13261)
Lifting an optimisation from Agent code, `seriesHashmap.del` can use
the unique series reference, doesn't need to check Labels.
Also streamline the logic for deleting from `unique` and `conflicts` maps,
and add some comments to help the next person.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2024-01-25 11:57:54 +01:00