Rationales:
* metadata-wal-records might be deprecated and replaced going forward: https://github.com/prometheus/prometheus/issues/15911
* PRW 2.0 works without metadata just fine (although it sends untyped metrics as expected).
Signed-off-by: bwplotka <bwplotka@gmail.com>
Around Mimir compactions we see logging in ShardedPostings do massive allocations and drive GC up to 50% of CPU.
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
'defer' only runs at the end of the function, so explicitly close the
querier after we finish with it. Also check it didn't error.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
'defer' only runs at the end of the function, so introduce some more
functions / move the start, so that 'defer' can run at the end of the
logical block.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Compact() is an uppercase function that deals with locks on its own, so we shouldn't have a lock around it.
Signed-off-by: Lukasz Mierzwa <lukasz@cloudflare.com>
We don't hold db.mtx lock when trying to read db.blocks here so we need a read lock around this loop.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
This test ensures that running db.reloadBlocks() and db.CleanTombstones() at the same time doesn't race.
The problem is that CleanTombstones() is a public method while reloadBlocks() is internal.
CleanTombstones() sets db.cmtx lock while reloadBlocks() is not protected by any locks at all, it expects the public method through which it was called to do it.
So having a race between these two is not unexpected and we shouldn't really be testing this.
db.cmtx ensures that no other function can be modifying the list of open blocks and so the scenario tested here cannot happen.
If it would happen it would be only because some other method doesn't aquire db.ctmx lock, something this test cannot detect.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
This partially reverts ae3d392aa9c3a5c5f92f8116738c5b32c98b09a7.
ae3d392aa9c3a5c5f92f8116738c5b32c98b09a7 added a call to db.mtx.Lock() that lasts for the entire duration of db.reloadBlocks(),
previous db.mtx would be locked only during critical part of db.reloadBlocks().
The motivation was to protect against races:
9e0351e161 (r555699794)
The 'reloads' being mentioned are (I think) reloadBlocks() calls, rather than db.reload() or other methods.
TestTombstoneCleanRetentionLimitsRace was added to catch this but I wasn't able to ever get any error out of it, even after disabling all calls to db.mtx in reloadBlocks() and CleanTombstones().
To make things more complicated CleanupTombstones() itself calls reloadBlocks(), so it seems that the real issue is that we might have concurrent calls to reloadBlocks().
The problem with this change is that db.reloadBlocks() can take a very long time, that's because it might need to load very large blocks from disk, which is slow.
While db.mtx is locked a large chunk of the db is locked, including queries, since db.mtx read lock is needed for db.Querier() call.
One of the issues this manifests itself as is a gap in all metrics and blocked queries just after a large block compaction happens.
When compaction merges multiple day-or-more blocks into a week-or-more block it create a single very big block.
After that block is written it needs to be loaded and that seems to be taking many seconds (30-45), during which mtx is held and everything is blocked.
Turns out that there is another lock that is more fine grained and aimed at this specific use case:
// cmtx ensures that compactions and deletions don't run simultaneously.
cmtx sync.Mutex
All calls to reloadBlocks() are wrapped inside cmtx lock. The only exception is db.reload() which this change fixes.
We can't add cmtx lock inside reloadBlocks() itself because it's called by a number of functions, some of which are already holding cmtx.
Looking at the code I think it is sufficient to hold cmtx and skip a reloadBlocks() wide mtx call.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Fix issues raised by staticcheck
We are not enabling staticcheck explicitly, though, because it has too many false positives.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
When creating dummy data for benchmarks, call `Commit()` periodically to
avoid growing the appender to enormous size.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Exported the CheckpointPrefix constant to be used in other packages.
Updated references to the constant in db.go and checkpoint.go files.
This change improves code readability and maintainability.
Signed-off-by: johncming <johncming@yahoo.com>
Co-authored-by: johncming <conjohn668@gmail.com>
This enables it to take advantage of a more compact data structure
since all postings are known to be `*ListPostings`.
Remove the `Get` member which was not used for anything else, and fix up
tests.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Now we can call it with more specific types which is more efficient than
making everything go through the `Postings` interface.
Benchmark the concrete type.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
We need to create more postings entries so the merger has some work to do.
Not material for the regexp ones as they match so few series.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* [ENHANCEMENT] TSDB: Improve calculation of space used by labels
The labels for each series in the Head take up some some space in the
Postings index, but far more space in the `memSeries` structure.
Instead of having the Postings index calculate this overhead, which is
a layering violation, have the caller pass in a function to do it.
Provide three implementations of this function for the three Labels
versions.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Remove the 2 minute timeout as the default is 2 hours and wouldn't
interfere. With the test. Otherwise the extra samples combined with
race detection can push the test over 2 minutes and make it fail.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
The segment size was too low for the additional NHCB data, thus it created
more segments then expected. This meant that less were in the lower
numbered segments, which meant more was kept.
FAIL: TestCheckpoint (4.05s)
FAIL: TestCheckpoint/compress=none (0.22s)
checkpoint_test.go:361:
Error Trace: /home/krajo/go/github.com/prometheus/prometheus/tsdb/wlog/checkpoint_test.go:361
Error: "0.8586956521739131" is not less than "0.8"
Test: TestCheckpoint/compress=none
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>