Two cases in compactBuckets caused a panic when fed malformed histogram
data (e.g. via a crafted protobuf message):
1. All spans have zero length: after the zero-length span removal pass,
spans becomes empty. The subsequent loop called emptyBucketsHere(),
which accessed spans[0] and panicked with index out of range.
Fixed by the early return added in the previous commit (already on
this branch via the roidelapluie/histogram-compact-zero-spans fix).
2. More buckets than spans describe: iSpan can reach len(spans) before
all buckets are consumed, causing emptyBucketsHere() to access
spans[iSpan] out of bounds.
Fixed by adding iSpan < len(spans) to the loop guard.
Both fixes in compactBuckets are defensive layers. The primary fix is
in the protobuf parser: checkNativeHistogramConsistency now validates
that span total length matches bucket count before calling Compact(),
returning a proper error for malformed input instead of panicking.
Found by FuzzParseProtobuf.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
getMagicLabel had no bounds check on the quantile slice for the Summary
case. fieldsDone for an empty-quantile summary is set inside Series(),
not getMagicLabel. A caller driving Next() without calling Series() at
the _sum step would allow fieldPos to advance to 0 and index into an
empty slice.
Add the same out-of-bounds guard that the histogram branch already has,
and a regression test that exercises Next()-only iteration over a
summary with no quantiles.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
When a label-name position is followed by comma or brace-close, only
treat it as a metric name shorthand if the token was a double-quoted
string (tQString). Bare identifiers must be followed by an equal sign.
Add tests for bare identifier inputs that previously could panic.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
As for float samples, Kahan summation is used for the `sum` and `avg` aggregation and for the respective `_over_time` functions.
Kahan summation is not perfect. This commit also adds tests that even Kahan summation cannot reliably pass.
These tests are commented out.
Note that the behavior might be different on other hardware platforms. We have to keep an eye on test failing on other hardware platforms and adjust them accordingly.
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
This change fixes an issue introduced in #17707. When a regex
with a wildcard, literal, and final wildcard surounded by a
capture group was parsed - the capture group was not removed
first preventing `optimizeConcatRegex` from running.
Found via fuzz testing.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
#14173 introduced an optimisation to better handle regex patterns like .*-.*-.*. It identifies strings the pattern cannot possibly match (because they do not contain all of the literal values) and returns false from MatchString early.
However, if the string does contain all literal values, then the Go regex engine is used to confirm that the string does match the pattern. But this is not necessary in the case where the start and end of the pattern is .* and everything in between is either a literal or .*: if the string contains all of the literals in order, then it matches the pattern, and invoking Go's regex engine to confirm this is unnecessary and quite slow.
* Add some more test cases
* Add benchmark, since existing benchmark doesn't show much impact given most of the random test strings will not match the patterns.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
ReduceResolution is currently called before validation during
ingestion. This will cause a panic if there are not enough buckets in
the histogram. If there are too many buckets, the spurious buckets are
ignored, and therefore the error in the input histogram is masked.
Furthermore, invalid negative offsets might cause problems, too.
Therefore, we need to do some minimal validation in reduceResolution.
Fortunately, it is easy and shouldn't slow things down. Sadly, it
requires to return errors, which triggers a bunch of code changes.
Even here is a bright side, we can get rud of a few panics. (Remember:
Don't panic!)
In different news, we haven't done a full validation of histograms
read via remote-read. This is not so much a security concern (as you
can throw off Prometheus easily by feeding it bogus data via
remote-read) but more that remote-read sources might be makeshift and
could accidentally create invalid histograms. We really don't want to
panic in that case. So this commit does not only add a check of the
spans and buckets as needed for resolution reduction but also a full
validation during remote-read.
Signed-off-by: beorn7 <beorn@grafana.com>
Currently, iterating over histogram buckets can panic if the spans are
not consistent with the buckets. We aim for validating histograms upon
ingestion, but there might still be data corruptions on disk that
could trigger the panic. While data corruption on disk is really bad
and will lead to all kind of weirdness, we should still avoid
panic'ing.
Note, though, that chunks are secured by checksums, so the corruptions
won't realistically happen because of disk faults, but more likely
because a chunk was generated in a faulty way in the first place, by
a software bug or even maliciously.
This commit prevents panics in the situation where there are fewer
buckets than described by the spans. Note that the missing buckets
will simply not be iterated over. There is no signalling of this
problem. We might still consider this separately, but for now, I would
say that this kind of corruption is exceedingly rare and doesn't
deserve special treatment (which will add a whole lot of complexity to
the code).
Signed-off-by: beorn7 <beorn@grafana.com>
Partially fixes https://github.com/prometheus/prometheus/issues/17416 by
renaming all CT* names to ST* in the whole codebase except RW2 (this is
done in separate
[PR](https://github.com/prometheus/prometheus/pull/17411)) and
PrometheusProto exposition proto.
```
CreatedTimestamp -> StartTimestamp
CreatedTimeStamp -> StartTimestamp
created_timestamp -> start_timestamp
CT -> ST
ct -> st
```
Signed-off-by: bwplotka <bwplotka@gmail.com>
`histogram.Error` becomes the generic wrapper type for all histogram errors.
This makes it easier and less error prone when adding new errors to check if
an error is an histogram error as well as making it less error prone to convert
the errors.
This change the type of those specific sentinel errors from error to
`histogram.Error`, but it should almost never matter.
e.g., `errors.Is(err, ErrHistogram...)` would still work out of the box.
Signed-off-by: Laurent Dufresne <laurent.dufresne@grafana.com>
Fixes#17370
In Prometheus v3.7.0, using labelmap actions with replacement patterns
containing regex variables (e.g., `$1`, `${1}`) would fail validation
when `metric_name_validation_scheme` was set to `legacy`, causing
Prometheus to fail at startup with:
"$1" is invalid 'replacement' for labelmap action
This was a regression as the same configuration worked in v3.6.0.
The issue was in the validation logic: while UTF-8 validation correctly
allowed `$` characters, legacy validation incorrectly used
`IsValidLabelName` which rejects `$` characters. The fix ensures legacy
validation uses `relabelTargetLegacy` regex which explicitly supports
regex template variables.
Added test cases to verify labelmap validation works with both `$1` and
`${1}` replacement patterns under legacy validation scheme.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Fixes#17255.
The implementation happens mostly in the Add and Sub method, but the reconciliation works for all relevant operations. For example, you can now `rate` over a range wherein the custom bucket boundaries are changing.
Any custom bucket reconciliation is flagged with an info-level annotation.
---------
Signed-off-by: Linas Medziunas <linas.medziunas@gmail.com>
Signed-off-by: Linas Medžiūnas <linasm@users.noreply.github.com>
This adds:
* A `ScrapePoolConfig()` method to the scrape manager that allows getting
the scrape config for a given pool.
* An API endpoint at `/api/v1/targets/relabel_steps` that takes a pool name
and a label set of a target and returns a detailed list of applied
relabeling rules and their output for each step.
* A "show relabeling" link/button for each target on the discovery page
that shows the detailed flow of all relabeling rules (based on the API
response) for that target.
Note that this changes the JSON encoding of the relabeling rule config
struct to output the original snake_case (instead of camelCase) field names,
and before merging, we need to be sure that's ok :) See my comment about
that at https://github.com/prometheus/prometheus/pull/15383#issuecomment-3405591487
Fixes https://github.com/prometheus/prometheus/issues/17283
Signed-off-by: Julius Volz <julius.volz@gmail.com>
The detailed plan for this is laid out in
https://github.com/prometheus/prometheus/issues/16572 .
This commit adds a global and local scrape config option
`scrape_native_histograms`, which has to be set to true to ingest
native histograms.
To ease the transition, the feature flag is changed to simply set the
default of `scrape_native_histograms` to true.
Further implications:
- The default scrape protocols now depend on the
`scrape_native_histograms` setting.
- Everywhere else, histograms are now "on by default".
Documentation beyond the one for the feature flag and the scrape
config are deliberately left out. See
https://github.com/prometheus/prometheus/pull/17232 for that.
Signed-off-by: beorn7 <beorn@grafana.com>