mirror of
https://github.com/hashicorp/vault.git
synced 2025-08-14 18:47:01 +02:00
* docs change * Update website/content/docs/concepts/adaptive-overload-protection/index.mdx Co-authored-by: Paul Banks <pbanks@hashicorp.com> * Update website/content/docs/concepts/adaptive-overload-protection/index.mdx Co-authored-by: Paul Banks <pbanks@hashicorp.com> * Update website/content/docs/concepts/adaptive-overload-protection/index.mdx Co-authored-by: Paul Banks <pbanks@hashicorp.com> * adjust some replication verbiage --------- Co-authored-by: Paul Banks <pbanks@hashicorp.com>
99 lines
4.9 KiB
Plaintext
99 lines
4.9 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: 'Adaptive overload protection'
|
|
description: >-
|
|
Vault Enterprise provides adaptive overload protection to automatically
|
|
prevent workloads from overloading different resources of the Vault servers.
|
|
---
|
|
|
|
# Adaptive overload protection
|
|
|
|
@include 'alerts/enterprise-only.mdx'
|
|
|
|
Adaptive overload protection refers to a set of features in Vault Enterprise
|
|
that prevent client requests from overwhelming different server resources
|
|
leading to poor availability.
|
|
|
|
## Preventing overload
|
|
|
|
Vault currently supports one type of adaptive overload protection that prevents
|
|
Vault servers from being overwhelmed by write requests.
|
|
|
|
These protection measures are "Adaptive" in the sense that they automatically
|
|
and continuously adjust to maintain optimal performance for the current workload
|
|
and hardware resources available without any user tuning.
|
|
|
|
Load testing and tuning of appropriate limits is time-consuming for users during
|
|
initial setup. Even when clusters are carefully tuned during installation,
|
|
real-world workloads and hardware performance both change over time. A static
|
|
tuning will soon be suboptimal or even completely ineffective at preventing
|
|
overloads.
|
|
|
|
For example, an increase in disk latency caused by failing hardware might reduce
|
|
the server's available throughput. A static limit configured while disks were
|
|
performing at their peak would not protect the degraded system from overload. By
|
|
adaptively responding to current load and performance characteristics, Vault
|
|
Enterprise is able to provide long-term protection against overloads.
|
|
|
|
## Types of overload
|
|
|
|
There are many potential resources that could become a performance bottleneck in
|
|
a Vault Enterprise cluster. Different forms of adaptive overload protection
|
|
target specific components and workloads. This allows each one to be carefully
|
|
specialized and tuned to the needs of that sub-system. The sections below
|
|
describe specific mechanisms that prevent overload of particular subsystems and
|
|
protect against particular types of overloads.
|
|
|
|
## Write overload protection
|
|
|
|
In Vault Enterprise, all writes go through the `WALBackend` to allow for
|
|
replication to other clusters. This is true even if replication is not being
|
|
used. Vault performs batching or "group commit" for these writes to increases
|
|
throughput. Optimal throughput for a given storage backend is obtained when
|
|
there are enough write requests in the queue to fill the next batch. However, if
|
|
there are more requests queued than will fit in a batch, latencies start to grow
|
|
quickly as all writes have to wait behind multiple other batches.
|
|
|
|
In some cases, a sudden influx of write requests that exceeds Vault's hardware
|
|
capacity can result in the writes queueing for so long that every request times
|
|
out before the write can make it through the queue. This makes Vault effectively
|
|
unavailable to clients even though it is still processing requests and storing
|
|
data as fast as it can. This is illustrated in the test results shown below for
|
|
a workload of 100% logins.
|
|
|
|

|
|
|
|
This type of overload scenario can also delay WALs related to replication from being written,
|
|
leading to replication lag. This lag can cause data discrepancies between replicated clusters.
|
|
|
|
Adaptive Write Overload Protection prevents this scenario. It constantly
|
|
monitors the current state of the write queue and uses a carefully tuned
|
|
algorithm to allow just enough queueing to maximize throughput on the available
|
|
hardware while keeping latencies under control and unnecessary rejections to a
|
|
minimum. Writes required to replicate data to a secondary cluster are always handled
|
|
to avoid replication lag.
|
|
|
|
Write overload protection was added in Vault Enterprise 1.17 as a beta feature
|
|
and was disabled by default. Since Vault 1.18.0 it is enabled by default only
|
|
when using Raft Integrated Storage. Adaptive overload protection is disabled by
|
|
default for Consul and other storage backends.
|
|
|
|
To disable the feature use the [`adaptive_overload_protection` configuration
|
|
stanza](/vault/docs/configuration/adaptive-overload-protection).
|
|
|
|
### Metrics
|
|
|
|
Operators may wish to monitor metrics related to the write overload protection
|
|
controller. The most useful of these is the `reject_fraction` which represents
|
|
the controller's current estimate for the fraction of write requests that need
|
|
to be rejected to maintain optimal throughput and stability.
|
|
|
|
See the [wal.write_controller.reject_fraction metrics reference](/vault/docs/internals/telemetry/metrics/availability#vault-wal-write_controller-reject_fraction).
|
|
|
|
## Client handling of overloads
|
|
|
|
When Vault has reached capacity, new requests will be immediately rejected with
|
|
a retryable `503 - Service Unavailable`. See [Vault Server Temporarily
|
|
Overloaded](/vault/docs/concepts/adaptive-overload-protection/vault-server-temporarily-overloaded)
|
|
for additional considerations around handling this error correctly.
|