vault/website/content/docs/concepts/adaptive-overload-protection/index.mdx
davidadeleon 259b3ac9ec
VAULT-29658: Docs update for AOP (#28238)
* docs change

* Update website/content/docs/concepts/adaptive-overload-protection/index.mdx

Co-authored-by: Paul Banks <pbanks@hashicorp.com>

* Update website/content/docs/concepts/adaptive-overload-protection/index.mdx

Co-authored-by: Paul Banks <pbanks@hashicorp.com>

* Update website/content/docs/concepts/adaptive-overload-protection/index.mdx

Co-authored-by: Paul Banks <pbanks@hashicorp.com>

* adjust some replication verbiage

---------

Co-authored-by: Paul Banks <pbanks@hashicorp.com>
2024-08-30 16:08:24 -04:00

99 lines
4.9 KiB
Plaintext

---
layout: docs
page_title: 'Adaptive overload protection'
description: >-
Vault Enterprise provides adaptive overload protection to automatically
prevent workloads from overloading different resources of the Vault servers.
---
# Adaptive overload protection
@include 'alerts/enterprise-only.mdx'
Adaptive overload protection refers to a set of features in Vault Enterprise
that prevent client requests from overwhelming different server resources
leading to poor availability.
## Preventing overload
Vault currently supports one type of adaptive overload protection that prevents
Vault servers from being overwhelmed by write requests.
These protection measures are "Adaptive" in the sense that they automatically
and continuously adjust to maintain optimal performance for the current workload
and hardware resources available without any user tuning.
Load testing and tuning of appropriate limits is time-consuming for users during
initial setup. Even when clusters are carefully tuned during installation,
real-world workloads and hardware performance both change over time. A static
tuning will soon be suboptimal or even completely ineffective at preventing
overloads.
For example, an increase in disk latency caused by failing hardware might reduce
the server's available throughput. A static limit configured while disks were
performing at their peak would not protect the degraded system from overload. By
adaptively responding to current load and performance characteristics, Vault
Enterprise is able to provide long-term protection against overloads.
## Types of overload
There are many potential resources that could become a performance bottleneck in
a Vault Enterprise cluster. Different forms of adaptive overload protection
target specific components and workloads. This allows each one to be carefully
specialized and tuned to the needs of that sub-system. The sections below
describe specific mechanisms that prevent overload of particular subsystems and
protect against particular types of overloads.
## Write overload protection
In Vault Enterprise, all writes go through the `WALBackend` to allow for
replication to other clusters. This is true even if replication is not being
used. Vault performs batching or "group commit" for these writes to increases
throughput. Optimal throughput for a given storage backend is obtained when
there are enough write requests in the queue to fill the next batch. However, if
there are more requests queued than will fit in a batch, latencies start to grow
quickly as all writes have to wait behind multiple other batches.
In some cases, a sudden influx of write requests that exceeds Vault's hardware
capacity can result in the writes queueing for so long that every request times
out before the write can make it through the queue. This makes Vault effectively
unavailable to clients even though it is still processing requests and storing
data as fast as it can. This is illustrated in the test results shown below for
a workload of 100% logins.
![Login workload telemetry graphs showing difference with and without adaptive overload protection for writes](/img/adaptive-overload-protection-writes.png)
This type of overload scenario can also delay WALs related to replication from being written,
leading to replication lag. This lag can cause data discrepancies between replicated clusters.
Adaptive Write Overload Protection prevents this scenario. It constantly
monitors the current state of the write queue and uses a carefully tuned
algorithm to allow just enough queueing to maximize throughput on the available
hardware while keeping latencies under control and unnecessary rejections to a
minimum. Writes required to replicate data to a secondary cluster are always handled
to avoid replication lag.
Write overload protection was added in Vault Enterprise 1.17 as a beta feature
and was disabled by default. Since Vault 1.18.0 it is enabled by default only
when using Raft Integrated Storage. Adaptive overload protection is disabled by
default for Consul and other storage backends.
To disable the feature use the [`adaptive_overload_protection` configuration
stanza](/vault/docs/configuration/adaptive-overload-protection).
### Metrics
Operators may wish to monitor metrics related to the write overload protection
controller. The most useful of these is the `reject_fraction` which represents
the controller's current estimate for the fraction of write requests that need
to be rejected to maintain optimal throughput and stability.
See the [wal.write_controller.reject_fraction metrics reference](/vault/docs/internals/telemetry/metrics/availability#vault-wal-write_controller-reject_fraction).
## Client handling of overloads
When Vault has reached capacity, new requests will be immediately rejected with
a retryable `503 - Service Unavailable`. See [Vault Server Temporarily
Overloaded](/vault/docs/concepts/adaptive-overload-protection/vault-server-temporarily-overloaded)
for additional considerations around handling this error correctly.