docs: add operational best practices guide (#6287)

* docs: add operational best practices guide Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com> * docs: add operational best practices guide Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com> * docs: add operational best practices guide Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com> * docs: add operational best practices guide Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com> * docs: add operational best practices guide Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com> --------- Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com>
2026-05-04 22:26:11 +02:00 · 2026-03-19 10:48:31 +00:00 · 2026-03-19 10:48:31 +00:00 · de04f349d7
commit de04f349d7
parent 541bdf36e2
3 changed files with 436 additions and 28 deletions
--- a/docs/advanced/domain-filter.md
+++ b/docs/advanced/domain-filter.md
@ -7,6 +7,12 @@ tags:

 # Domain Filter

+> **Important:** Domain filter flags express application-level intent — they are not an
+> enforcement boundary. Credentials (IAM policies, API token scopes, ACLs) are the real
+> enforcement boundary. A misconfigured or missing flag will expose all zones the credentials
+> can reach. Always scope API keys or IAM roles to only the specific zones external-dns manages;
+> use these flags to complement that boundary, not replace it.
+
 ExternalDNS has two modes for selecting which domains it manages: **plain domain filter** and **regex domain filter**.

 Using any of the regex filter flags enables the **regex domain filter** mode, which overrides and ignores the **plain domain filter** flags.
--- a/docs/advanced/events.md
+++ b/docs/advanced/events.md
@ -112,32 +112,8 @@ sequenceDiagram

 ## Supported Sources

-Events are emitted for all sources that External-DNS supports. The following table lists the sources and whether they currently emit events.
-If a source does not emit events, it may in the future.
+Events are emitted for all sources that External-DNS supports. Event support is being rolled out
+progressively — if a source does not yet emit events, it may in the future.

-| Source                 | Supported |
-|:-----------------------|:---------:|
-| `ambassador-host`      |           |
-| `connector`            |           |
-| `contour-httpproxy`    |           |
-| `crd`                  |           |
-| `empty`                |           |
-| `f5-transportserver`   |           |
-| `f5-virtualserver`     |           |
-| `fake`                 |     ✅    |
-| `gateway-grpcroute`    |           |
-| `gateway-httproute`    |           |
-| `gateway-tcproute`     |           |
-| `gateway-tlsroute`     |           |
-| `gateway-udproute`     |           |
-| `gloo-proxy`           |           |
-| `ingress`              |           |
-| `istio-gateway`        |           |
-| `istio-virtualservice` |           |
-| `kong-tcpingress`      |           |
-| `node`                 |           |
-| `openshift-route`      |           |
-| `pod`                  |           |
-| `service`              |           |
-| `skipper-routegroup`   |           |
-| `traefik-proxy`        |           |
+See the [sources reference](../sources/index.md#available-sources) for the full list and
+per-source event support status.
--- a/docs/advanced/operational-best-practices.md
+++ b/docs/advanced/operational-best-practices.md
@ -0,0 +1,426 @@
+---
+tags:
+  - advanced
+  - operations
+  - performance
+  - configuration
+---
+
+# Operational Best Practices
+
+This guide covers configuration recommendations for running external-dns reliably in production.
+It focuses on the interaction between flags and real-world deployment scenarios — scope, memory,
+scale, and observability — and complements the per-feature reference pages linked throughout.
+
+> If you have operational experience or best practices not covered here, please open a proposal
+> PR to share it with the community.
+
+## Production Readiness Checklist
+
+Use this as a quick review before deploying to production. Each item links to the relevant
+section below.
+
+**Resource scope**
+
+- [ ] Set [`--service-type-filter`](#reducing-the-informer-scope) to only the service types you
+  actually publish (e.g., `LoadBalancer`). The default watches Pods, EndpointSlices, and Nodes
+  unnecessarily for most deployments.
+- [ ] Add [`--label-filter` or `--annotation-filter`](#reducing-the-informer-scope) to further
+  limit which objects are cached.
+
+**Source configuration**
+
+- [ ] Only configure [`--source=`](#source-configuration-and-preflight-validation) types whose
+  CRDs are fully installed and established on the cluster. A missing CRD does not always produce
+  a clear error — it can manifest as a `context deadline exceeded` timeout or silent informer
+  staleness.
+- [ ] Grant RBAC `list` **and** `watch` for every resource type each configured source requires.
+  A missing `watch` permission lets external-dns start cleanly but freezes its view of the
+  cluster — DNS records drift silently with no crash and no log warning.
+- [ ] Scope RBAC to only the sources that are configured. Excess permissions hide misconfiguration
+  rather than surfacing it.
+- [ ] In multi-cluster deployments, use per-cluster source lists rather than a shared
+  configuration.
+- [ ] Validate against a staging environment that mirrors production CRD and RBAC profiles before
+  rolling out changes.
+
+**Scaling**
+
+- [ ] Scope resources at every level — service type, label, annotation, domain, zone ID. See
+  [Scope resources](#scope-resources).
+- [ ] Split into multiple instances for large zone sets or source mixes, each with a distinct --txt-owner-id` and non-overlapping domain scope. See [Split instances](#split-instances).
+- [ ] Tune reconcile frequency and raise `--request-timeout` on large clusters. See [Reduce reconcile pressure](#reduce-reconcile-pressure).
+
+**Observability**
+
+- [ ] Alert on [`external_dns_controller_consecutive_soft_errors`](#key-metrics) greater than 0
+  for more than one reconcile cycle.
+- [ ] Alert on a sustained increase in [`external_dns_source_errors_total`](#key-metrics) or
+  [`external_dns_registry_errors_total`](#key-metrics).
+- [ ] Enable [`--events-emit=RecordError`](#kubernetes-events-for-invalid-endpoints) to surface
+  misconfigured endpoints on the responsible Kubernetes resource.
+
+**Registry and ownership**
+
+- [ ] Set a unique `--txt-owner-id` per external-dns instance and avoid overlapping
+  `--domain-filter` scopes. Multiple instances writing to the same zone without distinct owner
+  IDs can produce conflict errors and, if a conflict causes a hard exit, a crashloop.
+  See [State Conflicts and Ownership](#state-conflicts-and-ownership).
+
+**Provider**
+
+- [ ] Configure batch change size and interval for your provider if you manage large or
+  frequently-changing zones. See [DNS provider API rate limits](rate-limits.md) for per-provider
+  flags.
+- [ ] Enable zone caching if your provider supports it. Zone enumeration is an API call on
+  every reconcile; caching it reduces provider API pressure significantly for stable zone sets.
+  See [Zone caching](#zone-list-caching) for supported providers and flags.
+- [ ] Scope provider credentials (API keys, IAM roles) to only the zones external-dns manages.
+  Zone filtering flags express intent but are not an enforcement boundary — the credentials are.
+  See [Scope provider credentials to specific zones](#scope-provider-credentials-to-specific-zones).
+
+---
+
+## Resource Scope and Memory
+
+### The service source watches more than just Services
+
+By default the `service` source registers Kubernetes informers for **Services, Pods, EndpointSlices,
+and Nodes**. Which informers are active depends on the service types in scope:
+
+| Active informers      | Triggered when                                  |
+|:----------------------|:------------------------------------------------|
+| Services              | Always                                          |
+| Pods + EndpointSlices | `NodePort` or `ClusterIP` services are in scope |
+| Nodes                 | `NodePort` services are in scope                |
+
+With no `--service-type-filter` set (the default), all service types are in scope and all four
+informers are started. On large clusters this has two consequences:
+
+1. **Steady-state memory**: external-dns holds an in-memory cache of every Pod, EndpointSlice,
+   and Node in the cluster — not only the ones relevant to DNS.
+2. **Startup memory burst**: the classic Kubernetes LIST code path fetches all objects of a type
+   into memory at once during initial informer sync. A Pod transformer is applied
+   ([`source/service.go:159`](../../source/service.go)) to reduce stored size, but as the comment
+   there notes:
+   > "If watchList is not used it will not prevent memory bursts on the initial informer sync."
+
+### Reducing the informer scope
+
+The most effective mitigation is to restrict the service types external-dns watches:
+
+```sh
+# Most clusters only need LoadBalancer — eliminates Pod, EndpointSlice, and Node informers entirely
+--service-type-filter=LoadBalancer
+```
+
+Combine with label or annotation filters to further limit the set of Service objects listed:
+
+```sh
+--label-filter=external-dns/enabled=true
+--annotation-filter=external-dns.alpha.kubernetes.io/hostname
+```
+
+The table below shows which informers are eliminated by `--service-type-filter`:
+
+| Filter value                | Informers removed                 |
+|:----------------------------|:----------------------------------|
+| `LoadBalancer`              | Pods, EndpointSlices, Nodes       |
+| `LoadBalancer,ExternalName` | Pods, EndpointSlices, Nodes       |
+| `ClusterIP`                 | Nodes                             |
+| `NodePort`                  | *(none — all informers required)* |
+
+> **Note:** The informer scope reduction is a side-effect of type filtering, not its primary
+> purpose. Always choose filters based on what DNS records you actually need to publish; the
+> memory reduction is a bonus.
+
+### Reducing startup memory bursts
+
+The memory burst during initial sync is a known limitation of the classic LIST code path in
+`client-go`. A streaming alternative called **WatchList** (`WatchListClient` feature gate) avoids
+the burst by receiving objects one-at-a-time via a Watch with `SendInitialEvents=true` rather than
+fetching all objects at once.
+
+The `WatchListClient` feature gate defaults to `true` in recent versions of client-go, so the
+burst is effectively eliminated when running the latest release of external-dns. On older
+releases, `--service-type-filter` is the primary mitigation. **Use the latest release.**
+
+> **Note:** Even with WatchList enabled, transformers and indexers on all informer types are
+> needed to reduce steady-state memory. Work is ongoing to add them consistently across sources.
+
+## Source Configuration and Preflight Validation
+
+External-dns fails fast when a configured source cannot initialize — **this is intentional.**
+A crashloop is a clear, explicit signal that the configuration is wrong. Silently skipping a
+broken source would mask the problem and make failures much harder to diagnose.
+
+Production safety should come from correct configuration and preflight validation, not from
+external-dns guessing what the user meant.
+
+### Failure modes are not always obvious
+
+The difficulty is that RBAC problems and missing CRDs do not always surface as a clean, explicit
+error. Depending on what is misconfigured, the failure mode can be a timeout, an empty result,
+or silent staleness rather than a crash:
+
+| Misconfiguration                 | Typical symptom                                            | Why it's subtle                                                                                       |
+|:---------------------------------|:-----------------------------------------------------------|:------------------------------------------------------------------------------------------------------|
+| CRD not installed                | `context deadline exceeded` after ~60s                     | Informer blocks waiting for cache sync; no "CRD not found" message                                    |
+| No LIST permission               | `403 Forbidden` → exit                                     | Usually clean, but error may reference an internal API path that is hard to map back to the source    |
+| LIST allowed, WATCH denied       | Informer starts, never receives updates                    | DNS records appear stale; no crash, no error logged after startup                                     |
+| Admission webhook misconfigured  | Source initializes successfully, changes silently rejected | External-dns sees no error; records are never created or updated                                      |
+
+The LIST-without-WATCH case is particularly dangerous: external-dns starts cleanly, reports
+healthy, but its view of the cluster is frozen at the point of last successful LIST. DNS records
+will drift from actual cluster state without any indication in logs or metrics.
+
+### Practices
+
+**Explicitly scope enabled sources.**
+Only configure `--source=` types that are fully supported on the target cluster — the right CRDs
+installed, RBAC granted, and any admission webhooks configured. Do not rely on any form of
+best-effort or graceful degradation.
+
+```sh
+# Configure only what is present and authorized on this cluster
+--source=service
+--source=ingress
+```
+
+**Install CRDs before enabling dependent sources.**
+For sources that depend on custom resources (Gateway API, Istio, CRD source), install the CRDs
+and verify they are established (`kubectl get crd <name>` shows `ESTABLISHED`) before adding the
+corresponding `--source=` flag. A missing CRD does not always produce a "not found" error — it
+can cause the informer cache sync to block and time out, surfacing as a generic `context deadline
+exceeded` at startup with no indication of which CRD is missing.
+
+**Use per-cluster source lists in multi-cluster deployments.**
+When managing clusters with different CRD profiles via Helm or ArgoCD, define source lists per
+cluster rather than sharing a single configuration. A value that works on a cluster with Gateway
+API installed will crash-loop on one without it.
+
+```yaml
+# values-cluster-a.yaml  (Gateway API installed)
+sources:
+  - service
+  - gateway-httproute
+
+# values-cluster-b.yaml  (no Gateway API)
+sources:
+  - service
+  - ingress
+```
+
+**Validate configuration early — fail in CI, not in production.**
+Add a startup check to your CI or pre-deployment pipeline using `--dry-run --once` against a
+staging cluster that mirrors the production CRD and RBAC profile. `--once` alone will apply real
+DNS changes; always pair it with `--dry-run` for validation. A crash in staging is cheap; a
+crashloop in production affects DNS for all managed records until the pod restarts.
+
+**Use minimal RBAC.**
+Grant external-dns only the API access it needs for the configured sources. Excess permissions
+are a security concern: if a source is accidentally added to the configuration, external-dns
+will silently start watching resources it was never intended to manage. Insufficient permissions
+for a configured source cause a crash on startup, which is the intended signal — but only if
+RBAC is scoped tightly enough to surface it.
+
+## Scaling on Large Clusters
+
+Scaling external-dns comes down to three principles applied in combination:
+
+### Scope resources
+
+The fewer Kubernetes objects external-dns watches and the fewer DNS zones it manages, the lower
+its steady-state memory, API call volume, and reconcile duration. Apply filters at every
+available level — service type, label, annotation, domain, and zone ID.
+
+See [Resource Scope and Memory](#resource-scope-and-memory) and
+[Domain Filter](domain-filter.md) for details.
+
+### Split instances
+
+A single external-dns instance managing a large number of zones or sources will have a large
+reconcile surface and long reconcile cycles. Splitting into multiple instances — each responsible
+for a distinct zone set, namespace, or source type — reduces per-instance load and makes
+failures smaller in blast radius.
+
+Each instance must have a distinct `--txt-owner-id` and non-overlapping `--domain-filter` or
+`--zone-id-filter` scopes to avoid ownership conflicts. See
+[State Conflicts and Ownership](#state-conflicts-and-ownership).
+
+### Reduce reconcile pressure
+
+Tune reconcile frequency to match your actual change rate rather than running at the default
+interval. Use event-driven reconciliation to react quickly to real changes while keeping
+background polling infrequent. Raise `--request-timeout` if informer cache sync exceeds the
+default on large clusters.
+
+For per-provider flags covering batch change sizing, record caching, and zone list caching, see
+[DNS provider API rate limits](rate-limits.md) and [Provider Notes](#provider-notes).
+
+## Observability
+
+### Key metrics
+
+The following metrics are the first place to look when diagnosing operational problems:
+
+| Metric                                             | When to alert                                                       |
+|:---------------------------------------------------|:--------------------------------------------------------------------|
+| `external_dns_controller_consecutive_soft_errors`  | > 0 for more than one reconcile cycle                               |
+| `external_dns_source_errors_total`                 | Sustained increase (Kubernetes API errors from informers)           |
+| `external_dns_registry_errors_total`               | Any increase (TXT / DynamoDB registry failures)                     |
+| `external_dns_controller_verified_records`         | Unexpected drop (records no longer owned by this instance)          |
+
+See [Available Metrics](../monitoring/metrics.md) for the full list.
+
+> **Future:** Work is in progress to add an `external_dns_source_invalid_endpoints` gauge
+> (partitioned by `record_type` and `source_type`) that resets and refills each reconcile cycle,
+> making it straightforward to alert on dropped endpoints without log grepping. Until that lands,
+> watch `external_dns_source_errors_total` and enable `--events-emit=RecordError` (see below).
+
+### Kubernetes Events for invalid endpoints
+
+Invalid endpoints — CNAME self-references, malformed MX/SRV records, unsupported alias types —
+are silently dropped by the dedup layer with only a log warning at default log levels. Without
+structured observability, the only way to discover them is to grep logs.
+
+Enable `RecordError` events to surface invalid endpoints directly on the responsible Kubernetes
+resource:
+
+```sh
+--events-emit=RecordError
+```
+
+```sh
+# Inspect invalid endpoints across the cluster
+kubectl get events --field-selector reason=RecordError
+
+# Or scoped to a specific resource
+kubectl describe ingress my-ingress
+```
+
+See [Kubernetes Events in External-DNS](events.md) for full documentation and the list of
+supported event types.
+
+## State Conflicts and Ownership
+
+External-dns detects desired vs. current state, computes a plan, and applies it — assuming the
+plan is internally consistent. When a provider returns a conflict error (HTTP 409 or equivalent),
+it means the current DNS state does not match what external-dns expects. **This is a state
+problem, not a software bug.** The correctness of annotations and desired state is the
+operator's responsibility. External-dns cannot auto-correct user-defined configuration: any
+automated correction risks removing or replacing DNS records that services depend on, making
+those services unreachable. Instead, external-dns does its best to make these problems visible
+so operators can fix them deliberately. It has no general conflict-resolution policy: it drops
+some well-known invalid records (such as CNAME self-references), but does not apply a subset,
+auto-correct arbitrary conflicts, or attempt partial best-effort behavior. Retrying the same
+request without changing the input or reconciling the external state will deterministically fail.
+
+```mermaid
+flowchart TD
+    A[Kubernetes resources] --> B[external-dns computes plan]
+    B --> C{Apply to DNS provider}
+    C -- Success --> D[DNS records updated]
+    D --> A
+    C -- Conflict error --> E[State mismatch detected]
+    E --> F["No auto-correction<br>auto-fix risks removing records<br>services depend on"]
+    F --> G[Problem surfaced<br>via logs / metrics / events]
+    G --> H[Operator fixes state]
+    H --> A
+```
+
+**Crashloop amplification.** A hard error that causes external-dns to exit leads to a crashloop:
+kubelet restarts the pod, informers resync with full LIST calls to the Kubernetes API for every
+watched resource type (Services, Pods, EndpointSlices, Nodes), and the same conflicting batch is
+attempted again. Each restart repeats the cycle, progressively increasing LIST traffic against
+the Kubernetes API server. On large clusters or at high restart frequency this can contribute to
+Kubernetes API throttling that affects other controllers and workloads — not just external-dns.
+
+```mermaid
+flowchart LR
+    A[Conflict error<br>from provider] --> B[Hard exit]
+    B --> C[kubelet restarts pod]
+    C --> D[Informers resync<br>full LIST calls to<br>Kubernetes API]
+    D --> E[Same conflicting<br>batch applied]
+    E --> A
+    D -. "pressure accumulates\non each restart" .-> F[Kubernetes API<br>throttling]
+```
+
+A hard error that kills the process does not increment `external_dns_controller_consecutive_soft_errors`
+— that metric tracks soft errors only. Monitor pod restarts via
+`kube_pod_container_status_restarts_total` and alert on crashloop backoff (`CrashLoopBackOff`
+status) to catch this early.
+
+**When you observe conflict errors, fix the state:**
+
+- Ensure a single external-dns instance owns each zone or record set. When using the TXT or
+  DynamoDB registry, use a distinct `--txt-owner-id` per instance and avoid overlapping
+  `--domain-filter` scopes.
+- Remove or update conflicting records in the DNS provider directly.
+- Review annotations and desired state for invalid record definitions — for example, mixing CNAME
+  with A/AAAA records for the same hostname, or a CNAME that points to itself.
+- Check for other controllers or automation writing to the same zone.
+- If the environment is in an inconsistent state during a migration or incident, scale
+  external-dns to zero until the state is reconciled, then scale it back up.
+
+> **Visibility:** Work is ongoing to make state problems as visible as possible before they
+> become incidents. Planned improvements include per-record-type metrics for rejected endpoints
+> and Kubernetes Events emitted directly on the responsible resource, so operators can alert on
+> conflicts without grepping logs. If you encounter a conflict or misconfiguration that is not
+> surfaced by existing metrics or events, please open an issue or submit a PR.
+
+## Provider Notes
+
+### Zone list caching
+
+On every reconcile, external-dns calls the provider API to enumerate the zones it is allowed to
+manage. For accounts with many zones, or providers with strict API rate limits, this enumeration
+can be a significant source of API traffic even when no DNS records are changing.
+
+Several providers support a zone list cache that stores the zone list in memory for a configurable
+TTL and re-fetches only after it expires. Set the TTL to reflect how often your zone list
+actually changes — for most deployments zones are added or removed rarely, so a value of `1h` or
+longer is appropriate.
+
+> **Note:** Zone list caching is distinct from record caching (`--provider-cache-time`), which
+> caches the DNS records within a zone. Both can be used together. See
+> [DNS provider API rate limits](rate-limits.md) for per-provider flags.
+
+### Scope provider credentials to specific zones
+
+Provider API keys or IAM roles should be scoped to only the zones external-dns is expected to
+manage. Granting access to all zones in an account has two consequences:
+
+- **Operational:** external-dns will enumerate and potentially modify every zone the credentials
+  can reach. A misconfigured filter or a missing `--txt-owner-id` can cause unintended changes
+  to zones outside the intended scope.
+- **Security:** a credential leak exposes every zone in the account, not just the ones
+  external-dns manages.
+
+Zone filtering flags express application-level intent and reduce API call volume, but they are
+not an enforcement boundary — the credentials are. See [Domain Filter](domain-filter.md) for
+details.
+
+### Batch API
+
+For zones with frequent or large change sets, individual per-record API calls can exhaust
+provider rate limits quickly. Where supported, a batch API significantly reduces call volume.
+The exact reduction varies by provider, but the general pattern is:
+
+| Approach   | API calls per sync                              |
+|:-----------|:------------------------------------------------|
+| Individual | Grows linearly with number of records changed   |
+| Batch      | Grows with number of batches, not record count  |
+
+When a batch submission fails (e.g., one record in the batch is misconfigured), providers
+typically fall back to individual per-record calls for that sync cycle, so a single bad record
+does not block DNS updates for the rest of the zone. See
+[DNS provider API rate limits](rate-limits.md) for per-provider batch flags.
+
+## See Also
+
+- [Flags reference](../flags.md) — complete flag listing with defaults
+- [DNS provider API rate limits](rate-limits.md) — batch sizing, provider cache, and rate-limit tuning
+- [Domain Filter](domain-filter.md) — domain and zone filtering, and the credential boundary distinction
+- [Kubernetes Events in External-DNS](events.md) — event types, sources, and consumption
+- [Available Metrics](../monitoring/metrics.md) — full metrics reference