docs: provide documentation for Talos 1.6

Updated lots of documentation with new/updated flows.

Provide What's New for Talos 1.6.0.

Update Troubleshooting guide to cover more steps.

Make Talos 1.6 docs the default.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This commit is contained in:
Andrey Smirnov 2023-12-08 18:22:56 +04:00
parent 9a185a30f7
commit d803e40ef2
No known key found for this signature in database
GPG Key ID: FE042E3D4085A811
35 changed files with 786 additions and 700 deletions

View File

@ -1468,7 +1468,7 @@ func stopAndRemoveAllPods(stopAction cri.StopAction) runtime.TaskExecutionFunc {
// We remove pods with POD network mode first so that the CNI can perform
// any cleanup tasks. If we don't do this, we run the risk of killing the
// CNI, preventing the CRI from cleaning up the pod's netwokring.
// CNI, preventing the CRI from cleaning up the pod's networking.
if err = client.StopAndRemovePodSandboxes(ctx, stopAction, runtimeapi.NamespaceMode_POD, runtimeapi.NamespaceMode_CONTAINER); err != nil {
return err

View File

@ -113,7 +113,7 @@ version_menu = "Releases"
# A link to latest version of the docs. Used in the "version-banner" partial to
# point people to the main doc site.
url_latest_version = "/v1.5"
url_latest_version = "/v1.6"
# Repository configuration (URLs for in-page links to opening issues and suggesting changes)
# github_repo = "https://github.com/googley-example"
@ -142,11 +142,11 @@ prism_syntax_highlighting = false
[[params.versions]]
url = "/v1.6/"
version = "v1.6 (pre-release)"
version = "v1.6 (latest)"
[[params.versions]]
url = "/v1.5/"
version = "v1.5 (latest)"
version = "v1.5"
[[params.versions]]
url = "/v1.4/"

View File

@ -10,7 +10,6 @@ prevKubernetesRelease: "1.27.4"
theilaRelease: "v0.2.1"
nvidiaContainerToolkitRelease: "v1.13.5"
nvidiaDriverRelease: "535.54.03"
menu: main
---
## Welcome

View File

@ -7,7 +7,7 @@ description: "Table of supported Talos Linux versions and respective platforms."
| Talos Version | 1.5 | 1.4 |
|----------------------------------------------------------------------------------------------------------------|------------------------------------|------------------------------------|
| Release Date | 2023-08-17 | 2023-04-18 (1.4.0) |
| End of Community Support | 1.6.0 release (2023-12-15, TBD) | 1.5.0 release (2023-08-17) |
| End of Community Support | 1.6.0 release (2023-12-15) | 1.5.0 release (2023-08-17) |
| Enterprise Support | [offered by Sidero Labs Inc.](https://www.siderolabs.com/support/) | [offered by Sidero Labs Inc.](https://www.siderolabs.com/support/) |
| Kubernetes | 1.28, 1.27, 1.26 | 1.27, 1.26, 1.25 |
| Architecture | amd64, arm64 | amd64, arm64 |

View File

@ -4,13 +4,12 @@ no_list: true
linkTitle: "Documentation"
cascade:
type: docs
lastRelease: v1.6.0-alpha.1
lastRelease: v1.6.0
kubernetesRelease: "1.29.0"
prevKubernetesRelease: "1.28.3"
theilaRelease: "v0.2.1"
nvidiaContainerToolkitRelease: "v1.13.5"
nvidiaDriverRelease: "535.54.03"
preRelease: true
nvidiaDriverRelease: "535.129.03"
menu: main
---
## Welcome
@ -27,11 +26,11 @@ If you are just getting familiar with Talos, we recommend starting here:
### Community
- GitHub: [repo](https://github.com/siderolabs/talos)
- Slack: Join our [slack channel](https://slack.dev.talos-systems.io)
- Support: Questions, bugs, feature requests [GitHub Discussions](https://github.com/siderolabs/talos/discussions)
- Community Slack: Join our [slack channel](https://slack.dev.talos-systems.io)
- Matrix: Join our Matrix channels:
- Community: [#talos:matrix.org](https://matrix.to/#/#talos:matrix.org)
- Support: [#talos-support:matrix.org](https://matrix.to/#/#talos-support:matrix.org)
- Support: Questions, bugs, feature requests [GitHub Discussions](https://github.com/siderolabs/talos/discussions)
- Community Support: [#talos-support:matrix.org](https://matrix.to/#/#talos-support:matrix.org)
- Forum: [community](https://groups.google.com/a/siderolabs.com/forum/#!forum/community)
- Twitter: [@SideroLabs](https://twitter.com/talossystems)
- Email: [info@SideroLabs.com](mailto:info@SideroLabs.com)

View File

@ -1,449 +0,0 @@
---
title: "Troubleshooting Control Plane"
description: "Troubleshoot control plane failures for running cluster and bootstrap process."
aliases:
- ../guides/troubleshooting-control-plane
---
<!-- markdownlint-disable MD026 -->
In this guide we assume that Talos client config is available and Talos API access is available.
Kubernetes client configuration can be pulled from control plane nodes with `talosctl -n <IP> kubeconfig`
(this command works before Kubernetes is fully booted).
### What is the control plane endpoint?
The Kubernetes control plane endpoint is the single canonical URL by which the
Kubernetes API is accessed.
Especially with high-availability (HA) control planes, this endpoint may point to a load balancer or a DNS name which may
have multiple `A` and `AAAA` records.
Like Talos' own API, the Kubernetes API uses mutual TLS, client
certs, and a common Certificate Authority (CA).
Unlike general-purpose websites, there is no need for an upstream CA, so tools
such as cert-manager, Let's Encrypt, or products such
as validated TLS certificates are not required.
Encryption, however, _is_, and hence the URL scheme will always be `https://`.
By default, the Kubernetes API server in Talos runs on port 6443.
As such, the control plane endpoint URLs for Talos will almost always be of the form
`https://endpoint:6443`.
(The port, since it is not the `https` default of `443` is required.)
The `endpoint` above may be a DNS name or IP address, but it should be
directed to the _set_ of all controlplane nodes, as opposed to a
single one.
As mentioned above, this can be achieved by a number of strategies, including:
- an external load balancer
- DNS records
- Talos-builtin shared IP ([VIP]({{< relref "../talos-guides/network/vip" >}}))
- BGP peering of a shared IP (such as with [kube-vip](https://kube-vip.io))
Using a DNS name here is a good idea, since it allows any other option, while offering
a layer of abstraction.
It allows the underlying IP addresses to change without impacting the
canonical URL.
Unlike most services in Kubernetes, the API server runs with host networking,
meaning that it shares the network namespace with the host.
This means you can use the IP address(es) of the host to refer to the Kubernetes
API server.
For availability of the API, it is important that any load balancer be aware of
the health of the backend API servers, to minimize disruptions during
common node operations like reboots and upgrades.
It is critical that the control plane endpoint works correctly during cluster bootstrap phase, as nodes discover
each other using control plane endpoint.
### kubelet is not running on control plane node
The `kubelet` service should be running on control plane nodes as soon as networking is configured:
```bash
$ talosctl -n <IP> service kubelet
NODE 172.20.0.2
ID kubelet
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (2m54s ago)
[Running]: Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused (3m4s ago)
[Running]: Started task kubelet (PID 2334) for container kubelet (3m6s ago)
[Preparing]: Creating service runner (3m6s ago)
[Preparing]: Running pre state (3m15s ago)
[Waiting]: Waiting for service "timed" to be "up" (3m15s ago)
[Waiting]: Waiting for service "cri" to be "up", service "timed" to be "up" (3m16s ago)
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m18s ago)
```
If the `kubelet` is not running, it may be due to invalid configuration.
Check `kubelet` logs with the `talosctl logs` command:
```bash
$ talosctl -n <IP> logs kubelet
172.20.0.2: I0305 20:45:07.756948 2334 controller.go:101] kubelet config controller: starting controller
172.20.0.2: I0305 20:45:07.756995 2334 controller.go:267] kubelet config controller: ensuring filesystem is set up correctly
172.20.0.2: I0305 20:45:07.757000 2334 fsstore.go:59] kubelet config controller: initializing config checkpoints directory "/etc/kubernetes/kubelet/store"
```
### etcd is not running
By far the most likely cause of `etcd` not running is because the cluster has
not yet been bootstrapped or because bootstrapping is currently in progress.
The `talosctl bootstrap` command must be run manually and only _once_ per
cluster, and this step is commonly missed.
Once a node is bootstrapped, it will start `etcd` and, over the course of a
minute or two (depending on the download speed of the control plane nodes), the
other control plane nodes should discover it and join themselves to the cluster.
Also, `etcd` will only run on control plane nodes.
If a node is designated as a worker node, you should not expect `etcd` to be
running on it.
When node boots for the first time, the `etcd` data directory (`/var/lib/etcd`) is empty, and it will only be populated when `etcd` is launched.
If `etcd` is not running, check service `etcd` state:
```bash
$ talosctl -n <IP> service etcd
NODE 172.20.0.2
ID etcd
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (3m21s ago)
[Running]: Started task etcd (PID 2343) for container etcd (3m26s ago)
[Preparing]: Creating service runner (3m26s ago)
[Preparing]: Running pre state (3m26s ago)
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m26s ago)
```
If service is stuck in `Preparing` state for bootstrap node, it might be related to slow network - at this stage
Talos pulls the `etcd` image from the container registry.
If the `etcd` service is crashing and restarting, check its logs with `talosctl -n <IP> logs etcd`.
The most common reasons for crashes are:
- wrong arguments passed via `extraArgs` in the configuration;
- booting Talos on non-empty disk with previous Talos installation, `/var/lib/etcd` contains data from old cluster.
### etcd is not running on non-bootstrap control plane node
The `etcd` service on control plane nodes which were not the target of the cluster bootstrap will wait until the bootstrapped control plane node has completed.
The bootstrap and discovery processes may take a few minutes to complete.
As soon as the bootstrapped node starts its Kubernetes control plane components, `kubectl get endpoints` will return the IP of bootstrapped control plane node.
At this point, the other control plane nodes will start their `etcd` services, join the cluster, and then start their own Kubernetes control plane components.
### Kubernetes static pod definitions are not generated
Talos should generate the static pod definitions for the Kubernetes control plane
as resources:
```bash
$ talosctl -n <IP> get staticpods
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 k8s StaticPod kube-apiserver 1
172.20.0.2 k8s StaticPod kube-controller-manager 1
172.20.0.2 k8s StaticPod kube-scheduler 1
```
Talos should report that the static pod definitions are rendered for the `kubelet`:
```bash
$ talosctl -n <IP> dmesg | grep 'rendered new'
172.20.0.2: user: warning: [2023-04-26T19:17:52.550527204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
172.20.0.2: user: warning: [2023-04-26T19:17:52.552186204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
172.20.0.2: user: warning: [2023-04-26T19:17:52.554607204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
```
If the static pod definitions are not rendered, check `etcd` and `kubelet` service health (see above)
and the controller runtime logs (`talosctl logs controller-runtime`).
### Talos prints error `an error on the server ("") has prevented the request from succeeding`
This is expected during initial cluster bootstrap and sometimes after a reboot:
```bash
[ 70.093289] [talos] task labelNodeAsControlPlane (1/1): starting
[ 80.094038] [talos] retrying error: an error on the server ("") has prevented the request from succeeding (get nodes talos-default-controlplane-1)
```
Initially `kube-apiserver` component is not running yet, and it takes some time before it becomes fully up
during bootstrap (image should be pulled from the Internet, etc.)
Once the control plane endpoint is up, Talos should continue with its boot
process.
If Talos doesn't proceed, it may be due to a configuration issue.
In any case, the status of the control plane components on each control plane nodes can be checked with `talosctl containers -k`:
```bash
$ talosctl -n <IP> containers --kubernetes
NODE NAMESPACE ID IMAGE PID STATUS
172.20.0.2 k8s.io kube-system/kube-apiserver-talos-default-controlplane-1 registry.k8s.io/pause:3.2 2539 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver:51c3aad7a271 registry.k8s.io/kube-apiserver:v{{< k8s_release >}} 2572 CONTAINER_RUNNING
```
If `kube-apiserver` shows as `CONTAINER_EXITED`, it might have exited due to configuration error.
Logs can be checked with `taloctl logs --kubernetes` (or with `-k` as a shorthand):
```bash
$ talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver:51c3aad7a271
172.20.0.2: 2021-03-05T20:46:13.133902064Z stderr F 2021/03/05 20:46:13 Running command:
172.20.0.2: 2021-03-05T20:46:13.133933824Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
172.20.0.2: 2021-03-05T20:46:13.133938524Z stderr F Run from directory:
172.20.0.2: 2021-03-05T20:46:13.13394154Z stderr F Executable path: /usr/local/bin/kube-apiserver
...
```
### Talos prints error `nodes "talos-default-controlplane-1" not found`
This error means that `kube-apiserver` is up and the control plane endpoint is healthy, but the `kubelet` hasn't received
its client certificate yet, and it wasn't able to register itself to Kubernetes.
The Kubernetes controller manager (`kube-controller-manager`)is responsible for monitoring the certificate
signing requests (CSRs) and issuing certificates for each of them.
The kubelet is responsible for generating and submitting the CSRs for its
associated node.
For the `kubelet` to get its client certificate, then, the Kubernetes control plane
must be healthy:
- the API server is running and available at the Kubernetes control plane
endpoint URL
- the controller manager is running and a leader has been elected
The states of any CSRs can be checked with `kubectl get csr`:
```bash
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jcn9j 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-p6b9q 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-sw6rm 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-vlghg 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
```
### Talos prints error `node not ready`
A Node in Kubernetes is marked as `Ready` only once its CNI is up.
It takes a minute or two for the CNI images to be pulled and for the CNI to start.
If the node is stuck in this state for too long, check CNI pods and logs with `kubectl`.
Usually, CNI-related resources are created in `kube-system` namespace.
For example, for Talos default Flannel CNI:
```bash
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
kube-flannel-25drx 1/1 Running 0 23m
kube-flannel-8lmb6 1/1 Running 0 23m
kube-flannel-gl7nx 1/1 Running 0 23m
kube-flannel-jknt9 1/1 Running 0 23m
...
```
### Talos prints error `x509: certificate signed by unknown authority`
The full error might look like:
```bash
x509: certificate signed by unknown authority (possiby because of crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"
```
Usually, this occurs because the control plane endpoint points to a different
cluster than the client certificate was generated for.
If a node was recycled between clusters, make sure it was properly wiped between
uses.
If a client has multiple client configurations, make sure you are matching the correct `talosconfig` with the
correct cluster.
### etcd is running on bootstrap node, but stuck in `pre` state on non-bootstrap nodes
Please see question `etcd is not running on non-bootstrap control plane node`.
### Checking `kube-controller-manager` and `kube-scheduler`
If the control plane endpoint is up, the status of the pods can be ascertained with `kubectl`:
```bash
$ kubectl get pods -n kube-system -l k8s-app=kube-controller-manager
NAME READY STATUS RESTARTS AGE
kube-controller-manager-talos-default-controlplane-1 1/1 Running 0 28m
kube-controller-manager-talos-default-controlplane-2 1/1 Running 0 28m
kube-controller-manager-talos-default-controlplane-3 1/1 Running 0 28m
```
If the control plane endpoint is not yet up, the container status of the control plane components can be queried with
`talosctl containers --kubernetes`:
```bash
$ talosctl -n <IP> c -k
NODE NAMESPACE ID IMAGE PID STATUS
...
172.20.0.2 k8s.io kube-system/kube-controller-manager-talos-default-controlplane-1 registry.k8s.io/pause:3.2 2547 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-controller-manager-talos-default-controlplane-1:kube-controller-manager:84fc77c59e17 registry.k8s.io/kube-controller-manager:v{{< k8s_release >}} 2580 CONTAINER_RUNNING
172.20.0.2 k8s.io kube-system/kube-scheduler-talos-default-controlplane-1 registry.k8s.io/pause:3.2 2638 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-scheduler-talos-default-controlplane-1:kube-scheduler:4182a7d7f779 registry.k8s.io/kube-scheduler:v{{< k8s_release >}} 2670 CONTAINER_RUNNING
...
```
If some of the containers are not running, it could be that image is still being pulled.
Otherwise the process might crashing.
The logs can be checked with `talosctl logs --kubernetes <containerID>`:
```bash
$ talosctl -n <IP> logs -k kube-system/kube-controller-manager-talos-default-controlplane-1:kube-controller-manager:84fc77c59e17
172.20.0.3: 2021-03-09T13:59:34.291667526Z stderr F 2021/03/09 13:59:34 Running command:
172.20.0.3: 2021-03-09T13:59:34.291702262Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
172.20.0.3: 2021-03-09T13:59:34.291707121Z stderr F Run from directory:
172.20.0.3: 2021-03-09T13:59:34.291710908Z stderr F Executable path: /usr/local/bin/kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291719163Z stderr F Args (comma-delimited): /usr/local/bin/kube-controller-manager,--allocate-node-cidrs=true,--cloud-provider=,--cluster-cidr=10.244.0.0/16,--service-cluster-ip-range=10.96.0.0/12,--cluster-signing-cert-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--cluster-signing-key-file=/system/secrets/kubernetes/kube-controller-manager/ca.key,--configure-cloud-routes=false,--kubeconfig=/system/secrets/kubernetes/kube-controller-manager/kubeconfig,--leader-elect=true,--root-ca-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--service-account-private-key-file=/system/secrets/kubernetes/kube-controller-manager/service-account.key,--profiling=false
172.20.0.3: 2021-03-09T13:59:34.293870359Z stderr F 2021/03/09 13:59:34 Now listening for interrupts
172.20.0.3: 2021-03-09T13:59:34.761113762Z stdout F I0309 13:59:34.760982 10 serving.go:331] Generated self-signed cert in-memory
...
```
### Checking controller runtime logs
Talos runs a set of controllers which operate on resources to build and support the Kubernetes control plane.
Some debugging information can be queried from the controller logs with `talosctl logs controller-runtime`:
```bash
$ talosctl -n <IP> logs controller-runtime
172.20.0.2: 2021/03/09 13:57:11 secrets.EtcdController: controller starting
172.20.0.2: 2021/03/09 13:57:11 config.MachineTypeController: controller starting
172.20.0.2: 2021/03/09 13:57:11 k8s.ManifestApplyController: controller starting
172.20.0.2: 2021/03/09 13:57:11 v1alpha1.BootstrapStatusController: controller starting
172.20.0.2: 2021/03/09 13:57:11 v1alpha1.TimeStatusController: controller starting
...
```
Controllers continuously run a reconcile loop, so at any time, they may be starting, failing, or restarting.
This is expected behavior.
Things to look for:
`k8s.KubeletStaticPodController: rendered new static pod`: static pod definitions were rendered successfully.
`k8s.ManifestApplyController: controller failed: error creating mapping for object /v1/Secret/bootstrap-token-q9pyzr: an error on the server ("") has prevented the request from succeeding`: control plane endpoint is not up yet, bootstrap manifests can't be injected, controller is going to retry.
`k8s.KubeletStaticPodController: controller failed: error refreshing pod status: error fetching pod status: an error on the server ("Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)") has prevented the request from succeeding`: kubelet hasn't been able to contact `kube-apiserver` yet to push pod status, controller
is going to retry.
`k8s.ManifestApplyController: created rbac.authorization.k8s.io/v1/ClusterRole/psp:privileged`: one of the bootstrap manifests got successfully applied.
`secrets.KubernetesController: controller failed: missing cluster.aggregatorCA secret`: Talos is running with 0.8 configuration, if the cluster was upgraded from 0.8, this is expected, and conversion process will fix machine config
automatically.
If this cluster was bootstrapped with version 0.9, machine configuration should be regenerated with 0.9 talosctl.
If there are no new messages in the `controller-runtime` log, it means that the controllers have successfully finished reconciling, and that the current system state is the desired system state.
### Checking static pod definitions
Talos generates static pod definitions for the `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler`
components based on its machine configuration.
These definitions can be checked as resources with `talosctl get staticpods`:
```bash
$ talosctl -n <IP> get staticpods -o yaml
get staticpods -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: StaticPods.kubernetes.talos.dev
id: kube-apiserver
version: 2
phase: running
finalizers:
- k8s.StaticPodStatus("kube-apiserver")
spec:
apiVersion: v1
kind: Pod
metadata:
annotations:
talos.dev/config-version: "1"
talos.dev/secrets-version: "1"
creationTimestamp: null
labels:
k8s-app: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
...
```
The status of the static pods can queried with `talosctl get staticpodstatus`:
```bash
$ talosctl -n <IP> get staticpodstatus
NODE NAMESPACE TYPE ID VERSION READY
172.20.0.2 controlplane StaticPodStatus kube-system/kube-apiserver-talos-default-controlplane-1 1 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-controller-manager-talos-default-controlplane-1 1 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-scheduler-talos-default-controlplane-1 1 True
```
The most important status field is `READY`, which is the last column printed.
The complete status can be fetched by adding `-o yaml` flag.
### Checking bootstrap manifests
As part of the bootstrap process, Talos injects bootstrap manifests into Kubernetes API server.
There are two kinds of these manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):
```bash
$ talosctl -n <IP> get manifests
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
172.20.0.2 controlplane Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
172.20.0.2 controlplane Manifest 10-kube-proxy 1
172.20.0.2 controlplane Manifest 11-core-dns 1
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
```
Details of each manifest can be queried by adding `-o yaml`:
```bash
$ talosctl -n <IP> get manifests 01-csr-approver-role-binding --namespace=controlplane -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: Manifests.kubernetes.talos.dev
id: 01-csr-approver-role-binding
version: 1
phase: running
spec:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system-bootstrap-approve-node-client-csr
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:bootstrappers
```
### Worker node is stuck with `apid` health check failures
Control plane nodes have enough secret material to generate `apid` server certificates, but worker nodes
depend on control plane `trustd` services to generate certificates.
Worker nodes wait for their `kubelet` to join the cluster.
Then the Talos `apid` queries the Kubernetes endpoints via control plane
endpoint to find `trustd` endpoints.
They then use `trustd` to request and receive their certificate.
So if `apid` health checks are failing on worker node:
- make sure control plane endpoint is healthy
- check that worker node `kubelet` joined the cluster

View File

@ -6,8 +6,8 @@ description: "Table of supported Talos Linux versions and respective platforms."
| Talos Version | 1.6 | 1.5 |
|----------------------------------------------------------------------------------------------------------------|------------------------------------|------------------------------------|
| Release Date | 2023-12-15 (TBD) | 2023-08-17 (1.5.0) |
| End of Community Support | 1.7.0 release (2024-03-15, TBD) | 1.6.0 release (2023-12-15) |
| Release Date | 2023-12-15 | 2023-08-17 (1.5.0) |
| End of Community Support | 1.7.0 release (2024-04-15, TBD) | 1.6.0 release (2023-12-15) |
| Enterprise Support | [offered by Sidero Labs Inc.](https://www.siderolabs.com/support/) | [offered by Sidero Labs Inc.](https://www.siderolabs.com/support/) |
| Kubernetes | 1.29, 1.28, 1.27, 1.26, 1.25, 1.24 | 1.28, 1.27, 1.26 |
| Architecture | amd64, arm64 | amd64, arm64 |
@ -18,9 +18,9 @@ description: "Table of supported Talos Linux versions and respective platforms."
| - SBCs | Banana Pi M64, Jetson Nano, Libre Computer Board ALL-H3-CC, Nano Pi R4S, Pine64, Pine64 Rock64, Radxa ROCK Pi 4c, Raspberry Pi 4B, Raspberry Pi Compute Module 4 | Banana Pi M64, Jetson Nano, Libre Computer Board ALL-H3-CC, Nano Pi R4S, Pine64, Pine64 Rock64, Radxa ROCK Pi 4c, Raspberry Pi 4B, Raspberry Pi Compute Module 4 |
| - local | Docker, QEMU | Docker, QEMU |
| **Cluster API** | | |
| [CAPI Bootstrap Provider Talos](https://github.com/siderolabs/cluster-api-bootstrap-provider-talos) | >= 0.6.2 | >= 0.6.1 |
| [CAPI Control Plane Provider Talos](https://github.com/siderolabs/cluster-api-control-plane-provider-talos) | >= 0.5.3 | >= 0.5.2 |
| [Sidero](https://www.sidero.dev/) | >= 0.6.0 | >= 0.6.0 |
| [CAPI Bootstrap Provider Talos](https://github.com/siderolabs/cluster-api-bootstrap-provider-talos) | >= 0.6.3 | >= 0.6.1 |
| [CAPI Control Plane Provider Talos](https://github.com/siderolabs/cluster-api-control-plane-provider-talos) | >= 0.5.4 | >= 0.5.2 |
| [Sidero](https://www.sidero.dev/) | >= 0.6.2 | >= 0.6.0 |
## Platform Tiers

View File

@ -0,0 +1,426 @@
---
title: "Troubleshooting"
description: "Troubleshoot control plane and other failures for Talos Linux clusters."
aliases:
- ../guides/troubleshooting-control-plane
- ../advanced/troubleshooting-control-plane
---
<!-- markdownlint-disable MD026 -->
In this guide we assume that Talos is configured with default features enabled, such as [Discovery Service]({{< relref "../talos-guides/discovery" >}}) and [KubePrism]({{< relref "../kubernetes-guides/configuration/kubeprism" >}}).
If these features are disabled, some of the troubleshooting steps may not apply or may need to be adjusted.
This guide is structured so that it can be followed step-by-step, skip sections which are not relevant to your issue.
## Network Configuration
As Talos Linux is an API-based operating system, it is important to have networking configured so that the API can be accessed.
Some information can be gathered from the [Interactive Dashboard]({{< relref "../talos-guides/interactive-dashboard " >}}) which is available on the machine console.
When running in the cloud the networking should be configured automatically.
Whereas when running on bare-metal it may need more specific configuration, see [networking `metal` configuration guide]({{< relref "../talos-guides/install/bare-metal-platforms/network-config" >}}).
## Talos API
The Talos API runs on [port 50000]({{< relref "../learn-more/talos-network-connectivity" >}}).
Control plane nodes should always serve the Talos API, while worker nodes require access to the control plane nodes to issue TLS certificates for the workers.
### Firewall Issues
Make sure that the firewall is not blocking port 50000, and [communication]({{< relref "../learn-more/talos-network-connectivity" >}}) on ports 50000/50001 inside the cluster.
### Client Configuration Issues
Make sure to use correct `talosconfig` client configuration file matching your cluster.
See [getting started]({{< relref "./getting-started" >}}) for more information.
The most common issue is that `talosctl gen config` writes `talosconfig` to the file in the current directory, while `talosctl` by default picks up the configuration from the default location (`~/.talos/config`).
The path to the configuration file can be specified with `--talosconfig` flag to `talosctl`.
### Conflict on Kubernetes and Host Subnets
If `talosctl` returns an error saying that certificate IPs are empty, it might be due to a conflict between Kubernetes and host subnets.
The Talos API runs on the host network, but it automatically excludes Kubernetes pod & network subnets from the useable set of addresses.
Talos default machine configuration specifies the following Kubernetes pod and subnet IPv4 CIDRs: `10.244.0.0/16` and `10.96.0.0/12`.
If the host network is configured with one of these subnets, change the machine configuration to use a different subnet.
### Wrong Endpoints
The `talosctl` CLI connects to the Talos API via the specified endpoints, which should be a list of control plane machine addresses.
The client will automatically retry on other endpoints if there are unavailable endpoints.
Worker nodes should not be used as the endpoint, as they are not able to forward request to other nodes.
The [VIP]({{< relref "../talos-guides/network/vip" >}}) should never be used as Talos API endpoint.
### TCP Loadbalancer
When using a TCP loadbalancer, make sure the loadbalancer endpoint is included in the `.machine.certSANs` list in the machine configuration.
## System Requirements
If minimum [system requirements]({{< relref "./system-requirements" >}}) are not met, this might manifest itself in various ways, such as random failures when starting services, or failures to pull images from the container registry.
## Running Health Checks
Talos Linux provides a set of basic health checks with `talosctl health` command which can be used to check the health of the cluster.
In the default mode, `talosctl health` uses information from the [discovery]({{< relref "../talos-guides/discovery" >}}) to get the information about cluster members.
This can be overridden with command line flags `--control-plane-nodes` and `--worker-nodes`.
## Gathering Logs
While the logs and state of the system can be queried via the Talos API, it is often useful to gather the logs from all nodes in the cluster, and analyze them offline.
The `talosctl support` command can be used to gather logs and other information from the nodes specified with `--nodes` flag (multiple nodes are supported).
## Discovery and Cluster Membership
Talos Linux uses [Discovery Service]({{< relref "../talos-guides/discovery" >}}) to discover other nodes in the cluster.
The list of members on each machine should be consistent: `talosctl -n <IP> get members`.
### Some Members are Missing
Ensure connectivity to the discovery service (default is `discovery.talos.dev:443`), and that the discovery registry is not disabled.
### Duplicate Members
Don't use same base secrets to generate machine configuration for multiple clusters, as some secrets are used to identify members of the same cluster.
So if the same machine configuration (or secrets) are used to repeatedly create and destroy clusters, the discovery service will see the same nodes as members of different clusters.
### Removed Members are Still Present
Talos Linux removes itself from the discovery service when it is [reset]({{< relref "../talos-guides/resetting-a-machine" >}}).
If the machine was not reset, it might show up as a member of the cluster for the maximum TTL of the discovery service (30 minutes), and after that it will be automatically removed.
## `etcd` Issues
`etcd` is the distributed key-value store used by Kubernetes to store its state.
Talos Linux provides automation to manage `etcd` members running on control plane nodes.
If `etcd` is not healthy, the Kubernetes API server will not be able to function correctly.
It is always recommended to run an odd number of `etcd` members, as with 3 or more members it provides fault tolerance for less than quorum member failures.
Common troubleshooting steps:
- check `etcd` service state with `talosctl -n IP service etcd` for each control plane node
- check `etcd` membership on each control plane node with `talosctl -n IP etcd member list`
- check `etcd` logs with `talosctl -n IP logs etcd`
- check `etcd` alarms with `talosctl -n IP etcd alarm list`
### All `etcd` Services are Stuck in `Pre` State
Make sure that a single member was [bootstrapped]({{< relref "./getting-started#kubernetes-bootstrap" >}}).
Check that the machine is able to pull the `etcd` container image, check `talosctl dmesg` for messages starting with `retrying:` prefix.
### Some `etcd` Services are Stuck in `Pre` State
Make sure traffic is not blocked on port 2380 between controlplane nodes.
Check that `etcd` quorum is not lost.
Check that all control plane nodes are reported in `talosctl get members` output.
### `etcd` Reports and Alarm
See [etcd maintenance]({{< relref "../advanced/etcd-maintenance" >}}) guide.
### `etcd` Quorum is Lost
See [disaster recovery]({{< relref "../advanced/disaster-recovery" >}}) guide.
### Other Issues
`etcd` will only run on control plane nodes.
If a node is designated as a worker node, you should not expect `etcd` to be running on it.
When a node boots for the first time, the `etcd` data directory (`/var/lib/etcd`) is empty, and it will only be populated when `etcd` is launched.
If the `etcd` service is crashing and restarting, check its logs with `talosctl -n <IP> logs etcd`.
The most common reasons for crashes are:
- wrong arguments passed via `extraArgs` in the configuration;
- booting Talos on non-empty disk with an existing Talos installation, `/var/lib/etcd` contains data from the old cluster.
## `kubelet` and Kubernetes Node Issues
The `kubelet` service should be running on all Talos nodes, and it is responsible for running Kubernetes pods,
static pods (including control plane components), and registering the node with the Kubernetes API server.
If the `kubelet` doesn't run on a control plane node, it will block the control plane components from starting.
The node will not be registered in Kubernetes until the Kubernetes API server is up and initial Kubernetes manifests are applied.
### `kubelet` is not running
Check that `kubelet` image is available (`talosctl image ls --namespace system`).
Check `kubelet` logs with `talosctl -n IP logs kubelet` for startup errors:
- make sure Kubernetes version is [supported]({{< relref "./support-matrix" >}}) with this Talos release
- make sure `kubelet` extra arguments and extra configuration supplied with Talos machine configuration is valid
### Talos Complains about Node Not Found
`kubelet` hasn't yet registered the node with the Kubernetes API server, this is expected during initial cluster bootstrap, the error will go away.
If the message persists, check Kubernetes API health.
The Kubernetes controller manager (`kube-controller-manager`) is responsible for monitoring the certificate
signing requests (CSRs) and issuing certificates for each of them.
The `kubelet` is responsible for generating and submitting the CSRs for its
associated node.
The state of any CSRs can be checked with `kubectl get csr`:
```bash
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jcn9j 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-p6b9q 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-sw6rm 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-vlghg 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
```
### `kubectl get nodes` Reports Wrong Internal IP
Configure the correct internal IP address with [`.machine.kubelet.nodeIP`]({{< relref "../reference/configuration/v1alpha1/config#Config.machine.kubelet.nodeIP" >}})
### `kubectl get nodes` Reports Wrong External IP
Talos Linux doesn't manage the external IP, it is managed with the Kubernetes Cloud Controller Manager.
### `kubectl get nodes` Reports Wrong Node Name
By default, the Kubernetes node name is derived from the hostname.
Update the hostname using the machine configuration, cloud configuration, or via DHCP server.
### Node Is Not Ready
A Node in Kubernetes is marked as `Ready` only once its CNI is up.
It takes a minute or two for the CNI images to be pulled and for the CNI to start.
If the node is stuck in this state for too long, check CNI pods and logs with `kubectl`.
Usually, CNI-related resources are created in `kube-system` namespace.
For example, for the default Talos Flannel CNI:
```bash
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
kube-flannel-25drx 1/1 Running 0 23m
kube-flannel-8lmb6 1/1 Running 0 23m
kube-flannel-gl7nx 1/1 Running 0 23m
kube-flannel-jknt9 1/1 Running 0 23m
...
```
### Duplicate/Stale Nodes
Talos Linux doesn't remove Kubernetes nodes automatically, so if a node is removed from the cluster, it will still be present in Kubernetes.
Remove the node from Kubernetes with `kubectl delete node <node-name>`.
### Talos Complains about Certificate Errors on `kubelet` API
This error might appear during initial cluster bootstrap, and it will go away once the Kubernetes API server is up and the node is registered.
By default configuration, `kubelet` issues a self-signed server certificate, but when `rotate-server-certificates` feature is enabled,
`kubelet` issues its certificate using `kube-apiserver`.
Make sure the `kubelet` CSR is approved by the Kubernetes API server.
In either case, this error is not critical, as it only affects reporting of the pod status to Talos Linux.
## Kubernetes Control Plane
The Kubernetes control plane consists of the following components:
- `kube-apiserver` - the Kubernetes API server
- `kube-controller-manager` - the Kubernetes controller manager
- `kube-scheduler` - the Kubernetes scheduler
Optionally, `kube-proxy` runs as a DaemonSet to provide pod-to-service communication.
`coredns` provides name resolution for the cluster.
CNI is not part of the control plane, but it is required for Kubernetes pods using pod networking.
Troubleshooting should always start with `kube-apiserver`, and then proceed to other components.
Talos Linux configures `kube-apiserver` to talk to the `etcd` running on the same node, so `etcd` must be healthy before `kube-apiserver` can start.
The `kube-controller-manager` and `kube-scheduler` are configured to talk to the `kube-apiserver` on the same node, so they will not start until `kube-apiserver` is healthy.
### Control Plane Static Pods
Talos should generate the static pod definitions for the Kubernetes control plane
as resources:
```bash
$ talosctl -n <IP> get staticpods
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 k8s StaticPod kube-apiserver 1
172.20.0.2 k8s StaticPod kube-controller-manager 1
172.20.0.2 k8s StaticPod kube-scheduler 1
```
Talos should report that the static pod definitions are rendered for the `kubelet`:
```bash
$ talosctl -n <IP> dmesg | grep 'rendered new'
172.20.0.2: user: warning: [2023-04-26T19:17:52.550527204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
172.20.0.2: user: warning: [2023-04-26T19:17:52.552186204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
172.20.0.2: user: warning: [2023-04-26T19:17:52.554607204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
```
If the static pod definitions are not rendered, check `etcd` and `kubelet` service health (see above)
and the controller runtime logs (`talosctl logs controller-runtime`).
### Control Plane Pod Status
Initially the `kube-apiserver` component will not be running, and it takes some time before it becomes fully up
during bootstrap (image should be pulled from the Internet, etc.)
The status of the control plane components on each of the control plane nodes can be checked with `talosctl containers -k`:
```bash
$ talosctl -n <IP> containers --kubernetes
NODE NAMESPACE ID IMAGE PID STATUS
172.20.0.2 k8s.io kube-system/kube-apiserver-talos-default-controlplane-1 registry.k8s.io/pause:3.2 2539 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver:51c3aad7a271 registry.k8s.io/kube-apiserver:v{{< k8s_release >}} 2572 CONTAINER_RUNNING
```
The logs of the control plane components can be checked with `talosctl logs --kubernetes` (or with `-k` as a shorthand):
```bash
talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver:51c3aad7a271
```
If the control plane component reports error on startup, check that:
- make sure Kubernetes version is [supported]({{< relref "./support-matrix" >}}) with this Talos release
- make sure extra arguments and extra configuration supplied with Talos machine configuration is valid
### Kubernetes Bootstrap Manifests
As part of the bootstrap process, Talos injects bootstrap manifests into Kubernetes API server.
There are two kinds of these manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):
```bash
$ talosctl -n <IP> get manifests
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
172.20.0.2 controlplane Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
172.20.0.2 controlplane Manifest 10-kube-proxy 1
172.20.0.2 controlplane Manifest 11-core-dns 1
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
```
Details of each manifest can be queried by adding `-o yaml`:
```bash
$ talosctl -n <IP> get manifests 01-csr-approver-role-binding --namespace=controlplane -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: Manifests.kubernetes.talos.dev
id: 01-csr-approver-role-binding
version: 1
phase: running
spec:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system-bootstrap-approve-node-client-csr
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:bootstrappers
```
### Other Control Plane Components
Once the Kubernetes API server is up, other control plane components issues can be troubleshooted with `kubectl`:
```shell
kubectl get nodes -o wide
kubectl get pods -o wide --all-namespaces
kubectl describe pod -n NAMESPACE POD
kubectl logs -n NAMESPACE POD
```
## Kubernetes API
The Kubernetes API client configuration (`kubeconfig`) can be retrieved using Talos API with `talosctl -n <IP> kubeconfig` command.
Talos Linux mostly doesn't depend on the Kubernetes API endpoint for the cluster, but Kubernetes API endpoint should be configured
correctly for external access to the cluster.
### Kubernetes Control Plane Endpoint
The Kubernetes control plane endpoint is the single canonical URL by which the
Kubernetes API is accessed.
Especially with high-availability (HA) control planes, this endpoint may point to a load balancer or a DNS name which may
have multiple `A` and `AAAA` records.
Like Talos' own API, the Kubernetes API uses mutual TLS, client
certs, and a common Certificate Authority (CA).
Unlike general-purpose websites, there is no need for an upstream CA, so tools
such as cert-manager, Let's Encrypt, or products such
as validated TLS certificates are not required.
Encryption, however, _is_, and hence the URL scheme will always be `https://`.
By default, the Kubernetes API server in Talos runs on port 6443.
As such, the control plane endpoint URLs for Talos will almost always be of the form
`https://endpoint:6443`.
(The port, since it is not the `https` default of `443` is required.)
The `endpoint` above may be a DNS name or IP address, but it should be
directed to the _set_ of all controlplane nodes, as opposed to a
single one.
As mentioned above, this can be achieved by a number of strategies, including:
- an external load balancer
- DNS records
- Talos-builtin shared IP ([VIP]({{< relref "../talos-guides/network/vip" >}}))
- BGP peering of a shared IP (such as with [kube-vip](https://kube-vip.io))
Using a DNS name here is a good idea, since it allows any other option, while offering
a layer of abstraction.
It allows the underlying IP addresses to change without impacting the
canonical URL.
Unlike most services in Kubernetes, the API server runs with host networking,
meaning that it shares the network namespace with the host.
This means you can use the IP address(es) of the host to refer to the Kubernetes
API server.
For availability of the API, it is important that any load balancer be aware of
the health of the backend API servers, to minimize disruptions during
common node operations like reboots and upgrades.
## Miscellaneous
### Checking Controller Runtime Logs
Talos runs a set of [controllers]({{< relref "../learn-more/controllers-resources" >}}) which operate on resources to build and support machine operations.
Some debugging information can be queried from the controller logs with `talosctl logs controller-runtime`:
```bash
talosctl -n <IP> logs controller-runtime
```
Controllers continuously run a reconcile loop, so at any time, they may be starting, failing, or restarting.
This is expected behavior.
If there are no new messages in the `controller-runtime` log, it means that the controllers have successfully finished reconciling, and that the current system state is the desired system state.

View File

@ -1,9 +1,138 @@
---
title: What's New in Talos 1.6
title: What's New in Talos 1.6.0
weight: 50
description: "List of new and shiny features in Talos Linux."
---
See also [upgrade notes]({{< relref "../../talos-guides/upgrading-talos/">}}) for important changes.
TBD
## Breaking Changes
### Linux Firmware
Starting with Talos 1.6, Linux firmware is not included in the default initramfs.
Users that need Linux firmware can pull them as an extension during install time using the [Image Factory]({{< relref "../../learn-more/image-factory" >}}) service.
If the initial boot requires firmware, a [custom ISO can be built]({{< relref "../../talos-guides/install/boot-assets" >}}) with the firmware included using the Image Factory service or using the `imager`.
This also ensures that the linux-firmware is not tied to a specific Talos version.
The list of firmware packages which are now available as extensions:
* [bnx2 and bnx2x firmware (Broadcom NetXtreme II)](https://github.com/siderolabs/extensions/tree/main/firmware/bnx2-bnx2x)
* [Intel ICE firmware (Intel(R) Ethernet Controller 800 Series)](https://github.com/siderolabs/extensions/tree/main/firmware/intel-ice-firmware)
### Network Device Selectors
Previously, [network device selectors]({{< relref "../../talos-guides/network/device-selector" >}}) only matched the first link, now the configuration is applied to all matching links.
### `talosctl images` command
The command `images` deprecated in Talos 1.5 was removed, please use `talosctl images default` instead.
### `.persist` Machine Configuration Option
The option `.persist` deprecated in Talos 1.5 was removed, the machine configuration is always persisted.
## New Features
### Kubernetes `n-5` Version Support
Talos Linux starting with version 1.6 supports the latest Kubernetes `n-5` versions, for release 1.6.0 this means [support]({{< relref "../support-matrix" >}}) for Kubernetes versions 1.24-1.29.
This allows users to make it easier to upgrade to new Talos Linux versions without having to upgrade Kubernetes at the same time.
> See [Kubernetes release support](https://kubernetes.io/releases/) for the list of supported versions by Kubernetes project.
### OAuth2 Machine Config Flow
Talos Linux when running on the `metal` platform can be configured to authenticate the machine configuration download using [OAuth2 device flow]({{< relref "../../advanced/machine-config-oauth" >}}).
### Ingress Firewall
Talos Linux now supports configuring the [ingress firewall rules]({{< relref "../../talos-guides/network/ingress-firewall" >}}).
## Improvements
### Component Updates
* Linux: 6.1.67
* Kubernetes: 1.29.0
* containerd: 1.7.10
* runc: 1.1.10
* etcd: 3.5.11
* CoreDNS: 1.11.1
* Flannel: 0.23.0
Talos is built with Go 1.21.5.
### Extension Services
Talos now starts Extension Services early in the boot process, this allows guest agents packaged as extension services to be started in maintenance mode.
### Flannel Configuration
Talos Linux now supports customizing default Flannel manifest with extra arguments for `flanneld`:
```yaml
cluster:
network:
cni:
flannel:
extraArgs:
- --iface-can-reach=192.168.1.1
```
### Kernel Arguments
Talos and Imager now supports dropping kernel arguments specified in `.machine.install.extraKernelArgs` or as `--extra-kernel-arg` to `imager`.
Any kernel argument that starts with a `-` is dropped.
Kernel arguments to be dropped can be specified either as `-<key>` which would remove all arguments that start with `<key>` or as `-<key>=<value>` which would remove the exact argument.
For example, `console=ttyS0` can be dropped by specifying `-console=ttyS0` as an extra argument.
### `kube-scheduler` Configuration
Talos now supports specifying the `kube-scheduler` configuration in the Talos configuration file.
It can be set under `cluster.scheduler.config` and `kube-scheduler` will be automatically configured to with the correct flags.
### Kubernetes Node Taint Configuration
Similar to `machine.nodeLabels` Talos Linux now provides `machine.nodeTaints` machine configuration field to configure Kubernetes `Node` taints.
### Kubelet Credential Provider Configuration
Talos now supports specifying the kubelet credential provider configuration in the Talos configuration file.
It can be set under `machine.kubelet.credentialProviderConfig` and kubelet will be automatically configured to with the correct flags.
The credential binaries are expected to be present under `/usr/local/lib/kubelet/credentialproviders`.
Talos System Extensions can be used to install the credential binaries.
### KubePrism
[KubePrism]({{< relref "../../kubernetes-guides/configuration/kubeprism" >}}) is enabled by default on port 7445.
### Sysctl
Talos now handles sysctl/sysfs key names in line with sysctl.conf(5):
* if the first separator is '/', no conversion is done
* if the first separator is '.', dots and slashes are remapped
Example (both sysctls are equivalent):
```yaml
machine:
sysctls:
net/ipv6/conf/eth0.100/disable_ipv6: "1"
net.ipv6.conf.eth0/100.disable_ipv6: "1"
```
### User Disks
Talos Linux now supports specifying user disks in `.machine.disks` machine configuration links via `udev` symlinks, e.g. `/dev/disk/by-id/XXXX`.
### Packet Capture
Talos Linux provides more performant implementation server-side for the packet capture API (`talosctl pcap` CLI).
### Memory Usage and Performance
Talos Linux core components now use less memory and start faster.

View File

@ -1,47 +0,0 @@
---
title: "Cluster Endpoint"
description: "How to explicitly set up an endpoint for the cluster API"
alises:
- ../../guides/configuring-the-cluster-endpoint
---
In this section, we will step through the configuration of a Talos based Kubernetes cluster.
There are three major components we will configure:
- `apid` and `talosctl`
- the controlplane nodes
- the worker nodes
Talos enforces a high level of security by using mutual TLS for authentication and authorization.
We recommend that the configuration of Talos be performed by a cluster owner.
A cluster owner should be a person of authority within an organization, perhaps a director, manager, or senior member of a team.
They are responsible for storing the root CA, and distributing the PKI for authorized cluster administrators.
### Recommended settings
Talos runs great out of the box, but if you tweak some minor settings it will make your life
a lot easier in the future.
This is not a requirement, but rather a document to explain some key settings.
#### Endpoint
To configure the `talosctl` endpoint, it is recommended you use a resolvable DNS name.
This way, if you decide to upgrade to a multi-controlplane cluster you only have to add the ip address to the hostname configuration.
The configuration can either be done on a Loadbalancer, or simply trough DNS.
For example:
> This is in the config file for the cluster e.g. controlplane.yaml and worker.yaml.
> for more details, please see: [v1alpha1 endpoint configuration]({{< relref "../../reference/configuration#controlplaneconfig" >}})
```yaml
.....
cluster:
controlPlane:
endpoint: https://endpoint.example.local:6443
.....
```
If you have a DNS name as the endpoint, you can upgrade your talos cluster with multiple controlplanes in the future (if you don't have a multi-controlplane setup from the start)
Using a DNS name generates the corresponding Certificates (Kubernetes and Talos) for the correct hostname.

View File

@ -0,0 +1,35 @@
---
title: "Local Storage"
description: "Using local storage for Kubernetes workloads."
---
Using local storage for Kubernetes workloads implies that the pod will be bound to the node where the local storage is available.
Local storage is not replicated, so in case of a machine failure contents of the local storage will be lost.
> Note: when using `EPHEMERAL` Talos partition (`/var`), make sure to use `--preserve` set while performing upgrades, otherwise you risk losing data.
## `hostPath` mounts
The simplest way to use local storage is to use `hostPath` mounts.
When using `hostPath` mounts, make sure the root directory of the mount is mounted into the `kubelet` container:
```yaml
machine:
kubelet:
extraMounts:
- destination: /var/mnt
type: bind
source: /var/mnt
options:
- bind
- rshared
- rw
```
Both `EPHEMERAL` partition and user disks can be used for `hostPath` mounts.
## Local Path Provisioner
[Local Path Provisioner](https://github.com/rancher/local-path-provisioner) can be used to dynamically provision local storage.
Make sure to update its configuration to use a path under `/var`, e.g. `/var/local-path-provisioner` as the root path for the local storage.
(In Talos Linux default local path provisioner path `/opt/local-path-provisioner` is read-only).

View File

@ -5,12 +5,20 @@ aliases:
- ../../guides/pod-security
---
Kubernetes deprecated [Pod Security Policy](https://kubernetes.io/docs/concepts/policy/pod-security-policy/) as of v1.21, and it is
going to be removed in v1.25.
Pod Security Policy was replaced with [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/).
Pod Security Admission is alpha in v1.22 (requires a feature gate) and beta in v1.23 (enabled by default).
Kubernetes deprecated [Pod Security Policy](https://kubernetes.io/docs/concepts/policy/pod-security-policy/) as of v1.21, and it was removed in v1.25.
In this guide we are going to enable and configure Pod Security Admission in Talos.
Pod Security Policy was replaced with [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/), which is enabled by default
starting with Kubernetes v1.23.
Talos Linux by default enables and configures Pod Security Admission plugin to enforce [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) with the
`baseline` profile as the default enforced with the exception of `kube-system` namespace which enforces `privileged` profile.
Some applications (e.g. Prometheus node exporter or storage solutions) require more relaxed Pod Security Standards, which can be configured by either updating the Pod Security Admission plugin configuration,
or by using the `pod-security.kubernetes.io/enforce` label on the namespace level:
```shell
kubectl label namespace NAMESPACE-NAME pod-security.kubernetes.io/enforce=privileged
```
## Configuration

View File

@ -1,12 +1,12 @@
---
title: "Local Storage"
title: "Replicated Local Storage"
description: "Using local storage with OpenEBS Jiva"
aliases:
- ../../guides/storage
---
If you want to use replicated storage leveraging disk space from a local disk with Talos Linux installed, OpenEBS Jiva is a great option.
This requires installing the [iscsi-tools](https://github.com/orgs/siderolabs/packages?tab=packages&q=iscsi-tools) [system extension]({{< relref "../../talos-guides/configuration/system-extensions" >}}).
This requires installing the `iscsi-tools` [system extension]({{< relref "../../talos-guides/configuration/system-extensions" >}}).
Since OpenEBS Jiva is a replicated storage, it's recommended to have at least three nodes where sufficient local disk space is available.
The documentation will follow installing OpenEBS Jiva via the offical Helm chart.
@ -18,29 +18,21 @@ Refer to the OpenEBS Jiva [documentation](https://github.com/openebs/jiva-operat
## Preparing the nodes
Find the matching `iscsi-tools` image reference for your Talos version by running the [following command](https://github.com/siderolabs/extensions):
```bash
crane export ghcr.io/siderolabs/extensions:{{< release >}} | tar x -O image-digests | grep iscsi-tools
```
Create the [boot assets]({{< relref "../../talos-guides/install/boot-assets" >}}) which includes the `iscsi-tools` system extensions (or create a custom installer and perform a machine upgrade if Talos is already installed).
Create a machine config patch with the contents below and save as `patch.yaml`
```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/iscsi-tools:<version>@sha256:<digest>
- op: add
path: /machine/kubelet/extraMounts
value:
- destination: /var/openebs/local
type: bind
source: /var/openebs/local
options:
- bind
- rshared
- rw
machine:
kubelet:
extraMounts:
- destination: /var/openebs/local
type: bind
source: /var/openebs/local
options:
- bind
- rshared
- rw
```
Apply the machine config to all the nodes using talosctl:
@ -49,16 +41,7 @@ Apply the machine config to all the nodes using talosctl:
talosctl -e <endpoint ip/hostname> -n <node ip/hostname> patch mc -p @patch.yaml
```
To install the system extension, the node needs to be upgraded.
If there is no new release of Talos, the node can be upgraded to the same version as the existing Talos version.
Run the following command on each nodes subsequently:
```bash
talosctl -e <endpoint ip/hostname> -n <node ip/hostname> upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
```
Once the node has upgraded and booted successfully the extension status can be verified by running the following command:
The extension status can be verified by running the following command:
```bash
talosctl -e <endpoint ip/hostname> -n <node ip/hostname> get extensions

View File

@ -10,10 +10,9 @@ aliases:
This documentation will outline installing Cilium CNI v1.14.0 on Talos in six different ways.
Adhering to Talos principles we'll deploy Cilium with IPAM mode set to Kubernetes, and using the `cgroupv2` and `bpffs` mount that talos already provides.
As Talos does not allow loading kernel modules by Kubernetes workloads, `SYS_MODULE` capability needs to be dropped from the Cilium default set of values, this override can be seen in the helm/cilium cli install commands.
Each method can either install Cilium using kube proxy (default) or without: [Kubernetes Without kube-proxy](https://docs.cilium.io/en/v1.13/network/kubernetes/kubeproxy-free/)
Each method can either install Cilium using kube proxy (default) or without: [Kubernetes Without kube-proxy](https://docs.cilium.io/en/v1.14/network/kubernetes/kubeproxy-free/)
From Talos 1.5 we can use Cilium in combination with KubePrism.
You can read more about it in Talos 1.5 release notes.
In this guide we assume that [KubePrism]({{< relref "../configuration/kubeprism" >}}) is enabled and configured to use the port 7445.
## Machine config preparation
@ -27,11 +26,6 @@ cluster:
network:
cni:
name: none
machine:
features:
kubePrism:
enabled: true
port: 7445
```
```bash
@ -51,11 +45,6 @@ cluster:
name: none
proxy:
disabled: true
machine:
features:
kubePrism:
enabled: true
port: 7445
```
```bash
@ -172,9 +161,6 @@ kubectl apply -f cilium.yaml
Without kube-proxy:
```bash
export KUBERNETES_API_SERVER_ADDRESS=<replace with api server endpoint here> # e.g. 10.96.0.1
export KUBERNETES_API_SERVER_PORT=6443
helm template \
cilium \
cilium/cilium \
@ -207,11 +193,6 @@ cluster:
name: custom
urls:
- https://server.yourdomain.tld/some/path/cilium.yaml
machine:
features:
kubePrism:
enabled: true
port: 7445
```
```bash
@ -235,11 +216,6 @@ cluster:
network:
cni:
name: none
machine:
features:
kubePrism:
enabled: true
port: 7445
```
```bash
@ -312,6 +288,6 @@ For more details: [GCP ILB support / support scope local routes to be configured
## Other things to know
- Talos has full kernel module support for eBPF, See:
- [Cilium System Requirements](https://docs.cilium.io/en/v1.13/operations/system_requirements/)
- [Cilium System Requirements](https://docs.cilium.io/en/v1.14/operations/system_requirements/)
- [Talos Kernel Config AMD64](https://github.com/siderolabs/pkgs/blob/main/kernel/build/config-amd64)
- [Talos Kernel Config ARM64](https://github.com/siderolabs/pkgs/blob/main/kernel/build/config-arm64)

View File

@ -23,7 +23,7 @@ The recommended method to upgrade Kubernetes is to use the `talosctl upgrade-k8s
This will automatically update the components needed to upgrade Kubernetes safely.
Upgrading Kubernetes is non-disruptive to the cluster workloads.
To trigger a Kubernetes upgrade, issue a command specifiying the version of Kubernetes to ugprade to, such as:
To trigger a Kubernetes upgrade, issue a command specifying the version of Kubernetes to ugprade to, such as:
`talosctl --nodes <controlplane node> upgrade-k8s --to {{< k8s_release >}}`
@ -91,16 +91,18 @@ updating "kube-apiserver" to version "{{< k8s_release >}}"
This command runs in several phases:
1. Every control plane node machine configuration is patched with the new image version for each control plane component.
1. Images for new Kubernetes components are pre-pulled to the nodes to minimize downtime and test for image availability.
2. Every control plane node machine configuration is patched with the new image version for each control plane component.
Talos renders new static pod definitions on the configuration update which is picked up by the kubelet.
The command waits for the change to propagate to the API server state.
2. The command updates the `kube-proxy` daemonset with the new image version.
3. On every node in the cluster, the `kubelet` version is updated.
3. The command updates the `kube-proxy` daemonset with the new image version.
4. On every node in the cluster, the `kubelet` version is updated.
The command then waits for the `kubelet` service to be restarted and become healthy.
The update is verified by checking the `Node` resource state.
4. Kubernetes bootstrap manifests are re-applied to the cluster.
5. Kubernetes bootstrap manifests are re-applied to the cluster.
Updated bootstrap manifests might come with a new Talos version (e.g. CoreDNS version update), or might be the result of machine configuration change.
Note: The `upgrade-k8s` command never deletes any resources from the cluster: they should be deleted manually.
> Note: The `upgrade-k8s` command never deletes any resources from the cluster: they should be deleted manually.
If the command fails for any reason, it can be safely restarted to continue the upgrade process from the moment of the failure.

View File

@ -53,3 +53,20 @@ To renew `kubeconfig`, use `talosctl kubeconfig` command, and the time-to-live (
Talos doesn't support timezones, and will always run in UTC.
This ensures consistency of log timestamps for all Talos Linux clusters, simplifying debugging.
Your containers can run with any timezone configuration you desire, but the timezone of Talos Linux is not configurable.
## How do I see Talos kernel configuration?
### Using Talos API
Current kernel config can be read with `talosctl -n <NODE> read /proc/config.gz`.
For example:
```shell
talosctl -n NODE read /proc/config.gz | zgrep E1000
```
### Using GitHub
For `amd64`, see https://github.com/siderolabs/pkgs/blob/main/kernel/build/config-amd64.
Use appropriate branch to see the kernel config matching your Talos release.

View File

@ -27,9 +27,15 @@ If the disk encryption is enabled for the EPHEMERAL partition, the system will:
This occurs only if the EPHEMERAL partition is empty and has no filesystem.
> Note: Talos Linux disk encryption is designed to guard against data being leaked or recovered from a drive that has been removed from a Talos Linux node.
It uses the hardware characteristics of the machine in order to decrypt the data, so drives that have been removed, or recycled from a cloud environment or attached to a different virtual machine, will maintain their protection and encryption.
It is not designed to protect against attacks where physical access to the machine, including the drive, is available.
Talos Linux supports four encryption methods, which can be combined together for a single partition:
- `static` - encrypt with the static passphrase (weakest protection, for `STATE` partition encryption it means that the passphrase will be stored in the `META` partition).
- `nodeID` - encrypt with the key derived from the node UUID (weak, it is designed to protect against data being leaked or recovered from a drive that has been removed from a Talos Linux node).
- `kms` - encrypt using key sealed with network KMS (strong, but requires network access to decrypt the data.)
- `tpm` - encrypt with the key derived from the TPM (strong, when used with [SecureBoot]({{< relref "../install/bare-metal-platforms/secureboot" >}})).
> Note: `nodeID` encryption is not designed to protect against attacks where physical access to the machine, including the drive, is available.
> It uses the hardware characteristics of the machine in order to decrypt the data, so drives that have been removed, or recycled from a cloud environment or attached to a different virtual machine, will maintain their protection and encryption.
## Configuration
@ -87,6 +93,8 @@ Talos supports two kinds of keys:
- `nodeID` which is generated using the node UUID and the partition label (note that if the node UUID is not really random it will fail the entropy check).
- `static` which you define right in the configuration.
- `kms` which is sealed with the network KMS.
- `tpm` which is sealed using the TPM and protected with SecureBoot.
> Note: Use static keys only if your STATE partition is encrypted and only for the EPHEMERAL partition.
> For the STATE partition it will be stored in the META partition, which is not encrypted.

View File

@ -8,9 +8,6 @@ aliases:
Talos node state is fully defined by [machine configuration]({{< relref "../../reference/configuration" >}}).
Initial configuration is delivered to the node at bootstrap time, but configuration can be updated while the node is running.
> Note: Be sure that config is persisted so that configuration updates are not overwritten on reboots.
> Configuration persistence was enabled by default since Talos 0.5 (`persist: true` in machine configuration).
There are three `talosctl` commands which facilitate machine configuration updates:
* `talosctl apply-config` to apply configuration from the file
@ -29,7 +26,7 @@ Each of these commands can operate in one of four modes:
> Note: applying change on next reboot (`--mode=staged`) doesn't modify current node configuration, so next call to
> `talosctl edit machineconfig --mode=staged` will not see changes
Additionally, there is also `talosctl get machineconfig`, which retrieves the current node configuration API resource and contains the machine configuration in the `.spec` field.
Additionally, there is also `talosctl get machineconfig -o yaml`, which retrieves the current node configuration API resource and contains the machine configuration in the `.spec` field.
It can be used to modify the configuration locally before being applied to the node.
The list of config changes allowed to be applied immediately in Talos {{< release >}}:

View File

@ -101,6 +101,15 @@ machine:
- talos.logging.kernel=tcp://host:5044/
```
Also kernel logs delivery can be configured using the [document]({{< relref "../../reference/configuration/runtime/kmsglogconfig.md" >}}) in machine configuration:
```yaml
apiVersion: v1alpha1
kind: KmsgLogConfig
name: remote-log
url: tcp://host:5044/
```
Kernel log destination is specified in the same way as service log endpoint.
The only supported format is `json_lines`.

View File

@ -12,27 +12,26 @@ The published versions of the NVIDIA fabricmanager system extensions is availabl
> The `nvidia-fabricmanager` extension version has to match with the NVIDIA driver version in use.
## Upgrading Talos and enabling the NVIDIA fabricmanager system extension
## Enabling the NVIDIA fabricmanager system extension
In addition to the patch defined in the [NVIDIA drivers]({{< relref "nvidia-gpu" >}}#upgrading-talos-and-enabling-the-nvidia-modules-and-the-system-extension) guide, we need to add the `nvidia-fabricmanager` system extension to the patch yaml `gpu-worker-patch.yaml`:
Create the [boot assets]({{< relref "../install/boot-assets" >}}) or a custom installer and perform a machine upgrade which include the following system extensions:
```text
ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
ghcr.io/siderolabs/nvidia-fabricmanager:{{< nvidia_driver_release >}}
```
Patch the machine configuration to load the required modules:
```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- image: ghcr.io/siderolabs/nvidia-fabricmanager:525.85.12
- op: add
path: /machine/kernel
value:
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
sysctls:
net.core.bpf_jit_harden: 1
```

View File

@ -9,56 +9,41 @@ aliases:
> The Talos published NVIDIA drivers are bound to a specific Talos release.
> The extensions versions also needs to be updated when upgrading Talos.
The published versions of the NVIDIA system extensions can be found here:
We will be using the following NVIDIA system extensions:
- [nonfree-kmod-nvidia](https://github.com/siderolabs/extensions/pkgs/container/nonfree-kmod-nvidia)
- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)
- `nonfree-kmod-nvidia`
- `nvidia-container-toolkit`
> To build a NVIDIA driver version not published by SideroLabs follow the instructions [here]({{< relref "../../../v1.4/talos-guides/configuration/nvidia-gpu-proprietary" >}})
## Upgrading Talos and enabling the NVIDIA modules and the system extension
Create the [boot assets]({{< relref "../install/boot-assets" >}}) which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
> Make sure to use `talosctl` version {{< release >}} or later
> Make sure the driver version matches for both the `nonfree-kmod-nvidia` and `nvidia-container-toolkit` extensions.
> The `nonfree-kmod-nvidia` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below:
## Enabling the NVIDIA modules and the system extension
Patch Talos machine configuration using the patch `gpu-worker-patch.yaml`:
```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nonfree-kmod-nvidia:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- op: add
path: /machine/kernel
value:
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
sysctls:
net.core.bpf_jit_harden: 1
```
> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
> Make sure the driver version matches for both the `nonfree-kmod-nvidia` and `nvidia-container-toolkit` extensions.
> The `nonfree-kmod-nvidia` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
```bash
talosctl patch mc --patch @gpu-worker-patch.yaml
```
Now we can proceed to upgrading Talos to the same version to enable the system extension:
```bash
talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
```
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
The NVIDIA modules should be loaded and the system extension should be installed.
This can be confirmed by running:

View File

@ -9,54 +9,39 @@ aliases:
> The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
> The extensions versions also needs to be updated when upgrading Talos.
The published versions of the NVIDIA OSS system extensions can be found here:
We will be using the following NVIDIA OSS system extensions:
- [nvidia-open-gpu-kernel-modules](https://github.com/siderolabs/extensions/pkgs/container/nvidia-open-gpu-kernel-modules)
- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)
- `nvidia-open-gpu-kernel-modules`
- `nvidia-container-toolkit`
## Upgrading Talos and enabling the NVIDIA OSS modules and the system extension
Create the [boot assets]({{< relref "../install/boot-assets" >}}) which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
> Make sure to use `talosctl` version {{< release >}} or later
> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions.
> The `nvidia-open-gpu-kernel-modules` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below:
## Enabling the NVIDIA OSS modules
Patch Talos machine configuration using the patch `gpu-worker-patch.yaml`:
```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- op: add
path: /machine/kernel
value:
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
sysctls:
net.core.bpf_jit_harden: 1
```
> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions.
> The `nvidia-open-gpu-kernel-modules` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
```bash
talosctl patch mc --patch @gpu-worker-patch.yaml
```
Now we can proceed to upgrading Talos to the same version to enable the system extension:
```bash
talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
```
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
The NVIDIA modules should be loaded and the system extension should be installed.
This can be confirmed by running:

View File

@ -16,7 +16,7 @@ Talos supports two configuration patch formats:
Strategic merge patches are the easiest to use, but JSON patches allow more precise configuration adjustments.
> Note: Talos 1.5 introduces experimental support for multi-document machine configuration.
> Note: Talos 1.5+ supports [multi-document machine configuration]({{< relref "../../reference/configuration" >}}).
> JSON patches don't support multi-document machine configuration, while strategic merge patches do.
### Strategic Merge patches
@ -51,7 +51,7 @@ There are some special rules:
- `network.interfaces.vlans` section is merged with the value in the machine config if there is a match on the `vlanId:` key
- `cluster.apiServer.auditPolicy` value is replaced on merge
When patching a multi-document machine configuration, following rules apply:
When patching a [multi-document machine configuration]({{< relref "../../reference/configuration" >}}), following rules apply:
- for each document in the patch, the document is merged with the respective document in the machine configuration (matching by `kind`, `apiVersion` and `name` for named documents)
- if the patch document doesn't exist in the machine configuration, it is appended to the machine configuration

View File

@ -58,6 +58,9 @@ Annotations: cluster.talos.dev/node-id: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yz
The `Service` registry by default uses a public external Discovery Service to exchange encrypted information about cluster members.
> Note: Talos supports operations when Discovery Service is disabled, but some features will rely on Kubernetes API availability to discover
> controlplane endpoints, so in case of a failure disabled Discovery Service makes troubleshooting much harder.
## Discovery Service
Sidero Labs maintains a public discovery service at `https://discovery.talos.dev/` whereby cluster members use a shared key that is globally unique to coordinate basic connection information (i.e. the set of possible "endpoints", or IP:port pairs).

View File

@ -16,15 +16,4 @@ cluster:
allowSchedulingOnControlPlanes: true
```
This may be done via editing the `controlplane.yaml` file before it is applied to the controlplane nodes, by `talosctl edit machineconfig`, or by [patching the machine config]({{< relref "../configuration/patching">}}).
> Note: if you edit or patch the machine config on a running control plane node to set `allowSchedulingOnControlPlanes: true`, it will be applied immediately, but will not have any effect until the next reboot.
You may reboot the nodes via `talosctl reboot`.
You may also immediately make the control plane nodes schedulable by running the below:
```bash
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
```
Note that unless `allowSchedulingOnControlPlanes: true` is set in the machine config, the nodes will be tainted again on next reboot.
This may be done via editing the `controlplane.yaml` file before it is applied to the control plane nodes, by [editing the machine config]({{< relref "../configuration/editing-machine-configuration" >}}), or by [patching the machine config]({{< relref "../configuration/patching">}}).

View File

@ -18,4 +18,4 @@ See [kernel parameters reference]({{< relref "../../../reference/kernel" >}}) fo
There are two flavors of ISO images available:
* `metal-<arch>.iso` supports booting on BIOS and UEFI systems (for x86, UEFI only for arm64)
* `secureboot-metal-<arch>.iso` supports booting on only UEFI systems in SecureBoot mode
* `metal-<arch>-secureboot.iso` supports booting on only UEFI systems in SecureBoot mode (via [Image Factory]({{< relref "../../../learn-more/image-factory" >}}))

View File

@ -40,7 +40,8 @@ Some platforms (e.g. AWS, Google Cloud, etc.) have their own network configurati
There is no such mechanism for bare-metal platforms, so Talos provides a way to use platform network config on the `metal` platform to submit the initial network configuration.
The platform network configuration is a YAML document which contains resource specifications for various network resources.
For the `metal` platform, the [interactive dashboard]({{< relref "../../interactive-dashboard" >}}) can be used to edit the platform network configuration.
For the `metal` platform, the [interactive dashboard]({{< relref "../../interactive-dashboard" >}}) can be used to edit the platform network configuration, also the configuration can be
created [manually]({{< relref "../../../advanced/metal-network-configuration" >}}).
The current value of the platform network configuration can be retrieved using the `MetaKeys` resource (key `0xa`):

View File

@ -35,3 +35,5 @@ talos.config=https://metadata.service/talos/config?mac=${mac}
```
> Note: The `talos.config` kernel parameter supports other substitution variables, see [kernel parameters reference]({{< relref "../../../reference/kernel" >}}) for the full list.
PXE booting can be also performed via [Image Factory]({{< relref "../../../learn-more/image-factory" >}}).

View File

@ -19,7 +19,7 @@ As Talos Linux is fully contained in the UKI image, the full operating system is
## Acquiring the SecureBoot Assets
Sidero Labs is going to provide Talos images signed with the Sidero Labs Secureboot key in August 2023, but until then, you can use the following instructions to build your own.
Sidero Labs is going to provide Talos images signed with the Sidero Labs SecureBoot key early 2024 via [Image Factory]({{< relref "../../../learn-more/image-factory" >}}), but until then, you can use the following instructions to build your own.
## SecureBoot with Custom Keys

View File

@ -170,6 +170,22 @@ The base profile can be customized with the additional flags to the imager:
For example `-console` removes all `console=<value>` arguments, whereas `-console=tty0` removes the `console=tty0` default argument.
* `--system-extension-image` allows to install a system extension into the image
### Extension Image Reference
While Image Factory automatically resolves the extension name into a matching container image for a specific version of Talos, `imager` requires the full explicit container image reference.
The `imager` also allows to install custom extensions which are not part of the official Talos Linux system extensions.
To get the official Talos Linux system extension container image reference matching a Talos release, use the [following command](https://github.com/siderolabs/extensions?tab=readme-ov-file#installing-extensions):
```shell
crane export ghcr.io/siderolabs/extensions:{{< release >}} | tar x -O image-digests | grep EXTENSION-NAME
```
> Note: this command is using [crane](https://github.com/google/go-containerregistry/blob/main/cmd/crane/README.md) tool, but any other tool which allows
> to export the image contents can be used.
For each Talos release, the `ghcr.io/siderolabs/extensions:VERSION` image contains a pinned reference to each system extension container image.
### Example: Bare-metal with Imager
Let's assume we want to boot Talos on a bare-metal machine with Intel CPU and add a `gvisor` container runtime to the image.
@ -177,20 +193,21 @@ Also we want to disable predictable network interface names with `net.ifnames=0`
First, let's lookup extension images for Intel CPU microcode updates and `gvisor` container runtime in the [extensions repository](https://github.com/siderolabs/extensions):
```text
ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0
ghcr.io/siderolabs/intel-ucode:20230613
```shell
$ crane export ghcr.io/siderolabs/extensions:{{< release >}} | tar x -O image-digests | egrep 'gvisor|intel-ucode'
ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e
ghcr.io/siderolabs/intel-ucode:20231114@sha256:ea564094402b12a51045173c7523f276180d16af9c38755a894cf355d72c249d
```
Now we can generate the ISO image with the following command:
```shell
$ docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} iso --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0 --system-extension-image ghcr.io/siderolabs/intel-ucode:20230613 --extra-kernel-arg net.ifnames=0 --extra-kernel-arg=-console --extra-kernel-arg=console=ttyS1
$ docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} iso --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e --system-extension-image ghcr.io/siderolabs/intel-ucode:20231114@sha256:ea564094402b12a51045173c7523f276180d16af9c38755a894cf355d72c249d --extra-kernel-arg net.ifnames=0 --extra-kernel-arg=-console --extra-kernel-arg=console=ttyS1
profile ready:
arch: amd64
platform: metal
secureboot: false
version: {{ < release > }}
version: {{< release >}}
customization:
extraKernelArgs:
- net.ifnames=0
@ -202,8 +219,8 @@ input:
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:{{< release >}}
systemExtensions:
- imageRef: ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0
- imageRef: ghcr.io/siderolabs/intel-ucode:20230613
- imageRef: ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e
- imageRef: ghcr.io/siderolabs/intel-ucode:20231114@sha256:ea564094402b12a51045173c7523f276180d16af9c38755a894cf355d72c249d
output:
kind: iso
outFormat: raw
@ -219,7 +236,7 @@ If the machine is going to be booted using PXE, we can instead generate kernel a
```shell
docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} iso --output-kind kernel
docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} iso --output-kind initramfs --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0 --system-extension-image ghcr.io/siderolabs/intel-ucode:20230613
docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} iso --output-kind initramfs --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e --system-extension-image ghcr.io/siderolabs/intel-ucode:20231114@sha256:ea564094402b12a51045173c7523f276180d16af9c38755a894cf355d72c249d
```
Now the `_out/kernel-amd64` and `_out/initramfs-amd64` contain the customized Talos kernel and initramfs images.
@ -229,7 +246,7 @@ Now the `_out/kernel-amd64` and `_out/initramfs-amd64` contain the customized Ta
As the next step, we should generate a custom `installer` image which contains all required system extensions (kernel args can't be specified with the installer image, but they are set in the machine configuration):
```shell
$ docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} installer --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0 --system-extension-image ghcr.io/siderolabs/intel-ucode:20230613
$ docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:{{< release >}} installer --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e --system-extension-image ghcr.io/siderolabs/intel-ucode:20231114@sha256:ea564094402b12a51045173c7523f276180d16af9c38755a894cf355d72c249d
...
output asset path: /out/metal-amd64-installer.tar
```
@ -253,14 +270,15 @@ Let's assume we want to boot Talos on AWS with `gvisor` container runtime system
First, let's lookup extension images for the `gvisor` container runtime in the [extensions repository](https://github.com/siderolabs/extensions):
```text
ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0
```shell
$ crane export ghcr.io/siderolabs/extensions:{{< release >}} | tar x -O image-digests | grep gvisor
ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e
```
Next, let's generate AWS disk image with that system extension:
```shell
$ docker run --rm -t -v $PWD/_out:/out -v /dev:/dev --privileged ghcr.io/siderolabs/imager:{{< release >}} aws --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-v1.5.0-beta.0
$ docker run --rm -t -v $PWD/_out:/out -v /dev:/dev --privileged ghcr.io/siderolabs/imager:{{< release >}} aws --system-extension-image ghcr.io/siderolabs/gvisor:20231214.0-{{< release >}}@sha256:548b2b121611424f6b1b6cfb72a1669421ffaf2f1560911c324a546c7cee655e
...
output asset path: /out/aws-amd64.raw
compression done: /out/aws-amd64.raw.xz

View File

@ -9,7 +9,7 @@ Talos Linux Ingress Firewall doesn't affect the traffic between the Kubernetes p
## Configuration
Ingress rules are configured as extra documents [NetwokDefaultActionConfig]({{< relref "../../reference/configuration/network/networkdefaultactionconfig.md" >}}) and
Ingress rules are configured as extra documents [NetworkDefaultActionConfig]({{< relref "../../reference/configuration/network/networkdefaultactionconfig.md" >}}) and
[NetworkRuleConfig]({{< relref "../../reference/configuration/network/networkruleconfig.md" >}}) in the Talos machine configuration:
```yaml

View File

@ -47,7 +47,7 @@ IP addresses:
We then choose our shared IP to be:
> 192.168.0.15
- `192.168.0.15`
## Configure your Talos Machines
@ -105,7 +105,5 @@ machine:
Since VIP functionality relies on `etcd` for elections, the shared IP will not come
alive until after you have bootstrapped Kubernetes.
This does mean that you cannot use the
shared IP when issuing the `talosctl bootstrap` command (although, as noted above, it is not recommended to access the Talos API via the VIP).
Instead, the `bootstrap` command will need to target one of the controlplane nodes
directly.
Don't use the VIP as the `endpoint` in the `talosconfig`, as the VIP is bound to `etcd` and `kube-apiserver` health, and you will not be able to recover from a failure of either of those components using Talos API.

View File

@ -101,4 +101,4 @@ network:
When `networkd` gets this configuration it will create the device, configure it and will bring it up (equivalent to `ip link set up dev wg0`).
All supported config parameters are described in the [Machine Config Reference]({{< relref "../../reference/configuration#devicewireguardconfig" >}}).
All supported config parameters are described in the [Machine Config Reference]({{< relref "../../reference/configuration/v1alpha1/config#Config.machine.network.interfaces..wireguard" >}}).

View File

@ -7,7 +7,7 @@ aliases:
From time to time, it may be beneficial to reset a Talos machine to its "original" state.
Bear in mind that this is a destructive action for the given machine.
Doing this means removing the machine from Kubernetes, Etcd (if applicable), and clears any data on the machine that would normally persist a reboot.
Doing this means removing the machine from Kubernetes, `etcd` (if applicable), and clears any data on the machine that would normally persist a reboot.
## CLI

View File

@ -38,6 +38,8 @@ For example, if upgrading from Talos 1.0 to Talos 1.2.4, the recommended upgrade
There are no specific actions to be taken before an upgrade.
Please review the [release notes]({{< relref "../introduction/what-is-new" >}}) for any changes that may affect your cluster.
## Video Walkthrough
To see a live demo of an upgrade of Talos Linux, see the video below:
@ -93,7 +95,19 @@ future.
## Machine Configuration Changes
TBD
New configuration documents:
* [Ingress Firewall]({{< relref "../talos-guides/network/ingress-firewall" >}}) configuration: [NetworkRuleConfig]({{< relref "../reference/configuration/network/networkruleconfig" >}}) and [NetworkDefaultActionConfig]({{< relref "../reference/configuration/network/networkdefaultactionconfig" >}}).
Updates in [v1alpha1 Config]({{< relref "../reference/configuration/v1alpha1/config" >}}):
* `.persist` option was removed
* `.machine.nodeTaints` configures Kubernetes node taints
* `.machine.kubelet.extraMounts` supports new fields `uidMappings` and `gidMappings`
* `.machine.kubelet.credendtialProviderConfig` configures `kubelet` credential provider
* `.machine.network.kubespan.harvestExtraEndpoints` to disable harvesting extra endpoints
* `.cluster.cni.flannel` provides customization for the default Flannel CNI manifest
* `.cluster.scheduler.config` provides custom `kube-scheduler` configuration
## Upgrade Sequence