mirror of
https://github.com/siderolabs/talos.git
synced 2025-10-08 14:11:13 +02:00
docs: update upgrading Talos, Kubernetes, and Docker guides
Variety of clarifications. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This commit is contained in:
parent
5484579c1a
commit
f6fa12e536
@ -2,19 +2,28 @@
|
||||
title: Upgrading Kubernetes
|
||||
---
|
||||
|
||||
This guide covers Kubernetes control plane upgrade for clusters running Talos-managed control plane.
|
||||
If the cluster is still running self-hosted control plane (after upgrade from Talos 0.8), please
|
||||
refer to 0.8 docs.
|
||||
This guide covers upgrading Kubernetes on Talos Linux clusters.
|
||||
For upgrading the Talos Linux operating system, see [Upgrading Talos](../upgrading-talos/)
|
||||
|
||||
## Video Walkthrough
|
||||
|
||||
To see a live demo of this writeup, see the video below:
|
||||
To see a demo of this process, watch this video:
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/uOKveKbD8MQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
## Automated Kubernetes Upgrade
|
||||
|
||||
To check what is going to be upgraded you can run `talosctl upgrade-k8s` with `--dry-run` flag:
|
||||
The recommended method to upgrade Kubernetes is to use the `talosctl upgrade-k8s` command.
|
||||
This will automatically update the components needed to upgrade Kubernetes safely.
|
||||
Upgrading Kubernetes is non-disruptive to the cluster workloads.
|
||||
|
||||
To trigger a Kubernetes upgrade, issue a command specifiying the version of Kubernetes to ugprade to, such as:
|
||||
|
||||
`talosctl --nodes <master node> upgrade-k8s --to 1.23.0`
|
||||
|
||||
Note that the `--nodes` parameter specifies the control plane node to send the API call to, but all members of the cluster will be upgraded.
|
||||
|
||||
To check what will be upgraded you can run `talosctl upgrade-k8s` with the `--dry-run` flag:
|
||||
|
||||
```bash
|
||||
$ talosctl --nodes <master node> upgrade-k8s --to 1.23.0 --dry-run
|
||||
@ -44,84 +53,15 @@ updating "kube-controller-manager" to version "1.23.0"
|
||||
> update kube-controller-manager: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.3": starting update
|
||||
> update kube-controller-manager: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.4": starting update
|
||||
> update kube-controller-manager: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
updating "kube-scheduler" to version "1.23.0"
|
||||
> "172.20.0.2": starting update
|
||||
> update kube-scheduler: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.3": starting update
|
||||
> update kube-scheduler: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.4": starting update
|
||||
> update kube-scheduler: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
updating daemonset "kube-proxy" to version "1.23.0"
|
||||
skipped in dry-run
|
||||
updating kubelet to version "1.23.0"
|
||||
> "172.20.0.2": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.3": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.4": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.5": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
> "172.20.0.6": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> skipped in dry-run
|
||||
|
||||
<snip>
|
||||
|
||||
updating manifests
|
||||
> apply manifest Secret bootstrap-token-3lb63t
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding system-bootstrap-approve-node-client-csr
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding system-bootstrap-node-bootstrapper
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding system-bootstrap-node-renewal
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding system:default-sa
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRole psp:privileged
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding psp:privileged
|
||||
> apply skipped in dry run
|
||||
> apply manifest PodSecurityPolicy privileged
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRole flannel
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding flannel
|
||||
> apply skipped in dry run
|
||||
> apply manifest ServiceAccount flannel
|
||||
> apply skipped in dry run
|
||||
> apply manifest ConfigMap kube-flannel-cfg
|
||||
> apply skipped in dry run
|
||||
> apply manifest DaemonSet kube-flannel
|
||||
> apply skipped in dry run
|
||||
> apply manifest ServiceAccount kube-proxy
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding kube-proxy
|
||||
> apply skipped in dry run
|
||||
> apply manifest ServiceAccount coredns
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRoleBinding system:coredns
|
||||
> apply skipped in dry run
|
||||
> apply manifest ClusterRole system:coredns
|
||||
> apply skipped in dry run
|
||||
> apply manifest ConfigMap coredns
|
||||
> apply skipped in dry run
|
||||
> apply manifest Deployment coredns
|
||||
> apply skipped in dry run
|
||||
> apply manifest Service kube-dns
|
||||
> apply skipped in dry run
|
||||
> apply manifest ConfigMap kubeconfig-in-cluster
|
||||
> apply skipped in dry run
|
||||
<snip>
|
||||
```
|
||||
|
||||
To upgrade Kubernetes from v1.22.4 to v1.23.0 run:
|
||||
@ -140,148 +80,32 @@ updating "kube-apiserver" to version "1.23.0"
|
||||
< "172.20.0.2": successfully updated
|
||||
> "172.20.0.3": starting update
|
||||
> update kube-apiserver: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.3": machine configuration patched
|
||||
> "172.20.0.3": waiting for API server state pod update
|
||||
< "172.20.0.3": successfully updated
|
||||
> "172.20.0.4": starting update
|
||||
> update kube-apiserver: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.4": machine configuration patched
|
||||
> "172.20.0.4": waiting for API server state pod update
|
||||
< "172.20.0.4": successfully updated
|
||||
updating "kube-controller-manager" to version "1.23.0"
|
||||
> "172.20.0.2": starting update
|
||||
> update kube-controller-manager: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.2": machine configuration patched
|
||||
> "172.20.0.2": waiting for API server state pod update
|
||||
< "172.20.0.2": successfully updated
|
||||
> "172.20.0.3": starting update
|
||||
> update kube-controller-manager: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.3": machine configuration patched
|
||||
> "172.20.0.3": waiting for API server state pod update
|
||||
< "172.20.0.3": successfully updated
|
||||
> "172.20.0.4": starting update
|
||||
> update kube-controller-manager: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.4": machine configuration patched
|
||||
> "172.20.0.4": waiting for API server state pod update
|
||||
< "172.20.0.4": successfully updated
|
||||
updating "kube-scheduler" to version "1.23.0"
|
||||
> "172.20.0.2": starting update
|
||||
> update kube-scheduler: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.2": machine configuration patched
|
||||
> "172.20.0.2": waiting for API server state pod update
|
||||
< "172.20.0.2": successfully updated
|
||||
> "172.20.0.3": starting update
|
||||
> update kube-scheduler: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.3": machine configuration patched
|
||||
> "172.20.0.3": waiting for API server state pod update
|
||||
< "172.20.0.3": successfully updated
|
||||
> "172.20.0.4": starting update
|
||||
> update kube-scheduler: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.4": machine configuration patched
|
||||
> "172.20.0.4": waiting for API server state pod update
|
||||
< "172.20.0.4": successfully updated
|
||||
updating daemonset "kube-proxy" to version "1.23.0"
|
||||
updating kubelet to version "1.23.0"
|
||||
> "172.20.0.2": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.2": machine configuration patched
|
||||
> "172.20.0.2": waiting for kubelet restart
|
||||
> "172.20.0.2": waiting for node update
|
||||
< "172.20.0.2": successfully updated
|
||||
> "172.20.0.3": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.3": machine configuration patched
|
||||
> "172.20.0.3": waiting for kubelet restart
|
||||
> "172.20.0.3": waiting for node update
|
||||
< "172.20.0.3": successfully updated
|
||||
> "172.20.0.4": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.4": machine configuration patched
|
||||
> "172.20.0.4": waiting for kubelet restart
|
||||
> "172.20.0.4": waiting for node update
|
||||
< "172.20.0.4": successfully updated
|
||||
> "172.20.0.5": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.5": machine configuration patched
|
||||
> "172.20.0.5": waiting for kubelet restart
|
||||
> "172.20.0.5": waiting for node update
|
||||
< "172.20.0.5": successfully updated
|
||||
> "172.20.0.6": starting update
|
||||
> update kubelet: v1.22.4 -> 1.23.0
|
||||
> "172.20.0.6": machine configuration patched
|
||||
> "172.20.0.6": waiting for kubelet restart
|
||||
> "172.20.0.6": waiting for node update
|
||||
< "172.20.0.6": successfully updated
|
||||
updating manifests
|
||||
> apply manifest Secret bootstrap-token-3lb63t
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding system-bootstrap-approve-node-client-csr
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding system-bootstrap-node-bootstrapper
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding system-bootstrap-node-renewal
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding system:default-sa
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRole psp:privileged
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding psp:privileged
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest PodSecurityPolicy privileged
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRole flannel
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding flannel
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ServiceAccount flannel
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ConfigMap kube-flannel-cfg
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest DaemonSet kube-flannel
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ServiceAccount kube-proxy
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding kube-proxy
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ServiceAccount coredns
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRoleBinding system:coredns
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ClusterRole system:coredns
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ConfigMap coredns
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest Deployment coredns
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest Service kube-dns
|
||||
> apply skipped: nothing to update
|
||||
> apply manifest ConfigMap kubeconfig-in-cluster
|
||||
> apply skipped: nothing to update
|
||||
<snip>
|
||||
```
|
||||
|
||||
Script runs in several phases:
|
||||
This command runs in several phases:
|
||||
|
||||
1. Every control plane node machine configuration is patched with new image version for each control plane component.
|
||||
Talos renders new static pod definition on configuration update which is picked up by the kubelet.
|
||||
Script waits for the change to propagate to the API server state.
|
||||
2. The script updates `kube-proxy` daemonset with the new image version.
|
||||
3. On every node in the cluster, `kubelet` version is updated.
|
||||
The script waits for the `kubelet` service to be restarted, become healthy.
|
||||
Update is verified with the `Node` resource state.
|
||||
1. Every control plane node machine configuration is patched with the new image version for each control plane component.
|
||||
Talos renders new static pod definitions on the configuration update which is picked up by the kubelet.
|
||||
The command waits for the change to propagate to the API server state.
|
||||
2. The command updates the `kube-proxy` daemonset with the new image version.
|
||||
3. On every node in the cluster, the `kubelet` version is updated.
|
||||
The command then waits for the `kubelet` service to be restarted and become healthy.
|
||||
The update is verified by checking the `Node` resource state.
|
||||
4. Kubernetes bootstrap manifests are re-applied to the cluster.
|
||||
The script never deletes any resources from the cluster, they should be deleted manually.
|
||||
Updated bootstrap manifests might come with new Talos version (e.g. CoreDNS version update), or might be result of machine configuration change.
|
||||
Updated bootstrap manifests might come with a new Talos version (e.g. CoreDNS version update), or might be the result of machine configuration change.
|
||||
Note: The `upgrade-k8s` command never deletes any resources from the cluster: they should be deleted manually.
|
||||
|
||||
If the script fails for any reason, it can be safely restarted to continue upgrade process from the moment of the failure.
|
||||
If the command fails for any reason, it can be safely restarted to continue the upgrade process from the moment of the failure.
|
||||
|
||||
## Manual Kubernetes Upgrade
|
||||
|
||||
Kubernetes can be upgraded manually as well by following the steps outlined below.
|
||||
Kubernetes can be upgraded manually by following the steps outlined below.
|
||||
They are equivalent to the steps performed by the `talosctl upgrade-k8s` command.
|
||||
|
||||
### Kubeconfig
|
||||
|
||||
In order to edit the control plane, we will need a working `kubectl` config.
|
||||
In order to edit the control plane, you need a working `kubectl` config.
|
||||
If you don't already have one, you can get one by running:
|
||||
|
||||
```bash
|
||||
@ -297,11 +121,11 @@ $ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replac
|
||||
patched mc at the node 172.20.0.2
|
||||
```
|
||||
|
||||
JSON patch might need to be adjusted if current machine configuration is missing `.cluster.apiServer.image` key.
|
||||
The JSON patch might need to be adjusted if current machine configuration is missing `.cluster.apiServer.image` key.
|
||||
|
||||
Also machine configuration can be edited manually with `talosctl -n <IP> edit mc --mode=no-reboot`.
|
||||
Also the machine configuration can be edited manually with `talosctl -n <IP> edit mc --mode=no-reboot`.
|
||||
|
||||
Capture new version of `kube-apiserver` config with:
|
||||
Capture the new version of `kube-apiserver` config with:
|
||||
|
||||
```bash
|
||||
$ talosctl -n <CONTROL_PLANE_IP_1> get kcpc kube-apiserver -o yaml
|
||||
@ -324,7 +148,7 @@ spec:
|
||||
extraVolumes: []
|
||||
```
|
||||
|
||||
In this example, new version is `5`.
|
||||
In this example, the new version is `5`.
|
||||
Wait for the new pod definition to propagate to the API server state (replace `talos-default-master-1` with the node name):
|
||||
|
||||
```bash
|
||||
@ -351,7 +175,7 @@ $ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replac
|
||||
patched mc at the node 172.20.0.2
|
||||
```
|
||||
|
||||
JSON patch might need be adjusted if current machine configuration is missing `.cluster.controllerManager.image` key.
|
||||
The JSON patch might need be adjusted if current machine configuration is missing `.cluster.controllerManager.image` key.
|
||||
|
||||
Capture new version of `kube-controller-manager` config with:
|
||||
|
||||
@ -389,7 +213,7 @@ NAME READY STATUS RESTARTS AG
|
||||
kube-controller-manager-talos-default-master-1 1/1 Running 0 35m
|
||||
```
|
||||
|
||||
Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.
|
||||
Repeat this process for every control plane node, verifying that state propagated successfully between each node update.
|
||||
|
||||
### Scheduler
|
||||
|
||||
|
@ -1,14 +1,28 @@
|
||||
---
|
||||
title: Upgrading Talos
|
||||
title: Upgrading Talos Linux
|
||||
---
|
||||
|
||||
Talos upgrades are effected by an API call.
|
||||
The `talosctl` CLI utility will facilitate this.
|
||||
<!-- , or you can use the automatic upgrade features provided by the [talos controller manager](https://github.com/talos-systems/talos-controller-manager) -->
|
||||
OS upgrades, like other operations on Talos Linux, are effected by an API call, which can be sent via the `talosctl` CLI utility.
|
||||
Because Talos Linux is image based, an upgrade is almost the same as installing Talos, with the difference that the system has already been initialized with a configuration.
|
||||
|
||||
The upgrade API call passes a node the installer image to use to perform the upgrade.
|
||||
Each Talos version has a corresponding installer.
|
||||
|
||||
Upgrades use an A-B image scheme in order to facilitate rollbacks.
|
||||
This scheme retains the previous Talos kernel and OS image following each upgrade.
|
||||
If an upgrade fails to boot, Talos will roll back to the previous version.
|
||||
Likewise, Talos may be manually rolled back via API (or `talosctl rollback`).
|
||||
This will simply update the boot reference and reboot.
|
||||
|
||||
Unless explicitly told to `preserve` data, an upgrade will cause the node to wipe the ephemeral partition, remove itself from the etcd cluster (if it is a control node), and generally make itself as pristine as is possible.
|
||||
(This is generally the desired behavior, except in specialised use cases such as single-node clusters.)
|
||||
|
||||
*Note* that unless the Kubernetes version has been specified in the machine config, an upgrade of the Talos Linux OS will also apply an upgrade of the Kubernetes version.
|
||||
Each release of Talos Linux includes the latest stable Kubernetes version by default.
|
||||
|
||||
## Video Walkthrough
|
||||
|
||||
To see a live demo of this writeup, see the video below:
|
||||
To see a live demo of an upgrade of Talos Linux, see the video below:
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/AAF6WhX0USo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
@ -16,10 +30,10 @@ To see a live demo of this writeup, see the video below:
|
||||
|
||||
TBD
|
||||
|
||||
## `talosctl` Upgrade
|
||||
## `talosctl upgrade`
|
||||
|
||||
To manually upgrade a Talos node, you will specify the node's IP address and the
|
||||
installer container image for the version of Talos to which you wish to upgrade.
|
||||
To upgrade a Talos node, specify the node's IP address and the
|
||||
installer container image for the version of Talos to upgrade to.
|
||||
|
||||
For instance, if your Talos node has the IP address `10.20.30.40` and you want
|
||||
to install the official version `v0.15.0`, you would enter a command such
|
||||
@ -30,12 +44,18 @@ as:
|
||||
--image ghcr.io/talos-systems/installer:v0.15.0
|
||||
```
|
||||
|
||||
There is an option to this command: `--preserve`, which can be used to explicitly tell Talos to either keep intact its ephemeral data or not.
|
||||
In most cases, it is correct to just let Talos perform its default action.
|
||||
There is an option to this command: `--preserve`, which will explicitly tell Talos to keep ephemeral data intact.
|
||||
In most cases, it is correct to let Talos perform its default action of erasing the ephemeral data.
|
||||
However, if you are running a single-node control-plane, you will want to make sure that `--preserve=true`.
|
||||
|
||||
If Talos fails to run the upgrade, the `--stage` flag may be used to perform the upgrade after a reboot
|
||||
which is followed by another reboot to upgraded version.
|
||||
Rarely, a upgrade command will fail to run due to a process holding a file open on disk, or you may wish to set a node to upgrade, but delay the actual reboot as long as possible.
|
||||
In these cases, you can use the `--stage` flag.
|
||||
This puts the upgrade artifacts on disk, and adds some metadata to a disk partition that gets checked very early in the boot process.
|
||||
The node is *not* rebooted by the `upgrade --stage` process.
|
||||
However, whenever the system does next reboot, Talos sees that it needs to apply an upgrade, and will do so immediately.
|
||||
Because this occurs in a just rebooted system, there will be no conflict with any files being held open.
|
||||
After the upgrade is applied, the node will reboot again, in order to boot into the new version.
|
||||
Note that because Talos Linux now reboots via the kexec syscall, the extra reboot adds very little time.
|
||||
|
||||
<!--
|
||||
## Talos Controller Manager
|
||||
@ -57,3 +77,70 @@ future.
|
||||
## Machine Configuration Changes
|
||||
|
||||
TBD
|
||||
|
||||
## Upgrade Sequence
|
||||
|
||||
When a Talos node receives the upgrade command, it cordons
|
||||
itself in Kubernetes, to avoid receiving any new workload.
|
||||
It then starts to drain its existing workload.
|
||||
|
||||
**NOTE**: If any of your workloads are sensitive to being shut down ungracefully, be sure to use the `lifecycle.preStop` Pod [spec](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks).
|
||||
|
||||
Once all of the workload Pods are drained, Talos will start shutting down its
|
||||
internal processes.
|
||||
If it is a control node, this will include etcd.
|
||||
If `preserve` is not enabled, Talos will leave etcd membership.
|
||||
(Talos ensures the etcd cluster is healthy and will remain healthy after our node leaves the etcd cluster, before allowing a control plane node to be upgraded.)
|
||||
|
||||
Once all the processes are stopped and the services are shut down, the filesystems will be unmounted.
|
||||
This allows Talos to produce a very clean upgrade, as close as possible to a pristine system.
|
||||
We verify the disk and then perform the actual image upgrade.
|
||||
We set the bootloader to boot _once_ with the new kernel and OS image, then we reboot.
|
||||
|
||||
After the node comes back up and Talos verifies itself, it will make
|
||||
the bootloader change permanent, rejoin the cluster, and finally uncordon itself to receive new workloads.
|
||||
|
||||
### FAQs
|
||||
|
||||
**Q.** What happens if an upgrade fails?
|
||||
|
||||
**A.** Talos Linux attempts to safely handle upgrade failures.
|
||||
|
||||
The most common failure is an invalid installer image reference.
|
||||
In this case, Talos will fail to download the upgraded image and will abort the upgrade.
|
||||
|
||||
Sometimes, Talos is unable to successfully kill off all of the disk access points, in which case it cannot safely unmount all filesystems to effect the upgrade.
|
||||
In this case, it will abort the upgrade and reboot.
|
||||
(`upgrade --stage` can ensure that upgrades can occur even when the filesytems cannot be unmounted.)
|
||||
|
||||
It is possible (especially with test builds) that the upgraded Talos system will fail to start.
|
||||
In this case, the node will be rebooted, and the bootloader will automatically use the previous Talos kernel and image, thus effectively rolling back the upgrade.
|
||||
|
||||
Lastly, it is possible that Talos itself will upgrade successfully, start up, and rejoin the cluster but your workload will fail to run on it, for whatever reason.
|
||||
This is when you would use the `talosctl rollback` command to revert back to the previous Talos version.
|
||||
|
||||
**Q.** Can upgrades be scheduled?
|
||||
|
||||
**A.** Because the upgrade sequence is API-driven, you can easily tie it in to your own business logic to schedule and coordinate your upgrades.
|
||||
|
||||
**Q.** Can the upgrade process be observed?
|
||||
|
||||
**A.** Yes, using the `talosctl dmesg -f` command.
|
||||
|
||||
**Q.** Are worker node upgrades handled differently from control plane node upgrades?
|
||||
|
||||
**A.** Short answer: no.
|
||||
|
||||
Long answer: Both node types follow the same set procedure.
|
||||
From the user's standpoint, however, the processes are identical.
|
||||
However, since control plane nodes run additional services, such as etcd, there are some extra steps and checks performed on them.
|
||||
For instance, Talos will refuse to upgrade a control plane node if that upgrade would cause a loss of quorum for etcd.
|
||||
If multiple control plane nodes are asked to upgrade at the same time, Talos will protect the Kubernetes cluster by ensuring only one control plane node actively upgrades at any time, via checking etcd quorum.
|
||||
If running a single-node cluster, and you want to force an upgrade despite the loss of quorum, you can set `preserve` to `true`.
|
||||
|
||||
**Q.** Can I break my cluster by upgrading everything at once?
|
||||
|
||||
**A.** Maybe - it's not recommended.
|
||||
|
||||
Nothing prevents the user from sending near-simultaneous upgrades to each node of the cluster - and while Talos Linux and Kubernetes can generally deal with this situation, other components of the cluster may not be able to recover from more than one node rebooting at a time.
|
||||
(e.g. any software that maintains a quorum or state across nodes, such as Rook/Ceph)
|
||||
|
@ -1,111 +0,0 @@
|
||||
---
|
||||
title: Upgrades
|
||||
weight: 5
|
||||
---
|
||||
|
||||
## Talos
|
||||
|
||||
The upgrade process for Talos, like everything else, begins with an API call.
|
||||
This call tells a node the installer image to use to perform the upgrade.
|
||||
Each Talos version corresponds to an installer with the same version, such that the
|
||||
version of the installer is the version of Talos which will be installed.
|
||||
|
||||
Because Talos is image based, even at run-time, upgrading Talos is almost
|
||||
exactly the same set of operations as installing Talos, with the difference that
|
||||
the system has already been initialized with a configuration.
|
||||
|
||||
An upgrade makes use of an A-B image scheme in order to facilitate rollbacks.
|
||||
This scheme retains the one previous Talos kernel and OS image following each upgrade.
|
||||
If an upgrade fails to boot, Talos will roll back to the previous version.
|
||||
Likewise, Talos may be manually rolled back via API (or `talosctl rollback`).
|
||||
This will simply update the boot reference and reboot.
|
||||
|
||||
An upgrade can `preserve` data or not.
|
||||
If Talos is told to NOT preserve data, it will wipe its ephemeral partition, remove itself from the etcd cluster (if it is a control node), and generally make itself as pristine as is possible.
|
||||
There are likely to be changes to the default option here over time, so if your setup has a preference to one way or the other, it is better to specify it explicitly, but we try to always be "safe" with this setting.
|
||||
|
||||
### Sequence
|
||||
|
||||
When a Talos node receives the upgrade command, the first thing it does is cordon
|
||||
itself in kubernetes, to avoid receiving any new workload.
|
||||
It then starts to drain away its existing workload.
|
||||
|
||||
**NOTE**: If any of your workloads is sensitive to being shut down ungracefully, be sure to use the `lifecycle.preStop` Pod [spec](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks).
|
||||
|
||||
Once all of the workload Pods are drained, Talos will start shutting down its
|
||||
internal processes.
|
||||
If it is a control node, this will include etcd.
|
||||
If `preserve` is not enabled, Talos will even leave etcd membership.
|
||||
(Don't worry about this; we make sure the etcd cluster is healthy and that it will remain healthy after our node departs, before we allow this to occur.)
|
||||
|
||||
Once all the processes are stopped and the services are shut down, all of the
|
||||
filesystems will be unmounted.
|
||||
This allows Talos to produce a very clean upgrade, as close as possible to a pristine system.
|
||||
We verify the disk and then perform the actual image upgrade.
|
||||
|
||||
Finally, we tell the bootloader to boot _once_ with the new kernel and OS image.
|
||||
Then we reboot.
|
||||
|
||||
After the node comes back up and Talos verifies itself, it will make permanent
|
||||
the bootloader change, rejoin the cluster, and finally uncordon itself to receive new workloads.
|
||||
|
||||
### FAQs
|
||||
|
||||
**Q.** What happens if an upgrade fails?
|
||||
|
||||
**A.** There are many potential ways an upgrade can fail, but we always try to do
|
||||
the safe thing.
|
||||
|
||||
The most common first failure is an invalid installer image reference.
|
||||
In this case, Talos will fail to download the upgraded image and will abort the upgrade.
|
||||
|
||||
Sometimes, Talos is unable to successfully kill off all of the disk access points, in which case it cannot safely unmount all filesystems to effect the upgrade.
|
||||
In this case, it will abort the upgrade and reboot.
|
||||
|
||||
It is possible (especially with test builds) that the upgraded Talos system will fail to start.
|
||||
In this case, the node will be rebooted, and the bootloader will automatically use the previous Talos kernel and image, thus effectively aborting the upgrade.
|
||||
|
||||
Lastly, it is possible that Talos itself will upgrade successfully, start up, and rejoin the cluster but your workload will fail to run on it, for whatever reason.
|
||||
This is when you would use the `talosctl rollback` command to revert back to the previous Talos version.
|
||||
|
||||
**Q.** Can upgrades be scheduled?
|
||||
|
||||
**A.** We provide the [Talos Controller Manager](https://github.com/talos-systems/talos-controller-manager) to coordinate upgrades of a cluster.
|
||||
Additionally, because the upgrade sequence is API-driven, you can easily tie this in to your own business logic to schedule and coordinate your upgrades.
|
||||
|
||||
**Q.** Can the upgrade process be observed?
|
||||
|
||||
**A.** The Talos Controller Manager does this internally, watching the logs of
|
||||
the node being upgraded, using the streaming log API of Talos.
|
||||
|
||||
You can do the same thing using the `talosctl logs --follow machined` command.
|
||||
|
||||
**Q.** Are worker node upgrades handled differently from control plane node upgrades?
|
||||
|
||||
**A.** Short answer: no.
|
||||
|
||||
Long answer: Both node types follow the same set procedure.
|
||||
However, since control plane nodes run additional services, such as etcd, there are some extra steps and checks performed on them.
|
||||
From the user's standpoint, however, the processes are identical.
|
||||
|
||||
There are also additional restrictions on upgrading control plane nodes.
|
||||
For instance, Talos will refuse to upgrade a control plane node if that upgrade will cause a loss of quorum for etcd.
|
||||
This can generally be worked around by setting `preserve` to `true`.
|
||||
|
||||
**Q.** Will an upgrade try to do the whole cluster at once?
|
||||
Can I break my cluster by upgrading everything?
|
||||
|
||||
**A.** No.
|
||||
|
||||
Nothing prevents the user from sending any number of near-simultaneous upgrades to each node of the cluster.
|
||||
While most people would not attempt to do this, it may be the desired behaviour in certain situations.
|
||||
|
||||
If, however, multiple control plane nodes are asked to upgrade at the same time, Talos will protect itself by making sure only one control plane node upgrades at any time, through its checking of etcd quorum.
|
||||
A lease is taken out by the winning control plane node, and no other control plane node is allowed to execute the upgrade until the lease is released and the etcd cluster is healthy and _will_ be healthy when the next node performs its upgrade.
|
||||
|
||||
**Q.** Is there an operator or controller which will keep my nodes updated
|
||||
automatically?
|
||||
|
||||
**A.** Yes.
|
||||
|
||||
We provide the [Talos Controller Manager](https://github.com/talos-systems/talos-controller-manager) to perform this maintenance in a simple, controllable fashion.
|
@ -17,8 +17,9 @@ The follow are requirements for running Talos in Docker:
|
||||
|
||||
## Caveats
|
||||
|
||||
Due to the fact that Talos runs in a container, certain APIs are not available when running in Docker.
|
||||
For example `upgrade`, `reset`, and APIs like these don't apply in container mode.
|
||||
Due to the fact that Talos will be running in a container, certain APIs are not available.
|
||||
For example `upgrade`, `reset`, and similar APIs don't apply in container mode.
|
||||
Further, when running on a Mac in docker, due to networking limitations, VIPs are not supported.
|
||||
|
||||
## Create the Cluster
|
||||
|
||||
@ -30,20 +31,12 @@ talosctl cluster create --wait
|
||||
|
||||
Once the above finishes successfully, your talosconfig(`~/.talos/config`) will be configured to point to the new cluster.
|
||||
|
||||
If you are running on MacOS, an additional command is required:
|
||||
> Note: Startup times can take up to a minute or more before the cluster is available.
|
||||
|
||||
```bash
|
||||
talosctl config --endpoints 127.0.0.1
|
||||
```
|
||||
Finally, we just need to specify which nodes you want to communicate with using talosctl.
|
||||
Talosctl can operate on one or all the nodes in the cluster – this makes cluster wide commands much easier.
|
||||
|
||||
> Note: Startup times can take up to a minute before the cluster is available.
|
||||
|
||||
## Retrieve and Configure the `kubeconfig`
|
||||
|
||||
```bash
|
||||
talosctl kubeconfig .
|
||||
kubectl --kubeconfig kubeconfig config set-cluster talos-default --server https://127.0.0.1:6443
|
||||
```
|
||||
`talosctl config nodes 10.5.0.2 10.5.0.3`
|
||||
|
||||
## Using the Cluster
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user