9.6 KiB
Proposal: Automated Upgrades
Author(s): Andrew Rynhard
Abstract
In the next version of Talos we aim to make it completely ephemeral by:
- having an extremely small footprint, allowing for the
rootfsto run purely in memory - formatting the owned block device on upgrades
- performing upgrades from a container
Background
Currently, Talos bootstraps a machine in the initramfs.
In this stage the required partitions are created, if necessary, and mounted.
This design requires four partitions:
- a boot partition
- two root partitions
ROOT-AROOT-B
- and a data partition
Our immutable approach necessitates two root partitions in order to perform an upgrade.
These two partitions are toggled on each upgrade (e.g. When upgrading while running ROOT-A, the ROOT-B partition if used for the new install and then targeted on the next boot).
By requiring two root partitions, the artifacts produced are sizable (~5GB), making our footprint bigger than we need.
In addition to the above, we currently run init in the initramfs and then perform a switch_root to the mounted rootfs and then execute /self/proc/exe to pass control to the same version of init but with a flag that indicates we should run rootfs specific code.
By executing /self/proc/exe we run into a number of limitations:
- tightly coupled
initramfsandrootfsPID 1 processes - requires a reboot upon upgrading
initis bloated
Note: There are a number of other tasks performed in the
initramfsbut I will spare the details as they don't have much influence on the design outlined in this proposal.
Proposal
This proposal introduces the idea of breaking apart init into two separate binaries with more focused scopes.
I propose we call these binaries init (runs in initramfs) and machined (runs in rootfs).
Additionally, optimizations in the rootfs size, and container based upgrades are proposed.
Note: I'm also playing with the name
nodedinstead ofmachined. Either way I'd like our nomenclature to align throughout Talos (e.g. nodeconfig, machineconfig).
Splitting init into init and machined
By splitting the current implementation of init into two distincts binaries we have the potential to avoid reboots on upgrades.
Since our current init code performs tasks in the early boot stage, and it acts is PID 1 once running the rootfs, we must always reboot in order run a new version of Talos.
Aside from these benefits, its just better in general to have more focused scope.
The Scope of init
Splitting our current init into two doesn't completely remove the need for a reboot, but it does reduce the chances that a reboot is required.
We can decrease these chances by having the new init perform only the following tasks:
- mount the special filesystems (e.g.
/dev,/proc,/sys, etc.) - mount the
rootfs(asquashfsfilesystem) using a loopback device - start
machined
By stripping init down to the bare minimum to start the rootfs, we reduce the chances that we introduce a change that requires upgrading init.
Additionally, on upgrades, we can return control back to init to perform a teardown of the node and reutrn it back to a state that is the same as when the machine was freshly booted.
This provides a clean slate for upgrades.
The Scope of machined
Once we enter the rootfs we have three high level tasks:
- retreive the machineconfig
- create, format, and mount partitions per the builtin specifications
- start system and k8s.io services
By having a versioned machined that is decoupled from initramfs we can encapsulate a version of Talos with one binary that knows how to run Talos.
It will know how to handle versioned machineconfigs and migrate them to newer formats.
If we ever decide to change paritition layout, filesystem types, or anything at the blockdevice level, the versioned machined will know how to handle that too.
The same can be said about the services that machined is responsible for.
Running rootfs in Memory
To take nodes one step closer to becoming completely ephemeral, we will run the rootfs in memory.
This is part of the solution to remove the need for ROOT-A and ROOT-B partitions.
It also allows us to dedicate the owned block device to data that can be reproduced (e.g. containerd data, kubelet data, etcd data, CNI data, etc.).
Using squashfs for the rootfs
For performance reasons, and a smaller footprint, we can create our rootfs as a squashfs image.
The way in which we can run the rootfs is by mounting it using a loopback device.
Once mounted, we have will perform a switch_root into the mounted rootfs.
Note: A benefit of
squashfsis that it allows us to retain our read onlyrootfswithout having to mount it as such.
Slimming down rootfs
To slim down the size of our rootfs, I propose that we remove Kubernetes tarballs entirely and let the kubelet download them.
By doing this we can embrace configurability at the Kubernetes layer.
The disadvantage to this is that this introduces a number a variables in the number of things that can go wrong.
Having "approved" tarballs packaged with Talos allows us to run conformance tests against exactly what users will use.
That being said, I think this is overall a better direction for Talos.
A New Partition Scheme
With the move to run the rootfs in memory, we will remove the need for ROOT-A and ROOT-B partitions.
In the case of cloud installs, or where users may want to boot without PXE/iPXE, we will require a boot partition.
In all cases we require a data partition.
I also propose we rename the DATA partition to EPHEMERAL to make it clear that this partition should not be used to store application data.
Our suggestion to users who do want persistent storage on the host is to mount another disk and to use that.
Talos will make the guarantee to only manage the "owned" block device.
Performing Upgrades in a Container
The move to running rootfs in memory, and thus making the block device ephemeral, and the split of init into init and machined, allows us to take an approach to upgrades that is clean, and relatively low risk.
The workflow will look roughly like the following:
- an upgrade request is received by
osdand proxied tomachined machinedstopssystemandk8s.ioservicesmachinedunmounts all mounts in/varmachinedruns the specified installer container
Automated Upgrades
All of the aformentioned improvements help in the higher goal of automated upgrades. Once the above is implemented, upgrades should be simple and self-contained to a machine. By running upgrades in containers, we can have very simple operator implementation that can rollout an upgrade across a cluster by simply:
- checking for a new tag
- orchestrating cluster-wide rollout
The operator will initially coordinate upgrades only, but it will also provide a basis for a number of future automation opportunities.
Release Channels
We will maintain a service that allows for the operator to discover when new releases of Talos are made available. There will be three "channels" that a machine can subscribe to:
- alpha
- beta
- stable
The implementation details around this service and its' interaction with the operator is not in the scope of this proposal.
Workflow
In order to upgrade nodes, the operator will subscribe to events for nodes. It will keep an in-memory list of all nodes.
The operator will periodically query the target registry for a list of containers within a time range. The time range should start where a previous query has ended. The initial query should use the currently installed image's timestamp to decide where to start.
In order to supply enough information to the operator, such that it can make an informed decision about using a particular container, we will make use of labels. With labels we can communicate to the operator that a container is:
- an official release
- within a given channel
- contains a version that is compatible with the current cluster
There may be more information we end up needing, the above is just an example.
Once a container has been chosen, the operator will then create a schedule based on polices. The initial policies will perform an upgrade based on:
- x% of nodes at a given time
- availability of a new image in a given channel
- time of day
- machines per unit of time (e.g. one machine a day, two machines per hour, etc.)
- strategy:
- In-place
- Replace
A user should be able to apply more than one policy.
Once the operator has decided scheduling, it will use the upgrade RPC to perform an upgrades according to the calculated upgrade schedule.
Rationale
One of the goals in Talos is to allow for machines to be thought of as a container in that it is ephemeral, reproducible, and predictable.
Since the kubelet can reproduce a machine's workload, it is not too crazy to think that we can facilitate upgrades by completely wiping a node, and then performing a fresh install as if the machine is new.
Compatibility
This change introduces backwards incompatible changes.
The way in which init will be ran, and upgrades will be performed, are fundamentally different.
Implementation
- slim down the
rootfsby removing Kubernetes tarballs (@andrewrynhard due by v0.2) - move
rootfsfrom tarball tosquashfs(@andrewrynhard due by v0.2) - split
initintoinitandmachined(@andrewrynhard due by v0.2) - run
rootfsin memory (@andrewrynhard due by v0.2) - use new design for installs and upgrades (@andrewrynhard due by v0.2)
- implement channel service (due by v0.3)
- implement the operator (@andrewrynhard due by v0.3)