This two required some additional attention and were split into separate branch. Also fix data race in NodeAddressSpec.DeepCopy method.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
From #5472 Andrey comments, this commit changes LinkRefresh and LinkStatus into typed.Resource by moving Bump and Physical methods to *Spec types.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The user will get an error message and talosctl aborts if `talosctl cluster create` is called with gen options and the --input-dir flag.
Fixes#2275
Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>
Adds an ADOPTERS markdown to the repo to allow users to show
they have adopted using Talos Linux in their organization.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Refactor remaining resources into typed.Resource. Exceptions are:
- MachineConfig
- MachineType
- LinkRefresh
- LinkStatus
all of which contain additional methods, and cannot be simply reworked into new resource framework.
StaticPod and StaticPodStatus are also absent from this PR, because they result in e2e errors which are going to be resolved in the next PR.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The links to the patch and script files were changed and not reflected
here. There was also a missing curl command in the first example of
downloading the patch.
Signed-off-by: Tames McTigue <tames@northwestern.edu>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
There were many discussions on creating native Talos providers for TF,
Pulumi, etc., but there's no documented idiomatic way to use our
machinery package to generate the config. This PR tries to fill this
gap.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
With the advent of generics, redo pointer functionality and remove github.com/AlekSi/pointer dependency.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
As Talos v1.0.4 now supports kubelet with graceful shutdown disabled,
update the docs.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Current code contains a data race, since access to r.bytes in Bytes() is unguarded and can be called from several goroutines. There is no need for it anyway, since WrapReadonly always gets a full slice. Refactor code to reflect that.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The problem is that these values needs to be set to zero if the kubelet
feature gate is disabled, so we can't assume that we can override zero
value with the proper config, so we have to do an extra check on the
supplied configuration.
Also creates KB article on disabling this feature gate.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Before this change, we didn't preserve bonded interfaces ordering, which caused problems in some scenarios. Fix this by remembering their position in the original config.
Fixes#5207.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
This code was written from JSON point of view, but
when YAML is unmarshaled, we get more primitive Go types
as values, so why not include all of them?
This was showing as an error when applying a machine config e.g. for
kubelet extraArgs like:
```
shutdownGracePeriod: 0
```
Changing this to string fixes the problem, but it's not the best UX.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
`/var/run` was mounted from `/run`, and D-Bus socket to `/var/run/dbus/`
path, so when the container is stopped, container mounts are removed,
but on the host side mount propagates back, so D-Bus socket gets
propagated back to the host `/run`, and on the next kubelet restart
process continues adding even more mount levels exponentially.
Eventually on kubelet restart kernel resources are exhausted and the
node freezes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
No functional changes.
Also bump bumped cosi-runtime with the fix for the UnmarshalProto.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
When registry CRI config gets updated, contents of the file are written
to the `EtcFileSpec` resource, which gets rendered to disk and resource
`EtcFileStatus` is updated when the config is ready.
CRI config parts are merged from contents of `*.part` files which come
from system extensions and dynamic registry config which is written via
`EtcFileSpec` resource. As the controller was incorrectly triggered on
`EtcFileSpec` resource while reading files from disk, it might have read
stale contents of CRI config part (which hasn't been fully rendered to
disk), it might miss the latest content of the CRI config.
With the fix, controller is triggered on `EtcFileStatus` update, so when
the file is rendered to disk.
The symptom of the bug is the empty CRI registry config like:
```shell
talosctl read /etc/cri/conf.d/cri.toml
## /etc/cri/conf.d/00-base.part
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
discard_unpacked_layers = true
## /etc/cri/conf.d/01-registries.part
```
Notice that the `01-registries.part` is empty.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Init nodes were deprecated in v1.0 so it makes sense
to remove the documentation about them and consign
them to the past!
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Increase values:
- fs.aio-max-nr to 1048576 (for Ceph|Veritas|other storages)
- fs.inotify.max_user_instances to 8192 (since the usual 512 is too small today's needs)
There is no need to adjust fs.inotify.max_user_watches since it's set dynamically during startup by kernel.
Closes#5175
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
This fixes an issue when `talosctl upgrade-k8s` fails with unhelpful
message if the version is specified as `v1.23.5` vs. `1.23.5`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
As QEMU clusters are used for testing, use unsafe cache options to
reduce amount of fsyncs going to the host blockdevice.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos historically relied on `kubernetes` `Endpoints` resource (which
specifies `kube-apiserver` endpoints) to find other controlplane members
of the cluster to connect to the `etcd` nodes for the cluster (when node
local etcd instance is not up, for example). This method works great,
but it relies on Kubernetes endpoint being up. If the Kubernetes API is
down for whatever reason, or if the loadbalancer malfunctions, endpoints
are not available and join/leave operations don't work.
This PR replaces the endpoints lookup to use the `Endpoints` COSI
resource which is filled in using two methods:
* from the discovery data (if discovery is enabled, default to enabled)
* from the Kubernetes `Endpoints` resource
If the discovery is disabled (or not available), this change does almost
nothing: still Kubernetes is used to discover control plane endpoints,
but as the data persists in memory, even if the Kubernetes control plane
endpoint went down, cached copy will be used to connect to the endpoint.
If the discovery is enabled, Talos can join the etcd cluster immediately
on boot without waiting for Kubernetes to be up on the bootstrap node
which means that Talos cluster initial bootstrap runs in parallel on all
control plane nodes, while previously nodes were waiting for the first
node to finish bootstrap enough to fill in the endpoints data.
As the `etcd` communication is anyways protected with mutual TLS,
there's no risk even if the discovery data is stale or poisoned, as etcd
operations would fail on TLS mismatch.
Most of the changes in this PR actually enable populating Talos
`Endpoints` resource based on the `Kubernetes` `endpoints` resource
using the watch API.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The new mode allows changing the config for a period of time, which
allows trying the configuration and automatically rolling it back in case
if it doesn't work for example.
The mode can only be used with changes that can be applied without a
reboot.
When changed it doesn't write the configuration to disk, only changes it
in memory.
`--timeout` parameter can be used to customize the rollback delay.
The default timeout is 1 minute.
Any consequent configuration change will abort try mode and the last
applied configuration will be used.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Since Talos moved to new registry redirect CRI plugin format, start
redirects are no longer supported in the CRI plugin (see
https://github.com/containerd/containerd/blob/main/docs/hosts.md).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
For most of the Talos service `post` stage does nothing, so it was never
properly noticed. FOr extension service, pre/post stages perform
mounting and unmounting of the overlayfs, so if post stage doesn't run
(if the runner can't be created), next time service is started, it won't
start as the post stage never ran.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This bug showed up with extension services: say we have a service
`ext-foo` which depends on service `cri`.
Service `ext-foo` will be started correctly only once `cri` is up.
But we should also stop `ext-foo` before `cri` is stopped, as otherwise
the dependency chain is broken. This PR fixes exactly that: once `cri`
is stopped, anything which depends on it should be stopped. We should
stop as well anything which depends on `ext-foo` (transitive
dependency).
In practical terms we use dependency on `cri` in extension service to
correctly stop/start extension services with `/var` filesystem
mount/unmount.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Make the latest-version banner sticky and
more noticeable, and ensure the link to the
latest version links to the current document
if possible.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>