Increase values:
- fs.aio-max-nr to 1048576 (for Ceph|Veritas|other storages)
- fs.inotify.max_user_instances to 8192 (since the usual 512 is too small today's needs)
There is no need to adjust fs.inotify.max_user_watches since it's set dynamically during startup by kernel.
Closes#5175
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
This fixes an issue when `talosctl upgrade-k8s` fails with unhelpful
message if the version is specified as `v1.23.5` vs. `1.23.5`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
As QEMU clusters are used for testing, use unsafe cache options to
reduce amount of fsyncs going to the host blockdevice.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos historically relied on `kubernetes` `Endpoints` resource (which
specifies `kube-apiserver` endpoints) to find other controlplane members
of the cluster to connect to the `etcd` nodes for the cluster (when node
local etcd instance is not up, for example). This method works great,
but it relies on Kubernetes endpoint being up. If the Kubernetes API is
down for whatever reason, or if the loadbalancer malfunctions, endpoints
are not available and join/leave operations don't work.
This PR replaces the endpoints lookup to use the `Endpoints` COSI
resource which is filled in using two methods:
* from the discovery data (if discovery is enabled, default to enabled)
* from the Kubernetes `Endpoints` resource
If the discovery is disabled (or not available), this change does almost
nothing: still Kubernetes is used to discover control plane endpoints,
but as the data persists in memory, even if the Kubernetes control plane
endpoint went down, cached copy will be used to connect to the endpoint.
If the discovery is enabled, Talos can join the etcd cluster immediately
on boot without waiting for Kubernetes to be up on the bootstrap node
which means that Talos cluster initial bootstrap runs in parallel on all
control plane nodes, while previously nodes were waiting for the first
node to finish bootstrap enough to fill in the endpoints data.
As the `etcd` communication is anyways protected with mutual TLS,
there's no risk even if the discovery data is stale or poisoned, as etcd
operations would fail on TLS mismatch.
Most of the changes in this PR actually enable populating Talos
`Endpoints` resource based on the `Kubernetes` `endpoints` resource
using the watch API.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The new mode allows changing the config for a period of time, which
allows trying the configuration and automatically rolling it back in case
if it doesn't work for example.
The mode can only be used with changes that can be applied without a
reboot.
When changed it doesn't write the configuration to disk, only changes it
in memory.
`--timeout` parameter can be used to customize the rollback delay.
The default timeout is 1 minute.
Any consequent configuration change will abort try mode and the last
applied configuration will be used.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Since Talos moved to new registry redirect CRI plugin format, start
redirects are no longer supported in the CRI plugin (see
https://github.com/containerd/containerd/blob/main/docs/hosts.md).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
For most of the Talos service `post` stage does nothing, so it was never
properly noticed. FOr extension service, pre/post stages perform
mounting and unmounting of the overlayfs, so if post stage doesn't run
(if the runner can't be created), next time service is started, it won't
start as the post stage never ran.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This bug showed up with extension services: say we have a service
`ext-foo` which depends on service `cri`.
Service `ext-foo` will be started correctly only once `cri` is up.
But we should also stop `ext-foo` before `cri` is stopped, as otherwise
the dependency chain is broken. This PR fixes exactly that: once `cri`
is stopped, anything which depends on it should be stopped. We should
stop as well anything which depends on `ext-foo` (transitive
dependency).
In practical terms we use dependency on `cri` in extension service to
correctly stop/start extension services with `/var` filesystem
mount/unmount.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Make the latest-version banner sticky and
more noticeable, and ensure the link to the
latest version links to the current document
if possible.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Not sure how and when it got broken, but we're looking for mounts for
the blockdevice (like `/dev/vda`), while the actual mount info contains
the partition device (like `/dev/vda6`).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Having polymorphic (spec type depends on ID) resources is not a good
idea, and it's not compatible with protobuf encoding.
Introduce new resources for each polymorphic sub-spec using new Go 1.18
generic typed.Resource to reduce the boilerplate code.
(Still needs proper deepcopy-gen, but I'm skipping it for now, as
K8sControlPlane had also broken deep copy).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
With update of the client library to 3.5.3, etcd library started using
the logger, so using `nil` isn't fine anymore.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add a note on how machine configuration can be retrieved
from the node, after e.g. interactive setup.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Many users have been using the VIP functionality to configure
endpoints in Talos config. Documentation to clarify the possible
issues with that option and that it should be avoided.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
This "fixes" the message like:
```
xfs filesystem being mounted at /var supports timestamps until 2038 (0x7fffffff)
```
We should support Talos beyond 2038, even if we switch to a different
filesystem type by 2038 :)
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes a typo in the Extension Services document alias
which serves as the redirect from the old location.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Enable the Rpi4 PoE hat fan control by pulling in the overlay
compatible with the upstream kernel driver.
Ref: https://github.com/siderolabs/pkgs/pull/450
Signed-off-by: Noel Georgi <git@frezbo.dev>
Dry run prints out config diff, selected application mode without
changing the configuration.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Containerd CRI plugin was merged into the main repo, but we were using
old import path, so our constants coming from the module were outdated.
This fixes the image version for the pause container.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See https://github.com/etcd-io/etcd/releases/tag/v3.5.3
This release should contain a fix for data consistency issue when etcd
is killed under high load.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Make improvements to help documentation discoverability and categorization.
Ensure all content pages have a description.
Ensure all link are replaced with Hugo shortcode.
Ensure all moved pages have an alias so redirects work.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Bump tools and pkgs to get kernel 5.15.33
5.15.33 has a bunch of fixes for some CVE's,
it was too hard to track those and reference
Signed-off-by: Noel Georgi <git@frezbo.dev>
Increase go.mod version from 1.17 to 1.18 in all projects. Update Makefile
to use latest tooling. Fix golangci by disable nolintlint for now.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>