This implements a simple way to upgrade Talos node running in
maintenance mode (only if Talos is installed, i.e. if `STATE` and
`EPHEMERAL` partitions are wiped).
Upgrade is only available over SideroLink for security reasons.
Upgrade in maintenance mode doesn't support any options, and it works
without machine configuration, so proxy environment variables are not
available, registry mirrors can't be used, and extensions are not
installed.
Fixes#6224
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add resource `AuditPolicyConfigs.kubernetes.talos.dev`.
It can be changed through machine config `cluster.apiServer.auditPolicy`
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This commit adds support for building Talos for the
Compute Module 4 and other generic Raspberry Pi
hardware.
Fixes: #6273
Signed-off-by: Kris Reeves <kris@pressbuttonllc.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
See #6333
Using permanent address fixes issues with mis-matching the links after
they got bonded.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Permanent address is only available for physical links, and it might be
different from the 'hardware address': when bonding, 'hardware address'
gets overridden from the bond master, while 'permanent address' still
shows MAC of the interface.
This part of the fix for incorrect bonding issue on Equinix Metal.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6302
This allows Talos to proceed if some manifest is invalid (or malformed),
while aborts the loop on connection errors (when `kube-apiserver` is not
ready).
This fixes a problem when a single resource might stop all manifests
from being applied and preventing a cluster bootstrap.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This fixes an issue introduced in #5879: options should be set same way
for both `init` and `controlplane` cases.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Don't allow worker nodes to act as apid routers:
* don't try to issue client certificate for apid on worker nodes
* if worker nodes receives incoming connections with `--nodes` set to
one of the local addresses of the nodd, it routes the request to
itself without proxying
Second point allows using `talosctl -e worker -n worker` to connect
directly to the worker if the connection from the control plane is not
available for some reason.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos worker nodes use `trustd` API on control plane nodes to issue
certificates for `apid` service. Access to the API is protected with the
Talos join token specified in the machine configuration.
There was no validation on what kind of request is requested, so
`trustd` could issue a certificate which is valid for client
authentication with any set of Talos API RBAC roles, including
`os:admin` role allowing full access to the Talos API on control plane
nodes.
See: GHSA-7hgc-php5-77qq
CVE: CVE-2022-36103
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
There is no need to use `assert.Implements` since we can express this check during compile time. Go will eliminate `_` variables and any accompanying allocations during dead-code elimination phase.
This commit also removes:
tok := new(v1alpha1.ClusterConfig).Token()
assert.Implements(t, (*config.Token)(nil), tok)
Code since it doesn't check anything - v1alpha1.ClusterConfig.Token() already returns a config.Token interface.
Also - run `go work sync` and `go mod tidy`.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The bug was triggered by `containerd` crash (restart), in this case
runner receives an error as if the process exited.
Runner tries to restart the container, but as the container is still
running, attempt to delete the task would fail.
With this change Talos always tries to kill the running container and
waits for the container to terminate.
The error message when the bug was triggered looks like:
```
service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to clean up task "kubelet": task must be stopped before deletion: running: failed precondition
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6156
Now access from Talos itself goes with `talos:admin` username in the
Kubernetes API server audit log, while access with admin kubeconfig goes
with `admin` username as before.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This is enabled via a machine config feature/version contract, as
`talosconfig` certificate generated previously didn't have proper key
usage set, so we need to keep backwards compatibility on upgrades.
New v1.3+ clusters will include this check.
This check prevents even potential mis-use of server certificates as a
client certificate.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
That was a mistake to use only 'routed' addresses, as they e.g. do not
include SideroLink.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
As APIs were not listed explicitly, access with `os:reader` was denied
by default, while it should have been checked down in the access filter.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6210
Refactored the code a bit to support excludes and default configuration.
Etcd should never advertise VIPs, as VIPs are managed by etcd elections.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This commit adds initial support for the Nano Pi
R4S from Friendlyelec. This device is a networking focused
rk3399 based SBC with two 1G ethernet interfaces,
making it perfect for edge or SOHO deployments.
Signed-off-by: Marvin Drees <marvin.drees@9elements.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
I hit this bug when one the API calls got hanging, and submitting the
machine config with `apply-config` never takes the node out of
maintenance mode, as `.GracefulStop()` may hang forever waiting for all
the calls to finish.
This way we always abort at some timeout and stop the server forcefully.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This fixes a case when a node is rebooted, and connection via another
endpoint apid "caches" a connection error even when the node is up.
E.g. this command:
```
talosctl -e IP1 -n IP2 version
```
If node `IP2` is rebooted, `apid` at `IP1` might enter long backoff loop
and return an error still when `IP2` is actually up.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6110
I somehow missed the fact that etcd certs were not made fully reactive
to node address changes (I wrongly assume it was already the fact).
This PR refactors etcd certificate generation process to be
resource-based and introduces unit-tests for the controller.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6119
With new stable default hostname feature, any default hostname is
disabled until the machine config is available.
Talos enters maintenance mode when the default config source is empty,
so it doesn't have any machine config available at the moment
maintenance service is started.
Hostname might be set via different sources, e.g. kernel args or via
DHCP before the machine config is available, but if all these sources
are not available, hostname won't be set at all.
This stops waiting for the hostname, and skips setting any DNS names in
the maintenance mode certificate SANs if the hostname is not available.
Also adds a regression test via new `--disable-dhcp-hostname` flag to
`talosctl cluster create`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
When converting to base36 a 256-bit number there's a bias in the
first character of the base36 encoding, as 256-bit number never fits
perfectly base 36 number.
To give an example, when converting 4-digit binary number to decimal,
the first digit of the decimal number will be [0..3], while the
second digit won't be biased:
```
0000 -> 00
0001 -> 01
...
0111 -> 15
1000 -> 16
...
1111 -> 31
```
Same issue happens when going from e.g. base16 to base36.
Stable hostnames were biased towards having a digit as the first
character.
The fix is to skip the first character of the base36 representation, and
also we don't need to convert all 256 bits to base36, if we use only 6
characters, we can save some CPU resources by taking only 8 bytes
instead of full 32 bytes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This allows to update the member information (for the current node) with
new advertised peer URLs as the config changes.
E.g. if the node IP changes, this will update the peer URLs for the
member accordingly.
At the same time any member update requires quorum, so changing IPs can
only be done on node-by-node basis.
If there are no changes to advertised peer URLs, controller does
nothing.
Talos node might still need a reboot to update the listen addresses, as
these are not handled automatically for now.
Fixes#6080
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Looks like it returns nil if it doesn't exist and the code doesn't
handle it properly.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Overview: deprecate existing Talos resource API, and introduce new COSI
API.
Consequences:
* COSI API can only go via one-2-one proxy (`client.WithNode`)
* client-side API access is way easier with `state.State` wrappers
* lots of small changes on the client side to use new APIs
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>