1266 Commits

Author SHA1 Message Date
Andrey Smirnov
139c62d762
feat: allow upgrades in maintenance mode (only over SideroLink)
This implements a simple way to upgrade Talos node running in
maintenance mode (only if Talos is installed, i.e. if `STATE` and
`EPHEMERAL` partitions are wiped).

Upgrade is only available over SideroLink for security reasons.

Upgrade in maintenance mode doesn't support any options, and it works
without machine configuration, so proxy environment variables are not
available, registry mirrors can't be used, and extensions are not
installed.

Fixes #6224

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-30 21:16:15 +04:00
Noel Georgi
48dee48057
feat: support mtu for routes
Support setting MTU for routes.

Fixes: #6324

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-30 16:38:22 +05:30
Serge Logvinov
18c377a4d1
feat: customize audit policy
Add resource `AuditPolicyConfigs.kubernetes.talos.dev`.
It can be changed through machine config `cluster.apiServer.auditPolicy`

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-28 13:46:44 +04:00
Noel Georgi
23c9ea46bb
fix: raspberry pi install
Fix raspberry pi install.

Some fixes were missed from #6388

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-28 01:09:28 +05:30
Noel Georgi
6bd3cca1a8
chore: generic raspberry pi images
Use generic Raspberry Pi images. Deprecate the RPi4 specific image.

Ref: https://github.com/siderolabs/pkgs/pull/596

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-27 16:39:12 +05:30
Kris Reeves
a0151aa13e
feat: add generic rpi u-boot support
This commit adds support for building Talos for the
Compute Module 4 and other generic Raspberry Pi
hardware.

Fixes: #6273

Signed-off-by: Kris Reeves <kris@pressbuttonllc.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-26 21:04:07 +05:30
Andrey Smirnov
30f851d093
chore: bump dependences
go-mod-outdated

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-26 18:37:38 +04:00
Andrey Smirnov
8b2235c3b6
fix: lookup Equinix Metal bond slaves using 'permanent addr'
See #6333

Using permanent address fixes issues with mis-matching the links after
they got bonded.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-26 18:10:39 +04:00
Andrey Smirnov
0b2767c164
feat: implement 'permanent addr' in link statuses
Permanent address is only available for physical links, and it might be
different from the 'hardware address': when bonding, 'hardware address'
gets overridden from the bond master, while 'permanent address' still
shows MAC of the interface.

This part of the fix for incorrect bonding issue on Equinix Metal.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-26 14:45:46 +04:00
Dmitriy Matrenichev
fc48849d00
chore: move maps/slices/ordered to gen module
Use github.com/siderolabs/gen

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-09-21 20:22:43 +03:00
Andrey Smirnov
8b09bd4b04
feat: update Kubernetes to v1.26.0-alpha.1
Talos 1.3.0 will ship with Kubernetes 1.26.0.

See https://github.com/kubernetes/kubernetes/releases/tag/v1.26.0-alpha.1

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-21 18:42:31 +04:00
Noel Georgi
357b770cb5
fix: cryptsetup delete slot
Fix cryptsetup delete slot.

Fixes: #6298

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-21 16:37:54 +05:30
Andrey Smirnov
7111288393
fix: continue applying bootstrap manifests on some errors
Fixes #6302

This allows Talos to proceed if some manifest is invalid (or malformed),
while aborts the loop on connection errors (when `kube-apiserver` is not
ready).

This fixes a problem when a single resource might stop all manifests
from being applied and preventing a cluster bootstrap.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-20 22:27:17 +04:00
Andrey Smirnov
472590aa82
chore: return InvalidArgument on invalid config in maintenance mode
Follow-up fix for #6258

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-15 21:46:48 +04:00
Andrey Smirnov
e5cabd42cc
feat: enable etcd consistency hashcheck
This will be only enabled for Talos v1.3.x.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-15 21:03:40 +04:00
Andrey Smirnov
015535d905
fix: update discovery client with the redirect fix
See https://github.com/siderolabs/discovery-client/pull/4

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-15 20:32:33 +04:00
Andrey Smirnov
94b088f02f
fix: set etcd options consistently
This fixes an issue introduced in #5879: options should be set same way
for both `init` and `controlplane` cases.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-14 22:56:26 +04:00
Andrey Smirnov
7b270ff33d
test: fix api controller test
Fixing the test to match the implementation.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-13 15:26:32 +04:00
Andrey Smirnov
2dadcd6695
fix: stop worker nodes from acting as apid routers
Don't allow worker nodes to act as apid routers:

* don't try to issue client certificate for apid on worker nodes
* if worker nodes receives incoming connections with `--nodes` set to
  one of the local addresses of the nodd, it routes the request to
  itself without proxying

Second point allows using `talosctl -e worker -n worker` to connect
directly to the worker if the connection from the control plane is not
available for some reason.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-13 15:07:31 +04:00
Andrey Smirnov
9eaf33f3f2
fix: never sign client certificate requests in trustd
Talos worker nodes use `trustd` API on control plane nodes to issue
certificates for `apid` service. Access to the API is protected with the
Talos join token specified in the machine configuration.

There was no validation on what kind of request is requested, so
`trustd` could issue a certificate which is valid for client
authentication with any set of Talos API RBAC roles, including
`os:admin` role allowing full access to the Talos API on control plane
nodes.

See: GHSA-7hgc-php5-77qq
CVE: CVE-2022-36103

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-13 15:06:09 +04:00
Noel Georgi
4367491247
feat: environment vars for extension service
This allows setting environment variables for the extension service.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-13 14:06:55 +05:30
Andrey Smirnov
0c0cb671ea
chore: mark machine configuration validation failure as InvalidArgument
This makes it easier to distinguish between retriable and fatal
failures.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-12 22:04:54 +04:00
Dmitriy Matrenichev
12827b861c
chore: move "implements" checks to compile time
There is no need to use `assert.Implements` since we can express this check during compile time. Go will eliminate `_` variables and any accompanying allocations during dead-code elimination phase.

This commit also removes:

    tok := new(v1alpha1.ClusterConfig).Token()
	assert.Implements(t, (*config.Token)(nil), tok)

Code since it doesn't check anything - v1alpha1.ClusterConfig.Token() already returns a config.Token interface.

Also - run `go work sync` and `go mod tidy`.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-09-12 16:57:24 +03:00
Andrey Smirnov
3a67c42cbf
fix: kill the task processes when cleaning up stale task
The bug was triggered by `containerd` crash (restart), in this case
runner receives an error as if the process exited.
Runner tries to restart the container, but as the container is still
running, attempt to delete the task would fail.

With this change Talos always tries to kill the running container and
waits for the container to terminate.

The error message when the bug was triggered looks like:

```
service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to clean up task "kubelet": task must be stopped before deletion: running: failed precondition
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-12 17:05:13 +04:00
Andrey Smirnov
6882725157
fix: use different username for Talos Kubernetes API access
Fixes #6156

Now access from Talos itself goes with `talos:admin` username in the
Kubernetes API server audit log, while access with admin kubeconfig goes
with `admin` username as before.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-09 19:30:36 +04:00
Andrey Smirnov
161a52a9ef
feat: check apid client certificate extended key usage
This is enabled via a machine config feature/version contract, as
`talosconfig` certificate generated previously didn't have proper key
usage set, so we need to keep backwards compatibility on upgrades.

New v1.3+ clusters will include this check.

This check prevents even potential mis-use of server certificates as a
client certificate.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-09 16:37:21 +04:00
Andrey Smirnov
9dadc4a599
fix: include all node addresses into etcd cert SANs
That was a mistake to use only 'routed' addresses, as they e.g. do not
include SideroLink.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-09 15:24:58 +04:00
Andrey Smirnov
9df8f1ff1a
fix: list COSI APIs for the apid authenticator
As APIs were not listed explicitly, access with `os:reader` was denied
by default, while it should have been checked down in the access filter.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-08 21:05:36 +04:00
Andrey Smirnov
f62d17125b
chore: update crypto to use new import path siderolabs/crypto
No functional changes in this PR, just updating import paths.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-07 23:02:50 +04:00
Andrey Smirnov
6472ae00b2
fix: automatically discard VIPs for etcd advertised addresses
Fixes #6210

Refactored the code a bit to support excludes and default configuration.

Etcd should never advertise VIPs, as VIPs are managed by etcd elections.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-06 14:22:12 +04:00
Noel Georgi
5e21cca52d
feat: support setting kernel parameters
Support setting kernel parameters via machine config.

Fixes: #6206

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-05 23:45:51 +05:30
Marvin Drees
cdb6bb2cc7
feat: add Nano Pi R4S support
This commit adds initial support for the Nano Pi
R4S from Friendlyelec. This device is a networking focused
rk3399 based SBC with two 1G ethernet interfaces,
making it perfect for edge or SOHO deployments.

Signed-off-by: Marvin Drees <marvin.drees@9elements.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-09-02 23:37:07 +05:30
Andrey Smirnov
353154281a
fix: drop kube-system SA default binding
This is not needed anymore, it's a leftover from bootkube times.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-09-01 21:38:01 +04:00
Andrey Smirnov
0723498125
fix: update COSI to the version with gRPC Wait fix
See https://github.com/cosi-project/runtime/pull/140

Also update for changes in https://github.com/cosi-project/runtime/pull/134

Fixes #6169

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-29 23:09:35 +04:00
Andrey Smirnov
89d57aa816
fix: always abort the maintenance service
I hit this bug when one the API calls got hanging, and submitting the
machine config with `apply-config` never takes the node out of
maintenance mode, as `.GracefulStop()` may hang forever waiting for all
the calls to finish.

This way we always abort at some timeout and stop the server forcefully.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-29 22:48:06 +04:00
Andrey Smirnov
f6fa746193
fix: limit apid backoff max delay
This fixes a case when a node is rebooted, and connection via another
endpoint apid "caches" a connection error even when the node is up.

E.g. this command:

```
talosctl -e IP1 -n IP2 version
```

If node `IP2` is rebooted, `apid` at `IP1` might enter long backoff loop
and return an error still when `IP2` is actually up.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-29 21:59:46 +04:00
Andrey Smirnov
8c203ce9b1
feat: remove the machine from the discovery service on reset
Fixes #6137

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-25 22:05:52 +04:00
Dmitriy Matrenichev
b59ca5810e
chore: move from inet.af/netaddr to net/netip and go4.org/netipx
Closes #6007

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-08-25 17:51:32 +03:00
Andrey Smirnov
053af1d59e
fix: update etcd certificates when node addresses changes
Fixes #6110

I somehow missed the fact that etcd certs were not made fully reactive
to node address changes (I wrongly assume it was already the fact).

This PR refactors etcd certificate generation process to be
resource-based and introduces unit-tests for the controller.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-25 00:27:52 +04:00
Dmitriy Matrenichev
29bd632401
chore: remove old build tags syntax
This commit removes lines contains old build tag syntax.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-08-24 17:27:01 +03:00
Andrey Smirnov
8c3ac4c42b
chore: limit GOMAXPROCS for Talos services
Fixes #5971

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-24 15:42:49 +04:00
Andrey Smirnov
361e85b744
fix: properly read kexec disabled sysctl
Fixes #6046

Fix by @bzub

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-24 00:06:14 +04:00
Andrey Smirnov
2f2d97b6b5
fix: don't wait for the hostname in maintenance mode
Fixes #6119

With new stable default hostname feature, any default hostname is
disabled until the machine config is available.

Talos enters maintenance mode when the default config source is empty,
so it doesn't have any machine config available at the moment
maintenance service is started.

Hostname might be set via different sources, e.g. kernel args or via
DHCP before the machine config is available, but if all these sources
are not available, hostname won't be set at all.

This stops waiting for the hostname, and skips setting any DNS names in
the maintenance mode certificate SANs if the hostname is not available.

Also adds a regression test via new `--disable-dhcp-hostname` flag to
`talosctl cluster create`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-23 17:52:20 +04:00
Andrey Smirnov
a0d94be30d
fix: stable default hostname bias
When converting to base36 a 256-bit number there's a bias in the
first character of the base36 encoding, as 256-bit number never fits
perfectly base 36 number.

To give an example, when converting 4-digit binary number to decimal,
the first digit of the decimal number will be [0..3], while the
second digit won't be biased:

```
0000 -> 00
0001 -> 01
...
0111 -> 15
1000 -> 16
...
1111 -> 31
```

Same issue happens when going from e.g. base16 to base36.

Stable hostnames were biased towards having a digit as the first
character.

The fix is to skip the first character of the base36 representation, and
also we don't need to convert all 256 bits to base36, if we use only 6
characters, we can save some CPU resources by taking only 8 bytes
instead of full 32 bytes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-22 21:36:05 +04:00
Andrey Smirnov
da4cd34ef5
feat: update etcd advertised peer addresses on the fly
This allows to update the member information (for the current node) with
new advertised peer URLs as the config changes.

E.g. if the node IP changes, this will update the peer URLs for the
member accordingly.

At the same time any member update requires quorum, so changing IPs can
only be done on node-by-node basis.

If there are no changes to advertised peer URLs, controller does
nothing.

Talos node might still need a reboot to update the listen addresses, as
these are not handled automatically for now.

Fixes #6080

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-22 19:49:51 +04:00
Artem Chernyshev
fd467e02c1
fix: handle grub config being empty in the Revert function
Looks like it returns nil if it doesn't exist and the code doesn't
handle it properly.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-08-16 23:05:43 +03:00
Artem Chernyshev
9492aca652
fix: clean up cancelCtxMu leftovers in PriorityLock
Removed it from one place but forgot to clean up the other usages.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-08-16 13:19:56 +03:00
Artem Chernyshev
32db7a7f5d
fix: surround cancelCtx with the mutex
Looks like `cancelCtx` access from the different goroutines wasn't
protected.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-08-15 22:21:07 +03:00
Dmitriy Matrenichev
0fe4492e72
chore: bump golangci-lint from 1.47.2 to 1.48.0
Patch version linter upgrade.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-08-15 18:11:30 +03:00
Andrey Smirnov
9baca49662
refactor: implement COSI resource API for Talos
Overview: deprecate existing Talos resource API, and introduce new COSI
API.

Consequences:

* COSI API can only go via one-2-one proxy (`client.WithNode`)
* client-side API access is way easier with `state.State` wrappers
* lots of small changes on the client side to use new APIs

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-12 22:31:54 +04:00