Kill old-style "manual" tests, use `ctest` consistently now.
This should be no-op refactoring.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit df0b9a8da1423842d830261e5ddc5dc8f5a234c1)
While the OOM pressure is high, we might observe "extra kills" as there
are no other victims to kill anymore (as `stress-ng` is already gone).
Tolerate those kills, but log them in case we see this getting out of
hand.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 71aeb347f90969cb6057651666bfda205269d917)
A sample failure:
```
manifests.go:133:
Error Trace: /src/internal/integration/k8s/manifests.go:133
Error: []string{"/usr/local/bin/kube-proxy", "--cluster-cidr=10.244.0.0/16", "--conntrack-max-per-core=0", "--hostname-override=$(NODE_NAME)", "--kubeconfig=/etc/kubernetes/kubeconfig", "--proxy-mode=nftables"} does not contain "--nodeport-addresses=0.0.0.0/0"
Test: TestIntegration/k8s.ManifestsSuite/TestSync
manifests.go:137: disabling kube-proxy
```
My running theory is that `List()` picks up a stale pod, so trying to
filter it out and log it in full if we hit it.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 9b9542cc55ee6d08f3490d270c1b497c7b9d3049)
Fixes#13169
Also fixes a number of other issues with controller being stuck
"watching" over stale data.
The major part of the change is to watch contents of kubelet's
kubeconfig and restart the watch when it changes.
The internals of the watch process don't always bubble up error
properly, or we don't watch for errors.
With this change, not only initial sync has a timeout and a way to abort
the sync process, Talos now can also restart the sync on kubeconfig
change make it more transparent.
This might become irrelevant if we start managing kubeconfig via Talos
controlplane for workers, but for now this seems to be the way to fix
issues.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 149592fa59d20c5aa29e4c0af9a3760585f378ce)
See #13159, newer GPU operator v26.3.1 has better detection.
Signed-off-by: Noel Georgi <git@frezbo.dev>
(cherry picked from commit bba0b4aeefd7ec0daf7cc048e48c66d8b614f576)
At the end of every sequence that intentionally terminates the machine (reboot, shutdown, upgrade, etc.), a fatal event is published to signal expected termination. The machine status controller was unconditionally flipping the stage to "rebooting" on this event, which was correct for sequences that end in a reboot but incorrect for the shutdown sequence whose expected termination is a power-off.
The stage tracker now skips this transition when the current sequence is shutdown, so the machine stays in "shutting down" until it actually powers off.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
(cherry picked from commit c028db0b8d25e85a4b580e10252d964785320291)
Make sure we run the check commands also on the same node where we created the pool.
Fixes: #13014
Signed-off-by: Noel Georgi <git@frezbo.dev>
(cherry picked from commit 7fa4d39197e1a9e54ba8a259c111f2cb8047ef9c)
This check was in maintenance Upgrade API for Talos <= 1.12,
so keep it in the "normal" API as well.
It always makes sense - the upgrade would fail if Talos is not
installed, but that failure in legacy Upgrade API is async and not
reported properly back.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 0d8362119e4415182caa9349e0ddfb27ea290d90)
During removal of encryption key, we logged slot of current key instead of the removed key.
Signed-off-by: Mateusz Urbanek <mateusz.urbanek@siderolabs.com>
(cherry picked from commit be58eafaba98bb7b1bcd20ac1ed8f8b03734c7e0)
There are no security issues fixed.
Drop username/password creds - they were not used.
Improve security of token interceptor.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 9fbb7c95df2b1dcd68fafa23865412bbd8300f4b)
They re-enabled support for absolute symlinks, but symlinks which target
paths with `../` are still dropped.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 212182e6f655f61e8917059868fc381728e4a959)
Remove the skip statements/rework the code to allow
FIPS builds to do Wireguard by wrapping Wireguard operations
into `fips140.WithoutEnforcement` blocks.
Using Wireguard (or not using it) is still a user's choice, but this
allows tests to run in strict mode.
There might be more fixes required for FIPS strict, right now being
blocked by Go issue with X25119 which is going to be backported to Go
1.26.3.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 1ef8e630ab77b3c849e7da6d1ff83e7c6795f070)
Reset ARPIPTargets and NSIP6Targets at the start of BondMasterSpec.Decode.
Without this, repeated decode calls on the same struct can retain old target
entries after config removes them, which makes link status drift from
current bond configuration.
Add a regression test that decodes a payload with targets, then decodes a
payload without target attributes into the same struct and asserts both
slices are empty.
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 0a47f40b3cdf304a079c6b3fa964e9f82e91ec63)
Add an integration test and fix legacy upgrade API in maintenance mode.
There were several assumptions which do not hold true in maintenance as
we have no machine configuration.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit c464c7e88a3f058cb2bbc36af1910d69d903cd07)
Also fix one more place when version.Name wasn't used properly.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 4ba11156fd164a0d94538508f5c028f249deed50)
Add NVIDIA arm64 test matrix.
Also ensure we have a known baseline for nvidia cdi files,
so if upstream adds more files and we don't install to right location
the test would fail.
Signed-off-by: Noel Georgi <git@frezbo.dev>
(cherry picked from commit 6a3ab87c54f83f70869a2e298e6ed7722cf4afad)
For IPv4, they should be attached to no interfaces.
Discovered while doing some manual testing for the documentation.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 0bfdf7f7035fefe804ec4b568709cd6a09195293)
Allow to set build NAME on build, propagate it down to more consumers.
Expose name in `Version` resource, and use that in the dashboard
next to Talos version.
Fix some places where `Name` was hardcoded.
Propagate Name down to UKI build.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 968ec1e0ca26eb1f0de0836e0a55df09dea7dafe)
When decompressing extensions, we might not be able to set xattrs (e.g.
running rootless), so instead of setting xattrs, save them in memory and
push to mksquashfs as pseudo definitions.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit d697f5538a7a624a1ac7bafdfebc67dd9418c434)
Add --drain and --drain-timeout flags to `talosctl reboot` (default off)
and `talosctl upgrade` (default on) that cordon and drain the Kubernetes
node before rebooting, then wait for Ready and uncordon after it comes
back. When --drain is enabled, --wait is forced to true.
Signed-off-by: Mateusz Urbanek <mateusz.urbanek@siderolabs.com>
(cherry picked from commit 52b920032e97e1b241c1e0bd89c6e41cbc1c9a47)
When dashboard runs within Talos, it previously used `os:admin` role
which allows anything.
With changes in 1.13, I dropped the role to `os:reader`, which is a way
tighter scope from the security perspective, but it broke network config
tab - it tries to write to META, which is not allowed under `os:reader`
role, so this change fixes the dashboard, but still keeps the RBAC
tight.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 649ab7fe4234de1a947071926603377e00910cb9)
Fixes#12933
There are many usecases for this:
* exploring resources and state of the system, learning available
resources
* when a Talos machine is booted up in an environment without network
access, learning all available network interfaces, all disks
available, etc.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 5e24d5265bde9adee92c02e675140de87ee126bf)
Fixes#13056
The TPM unseal operation doesn't respect the context, and we had 10
second timeout for the whole key unlock operation.
So there might a case when a "slow" TPM unseal runs for more than 10
seconds, and by the time TPM unseal is down, context timeout already
passed, so a somewhat wrong messahe pops in, as the rate limiter is
configured with any limit, but it fails due to the fact that the context
got canceled (but it would have failed later anyways doing the actual
resource operation).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 087ced85f5130656cbc647c2e4d838cab3ff1737)
Previously, there was no way to grow virtual disks attached to VMs,
even though resizing them was possible (e.g. through hypervisor changing
the size of disk). This forces the UserVolume of type=disk to always
grow to full size of the disk.
Signed-off-by: Mateusz Urbanek <mateusz.urbanek@siderolabs.com>
(cherry picked from commit e2df0f6ce8c47b0dc3e93bf257afb8a1ae9243fb)
The runtime capabilities lookup did not include an entry for the metal-agent mode, causing an index out of range panic when any capability check was performed in that mode. This broke MetaWrite calls from Omni to machines running in metal-agent mode through the new unified apid, preventing them from appearing as pending machines.
Also fix the incorrect comments on the existing entries to match the actual iota order.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
(cherry picked from commit 783a35851ed1bac4ddd0f1fed583fc1b6477614d)
when processing on-link routes, the source address was incorrectly set to the first address of the interface.
This caused issues when the interface had multiple addresses, as the source address may not have been valid for the route.
The source address is now set to an empty string, which allows the kernel to automatically select the appropriate source address for the route.
Signed-off-by: Orzelius <33936483+Orzelius@users.noreply.github.com>
(cherry picked from commit 3400059ccf4811140a4326397d972f68693c708c)
It seems that depending on timing, we might get one or another Talos in
gRPC client.
Fixes#13016
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 4227921b3979d3a8542946fed4ceb622747adb00)
This is a regression compared to Talos 1.12: allow blockdevice wipe in
maintenance mode (with `os:reader` role).
Also improve the test for maintenance via SideroLink - add a test on
install, META write and reboot preserving META value.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 1dd701efa8119b6515a62ff68c430c99a96f2b68)
As one of the integration tests was overriding TrustedRoots config, it
erased the required settings leading to a random failure (depending on
the nodes picked for subsequent tests).
Fixes#13013
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 70cefab6af3dacdc80921b55ca8dbf5644501c6c)
Add a test that covers all maintenance APIs in general.
Add a test for transition from SideroLink.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit ad72c73006abc3b51e5371496c61d8637b2222f0)
The gpu-operator device plugin generates CDI specs with hooks pointing
to /usr/bin/nvidia-ctk and /usr/bin/nvidia-cdi-hook (hardcoded defaults
in NVIDIA/k8s-device-plugin and NVIDIA/nvidia-container-toolkit). Talos
extensions install these binaries under /usr/local/bin/, so pods
requesting nvidia.com/gpu resource limits fail with "no such file".
Add /usr/bin/nvidia-ctk and /usr/bin/nvidia-cdi-hook to the rootfs as
symlinks.
Fixes: #13021
Fixes: https://github.com/siderolabs/extensions/issues/1017
Signed-off-by: David Orman <ormandj@corenode.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 9597714f625ac07bf74de32a24c3e6dad5abdc91)
See https://github.com/siderolabs/talos/discussions/13012
The containerd's default OCI spec sets NOFILE rlimit to 1024,
unset it to simply let machined defaults take over.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 8ac47d677703624ec6568294d94dcad7e533e6c4)
Whitelist services which can access the file socket, refuse other
connections.
Fixes#12701
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 038cb87354eea1c1ff4612bdd13d1e77e595955a)
We should use the endpoint(s) from the original talosconfig instead of
using node IPs, as they might be private/behind the LB.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 8e1c8a7a90fb039fd8a639a1218c169bc683d141)
Drop maintenance service and all the code supporting it directly.
Instead, move all network API termination into the `apid` service, which
now can work now in more modes to support maintenance operations as
well.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Trade some imports, bump some modules, net result is killing lots of
transitive dependencies which were getting into the build.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Pseudo late mount points (`/system`, `/run` and `/system`) were consistently failing to unmount.
While reaching this unmount sequence, we should already have unmounted any children.
However, if those are not unmounted, we should log what are we unmounting and unmount them recursively.
Fixes#12974
Signed-off-by: Mateusz Urbanek <mateusz.urbanek@siderolabs.com>
The panic:
```
2026/03/16 13:39:56 172.20.0.3: {"component":"controller-runtime","controller":"hardware.SystemInfoController","error":"controller \"hardware.SystemInfoController\" panicked: output tracking already enabled\n\ngoroutine 613 [running]:\nruntime/debug.Stack()\n\t/go/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/cosi-project/runtime/pkg/controller/runtime/internal/rruntime.(*Adapter).runOnce.func2()\n\t/.cache/mod/github.com/cosi-project/runtime@v1.14.0/pkg/controller/runtime/internal/rruntime/run.go:67 +0x4c\npanic({0x2a43dc0?, 0x350ff30?})\n\t/go/src/runtime/panic.go:860 +0x13a\ngithub.com/cosi-project/runtime/pkg/controller/runtime/internal/rruntime.(*Adapter).StartTrackingOutputs(0x38246abe1c98?)\n\t/.cache/mod/github.com/cosi-project/runtime@v1.14.0/pkg/controller/runtime/internal/rruntime/output_tracker.go:25 +0x94\ngithub.com/siderolabs/talos/internal/app/machined/pkg/controllers/hardware.(*SystemInfoController).Run(0x38246a3fe280, {0x3549b50, 0x38246a96dbd0}, {0x358b070, 0x38246adaf0e0}, 0x38246adba000)\n\t/src/internal/app/machined/pkg/controllers/hardware/system.go:93 +0x127\ngithub.com/cosi-project/runtime/pkg/controller/runtime/internal/rruntime.(*Adapter).runOnce(0x38246adaf0e0, {0x3549b50, 0x38246a96dbd0}, 0x38246adba000)\n\t/.cache/mod/github.com/cosi-project/runtime@v1.14.0/pkg/controller/runtime/internal/rruntime/run.go:73 +0xfa\ngithub.com/cosi-project/runtime/pkg/controller/runtime/internal/rruntime.(*Adapter).Run(0x38246adaf0e0, {0x3549b50, 0x38246a96dbd0})\n\t/.cache/mod/github.com/cosi-project/runtime@v1.14.0/pkg/controller/runtime/internal/rruntime/run.go:25 +0x16b\ngithub.com/cosi-project/runtime/pkg/controller/runtime.(*Runtime).Run.func1.2()\n\t/.cache/mod/github.com/cosi-project/runtime@v1.14.0/pkg/controller/runtime/runtime.go:201 +0x2e\ngithub.com/cosi-project/runtime/pkg/controller/runtime.(*Runtime).Run.func1.goFunc.3()\n\t/.cache/mod/github.com/cosi-project/runtime@v1.14.0/pkg/controller/runtime/runtime.go:473 +0x13\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/.cache/mod/golang.org/x/sync@v0.20.0/errgroup/errgroup.go:93 +0x50\ncreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 146\n\t/.cache/mod/golang.org/x/sync@v0.20.0/errgroup/errgroup.go:78 +0x95\n","msg":"2026-03-16T09:39:56.457Z \u001b[31mERROR\u001b[0m controller failed","talos-level":"info","talos-service":"controller-runtime","talos-time":"2026-03-16T09:39:56.718594712Z"}
```
This more of a cosmetic issue, but still - move tracking outputs below
the `continue` statement, otherwise it might be called twice in a single
run.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>