9 Commits

Author SHA1 Message Date
Utku Ozdemir
1e6be81f39
refactor: introduce uncached reader/writer package, fix flaky tests
Introduce a new `uncached` package that provides `Reader()` and `ReaderWriter()` wrappers to bypass the COSI controller runtime read cache. Replace all manual `controller.UncachedReader` type assertion casts across the codebase with the new package, making uncached reads more ergonomic and less error-prone.

Use the new package to fix the flaky `Test_KubernetesCARotation/rotation_ongoing` test. The rotation status controller performs a one-off operation: it fires, runs through stages, and marks rotation as done. This makes it inherently vulnerable to stale reads, because further wakeups caused by delayed update notifications cannot bring the state to the desired one — the resource snapshot at the time of rotation is crucial. Replace all its read operations with uncached reads via `uncached.ReaderWriter()`.

Fix the flaky `TestUserMetrics` test by moving mock resource creation to the setup phase so the controller sees the data from its first reconcile, eliminating a race where the controller's initial reconcile would run before the test data was created.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-03-05 13:05:59 +01:00
Utku Ozdemir
1f237905fb
fix: compare current and new kernel args more defensively
To avoid unwanted upgrades, compare the kernel args more conservatively, with partially ignoring order:

- protected args need to be equal, ignoring order completely
- non-protected args need to be equal, respecting the order
- protected and non-protected args can appear in any order relative to each other, they can also be interleaved/scattered.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-18 13:17:20 +01:00
Utku Ozdemir
0906bcc23c
fix: prevent unwanted upgrades of non-image-factory machines
The schematic comparison logic had an edge case: if a machine predates the image factory, it is installed via a `ghcr.io` installer image (or a custom one). Those machines do not have the schematic meta extension on them, and Omni creates a synthetic schematic ID and properties for those. These properties do not have the "actual" kernel args of the machine, but rather, Omni sets them as what it thinks they should be (the "correct" siderolink args from the Omni perspective).

Later, if Omni gets its siderolink API advertised URL get updated, it wrongly detects those synthetic kernel args to be the "new ones (with the new URL)", hence, the desired vs actual schematic comparison returns a mismatch. And Omni does an unnecessary upgrade to that machine.

Fix this by using the "current (non-protected) args of the machine" as the synthetic args in such cases. Those "current" args will be synthetic themselves (since we cannot read them from the machine, as it does not have schematic info on it), but, it will prevent changes when the advertised URL changes.

Additionally, we have two checks to detect a schematic mismatch in the `ClusterMachineConfigStatus` controller - make them check the mismatch in the same way, to be more consistent.

Unrelated to this bug, also fix the `SchematicReady` check (introduced in 1.5) to treat invalid schematics as valid, as otherwise we cannot create clusters from non-factory images.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-05 13:56:49 +01:00
Utku Ozdemir
c319d7bcf2
fix: fix schematic generation for machines in agent mode
We had an issue with bare metal provider where two different schematic IDs would fight each other, causing machine to get installed with a wrong schematic ID, only to be upgraded to the correct one immediately, and in some cases, go into an upgrade loop between a correct and an incorrect schematic.

The cause: Omni treated schematics it observed when the machine in agent mode dialed in, and stored the information it received (like kernel args and initial schematic info). This was wrong, as agent mode information essentially meaningless.

Fix this by changing the simple check of "was the schematic info for machine X ever observed" to be "is the schematic info for machine X ready". The readiness check involves schematic being populated and machine not being in agent mode.

This change caused `SchematicConfiguration` resource to not be generated before the machine leaves the agent mode, and caused a side effect: `InfraMachineController` would not receive Talos version from it and would not populate it on the `InfraMachine` resource. And this would cause BM provider to never get notified about the fact that the machine is allocated to a cluster, and would not power it on (to PXE boot it to "regular" Talos, for it to receive the "install" call to Omni).

Change that controller to get the Talos version info directly from the Cluster resource.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-03 11:46:15 +01:00
Edward Sammut Alessi
d3ae77c0cc
chore: bump copyright to 2026
Bump copyright for conformance to 2026

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-01-21 15:30:49 +01:00
Oguz Kilcan
bc2a5a9986
chore: prepare omni with talos v1.12.0-beta.1
Prepare omni for upcoming talos version 1.12.0-beta.1.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2025-12-06 16:55:35 +01:00
Utku Ozdemir
4d0658bb10
test: fix flaky MachineUpgradeStatusController test
Do the same thing we did in the schematic configuration controller and read the KernelArgs resource uncached.

This logic is in the process of being centralized in #1792, but still, it should help with the test stability at the time being.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-11-13 23:02:22 +01:00
Utku Ozdemir
3e90bc6c94
fix: prevent stale reads of kernel args in schematic id calculation
Fix the rare issue which was caught by our upgrade tests: `KernelArgs` resource status might not be up to date after we observed them to be initialized on the machine status resource. This caused an unwanted Talos upgrade in rare cases. We were also able to reproduce this with a unit test. Do an uncached read to circumvent that.

Additionally, do a small fix the initialization of the `KernelArgs` resource: do the "emptiness check" on the extra kernel args after we filter the system ones out, so we won't create unnecessary `KernelArgs` resources (this did not cause any changes in the logic, can be considered an optimization fix).

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-11-05 15:09:21 +01:00
Utku Ozdemir
15deddde56
feat: implement extra kernel args support
(Re)implement the kernel args support functionality in the following way:
- Only support UKI or UKI-like (>=1.12 with GrubUseUKICmdline) systems.
- In `MachineStatusController`:
  - When we see a machine for the first time, do a one-time operation of extracting of the extra kernel args from it and store them in the newly introduced `KernelArgs` resource. This resource is user-owned from that point on.
  - Mark the `MachineStatus` with an annotation as "its kernel args are initialized".
  - Start storing the the raw schematic.
  - Take a one-time snapshot of the extensions on the machine and set them as "initial extensions". They might not be the "actual initial", i.e., the set of extensions when we actually seen the machine for the first time, but we do this in a best-effort basis. We need this, since now we cannot simply go back to the initial schematic ID when all extensions are removed - kernel args are also included in the schematic.
  - Start collecting the kernel cmdline from Talos machines as well.
- Adapt the `SchematicConfiguration` controller to not revert to the initial schematic ID ever - it now always computes the needed schematic - when it wants to revert to the initial set of extensions, it uses the new field on the `MachineStatus`.
- Introduce the resource `MachineUpgradeStatus` and its controller `MachineUpgradeStatusController`, which handles the maintenance mode upgrades when kernel args are updated. The controller is named this way, since our long-term plan is to centralize all upgrade calls to be done from this controller. Currently, it does not change Talos version or the set of extensions. It works only in maintenance mode, only for kernel args changes (when supported).
- Introduce the resource `KernelArgsStatus` and its controller `KernelArgsStatusController`, which provides information about the kernel args updates. Its status is reliable in both maintenance and non-maintenance modes.
- Build a UI to update these args (with @Unix4ever's help).

Co-authored-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-10-28 14:44:48 +01:00