To avoid unwanted upgrades, compare the kernel args more conservatively, with partially ignoring order:
- protected args need to be equal, ignoring order completely
- non-protected args need to be equal, respecting the order
- protected and non-protected args can appear in any order relative to each other, they can also be interleaved/scattered.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
(cherry picked from commit 1f237905fb24418652ad3a39b4c1a37bb3c86c26)
We had an issue with bare metal provider where two different schematic IDs would fight each other, causing machine to get installed with a wrong schematic ID, only to be upgraded to the correct one immediately, and in some cases, go into an upgrade loop between a correct and an incorrect schematic.
The cause: Omni treated schematics it observed when the machine in agent mode dialed in, and stored the information it received (like kernel args and initial schematic info). This was wrong, as agent mode information essentially meaningless.
Fix this by changing the simple check of "was the schematic info for machine X ever observed" to be "is the schematic info for machine X ready". The readiness check involves schematic being populated and machine not being in agent mode.
This change caused `SchematicConfiguration` resource to not be generated before the machine leaves the agent mode, and caused a side effect: `InfraMachineController` would not receive Talos version from it and would not populate it on the `InfraMachine` resource. And this would cause BM provider to never get notified about the fact that the machine is allocated to a cluster, and would not power it on (to PXE boot it to "regular" Talos, for it to receive the "install" call to Omni).
Change that controller to get the Talos version info directly from the Cluster resource.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
(cherry picked from commit c319d7bcf27966494dde99f62aa3c6eb636640cd)
The schematic comparison logic had an edge case: if a machine predates the image factory, it is installed via a `ghcr.io` installer image (or a custom one). Those machines do not have the schematic meta extension on them, and Omni creates a synthetic schematic ID and properties for those. These properties do not have the "actual" kernel args of the machine, but rather, Omni sets them as what it thinks they should be (the "correct" siderolink args from the Omni perspective).
Later, if Omni gets its siderolink API advertised URL get updated, it wrongly detects those synthetic kernel args to be the "new ones (with the new URL)", hence, the desired vs actual schematic comparison returns a mismatch. And Omni does an unnecessary upgrade to that machine.
Fix this by using the "current (non-protected) args of the machine" as the synthetic args in such cases. Those "current" args will be synthetic themselves (since we cannot read them from the machine, as it does not have schematic info on it), but, it will prevent changes when the advertised URL changes.
Additionally, we have two checks to detect a schematic mismatch in the `ClusterMachineConfigStatus` controller - make them check the mismatch in the same way, to be more consistent.
(cherry picked from commit 0906bcc23c5d2e56b52fe4c3c7826afbea73dada)
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Do the same thing we did in the schematic configuration controller and read the KernelArgs resource uncached.
This logic is in the process of being centralized in #1792, but still, it should help with the test stability at the time being.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fix the rare issue which was caught by our upgrade tests: `KernelArgs` resource status might not be up to date after we observed them to be initialized on the machine status resource. This caused an unwanted Talos upgrade in rare cases. We were also able to reproduce this with a unit test. Do an uncached read to circumvent that.
Additionally, do a small fix the initialization of the `KernelArgs` resource: do the "emptiness check" on the extra kernel args after we filter the system ones out, so we won't create unnecessary `KernelArgs` resources (this did not cause any changes in the logic, can be considered an optimization fix).
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
(Re)implement the kernel args support functionality in the following way:
- Only support UKI or UKI-like (>=1.12 with GrubUseUKICmdline) systems.
- In `MachineStatusController`:
- When we see a machine for the first time, do a one-time operation of extracting of the extra kernel args from it and store them in the newly introduced `KernelArgs` resource. This resource is user-owned from that point on.
- Mark the `MachineStatus` with an annotation as "its kernel args are initialized".
- Start storing the the raw schematic.
- Take a one-time snapshot of the extensions on the machine and set them as "initial extensions". They might not be the "actual initial", i.e., the set of extensions when we actually seen the machine for the first time, but we do this in a best-effort basis. We need this, since now we cannot simply go back to the initial schematic ID when all extensions are removed - kernel args are also included in the schematic.
- Start collecting the kernel cmdline from Talos machines as well.
- Adapt the `SchematicConfiguration` controller to not revert to the initial schematic ID ever - it now always computes the needed schematic - when it wants to revert to the initial set of extensions, it uses the new field on the `MachineStatus`.
- Introduce the resource `MachineUpgradeStatus` and its controller `MachineUpgradeStatusController`, which handles the maintenance mode upgrades when kernel args are updated. The controller is named this way, since our long-term plan is to centralize all upgrade calls to be done from this controller. Currently, it does not change Talos version or the set of extensions. It works only in maintenance mode, only for kernel args changes (when supported).
- Introduce the resource `KernelArgsStatus` and its controller `KernelArgsStatusController`, which provides information about the kernel args updates. Its status is reliable in both maintenance and non-maintenance modes.
- Build a UI to update these args (with @Unix4ever's help).
Co-authored-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>