Now inline supports all three variants:
- a single inline map (backward compatibility for config patches).
- a list of inline maps
- raw bytes, that can also contain multiple documents.
`omnictl cluster template export` command was updated to export config
patches/manifests as raw bytes to ensure that multiple values are
properly supported.
Fixes: https://github.com/siderolabs/omni/issues/2683
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
By default only allow to include files from the same directory where the
template file lives.
This is to prevent malicious cluster templates that include something
like `/etc/passwd`.
Fixes: https://github.com/siderolabs/omni/issues/2590
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Rewrite `TalosUpgradeStatus` controller to use the completely different
flow:
- update all `ClusterMachineTalosVersion` resources immediately.
- to control quotas and rollout sequence use `UpgradeRollout` resource,
it has a single field which is a map of MachineSetName -> Current
Quota:
- if control plane is updating it sets quota 0 on all other machine
sets.
- the number of not running/unhealthy machines is subtracted from the
quota.
- quota is now copied from the new `UpgradeStrategy`, so it's possible
to have more than one machine updated in parallel.
- `ClusterMachineConfigStatus` controller now adds a new finalizer for
upgrades on all `ClusterMachines` which are currently being updated to
acquire/release locks and reads quotas from the `UpgradeRollout`.
Fixes: https://github.com/siderolabs/omni/issues/2393
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Update Go in go.mod to keep it consistent with the value in the Makefile (the actual Go version the project is built with).
It kicks in some new linters, causes linters to change behavior. Reformat and fix all those linting issues.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
The gotextdiff/myers library uses the naive Myers algorithm variant that stores the full edit trace, resulting in O((M+N)^2) space complexity.
For machine configs with large inline K8s manifests (thousands of lines), this causes massive memory spikes — e.g., 80K lines allocates ~98 GB and gets OOM-killed.
Replace it with neticdk/go-stdlib/diff/myers which implements the linear-space Myers variant (divide-and-conquer). Memory usage drops from ~25 GB to ~8 MB for 40K-line inputs.
The diff output format is unchanged (unified diff with @@ hunks).
Co-authored-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Co-authored-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
In the integration tests, we were accessing the API of the Talos machines which are in maintenance mode by directly hitting their SideroLink mgmt endpoint.
This worked only because the test was running on the same host as Omni itself (as we spawned Omni as process). This approach breaks when we install Omni via its helm chart on a Kubernetes cluster.
Fix this by going to them through Omni as well.
Additionally, centralize the talos client creation in the tests.
Additionally: bump Talos machinery, and pass the service account key explicitly to the Talos client when creating it, instead of relying on it to pick it from env vars.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
When the resource compression was disabled in the Omni config, we were not generating the ClusterMachineConfigPatches correctly.
The issue was: it was attempting to "force-compress" the ClusterMachineConfigPatches when any of the patches' size was above the threshold. But when it was trying to do that, it did not override the global setting of false.
The default setting for resource compression is `true`, but when a config file is used to configure Omni, and it was not specified in the config YAML, it was getting overwritten to be `false` due to the boolean merging behavior, which was fixed in https://github.com/siderolabs/omni/pull/2150.
Also: fix the compression kicking in even in cases when it is disabled in config but above the threshold.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Now graceful config rollout is handled by the
`ClusterMachineConfigStatusController`.
It calculates the available update quota by adding finalizers on the
`ClusterMachine` resources. By counting the resources with the
finalizers it tracks the remaining quota.
It now also calculates the pending changes which are not yet applied to
the machine in the `MachinePendingUpdates`.
Pending changes are not yet shown in the UI anywhere.
Fixes: https://github.com/siderolabs/omni/issues/1929
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Implement kernel args support in cluster templates.
Managing kernel args via templates is opt-in: only and only if the `kernelArgs` YAML key is defined on a `Cluster`, `ControlPlane`, `Worker` or `Machine`, the matching `KernelArgs` resource will be created/updated.
Lower levels override higher levels (Cluster -> MachineSet -> Machine).
Unlike other cluster template managed resources, they will never be destroyed, i.e, when they are removed from a template (removed completely, as in, `kernelArgs` key doesn't exist) or when `omnictl cluster template delete` is run. They instead will get updated to have the annotation `omni.sidero.dev/managed-by-cluster-templates` removed from them.
Add the new flag `--include-kernel-args` to the `omnictl cluster template export` command to optionally include them in the exported template. Note: when this flag is set, `kernelArgs` key is always included at per-machine level, not pulled up even if they are the same for all machines in a machine set or a cluster.
Update the frontend, specifically the kernel args update screen to warn the user if kernel args for that machine is managed by templates, similar to what we do for clusters.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Now as `MachineSetNodes` are no longer ever owned by the
`MachineSetNodeController` and marked with
`managed-by-machine-set-node-controller` label instead, CLI tools should
properly handle that and ignore such `MachineSetNodes` during export and
cluster sync.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Omni now supports ECDSA P-256 keys for signing the requests.
The plain key should be encoded as PEM when it is submitted to
`RegisterPublicKey` method.
Signature should be encoded using RFC4754 method (`r||s`).
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
(Re)implement the kernel args support functionality in the following way:
- Only support UKI or UKI-like (>=1.12 with GrubUseUKICmdline) systems.
- In `MachineStatusController`:
- When we see a machine for the first time, do a one-time operation of extracting of the extra kernel args from it and store them in the newly introduced `KernelArgs` resource. This resource is user-owned from that point on.
- Mark the `MachineStatus` with an annotation as "its kernel args are initialized".
- Start storing the the raw schematic.
- Take a one-time snapshot of the extensions on the machine and set them as "initial extensions". They might not be the "actual initial", i.e., the set of extensions when we actually seen the machine for the first time, but we do this in a best-effort basis. We need this, since now we cannot simply go back to the initial schematic ID when all extensions are removed - kernel args are also included in the schematic.
- Start collecting the kernel cmdline from Talos machines as well.
- Adapt the `SchematicConfiguration` controller to not revert to the initial schematic ID ever - it now always computes the needed schematic - when it wants to revert to the initial set of extensions, it uses the new field on the `MachineStatus`.
- Introduce the resource `MachineUpgradeStatus` and its controller `MachineUpgradeStatusController`, which handles the maintenance mode upgrades when kernel args are updated. The controller is named this way, since our long-term plan is to centralize all upgrade calls to be done from this controller. Currently, it does not change Talos version or the set of extensions. It works only in maintenance mode, only for kernel args changes (when supported).
- Introduce the resource `KernelArgsStatus` and its controller `KernelArgsStatusController`, which provides information about the kernel args updates. Its status is reliable in both maintenance and non-maintenance modes.
- Build a UI to update these args (with @Unix4ever's help).
Co-authored-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
default / e2e-forced-removal (push) Has been cancelled
default / e2e-omni-upgrade (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
- Bump some deps, namely cosi-runtime and Talos machinery.
- Update `auditState` to implement the new methods in COSI's `state.State`.
- Bump default Talos and Kubernetes versions to their latest.
- Rekres, which brings Go 1.24.5. Also update it in go.mod files.
- Fix linter errors coming from new linters.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
Move the cluster validations which do not require server-side information (access to resources) to a common place.
Use this new validator both from the server side validations and from the cluster templates validations.
This makes the validations consistent, resolving the inconsistency where cluster names (ID) were validated on templates but not by the server.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
This option is redundant. It was inteded for MCP, but the MCP
implementation will not be using it, so we should stop dragging it along
anymore.
This change was extracted from https://github.com/siderolabs/omni/pull/723
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
- The license headers in the generated test sources via `mockgen` were getting commented-out after `make generate` was run.
Fix this by replacing repeated double-slashes `// //` via a single double-slash `//`.
- Rekres, `make generate` and `make generate-frontend`.
- Bump Go deps.
- Fix linting errors to satisfy new rules in golangci-lint `v2.1.1`.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Introduce a new component to watch infra.MachineStatus resources to produce complementary "synthetic" machine events for "powering on" and "powered off" stages.
Change the logic in `MachineStatusSnapshot` and `ClusterMachineStatus` controllers to take these changes into account.
This is required to display the correct status for the machines that are powered on/off by the infra providers.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
Introduce the new label `omni.sidero.dev/ready-to-use`, which is now
used in the UI when creating/scaling a cluster and in the machine set
scaling controller instead of `omni.sidero.dev/connected`.
The machine status controller was changed to add the ready to use label
when the machine is created by the bare metal infra provider.
Then the controller looks on the state of the `InfraMachineStatus`
resource and sets `omni.sidero.dev/ready-to-use` label when the
`ReadyToUse` flag is true in the `InfraMachineStatus`.
If the machine is managed by the bare metal infra provider and is not a
part of a cluster we no longer
set `omni.sidero.dev/disconnected` label. It is used only in the UI.
We still remove `connected` label when the machine gets disconnected
from Omni.
Update the `MachineStatus` resource to also include power state of the
machine. Then display it in the UI for the machines list.
Introduce `POWERING_ON` stage in the cluster machine status. Although it
doesn't cover the power on stage fully. It will show powering on only
for a short period until the bare metal provider issues the power on.
We can improve that in the follow-up PR when we start tracking powering
on stage more precisely.
Fixes: https://github.com/siderolabs/omni/issues/774
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
Bump Go, rekres (using a build with this fix: https://github.com/siderolabs/kres/pull/464), regenerate sources, comply with the new golangci-lint linters.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
With that it becomes possible to get the machines from the machine
request sets instead of the machine classes.
This opens the way for automated machine provisioning.
Fixes: https://github.com/siderolabs/omni/issues/595
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Display pending machine config updates count for the `MachineSet` and
mark each machine which is locked and has the config update which is not
applied yet.
Fixes: https://github.com/siderolabs/omni/issues/15
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Run a discovery service instance inside Omni (enabled by default).
It listens only on the SideroLink interface on port 8093.
Clusters can opt in to use this embedded discovery service instead of the `discovery.talos.dev`. It is added as a new cluster feature both on frontend and in cluster templates.
Closessiderolabs/omni#20.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
- run rekres and fix nolint directives
- bump deps (keep gen to 0.4.8 for now) for server, client and tests
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Fixes: https://github.com/siderolabs/omni/issues/45
Introduced new resource type `ExtensionsConfiguration` that allows
setting machine extensions list.
`SchematicConfiguration` is now readonly and is created by
`SchematicConfigurationController` from `ExtensionsConfiguration`
resource. It also ensures that schematic exists in the image factory by
calling the API.
This change is required to simplify the flow in the cluster templates
(no need to call `CreateSchematic` for each resource).
Export command support added as well.
Added cleanup hooks for the `ExtensionsConfiguration` for machine set, machine and cluster levels.
Changed the resource format to use `labels` instead of `target`. Now
it's the same as for config patches, except it doesn't merge several
resources, but gets the first one.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Omni is source-available under BUSL.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Co-Authored-By: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Co-Authored-By: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Co-Authored-By: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Co-Authored-By: Philipp Sauter <philipp.sauter@siderolabs.com>
Co-Authored-By: Noel Georgi <git@frezbo.dev>
Co-Authored-By: evgeniybryzh <evgeniybryzh@gmail.com>
Co-Authored-By: Tim Jones <tim.jones@siderolabs.com>
Co-Authored-By: Andrew Rynhard <andrew@rynhard.io>
Co-Authored-By: Spencer Smith <spencer.smith@talos-systems.com>
Co-Authored-By: Christian Rolland <christian.rolland@siderolabs.com>
Co-Authored-By: Gerard de Leeuw <gdeleeuw@leeuwit.nl>
Co-Authored-By: Steve Francis <67986293+steverfrancis@users.noreply.github.com>
Co-Authored-By: Volodymyr Mazurets <volodymyrmazureets@gmail.com>