Add `account.maxRegisteredMachines` config option to cap the number of registered machines. The provision handler atomically checks the limit under a mutex before creating new Link resources, returning ResourceExhausted when the cap is reached.
Introduce a Notification resource type (ephemeral namespace) so controllers can surface warnings to users. `omnictl` displays all active notifications on every command invocation. Frontend part of showing notifications will be implemented in a different PR.
MachineStatusMetricsController creates a warning notification when the registration limit is reached and tears it down when it's not.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Backend now automatically switches between legacy and SSA modes for
different Talos versions.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Support isolated OIDC token cache directories in generated `kubeconfig`s to prevent token conflicts when switching between users/clusters. Configurable via server flags and omnictl `--oidc-cache-base-dir` `--oidc-cache-isolation`.
Also upgrade exec credential API to v1 and add interactiveMode field.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Rewrite `TalosUpgradeStatus` controller to use the completely different
flow:
- update all `ClusterMachineTalosVersion` resources immediately.
- to control quotas and rollout sequence use `UpgradeRollout` resource,
it has a single field which is a map of MachineSetName -> Current
Quota:
- if control plane is updating it sets quota 0 on all other machine
sets.
- the number of not running/unhealthy machines is subtracted from the
quota.
- quota is now copied from the new `UpgradeStrategy`, so it's possible
to have more than one machine updated in parallel.
- `ClusterMachineConfigStatus` controller now adds a new finalizer for
upgrades on all `ClusterMachines` which are currently being updated to
acquire/release locks and reads quotas from the `UpgradeRollout`.
Fixes: https://github.com/siderolabs/omni/issues/2393
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
* Add `IdentityLastActive` resource to record the last time each identity(`User`/`ServiceAccount`) made a gRPC call.
* Add `IdentityStatusController` to aggregate identity, user role, and last-active data into an ephemeral `IdentityStatus` resource.
* Expose last_active in ListUsers/ListServiceAccounts gRPC responses, omnictl CLI output, and the frontend Users/ServiceAccounts views.
* Add `UserMetricsController` exposing `omni_users` (total) and `omni_active_users` (7d/30d windows) Prometheus gauges.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Add talos_version and kubernetes_version to ClusterStatusSpec, so as to not need to also query ClusterSpec.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Migrate user create, list, update, and destroy operations from direct resource manipulation to dedicated ManagementService gRPC endpoints, matching the existing service account pattern.
Direct Identity/User resource mutations are now restricted, and the CLI, frontend, and client library are updated to use the new endpoints.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
This allows token rotation and disaster recovery if the token gets
rejected by Omni.
Introduced the new CLI command for that:
```
omnictl configure machine <id> --reset-node-unique-token
```
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Extract the fields required by the `MachineConfigStatusController` to a
separate resource.
Otherwise there's circular loop: `MachinePendingUpdates` ->
`MachineSetStatus` -> `MachineConfigStatus` -> `MachinePendingUpdates`...
Also change the way machine pending is calculated: do not delete the
pending machine updates resource if the Talos version/schematic is not
in sync.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
The schematic comparison logic had an edge case: if a machine predates the image factory, it is installed via a `ghcr.io` installer image (or a custom one). Those machines do not have the schematic meta extension on them, and Omni creates a synthetic schematic ID and properties for those. These properties do not have the "actual" kernel args of the machine, but rather, Omni sets them as what it thinks they should be (the "correct" siderolink args from the Omni perspective).
Later, if Omni gets its siderolink API advertised URL get updated, it wrongly detects those synthetic kernel args to be the "new ones (with the new URL)", hence, the desired vs actual schematic comparison returns a mismatch. And Omni does an unnecessary upgrade to that machine.
Fix this by using the "current (non-protected) args of the machine" as the synthetic args in such cases. Those "current" args will be synthetic themselves (since we cannot read them from the machine, as it does not have schematic info on it), but, it will prevent changes when the advertised URL changes.
Additionally, we have two checks to detect a schematic mismatch in the `ClusterMachineConfigStatus` controller - make them check the mismatch in the same way, to be more consistent.
Unrelated to this bug, also fix the `SchematicReady` check (introduced in 1.5) to treat invalid schematics as valid, as otherwise we cannot create clusters from non-factory images.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
We had an issue with bare metal provider where two different schematic IDs would fight each other, causing machine to get installed with a wrong schematic ID, only to be upgraded to the correct one immediately, and in some cases, go into an upgrade loop between a correct and an incorrect schematic.
The cause: Omni treated schematics it observed when the machine in agent mode dialed in, and stored the information it received (like kernel args and initial schematic info). This was wrong, as agent mode information essentially meaningless.
Fix this by changing the simple check of "was the schematic info for machine X ever observed" to be "is the schematic info for machine X ready". The readiness check involves schematic being populated and machine not being in agent mode.
This change caused `SchematicConfiguration` resource to not be generated before the machine leaves the agent mode, and caused a side effect: `InfraMachineController` would not receive Talos version from it and would not populate it on the `InfraMachine` resource. And this would cause BM provider to never get notified about the fact that the machine is allocated to a cluster, and would not power it on (to PXE boot it to "regular" Talos, for it to receive the "install" call to Omni).
Change that controller to get the Talos version info directly from the Cluster resource.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fix a bug when setting arch to AMD64 which was enum value 0 and was being omitted in responses to frontend.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
When the resource compression was disabled in the Omni config, we were not generating the ClusterMachineConfigPatches correctly.
The issue was: it was attempting to "force-compress" the ClusterMachineConfigPatches when any of the patches' size was above the threshold. But when it was trying to do that, it did not override the global setting of false.
The default setting for resource compression is `true`, but when a config file is used to configure Omni, and it was not specified in the config YAML, it was getting overwritten to be `false` due to the boolean merging behavior, which was fixed in https://github.com/siderolabs/omni/pull/2150.
Also: fix the compression kicking in even in cases when it is disabled in config but above the threshold.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Now graceful config rollout is handled by the
`ClusterMachineConfigStatusController`.
It calculates the available update quota by adding finalizers on the
`ClusterMachine` resources. By counting the resources with the
finalizers it tracks the remaining quota.
It now also calculates the pending changes which are not yet applied to
the machine in the `MachinePendingUpdates`.
Pending changes are not yet shown in the UI anywhere.
Fixes: https://github.com/siderolabs/omni/issues/1929
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Add PlatformMetalID constant to frontend and use it where relevant as an ID. Also update some places in backend with the same idea. There were some lingering uses of List requests in places where Get requests were more suited and those have been replaced too.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Extract schematic generation and download links from the confirmation step of the installation media wizard to allow for re-use inside the download modal of the list view.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
This resource is going to be used to store the saved installation media
presets generated by the UI wizard.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Now it's possible to pass the `overlay` ID directly to the request.
`MediaId` is also still supported, but is there only for the backward
compatibility.
`InstallationMedia` resources will be used only in the `omnictl download`.
Updated the Wizard UI to no longer use `InstallationMedia` resources.
Dropped `pxe_url` from the `CreateSchematic` response, as all required
arguments are now on the client side (if not using `InstallationMedia`
resources).
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
This PR would add a new flag for minimum committed machines. This data would come from stripe and if the user's Omni environment has less that the committed machines, we'd just report the minimum specified.
Signed-off-by: Spencer Smith <spencer.smith@talos-systems.com>
Return the final schematic YML to display in the frontend when creating installation media
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Add labels for the assigned cluster and connection status to the
`omni_machines_version` metric.
Closes#1967
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
Adjust the secure boot support check in the machine arch step to match how it works in factory.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Make `MachineSetNode` created without an owner by the
`MachineSetNodeController`.
Fixes: https://github.com/siderolabs/omni/issues/1450
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Present all of that as 3 kinds of virtual resources:
- `MetalPlatformConfig`.
- `CloudPlatformConfig`.
- `SBCConfig`
Virtual resource supports `Get` and `List` operations.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
We compute the schematic id for a machine in two different places: in the `SchematicConfigurationController` for allocated machines, and in `MachineUpgradeStatusController` for maintenance mode machines.
Centralize this computation to be done only in `SchematicConfigurationController`.
Change the lifecycle of the `SchematicConfiguration` resource to be bound to a machine, not to a cluster.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Added new `update_on_each_login` field to the `SAMLLabelRule` spec.
Also renamed `assign_role_on_registration` to `assign_role` as it's no
longer reflecting the actual meaning.
The old field is kept there for the backward compatibility.
Fixes: https://github.com/siderolabs/omni/issues/1201
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Omni now supports ECDSA P-256 keys for signing the requests.
The plain key should be encoded as PEM when it is submitted to
`RegisterPublicKey` method.
Signature should be encoded using RFC4754 method (`r||s`).
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
(Re)implement the kernel args support functionality in the following way:
- Only support UKI or UKI-like (>=1.12 with GrubUseUKICmdline) systems.
- In `MachineStatusController`:
- When we see a machine for the first time, do a one-time operation of extracting of the extra kernel args from it and store them in the newly introduced `KernelArgs` resource. This resource is user-owned from that point on.
- Mark the `MachineStatus` with an annotation as "its kernel args are initialized".
- Start storing the the raw schematic.
- Take a one-time snapshot of the extensions on the machine and set them as "initial extensions". They might not be the "actual initial", i.e., the set of extensions when we actually seen the machine for the first time, but we do this in a best-effort basis. We need this, since now we cannot simply go back to the initial schematic ID when all extensions are removed - kernel args are also included in the schematic.
- Start collecting the kernel cmdline from Talos machines as well.
- Adapt the `SchematicConfiguration` controller to not revert to the initial schematic ID ever - it now always computes the needed schematic - when it wants to revert to the initial set of extensions, it uses the new field on the `MachineStatus`.
- Introduce the resource `MachineUpgradeStatus` and its controller `MachineUpgradeStatusController`, which handles the maintenance mode upgrades when kernel args are updated. The controller is named this way, since our long-term plan is to centralize all upgrade calls to be done from this controller. Currently, it does not change Talos version or the set of extensions. It works only in maintenance mode, only for kernel args changes (when supported).
- Introduce the resource `KernelArgsStatus` and its controller `KernelArgsStatusController`, which provides information about the kernel args updates. Its status is reliable in both maintenance and non-maintenance modes.
- Build a UI to update these args (with @Unix4ever's help).
Co-authored-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Adds a link to stripe in the omni settings sidebar if stripe is enabled in omni
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Add metrics for enabled cluster features and for various machine properties:
```text
# HELP omni_cluster_features Number of clusters with specific features enabled.
# TYPE omni_cluster_features gauge
omni_cluster_features{feature="disk_encryption"} 1
omni_cluster_features{feature="embedded_discovery_service"} 0
omni_cluster_features{feature="workload_proxy"} 1
# HELP omni_machine_platforms Number of machines in the instance by platform.
# TYPE omni_machine_platforms gauge
omni_machine_platforms{platform="akamai"} 0
omni_machine_platforms{platform="aws"} 0
omni_machine_platforms{platform="azure"} 0
omni_machine_platforms{platform="cloudstack"} 0
omni_machine_platforms{platform="digital-ocean"} 0
omni_machine_platforms{platform="equinixMetal"} 0
omni_machine_platforms{platform="exoscale"} 0
omni_machine_platforms{platform="gcp"} 0
omni_machine_platforms{platform="hcloud"} 0
omni_machine_platforms{platform="metal"} 10
omni_machine_platforms{platform="nocloud"} 0
omni_machine_platforms{platform="opennebula"} 0
omni_machine_platforms{platform="openstack"} 0
omni_machine_platforms{platform="oracle"} 0
omni_machine_platforms{platform="scaleway"} 0
omni_machine_platforms{platform="upcloud"} 0
omni_machine_platforms{platform="vmware"} 0
omni_machine_platforms{platform="vultr"} 0
# HELP omni_machine_secure_boot_status Number of machines in the instance by secure boot status.
# TYPE omni_machine_secure_boot_status gauge
omni_machine_secure_boot_status{enabled="false"} 10
omni_machine_secure_boot_status{enabled="true"} 0
omni_machine_secure_boot_status{enabled="unknown"} 0
# HELP omni_machine_uki_status Number of machines in the instance by UKI (Unified Kernel Image) status.
# TYPE omni_machine_uki_status gauge
omni_machine_uki_status{booted_with_uki="false"} 0
omni_machine_uki_status{booted_with_uki="true"} 10
omni_machine_uki_status{booted_with_uki="unknown"} 0
```
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>