97 Commits

Author SHA1 Message Date
Utku Ozdemir
2fe716d2c9
chore: enable go linting for build tags, fix linting errors
Add the build tags we were using, `integration` and `tools`, to be included in the linting/formatting of  golangci-lint.

Rename the build tag `tools` to `sidero.tools` to avoid colliding with the same named build tag in `github.com/johannesboyne/gofakes3` package - otherwise the dependency was failing to compile due to having multiple package names in the same package.

Fix all the linting errors surfaced by this enablement.

Also, temporarily re-enabled `nolintlint` to find the nolint directives which were no longer necessary and removed them.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-04-29 21:18:45 +02:00
Edward Sammut Alessi
d3592671ec
feat: download talosctl directly from factory
Download talosctl binaries from factory instead of Github

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-04-29 17:06:25 +02:00
Edward Sammut Alessi
c5a4310570
feat(frontend): add support modal to omni
Add a support modal to Omni, providing links to github issues, support, docs, community links, and office hours.

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-04-23 15:46:42 +02:00
Edward Sammut Alessi
be67f710f8
feat: allow reader access to join token
Explicitly allow readers to read join tokens

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-04-21 16:28:32 +02:00
Oguz Kilcan
0987fa9e8f
chore: prepare omni with talos v1.13.0-rc
Prepare omni for upcoming talos version 1.13

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-04-17 16:58:24 +02:00
Artem Chernyshev
78544a8557
feat: restrict directories for included files in the cluster templates
By default only allow to include files from the same directory where the
template file lives.
This is to prevent malicious cluster templates that include something
like `/etc/passwd`.
Fixes: https://github.com/siderolabs/omni/issues/2590

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-04-16 19:28:33 +03:00
Edward Sammut Alessi
488b020b2e
feat: add more filters to audit logs
Add multiple new filters to audit logs. Through the UI, there will be a generic search box and the ability to sort columns. Through the CLI, there will be support for the same plus also direct filters for event_type, resource_type, resource_id, cluster_id, and actor.

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-04-15 11:03:54 +02:00
Utku Ozdemir
590ea2e370
feat: add per-key creation and last-active tracking for service accounts
Add creation timestamps and per-key last-active tracking to service account key listings. The `omnictl serviceaccount list` command now shows KEY CREATED and KEY LAST ACTIVE columns for each public key, alongside the existing SA-level LAST ACTIVE.

A new PublicKeyLastActive resource tracks per-key usage. The activity interceptor now extracts the signing key fingerprint from the auth context and records last-used timestamps per key, with independent debouncing. The ServiceAccountStatusController aggregates this data into the service account status for display.

A cleanup controller removes PublicKeyLastActive resources when their corresponding public key is torn down.

Closes: siderolabs/omni#2661
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-04-14 21:12:30 +02:00
Edward Sammut Alessi
cad3713552
feat: implement eula guard for omni
Implement a guard for Omni to prevent usage until users accept an EULA through the UI or a startup flag.

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-04-13 16:49:51 +02:00
Oguz Kilcan
9201358b22
chore: bump dependencies and rekres
Bump dependencies, rekres and fix linter issues

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-04-07 17:59:48 +02:00
Artem Chernyshev
5db4dbfa08
test: lock prepared for Omni upgrade cluster, then check pending changes
This check will show actual unexpected introduced diffs.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-04-01 18:59:29 +03:00
Artem Chernyshev
6efb0f2f0a
feat: support Kubernetes manifests in the cluster templates
Fixes: https://github.com/siderolabs/omni/issues/2172

Leverage kubernetes manifest resources and expose them through cluster
templates.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-03-26 14:10:14 +03:00
Artem Chernyshev
ada0360837
feat: add a way to sync Kubernetes manifests in Omni
Manifests support two modes:
- `FULL` - Omni will keep the manifest in sync always.
- `ONE_TIME` - Omni will apply the manifest only if it doesn't exist. If the manifest is removed by hand and then changed in Omni it will be applied too.

Manifests are applied using service side apply, Omni now has three inventories: `omni-internal-inventory`, `omni-user-inventory` and `omny-sync-one-time`:

- User inventory will be used for user managed manifests.
- Internal one will be used for the manifests which are created by Omni controllers (workloadproxy, advanced healtcheck service).
- One time inventory is used with NoPrune enabled. If the manifest is
  applied it's just removed from the list of applied manifests: that
  ensures that manifests changes are not going happen.

Manifests also support setting namespace to all namespaced resources. It might be useful for the huge manifest files which are supplied without the namespace (similar to `kubectl apply -n namespace -f manifest.yaml`).

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-03-23 15:29:49 +03:00
Utku Ozdemir
2977f05381
feat: allow empty subdomain for workload proxy
Allow setting the workload proxy subdomain to an empty string when useOmniSubdomain is true. This exposes services directly as subdomains of Omni (e.g., grafana.omni.example.com), which is the simplest possible setup for on-prem deployments needing only a wildcard DNS and cert on the Omni domain.

Continuation of https://github.com/siderolabs/omni/pull/2538.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-03-19 12:07:38 +01:00
Oguz Kilcan
6370b41c9a
test: re-fetch machine IPs in AssertTalosVersion retry loop
During ScaleUpAndDown, machines being removed still have ClusterMachineIdentity resources when the version check starts. The test collected IPs once upfront, then spent 2 minutes trying to reach a machine whose TLS identity was already invalidated, causing x509 errors until the timeout.

Re-fetch ClusterMachineIdentity on each retry iteration so that destroyed machines drop out of the IP list naturally.

Also fix clearConnectionRefused: replace the manual ctx.Done() check with RetryWithContext. The old code returned a plain fmt.Errorf on timeout, which fell through as a non-retryable error due to a race between the context deadline and the retry loop's own timeout.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-03-16 15:04:23 +01:00
Oguz Kilcan
cf7d752453
feat: enforce configurable machine registration limit
Add `account.maxRegisteredMachines` config option to cap the number of registered machines. The provision handler atomically checks the limit under a mutex before creating new Link resources, returning ResourceExhausted when the cap is reached.

Introduce a Notification resource type (ephemeral namespace) so controllers can surface warnings to users. `omnictl` displays all active notifications on every command invocation. Frontend part of showing notifications will be implemented in a different PR.

MachineStatusMetricsController creates a warning notification when the registration limit is reached and tears it down when it's not.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-03-16 12:48:47 +01:00
Artem Chernyshev
385c512d4c
test: fix ConfigPatching test
Accidentally added the check which was intended for another test case.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-03-12 16:42:31 +03:00
Artem Chernyshev
31e13e9e39
fix: do not release lock on apply config fails
The code there was also incorrect: it was skipping setting the
`LastError` on the `ClusterMachineConfigStatus` resource.
Also add an integration test to verify that invalid config errors are properly
reported.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-03-10 19:49:33 +03:00
Artem Chernyshev
f8a42eeb04
chore: move graceful upgrades to the lowest level
Rewrite `TalosUpgradeStatus` controller to use the completely different
flow:
- update all `ClusterMachineTalosVersion` resources immediately.
- to control quotas and rollout sequence use `UpgradeRollout` resource,
  it has a single field which is a map of MachineSetName -> Current
  Quota:
  - if control plane is updating it sets quota 0 on all other machine
    sets.
  - the number of not running/unhealthy machines is subtracted from the
    quota.
  - quota is now copied from the new `UpgradeStrategy`, so it's possible
    to have more than one machine updated in parallel.
- `ClusterMachineConfigStatus` controller now adds a new finalizer for
  upgrades on all `ClusterMachines` which are currently being updated to
  acquire/release locks and reads quotas from the `UpgradeRollout`.

Fixes: https://github.com/siderolabs/omni/issues/2393

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-03-03 20:02:59 +03:00
Oguz Kilcan
6d03fc7cdb
feat: track user and service account last activity
* Add `IdentityLastActive` resource to record the last time each identity(`User`/`ServiceAccount`) made a gRPC call.
* Add `IdentityStatusController` to aggregate identity, user role, and last-active data into an ephemeral `IdentityStatus` resource.
* Expose last_active in ListUsers/ListServiceAccounts gRPC responses, omnictl CLI output, and the frontend Users/ServiceAccounts views.
* Add `UserMetricsController` exposing `omni_users` (total) and `omni_active_users` (7d/30d windows) Prometheus gauges.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-03-03 13:53:29 +01:00
Oguz Kilcan
e3df911d48
feat: enforce configurable limits on user and service account creation
Add state validation that rejects identity creation when the configured maximum number of users or service accounts is reached. The gRPC resource and management servers now use the validated state so these limits are enforced for all creation paths (CLI, UI, API). Identity is created before the user resource so the validation fires before any side effects.

Also adds create validation for join token name, e2e Playwright tests covering UI and AccountLimits integration test covering API and CLI for limit enforcement.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-02-26 13:47:52 +01:00
Oguz Kilcan
da60807d48
feat: add ManagementService gRPC endpoints for user operations
Migrate user create, list, update, and destroy operations from direct resource manipulation to dedicated ManagementService gRPC endpoints, matching the existing service account pattern.
Direct Identity/User resource mutations are now restricted, and the CLI, frontend, and client library are updated to use the new endpoints.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-02-26 09:33:27 +01:00
Artem Chernyshev
69c2759b8b
fix: break the dep loop in the cluster machine config status controller
Extract the fields required by the `MachineConfigStatusController` to a
separate resource.
Otherwise there's circular loop: `MachinePendingUpdates` ->
`MachineSetStatus` -> `MachineConfigStatus` -> `MachinePendingUpdates`...

Also change the way machine pending is calculated: do not delete the
pending machine updates resource if the Talos version/schematic is not
in sync.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-02-17 00:28:32 +03:00
Utku Ozdemir
fbf36740f2
test: add unit and e2e tests to the helm chart
Add helm unit tests (via helm-unittest) covering services, ingresses, HTTPRoutes, secrets, PrometheusRules and ServiceAccounts. Add a helm-based e2e test workflow that deploys Omni on a Talos cluster with Traefik and etcd, runs integration tests including workload proxy, and verifies the full stack end-to-end. Add a configurable TestOptions struct to the workload proxy test to allow running with smaller scale in helm e2e.

Signed-off-by: Kevin Tijssen <kevin.tijssen@siderolabs.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-16 13:58:56 +01:00
Oguz Kilcan
afdf123e29
feat: add support for Kubernetes CA rotation
Add support for Kubernetes CA rotation

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-02-14 11:32:00 +01:00
Utku Ozdemir
30d17dcf6d
chore: update Go to 1.26 in go.mod, rekres, fix linting issues
Update Go in go.mod to keep it consistent with the value in the Makefile (the actual Go version the project is built with).

It kicks in some new linters, causes linters to change behavior. Reformat and fix all those linting issues.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-13 10:58:59 +01:00
Utku Ozdemir
868f8ac1e7
test: reach maintenance mode machines' Talos API through Omni in tests
In the integration tests, we were accessing the API of the Talos machines which are in maintenance mode by directly hitting their SideroLink mgmt endpoint.

This worked only because the test was running on the same host as Omni itself (as we spawned Omni as process). This approach breaks when we install Omni via its helm chart on a Kubernetes cluster.

Fix this by going to them through Omni as well.
Additionally, centralize the talos client creation in the tests.

Additionally: bump Talos machinery, and pass the service account key explicitly to the Talos client when creating it, instead of relying on it to pick it from env vars.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-12 10:20:59 +01:00
Utku Ozdemir
ef3e3bc1cc
test: use automation sa directly in integration tests
Instead of doing the fake user auth flow in the integration tests via the `clientconfig` package, use the automation service account directly. Remove all other usages of that package as well, and drop it completely.

The package predates the initial service account token feature of Omni, its purpose was to authenticate to the Omni API in the integration tests. We have the automation key now, so we don't need that anymore.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-11 19:26:46 +01:00
Utku Ozdemir
f3cdbda7e0
refactor: remove global config, inject it to services
Part of the effort to improve Omni codebase, reduce the usage of globals.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-09 14:16:02 +01:00
Utku Ozdemir
4cc3a3da8f
test: do not check for empty wipe id in static infra provider test
Wipe ID on `InfraMachine` resources is empty only when a machine was **never needed to be wiped**, i.e., was never allocated to and then de-allocated from a cluster.

This is not always the case in the bare metal infra provider tests, as it runs both `ConfigPatching` and `StaticInfraProvider` integration tests at the same time. Sometimes, the latter test picked machines which were released by the former test, and those machines were already wiped at least once.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-04 12:09:45 +01:00
Utku Ozdemir
c319d7bcf2
fix: fix schematic generation for machines in agent mode
We had an issue with bare metal provider where two different schematic IDs would fight each other, causing machine to get installed with a wrong schematic ID, only to be upgraded to the correct one immediately, and in some cases, go into an upgrade loop between a correct and an incorrect schematic.

The cause: Omni treated schematics it observed when the machine in agent mode dialed in, and stored the information it received (like kernel args and initial schematic info). This was wrong, as agent mode information essentially meaningless.

Fix this by changing the simple check of "was the schematic info for machine X ever observed" to be "is the schematic info for machine X ready". The readiness check involves schematic being populated and machine not being in agent mode.

This change caused `SchematicConfiguration` resource to not be generated before the machine leaves the agent mode, and caused a side effect: `InfraMachineController` would not receive Talos version from it and would not populate it on the `InfraMachine` resource. And this would cause BM provider to never get notified about the fact that the machine is allocated to a cluster, and would not power it on (to PXE boot it to "regular" Talos, for it to receive the "install" call to Omni).

Change that controller to get the Talos version info directly from the Cluster resource.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-02-03 11:46:15 +01:00
Oguz Kilcan
c6cc25c73c
feat: add support for Talos CA rotation
Add support for Talos CA rotation

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-01-30 09:59:25 +01:00
Utku Ozdemir
a5795c2fa4
feat: add config descriptions in schema, use them in flags
Rework root command to get the flag descriptions from the JSON schema.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-27 14:50:25 +01:00
Utku Ozdemir
91c8bff46c
feat: generate omni config from schema
Make all leaf fields nillable, so that we can distinguish unset from explicit empty, and merging of CLI args and YAML configs work correctly.

Generate nil-safe accessors (getter/setters) for these nillable fields and use them in the code.

Wrap the cobra command line parser to support nillable flags.

Move all validations into the JSON schema and drop go-validator usage and its annotations.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-22 13:23:11 +01:00
Edward Sammut Alessi
d3ae77c0cc
chore: bump copyright to 2026
Bump copyright for conformance to 2026

Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
2026-01-21 15:30:49 +01:00
Artem Chernyshev
41506f72f8
chore: move graceful config rollout logic to the lowest controller level
Now graceful config rollout is handled by the
`ClusterMachineConfigStatusController`.
It calculates the available update quota by adding finalizers on the
`ClusterMachine` resources. By counting the resources with the
finalizers it tracks the remaining quota.
It now also calculates the pending changes which are not yet applied to
the machine in the `MachinePendingUpdates`.

Pending changes are not yet shown in the UI anywhere.

Fixes: https://github.com/siderolabs/omni/issues/1929

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2026-01-19 16:30:28 +03:00
Oguz Kilcan
2d5e58cbac
chore: rekres and bump deps
* rekres
* bump deps
* bump go to 1.25.6
* fix linter errors

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-01-16 11:15:02 +01:00
Pranav Patil
c6aaff0f9e
refactor: make namespace implicit in auth package
Simplify the code and make it less error prone.

Signed-off-by: Pranav Patil <pranavppatil767@gmail.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-14 21:07:33 +01:00
Pranav Patil
dff8e1f64d
feat: make namespace implicit in k8s and oidc package NewResource functions
Refactored NewKubernetesResource and NewJWTPublicKey to use implicit namespace

Signed-off-by: Pranav Patil <pranavppatil767@gmail.com>
2026-01-14 11:29:03 +01:00
Utku Ozdemir
4db838196c
test: remove machine.install.extraKernelArgs from infra machines
With Talos 1.12, `.machine.install.extraKernelArgs` is not the right way of setting kernel args. Remove that from the infra machines (bare metal infra provider) tests.

Remove the `disableKexec` bool argument from the function, as it was always set to true.

Set the kernel arg to disable kexec in correct format, as `sysctl.kernel.kexec_load_disabled=1`, not `kexec_load_disabled=1` (was effectively no-op).

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-13 20:31:35 +01:00
Pranav Patil
de6e2c66f7
refactor: make namespace implicit in omni resources
Refactor for code simplicity.

Signed-off-by: Pranav Patil <pranavppatil767@gmail.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-12 12:54:11 +01:00
Pranav Patil
9503f850cc
refactor: make namespace implicit in siderolink resources
Refactor for code simplicity.

Signed-off-by: Pranav Patil <pranavppatil767@gmail.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-12 10:42:29 +01:00
Oguz Kilcan
ef2d931aac
chore: rekres and bump deps
* Rekres
* Bump deps
* Update default versions for talos and kubernetes

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2026-01-09 11:34:03 +01:00
Pranav Patil
55fd33db39
refactor: make namespace implicit in system & virtual resources
Refactor for code simplicity.

Signed-off-by: Pranav Patil <pranavppatil767@gmail.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-08 11:40:25 +01:00
Utku Ozdemir
0be460205b
test: improve test stability
Fix a few things in tests:
- Add the forgotten `claimMachines` calls to a few integration tests
- When picking unallocated machines in integration tests, ensure that they are unallocated by checking that here is no corresponding `MachineSetNode` resource. Previous check on the `Available` label on `MachineStatus` resource was inherently racy, as that label is set by a controller asynchronously after a machine was "picked".
- Fix the flake in TalosUpgradeStatus unit test: it was skipping reconciliation because the `SchematicConfiguration` resource was missing the cluster label, but in the same time it was not failing reliably, as it was not asserting the completion of one upgrade before starting the next one. Fix both issues.
- Fix a crash in TalosUpgradeStatusController - it was failing to read back the `ClusterMachineTalosVersion` resource it just created because it was not yet available in the controller runtime cache. Instead of reading it back after writing, simply return the created resource reference.

Co-authored-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-07 17:19:17 +01:00
Utku Ozdemir
535d733ea6
chore: drop migrations older than v1.1.0
Drop old migrations and deprecated types which were kept only for the migrations.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2026-01-06 14:50:11 +01:00
Edward Sammut Alessi
5c98d44bdf
chore: implement InstallationMediaConfig resource
This resource is going to be used to store the saved installation media
presets generated by the UI wizard.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-12-29 17:41:45 +01:00
Artem Chernyshev
36c20175e6
fix: ignore labeled MachineSetNodes in the export and sync CLI cmds
Now as `MachineSetNodes` are no longer ever owned by the
`MachineSetNodeController` and marked with
`managed-by-machine-set-node-controller` label instead, CLI tools should
properly handle that and ignore such `MachineSetNodes` during export and
cluster sync.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-12-23 20:39:19 +03:00
Artem Chernyshev
ee926cd9eb
feat: add a way to switch gRPC tunnel mode for the connected machines
Fixes: https://github.com/siderolabs/omni/issues/1816

Introduce a new command:

```
omnictl configure machine <id> --siderolink-connection=[udp|http-tunnel|auto]
```

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-12-12 22:59:33 +03:00
Utku Ozdemir
9bf690ef2e
refactor: do SQLite migrations unconditionally, rework the config flags
Remove the flags for turning on SQLite storage for:
- Discovery service state
- Audit logs
- Machine logs

Instead, migrate them unconditionally to SQLite on the next startup.

Remove many flags which are no longer meaningful. Only keep the ones which are required for the migrations.

Additionally: Make the `--sqlite-storage-path` (or its config counterpart `.storage.sqlite.path`) required with no default value, as a default value does not make sense for it in most of the cases.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-12-12 12:47:04 +01:00