Add the build tags we were using, `integration` and `tools`, to be included in the linting/formatting of golangci-lint.
Rename the build tag `tools` to `sidero.tools` to avoid colliding with the same named build tag in `github.com/johannesboyne/gofakes3` package - otherwise the dependency was failing to compile due to having multiple package names in the same package.
Fix all the linting errors surfaced by this enablement.
Also, temporarily re-enabled `nolintlint` to find the nolint directives which were no longer necessary and removed them.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Add a support modal to Omni, providing links to github issues, support, docs, community links, and office hours.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
By default only allow to include files from the same directory where the
template file lives.
This is to prevent malicious cluster templates that include something
like `/etc/passwd`.
Fixes: https://github.com/siderolabs/omni/issues/2590
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Add multiple new filters to audit logs. Through the UI, there will be a generic search box and the ability to sort columns. Through the CLI, there will be support for the same plus also direct filters for event_type, resource_type, resource_id, cluster_id, and actor.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Add creation timestamps and per-key last-active tracking to service account key listings. The `omnictl serviceaccount list` command now shows KEY CREATED and KEY LAST ACTIVE columns for each public key, alongside the existing SA-level LAST ACTIVE.
A new PublicKeyLastActive resource tracks per-key usage. The activity interceptor now extracts the signing key fingerprint from the auth context and records last-used timestamps per key, with independent debouncing. The ServiceAccountStatusController aggregates this data into the service account status for display.
A cleanup controller removes PublicKeyLastActive resources when their corresponding public key is torn down.
Closes: siderolabs/omni#2661
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Implement a guard for Omni to prevent usage until users accept an EULA through the UI or a startup flag.
Signed-off-by: Edward Sammut Alessi <edward.sammutalessi@siderolabs.com>
Manifests support two modes:
- `FULL` - Omni will keep the manifest in sync always.
- `ONE_TIME` - Omni will apply the manifest only if it doesn't exist. If the manifest is removed by hand and then changed in Omni it will be applied too.
Manifests are applied using service side apply, Omni now has three inventories: `omni-internal-inventory`, `omni-user-inventory` and `omny-sync-one-time`:
- User inventory will be used for user managed manifests.
- Internal one will be used for the manifests which are created by Omni controllers (workloadproxy, advanced healtcheck service).
- One time inventory is used with NoPrune enabled. If the manifest is
applied it's just removed from the list of applied manifests: that
ensures that manifests changes are not going happen.
Manifests also support setting namespace to all namespaced resources. It might be useful for the huge manifest files which are supplied without the namespace (similar to `kubectl apply -n namespace -f manifest.yaml`).
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Allow setting the workload proxy subdomain to an empty string when useOmniSubdomain is true. This exposes services directly as subdomains of Omni (e.g., grafana.omni.example.com), which is the simplest possible setup for on-prem deployments needing only a wildcard DNS and cert on the Omni domain.
Continuation of https://github.com/siderolabs/omni/pull/2538.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
During ScaleUpAndDown, machines being removed still have ClusterMachineIdentity resources when the version check starts. The test collected IPs once upfront, then spent 2 minutes trying to reach a machine whose TLS identity was already invalidated, causing x509 errors until the timeout.
Re-fetch ClusterMachineIdentity on each retry iteration so that destroyed machines drop out of the IP list naturally.
Also fix clearConnectionRefused: replace the manual ctx.Done() check with RetryWithContext. The old code returned a plain fmt.Errorf on timeout, which fell through as a non-retryable error due to a race between the context deadline and the retry loop's own timeout.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Add `account.maxRegisteredMachines` config option to cap the number of registered machines. The provision handler atomically checks the limit under a mutex before creating new Link resources, returning ResourceExhausted when the cap is reached.
Introduce a Notification resource type (ephemeral namespace) so controllers can surface warnings to users. `omnictl` displays all active notifications on every command invocation. Frontend part of showing notifications will be implemented in a different PR.
MachineStatusMetricsController creates a warning notification when the registration limit is reached and tears it down when it's not.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
The code there was also incorrect: it was skipping setting the
`LastError` on the `ClusterMachineConfigStatus` resource.
Also add an integration test to verify that invalid config errors are properly
reported.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Rewrite `TalosUpgradeStatus` controller to use the completely different
flow:
- update all `ClusterMachineTalosVersion` resources immediately.
- to control quotas and rollout sequence use `UpgradeRollout` resource,
it has a single field which is a map of MachineSetName -> Current
Quota:
- if control plane is updating it sets quota 0 on all other machine
sets.
- the number of not running/unhealthy machines is subtracted from the
quota.
- quota is now copied from the new `UpgradeStrategy`, so it's possible
to have more than one machine updated in parallel.
- `ClusterMachineConfigStatus` controller now adds a new finalizer for
upgrades on all `ClusterMachines` which are currently being updated to
acquire/release locks and reads quotas from the `UpgradeRollout`.
Fixes: https://github.com/siderolabs/omni/issues/2393
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
* Add `IdentityLastActive` resource to record the last time each identity(`User`/`ServiceAccount`) made a gRPC call.
* Add `IdentityStatusController` to aggregate identity, user role, and last-active data into an ephemeral `IdentityStatus` resource.
* Expose last_active in ListUsers/ListServiceAccounts gRPC responses, omnictl CLI output, and the frontend Users/ServiceAccounts views.
* Add `UserMetricsController` exposing `omni_users` (total) and `omni_active_users` (7d/30d windows) Prometheus gauges.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Add state validation that rejects identity creation when the configured maximum number of users or service accounts is reached. The gRPC resource and management servers now use the validated state so these limits are enforced for all creation paths (CLI, UI, API). Identity is created before the user resource so the validation fires before any side effects.
Also adds create validation for join token name, e2e Playwright tests covering UI and AccountLimits integration test covering API and CLI for limit enforcement.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Migrate user create, list, update, and destroy operations from direct resource manipulation to dedicated ManagementService gRPC endpoints, matching the existing service account pattern.
Direct Identity/User resource mutations are now restricted, and the CLI, frontend, and client library are updated to use the new endpoints.
Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Extract the fields required by the `MachineConfigStatusController` to a
separate resource.
Otherwise there's circular loop: `MachinePendingUpdates` ->
`MachineSetStatus` -> `MachineConfigStatus` -> `MachinePendingUpdates`...
Also change the way machine pending is calculated: do not delete the
pending machine updates resource if the Talos version/schematic is not
in sync.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Add helm unit tests (via helm-unittest) covering services, ingresses, HTTPRoutes, secrets, PrometheusRules and ServiceAccounts. Add a helm-based e2e test workflow that deploys Omni on a Talos cluster with Traefik and etcd, runs integration tests including workload proxy, and verifies the full stack end-to-end. Add a configurable TestOptions struct to the workload proxy test to allow running with smaller scale in helm e2e.
Signed-off-by: Kevin Tijssen <kevin.tijssen@siderolabs.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Update Go in go.mod to keep it consistent with the value in the Makefile (the actual Go version the project is built with).
It kicks in some new linters, causes linters to change behavior. Reformat and fix all those linting issues.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
In the integration tests, we were accessing the API of the Talos machines which are in maintenance mode by directly hitting their SideroLink mgmt endpoint.
This worked only because the test was running on the same host as Omni itself (as we spawned Omni as process). This approach breaks when we install Omni via its helm chart on a Kubernetes cluster.
Fix this by going to them through Omni as well.
Additionally, centralize the talos client creation in the tests.
Additionally: bump Talos machinery, and pass the service account key explicitly to the Talos client when creating it, instead of relying on it to pick it from env vars.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Instead of doing the fake user auth flow in the integration tests via the `clientconfig` package, use the automation service account directly. Remove all other usages of that package as well, and drop it completely.
The package predates the initial service account token feature of Omni, its purpose was to authenticate to the Omni API in the integration tests. We have the automation key now, so we don't need that anymore.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Wipe ID on `InfraMachine` resources is empty only when a machine was **never needed to be wiped**, i.e., was never allocated to and then de-allocated from a cluster.
This is not always the case in the bare metal infra provider tests, as it runs both `ConfigPatching` and `StaticInfraProvider` integration tests at the same time. Sometimes, the latter test picked machines which were released by the former test, and those machines were already wiped at least once.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
We had an issue with bare metal provider where two different schematic IDs would fight each other, causing machine to get installed with a wrong schematic ID, only to be upgraded to the correct one immediately, and in some cases, go into an upgrade loop between a correct and an incorrect schematic.
The cause: Omni treated schematics it observed when the machine in agent mode dialed in, and stored the information it received (like kernel args and initial schematic info). This was wrong, as agent mode information essentially meaningless.
Fix this by changing the simple check of "was the schematic info for machine X ever observed" to be "is the schematic info for machine X ready". The readiness check involves schematic being populated and machine not being in agent mode.
This change caused `SchematicConfiguration` resource to not be generated before the machine leaves the agent mode, and caused a side effect: `InfraMachineController` would not receive Talos version from it and would not populate it on the `InfraMachine` resource. And this would cause BM provider to never get notified about the fact that the machine is allocated to a cluster, and would not power it on (to PXE boot it to "regular" Talos, for it to receive the "install" call to Omni).
Change that controller to get the Talos version info directly from the Cluster resource.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Make all leaf fields nillable, so that we can distinguish unset from explicit empty, and merging of CLI args and YAML configs work correctly.
Generate nil-safe accessors (getter/setters) for these nillable fields and use them in the code.
Wrap the cobra command line parser to support nillable flags.
Move all validations into the JSON schema and drop go-validator usage and its annotations.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Now graceful config rollout is handled by the
`ClusterMachineConfigStatusController`.
It calculates the available update quota by adding finalizers on the
`ClusterMachine` resources. By counting the resources with the
finalizers it tracks the remaining quota.
It now also calculates the pending changes which are not yet applied to
the machine in the `MachinePendingUpdates`.
Pending changes are not yet shown in the UI anywhere.
Fixes: https://github.com/siderolabs/omni/issues/1929
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Simplify the code and make it less error prone.
Signed-off-by: Pranav Patil <pranavppatil767@gmail.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
With Talos 1.12, `.machine.install.extraKernelArgs` is not the right way of setting kernel args. Remove that from the infra machines (bare metal infra provider) tests.
Remove the `disableKexec` bool argument from the function, as it was always set to true.
Set the kernel arg to disable kexec in correct format, as `sysctl.kernel.kexec_load_disabled=1`, not `kexec_load_disabled=1` (was effectively no-op).
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fix a few things in tests:
- Add the forgotten `claimMachines` calls to a few integration tests
- When picking unallocated machines in integration tests, ensure that they are unallocated by checking that here is no corresponding `MachineSetNode` resource. Previous check on the `Available` label on `MachineStatus` resource was inherently racy, as that label is set by a controller asynchronously after a machine was "picked".
- Fix the flake in TalosUpgradeStatus unit test: it was skipping reconciliation because the `SchematicConfiguration` resource was missing the cluster label, but in the same time it was not failing reliably, as it was not asserting the completion of one upgrade before starting the next one. Fix both issues.
- Fix a crash in TalosUpgradeStatusController - it was failing to read back the `ClusterMachineTalosVersion` resource it just created because it was not yet available in the controller runtime cache. Instead of reading it back after writing, simply return the created resource reference.
Co-authored-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This resource is going to be used to store the saved installation media
presets generated by the UI wizard.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Now as `MachineSetNodes` are no longer ever owned by the
`MachineSetNodeController` and marked with
`managed-by-machine-set-node-controller` label instead, CLI tools should
properly handle that and ignore such `MachineSetNodes` during export and
cluster sync.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Remove the flags for turning on SQLite storage for:
- Discovery service state
- Audit logs
- Machine logs
Instead, migrate them unconditionally to SQLite on the next startup.
Remove many flags which are no longer meaningful. Only keep the ones which are required for the migrations.
Additionally: Make the `--sqlite-storage-path` (or its config counterpart `.storage.sqlite.path`) required with no default value, as a default value does not make sense for it in most of the cases.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>