Commit Graph

20 Commits

Author SHA1 Message Date
Utku Ozdemir
0e76483bab
chore: rekres, bump deps, Go, Talos and k8s versions, satisfy linters
Some checks failed
default / default (push) Has been cancelled
default / e2e-backups (push) Has been cancelled
default / e2e-forced-removal (push) Has been cancelled
default / e2e-omni-upgrade (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
- Bump some deps, namely cosi-runtime and Talos machinery.
- Update `auditState` to implement the new methods in COSI's `state.State`.
- Bump default Talos and Kubernetes versions to their latest.
- Rekres, which brings Go 1.24.5. Also update it in go.mod files.
- Fix linter errors coming from new linters.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-07-11 18:23:48 +02:00
Utku Ozdemir
3b7014839a
test: reduce the log verbosity in unit tests
Our unit test logs are too verbose/excessive because:

- COSI runtime writes a lot of `reconcile succeeded` logs
- migration tests were too spammy

This caused the logs to get clipped by docker/buildx in the CI:
```
[output clipped, log limit 2MiB reached]
```
and resulted in mysterious test failures without the failure reason being printed.

With these changes we pull our unit test logs a bit below the 2MiB limit (1.7MiB).

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-07-09 13:15:20 +02:00
Artem Chernyshev
122b79605f
test: run Omni as part of integration tests
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
This enables test coverage, builds Omni with race detector.

Also redone the COSI state creation flow: no more callbacks.
The state is now an Object, which has `Stop` method, that should be
called when the app stops.
All defers were moved into the `Stop` method basically.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-06-18 16:20:11 +03:00
Utku Ozdemir
fbb80f0b51
feat: implement async delete from discovery service(s)
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
Rework the discovery service affiliate deletion by doing the following changes:

1. Add support for arbitrary discovery services (e.g., self-hosted or third party):
   - Read the discovery service used by a machine from the machine itself
   - Implement a cache for discovery service clients
   - Use this discovery service client to remove the affiliate on node removal.

2. Make the discovery affiliate deletion asynchronous:
   - Introduce `DiscoveryAffiliateDeleteTask` resource
   - When a node is removed from a cluster, a resource for this node ID is created
   - A controller continuously tries to remove the affiliate until it succeeds or until it gets expired in the discovery service itself (after 30 minutes)
   - The controller removes the `DiscoveryAffiliateDeleteTask` resource

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-04-28 20:18:51 +02:00
Dmitriy Matrenichev
0cda77bbce
chore: bump Go and rekres
Run rekres, update Go version and update all files affected by linters.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2025-02-14 12:31:38 +03:00
Dmitriy Matrenichev
157ceac7f8
fix: close grpc connections after their usage is complete
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
- `NewClusterBootstrapStatusController.TransformExtraOutputFunc` gets `*client.Client` but forgets to close it.
This leads to `*grpc.ClientConn` leakage.
- `MachineTeardownController.resetMachine` gets `*client.Client` but forgets to close it.
This leads to `*grpc.ClientConn` leakage.
- `runWithState` gets `*discovery.Client` but forgets to close it. This leads to `*grpc.ClientConn` leakage.
- `WithClient` gets `*client.Client` but forgets to close it. This leads to `*grpc.ClientConn` leakage.

Shorten the code in some places, while we are at it.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2025-02-04 22:09:33 +03:00
Artem Chernyshev
ed946b30a6
feat: display OMNI_ENDPOINT in the service account creation UI
Fixes: https://github.com/siderolabs/omni/issues/858

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-01-29 15:27:36 +03:00
Utku Ozdemir
fd888ab190
refactor: track infra machine install status via a counter
With this change, we change the way we track whether Talos is installed on the disk or not in the bare-metal infra provider.

Previously, it worked like the following:
- Omni, when observing some specific type of events on SideroLink, set the `installed` flag on the dedicated `MachineState` resource to true.
- The provider, after wiping disks of a machine, set that flag to false.

This method went against the "single owner per resource" principle and was not leveraging COSI runtime and controller-based logic. Furthermore, it made the contract between Omni and the provider more complex since it was yet another resource.

Instead, now, we do the following:
- Every time we observe those specific types of events on SideroLink, we increment a counter field on the `infra.Machine` resource.
- When the provider wipes a machine, it persists this counter value at the time of wipe internally.
- To detect whether Talos is installed or not, the provider compares the internally stored counter value vs the value on the `infra.Machine`. It is "installed" only if the counter value on the `infra.Machine` is bigger than the internally stored one (it means we observed an installation after the last wipe).

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-01-27 05:11:30 +01:00
Utku Ozdemir
e3d46f949c
feat: implement compression of config fields on resources
Add compression support.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-09-11 14:48:57 +02:00
Artem Chernyshev
6080c251c6
test: fix several flaky tests
Should make CI runs more reliable.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-08-13 19:27:36 +03:00
Utku Ozdemir
e9bca13f8f
feat: use tcp loadbalancer for exposed services
Improve the exposed service reliability by using a TCP loadbalancer between the nodes exposing the service.

Rework the exposed service proxy registry to be a COSI controller instead to simplify the logic, improve reliability and testability.

Closes siderolabs/omni#396.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-06-25 17:28:21 +02:00
Artem Chernyshev
63ad5bd1ef
feat: provide a way to getadmin talosconfig and kubeconfig
Fixes: https://github.com/siderolabs/omni/issues/33

It is now possible to get full access `kubeconfig` and `talosconfig`
(operator role), if the Omni instance has `enable-break-glass-configs`
flag enabled.

They can be downloaded using cli commands:

`omnictl kubeconfig --admin --cluster <name>`
`omnictl talosconfig --admin --cluster <name>`

After you download the config the cluster will be marked with
`omni.sidero.dev/tainted` annotation to keep in mind that this cluster
has weaker security and might need to get secrets rotation in the
future.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-06-12 15:49:48 +03:00
Utku Ozdemir
331fc31984
feat: run embedded discovery service in Omni
Run a discovery service instance inside Omni (enabled by default).

It listens only on the SideroLink interface on port 8093.

Clusters can opt in to use this embedded discovery service instead of the `discovery.talos.dev`. It is added as a new cluster feature both on frontend and in cluster templates.

Closes siderolabs/omni#20.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-06-06 01:11:17 +02:00
Dmitriy Matrenichev
d0cb1bc744
chore: replace grpc.Dial* with grpc.NewClient
That should silence `staticcheck` linter.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-06-05 19:15:35 +03:00
Artem Chernyshev
ed26122ce0
fix: implement the controller for handling machine status snapshot
Make the controller run tasks that can collect machine status from each
machine.
Instead of changing the `MachineStatusSnapshot` directly in the
siderolink events handler pass these events to the controller through
the channel, so that all events are handled in the same place.

If either event comes from siderolink or if task runner gets the machine
status it updates the `MachineStatusSnapshot` resource.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-06-04 13:59:47 +03:00
Dmitriy Matrenichev
82abb2ba53
chore: bump deps
- run rekres and fix nolint directives
- bump deps (keep gen to 0.4.8 for now) for server, client and tests

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-06-03 22:43:37 +03:00
Utku Ozdemir
aa4d76489e
fix: always delete removed nodes from discovery service
If a `ClusterMachine` is removed, always attempt to remove them from the discovery service.

Use a single discovery service client instead of recreating it every time.

Use the GRPC dial options exposed from the siderolabs/discovery-client when connecting to the discovery service.

Closes siderolabs/omni#19.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-04-16 00:29:07 +02:00
Artem Chernyshev
7486bb8d20
feat: support generating installation media with overlays for Talos 1.7+
Fixes: https://github.com/siderolabs/omni/issues/143

This is crucial if we want to support SBCs in Omni.

Automatically detect which overlay we need to install when any SBC type
is selected on the backend.
Move some of filename generation to the backend, as it's now Talos
version dependent.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-04-15 22:43:19 +03:00
Utku Ozdemir
176f9d9f57
feat: compute schematic id only from the extensions
When determining the schematic ID of a machine, instead of relying the ID on the schematic ID meta-extension, compute the ID by gathering the extensions on the machine. This way, the extension ID will not contain the META values, labels or the kernel args.

This ID is actually the ID we need, as when we compare the desired schematic with the actual one during a Talos upgrade, we are only interested in the changes in the list of extensions.

This does not cause the kernel args, labels, etc. to disappear, as they are used at installation time and preserved afterward (e.g., during upgrades).

Additionally:
- Remove the list of extensions from the `Schematic` resource, as it relied upon the schematics always being created through Omni. This is not always the case - i.e., when a partial join config is used. Therefore, instead of relying on it, we store the list of extensions by directly reading them from the machine and storing them on the `MachineStatus` resource.
- Skip setting the schematic META section at all if there are no labels set on Download Installation Media screen.

Closes siderolabs/omni#55.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-03-22 14:58:19 +03:00
Andrey Smirnov
dfcbaae7d0
chore: initial commit
Omni is source-available under BUSL.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Co-Authored-By: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Co-Authored-By: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Co-Authored-By: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Co-Authored-By: Philipp Sauter <philipp.sauter@siderolabs.com>
Co-Authored-By: Noel Georgi <git@frezbo.dev>
Co-Authored-By: evgeniybryzh <evgeniybryzh@gmail.com>
Co-Authored-By: Tim Jones <tim.jones@siderolabs.com>
Co-Authored-By: Andrew Rynhard <andrew@rynhard.io>
Co-Authored-By: Spencer Smith <spencer.smith@talos-systems.com>
Co-Authored-By: Christian Rolland <christian.rolland@siderolabs.com>
Co-Authored-By: Gerard de Leeuw <gdeleeuw@leeuwit.nl>
Co-Authored-By: Steve Francis <67986293+steverfrancis@users.noreply.github.com>
Co-Authored-By: Volodymyr Mazurets <volodymyrmazureets@gmail.com>
2024-02-29 17:19:57 +04:00