Commit Graph

30 Commits

Author SHA1 Message Date
Utku Ozdemir
0e76483bab
chore: rekres, bump deps, Go, Talos and k8s versions, satisfy linters
Some checks failed
default / default (push) Has been cancelled
default / e2e-backups (push) Has been cancelled
default / e2e-forced-removal (push) Has been cancelled
default / e2e-omni-upgrade (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
- Bump some deps, namely cosi-runtime and Talos machinery.
- Update `auditState` to implement the new methods in COSI's `state.State`.
- Bump default Talos and Kubernetes versions to their latest.
- Rekres, which brings Go 1.24.5. Also update it in go.mod files.
- Fix linter errors coming from new linters.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-07-11 18:23:48 +02:00
Artem Chernyshev
a7ac63725d
chore: rewrite join config generation
Now the machine join config is always generate when there's a `machine`
resource. It will automatically populate the correct parameters for the
machine API URL, logs and events.
If the machine is managed by an infra provider it will populate it's
request ID too.

The default provider join config is also generated, but it is not used
in the common infra provider library, because it's easier to just
generate the config at the moment it's going to be used.

The code for the siderolink join config generation was unified in all
the places, and is now in `client/pkg/siderolink`.

The new management API introduced for downloading the join config in the
UI `GetMachineJoinConfig`.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-07-10 13:41:38 +03:00
Oguz Kilcan
ff32ae4c7f
fix: gracefully handle logServer shutdown
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-omni-upgrade (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
Refactor the siderolink manager to improve context handling for Wireguard and its associated services.
This ensures proper shutdown and cleanup of resources.
Add a linger option to TCP connections in the log receiver to forcefully close the connection with RST.

Signed-off-by: Oguz Kilcan <oguz.kilcan@siderolabs.com>
2025-07-08 17:44:59 +02:00
Artem Chernyshev
ccd55cc8fb
feat: rewrite Omni config management
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
Omni can now be configured via a config file instead of the command line
flags.
The flags `--config-path` will now read the config provided in the YAML
format.
The config structure was completely changed. It was not public before,
so it's fine to ignore backward compatibility.
The command line flags were not changed.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-06-09 14:44:29 +03:00
Utku Ozdemir
3b0e831dff
fix: do not switch Siderolink GRPC tunnel mode after provisioning
Followup fix after https://github.com/siderolabs/omni/pull/976.

Change the way the global flag `--siderolink-use-grpc-tunnel` works: It now is used in the SideroLink provision API to decide whether to ignore what the machine requests in the prevision request and "force" the tunnel mode.

Additionally, now it is ignored in the generated `SideroLinkConfig` document which is pushed to the machines when they are allocated - instead, we simply check if the machine was already connected (provisioned) using a GRPC tunnel, and preserve that option. By doing that, we avoid switching the mode after the provisioning, avoiding the bug which is fixed in https://github.com/siderolabs/talos/pull/10517.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-03-13 00:35:53 +01:00
Artem Chernyshev
a5efd816a2
feat: validate incoming packets addresses in siderolink manager
Some checks failed
default / default (push) Has been cancelled
default / e2e-backups (push) Has been cancelled
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
e2e-workload-proxy-cron / default (push) Has been cancelled
e2e-upgrades-cron / default (push) Has been cancelled
e2e-templates-cron / default (push) Has been cancelled
e2e-short-secureboot-cron / default (push) Has been cancelled
e2e-short-cron / default (push) Has been cancelled
e2e-scaling-cron / default (push) Has been cancelled
e2e-forced-removal-cron / default (push) Has been cancelled
e2e-backups-cron / default (push) Has been cancelled
Updated SideroLink module to add the support for it and configure it
on the Omni side.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-03-07 20:39:06 +03:00
Orzelius
9e7d8fbe92
fix: increase log level of the SideroLink GRPC tunnel handler
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
follow up to https://github.com/siderolabs/omni/pull/959

reduce the amount of unimportant logs coming from
the SideroLink GRPC tunnel handler.

Signed-off-by: Orzelius <33936483+Orzelius@users.noreply.github.com>
2025-02-26 20:22:30 +03:00
Utku Ozdemir
ef32e434ac
fix: increase log level of the SideroLink GRPC tunnel handler
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
It seems SideroLink logs are too verbose in debug mode when the GRPC tunnel is used.
Reduce their level to info.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2025-02-25 12:19:46 +01:00
Artem Chernyshev
510512e7b2
fix: properly read the siderolink-disable-last-endpoint flag
Some checks failed
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
e2e-workload-proxy-cron / default (push) Has been cancelled
e2e-upgrades-cron / default (push) Has been cancelled
e2e-templates-cron / default (push) Has been cancelled
e2e-short-secureboot-cron / default (push) Has been cancelled
e2e-short-cron / default (push) Has been cancelled
e2e-scaling-cron / default (push) Has been cancelled
e2e-forced-removal-cron / default (push) Has been cancelled
e2e-backups-cron / default (push) Has been cancelled
The condition was inverted 🤦

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-02-24 17:07:05 +03:00
Artem Chernyshev
5fe3223999
feat: add a flag to enable secure join token flow
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
With the feature flag it is now possible to use the old flow.

The new secure join flow is not stable yet, so the default mode is
legacy, which doesn't enable node unique token generation for all
machines, not only the ones using too old Talos versions.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-02-19 21:25:04 +03:00
Artem Chernyshev
9bb85f8034
feat: implement secure node join flow
Some checks failed
default / default (push) Has been cancelled
default / e2e-backups (push) Has been cancelled
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
e2e-workload-proxy-cron / default (push) Has been cancelled
e2e-upgrades-cron / default (push) Has been cancelled
e2e-templates-cron / default (push) Has been cancelled
e2e-short-secureboot-cron / default (push) Has been cancelled
e2e-short-cron / default (push) Has been cancelled
e2e-scaling-cron / default (push) Has been cancelled
e2e-forced-removal-cron / default (push) Has been cancelled
e2e-backups-cron / default (push) Has been cancelled
Fixes: https://github.com/siderolabs/omni/issues/840

This PR changes the Talos machine join flow drastically:

- newly joined machine first put into a limbo state where Omni creates a
  temporary Wireguard connection to it.
- the controller picks up and tries to write a unique machine token to
  the newly joined machine, in the mean time it also resolves UUID
  conflicts automatically and writes UUID override to the META
  partition.
- the machine re-joins Omni, now with the unique token.
- the unique token is saved in the `siderolink.Link` resource and any
  subsequent join checks that `siderolink.Link` has matching unique
  token.

Siderolink manager was refactored, as it was a huge monolithic poorly
testable chunk, it was split to:
- LinkStatus controller, which creates/removes wireguard peers.
- PendingMachineStatus controller, which ensures all joined machines
  have unique node tokens.
- Provision handler, which implements gRPC server and has all logic
  related to the machine acceptance now.
- PeersPool, which is used by LinkStatus controllers and deduplicate
  peers creation, reuse them when possible.

Additionally updated siderolink loghandler to not accept logger
connection for the machines which do not have corresponding log buffers.

Nodes which do not support secure flow are still able to join by
default.
Secure join flow can be forced by setting `--disable-legacy-join-tokens`
flag.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-02-14 19:12:28 +03:00
Artem Chernyshev
ed946b30a6
feat: display OMNI_ENDPOINT in the service account creation UI
Fixes: https://github.com/siderolabs/omni/issues/858

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2025-01-29 15:27:36 +03:00
Utku Ozdemir
5a26d4c7ac
feat: add resources and controllers for bare metal infra provider
Some checks failed
default / default (push) Has been cancelled
default / e2e-backups (push) Has been cancelled
default / e2e-forced-removal (push) Has been cancelled
default / e2e-scaling (push) Has been cancelled
default / e2e-short (push) Has been cancelled
default / e2e-short-secureboot (push) Has been cancelled
default / e2e-templates (push) Has been cancelled
default / e2e-upgrades (push) Has been cancelled
default / e2e-workload-proxy (push) Has been cancelled
Introduce the concept of "static" infra providers, e.g., bare-metal infra provider, which manage a static set of machines contrary to the "regular" infra providers.

Add the following resources:

- `infra.Machine`: similar to `MachineRequest`, lives in the `infra-provider` namespace, serving as the input of the owning static provider. It is created in the `MachineController` if there is a SideroLink connection with the static provider ID. Regular flow of `Machine` creation is blocked, until this `infra.Machine` is accepted.

- `infra.MachineStatus`: similar to `MachineRequestStatus`, lives in the `infra-provider` namespace, serving as the output of the owning static provider. Its lifecycle must be bound to the corresponding `infra.Machine`.

- `infra.MachineState`: a resource that is supposed to be shared by Omni and bare-metal provider bi-directionally - they both can read from and write to it. It is currently used to mark the machine as installed when we observe an installation (through `SequenceEvent`s in the event sink), and to mark it as non-installed after we wipe it in the provider.

- `omni.InfraMachineConfig`: a user-managed resource to mark the `infra.Machine`s as accepted or set their desired power state. The acceptance information is then propagated to the `infra.Machine` resource. A machine which was already accepted cannot be unaccepted (checked by a validation), and this resource can only be removed when the `siderolink.Link` for the matching infra machine is removed.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-11-27 00:00:20 +01:00
Artem Chernyshev
cc5919273b
feat: reset machine when it's removed from Omni
Some checks are pending
default / default (push) Waiting to run
default / e2e-backups (push) Blocked by required conditions
default / e2e-forced-removal (push) Blocked by required conditions
default / e2e-scaling (push) Blocked by required conditions
default / e2e-short (push) Blocked by required conditions
default / e2e-short-secureboot (push) Blocked by required conditions
default / e2e-templates (push) Blocked by required conditions
default / e2e-upgrades (push) Blocked by required conditions
default / e2e-workload-proxy (push) Blocked by required conditions
The reset removes Talos from the machine completely.
Fixes: https://github.com/siderolabs/omni/issues/419

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-10-31 02:54:48 +03:00
Dmitriy Matrenichev
4084b6e9d7
fix: get proper IP from peer metadata
Currently, if gateway meets existing X-Forwarded-For header, it will append peer address that it sees to the existing value using comma.
Our IP extraction function didn't account for that, and so it failed to parse IP and it used the original `peer.address` which
set deep below in the gRPC middleware.

This commit ensures that we try to split the string value using `,`.

Closes #668

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-10-09 13:27:51 +03:00
Artem Chernyshev
d547889b7b
fix: filter requests in the infra provision controller
Make each controller process only resources labeled with it's provider
ID.
Allow overriding gRPC tunnel options for the machine classes/request
sets.
Expose join configs to the infra providers.

Also publish Omni integration tests as the part of releases.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-10-08 18:41:02 +03:00
Utku Ozdemir
c4a4151d7a
feat: allow specifying grpc tunnel option explicitly for install media
By default, generated Talos installation media uses `grpc_tunnel` for SideroLink based on the Omni instance configuration, namely via `--siderolink-use-grpc-tunnel` flag.

Allow overriding this setting in `omnictl download` and in Download Installation Media screen on the web.

On the Download Installation Media screen, the default value of the checkbox is based on the instance default.

Closes siderolabs/omni#388.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-09-13 11:42:29 +02:00
Artem Chernyshev
03604222ea
feat: support passing extra data through the siderolink join token
This extra data is used in the infra provider to add the annotation to the
`siderolink.Link` as early as possible.
Then the `Machine` controller is changed to skip the `Links` that have
annotation `omni.sidero.dev/infra-provider` and do not have the label
`omni.sidero.dev/machine-request`.
This change makes not consistent `Links` to be ignored by the system,
until the are fully populated.

Also changed the infra provider interface to take siderolink connection
params as string instead of the resource.

Fixes: https://github.com/siderolabs/omni/issues/603

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-09-05 14:57:51 +03:00
Dmitriy Matrenichev
e2f5795789
chore: allow multiple IP's for siderolink-wireguard-advertised-addr flag
The code is already there: Talos will simply fail to connect and will try again by rotating the IP.
We simply add support for specifying multiple IP's in the `siderolink-wireguard-advertised-addr` flag separated by a comma.

Fixes #495

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-08-28 20:41:29 +03:00
Dmitriy Matrenichev
5d48547c7f
chore: use range-over-func iterators for resource iteration
Bump to Go 1.23 and use new iterator mechanism. Also fix new linter issues.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-08-22 01:20:55 +03:00
Artem Chernyshev
5d953e407b
fix: do not re-create peer on the remote addr change
And exclude port from the saved address.

Additionally fix Talos backends cache to not to
react on the `MachineType` `Create` and `Update` events.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-07-04 12:53:52 +03:00
Artem Chernyshev
cd8bac4117
feat: read real IP from the provision API gRPC requests
And store them in the `link` resources.
This might be help to determine the real IP of the node which is coming
to Omni in case if `MachineStatus` is not populated.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-06-26 16:13:48 +03:00
Utku Ozdemir
6dcfd4c979
feat: handle all goroutine panics gracefully
Convert goroutine panics to errors or error logs.

Disallow usage of `golang.org/x/sync/errgroup` package in the backend by `depguard` linter. This linter configuration depends on: https://github.com/siderolabs/kres/pull/417

Rekres the project to include the feature (also bump Go to 1.22.4), but revert `PROTOBUF_GO_VERSION` and `GRPC_GATEWAY_VERSION` manually to not break the frontend.

Disallowing the named `go` statement was not possible at the moment using existing linters, raised an issue in `forbidigo` for it: https://github.com/ashanbrown/forbidigo/issues/47

Closes siderolabs/omni#373.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-06-20 21:28:12 +02:00
Artem Chernyshev
a67d1fb30b
fix: always generate siderolink connection config for all machines
Even if they already have the kernel arguments.
It will generate the config only for Talos >= 1.5.0.

Added migration to avoid triggering config updates for all machines, as
they don't have this partial config right now.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-05-29 18:55:57 +03:00
Petr Krutov
f38b7e54a6
feat: enable ALPN for machine API
Enabled ALPN negotiation for machine API endpoint

Signed-off-by: Petr Krutov <kjubybot@proton.me>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-28 15:57:10 +04:00
Simon-Boyer
16108a9f22
feat: allow setting some url params for api endpoint
This lets the operator define url params for the api endpoint. For example https://<endpoint>/?grpc_tunnel=true. Instead of only appending the jointoken, we are parsing the url and adding it using Query.Set.

Signed-off-by: Simon-Boyer <si.boyer@hotmail.ca>
Co-authored-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-05-08 23:27:53 +03:00
Utku Ozdemir
95197e2b07
feat: improve reliability of machine status snapshots
Use the Talos resource API as well as the siderolink event sink to determine the status of a machine.

Follow the agreed decision tree of:
- if the update came over the same channel as before, use it
- if the update came over a different channel than before, and the timestamp is newer than the previous update, use it
- otherwise, drop it

Closes siderolabs/omni#41.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-04-30 17:32:20 +02:00
Artem Chernyshev
7f58ea4713
fix: allow adding machines to Omni at higher speed
Introduce a buffer for `PeerEvents` channel, to not block adding new
machines when SideroLink is processing new peers.

Fixes: https://github.com/siderolabs/omni/issues/120

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-04-25 17:30:12 +03:00
Dmitriy Matrenichev
d3e3eef0fa
chore: support WG over GRPC in Omni
This PR adds the support for WG over GRPC. New field `VirtualAddrport`
in `SiderolinkSpec` should allow for both
setting the virtual addr and loading it after the Omni restart.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-10 18:50:49 +03:00
Andrey Smirnov
dfcbaae7d0
chore: initial commit
Omni is source-available under BUSL.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Co-Authored-By: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Co-Authored-By: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Co-Authored-By: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Co-Authored-By: Philipp Sauter <philipp.sauter@siderolabs.com>
Co-Authored-By: Noel Georgi <git@frezbo.dev>
Co-Authored-By: evgeniybryzh <evgeniybryzh@gmail.com>
Co-Authored-By: Tim Jones <tim.jones@siderolabs.com>
Co-Authored-By: Andrew Rynhard <andrew@rynhard.io>
Co-Authored-By: Spencer Smith <spencer.smith@talos-systems.com>
Co-Authored-By: Christian Rolland <christian.rolland@siderolabs.com>
Co-Authored-By: Gerard de Leeuw <gdeleeuw@leeuwit.nl>
Co-Authored-By: Steve Francis <67986293+steverfrancis@users.noreply.github.com>
Co-Authored-By: Volodymyr Mazurets <volodymyrmazureets@gmail.com>
2024-02-29 17:19:57 +04:00