The problem was that `GracefulStop()` will hang forever if there is a
running API call. So if there is a running streaming call, the
maintenance service might hang until it is finished.
The problem shows up with 'Upgrade' API in the maintenance mode if there
is a concurrent streaming API call, e.g.:
1. Watch API is running against maintenance mode.
2. Upgrade API is issued, it tries to run the MaintenanceUpgrade
sequence, which tries to take over the Initialize sequence. The
Initialize sequence is canceled, maintenance API service context is
canceled, but the service doesn't terminate, as it's stuck in
`GracefulStop`. The sequence take over times out, as even the
sequence is canceled, it hasn't terminated yet.
Sample log:
```
[talos] upgrade request received: "ghcr.io/siderolabs/installer:v1.3.3"
[talos] upgrade failed: failed to acquire lock: timeout
[talos] task loadConfig (1/1): failed: failed to receive config via maintenance service: maintenance service failed: context canceled
[talos] phase config (6/7): failed
[talos] initialize sequence: failed
<stuck here>
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This the first step towards replacing all import paths to be based on
`siderolabs/` instead of `talos-systems/`.
All updates contain no functional changes, just refactorings to adapt to
the new path structure.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This implements a simple way to upgrade Talos node running in
maintenance mode (only if Talos is installed, i.e. if `STATE` and
`EPHEMERAL` partitions are wiped).
Upgrade is only available over SideroLink for security reasons.
Upgrade in maintenance mode doesn't support any options, and it works
without machine configuration, so proxy environment variables are not
available, registry mirrors can't be used, and extensions are not
installed.
Fixes#6224
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
I hit this bug when one the API calls got hanging, and submitting the
machine config with `apply-config` never takes the node out of
maintenance mode, as `.GracefulStop()` may hang forever waiting for all
the calls to finish.
This way we always abort at some timeout and stop the server forcefully.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6119
With new stable default hostname feature, any default hostname is
disabled until the machine config is available.
Talos enters maintenance mode when the default config source is empty,
so it doesn't have any machine config available at the moment
maintenance service is started.
Hostname might be set via different sources, e.g. kernel args or via
DHCP before the machine config is available, but if all these sources
are not available, hostname won't be set at all.
This stops waiting for the hostname, and skips setting any DNS names in
the maintenance mode certificate SANs if the hostname is not available.
Also adds a regression test via new `--disable-dhcp-hostname` flag to
`talosctl cluster create`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
If SideroLink is enabled, maintenance mode should only allow Siderolink connections.
Closes#5627
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
This basically provides `talosctl get --insecure` in maintenance mode.
Only non-sensitive resources are available (equivalent to having
`os:reader` role in the Talos client certificate).
Changes:
* refactored insecure/maintenance client setup in talosctl
* `LinkStatus` is no longer sensitive as it shows only Wireguard public
key, `LinkSpec` still contains private key for obvious reasons
* maintenance mode injects `os:reader` role implicitly
The motivation behind this PR is to deprecate networkd-era interfaces &
routes APIs which are being used in TUI installer, and we need a
replacement.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4139
This builds the local (for the node) `Affiliate` structure which
describes node for the cluster discovery. Dependending on the
configuration, KubeSpan information might be included as well.
`NodeAddresses` were updated to hold CIDRs instead of simple IPs.
The `Affiliate` will be pushed to the registries, while `Affiliate`s for
other nodes will be fetched back from the registries.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This got "broken" as part of the change to the new networking
implementation, so that maintenance service is launched before the
network is ready.
Fetch DNS names and IPs for the certificate from the resources.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Plus fix the logging on docker/Talos to avoid logs in docker mode going
to the host kernel message buffer.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Server in maintenance mode now prints certficate fingerprint and
provides sample talosctl command to upload config to the node.
`talosctl` can optionally enforce server certificate fingerprint.
See also https://github.com/talos-systems/crypto/pull/4Fixes#2753
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes the reverse Go dependency from `pkg/machinery` to `talos`
package.
Add a check to `Dockerfile` to prevent `pkg/machinery/go.mod` getting
out of sync, this should prevent problems in the future.
Fix potential security issue in `token` authorizer to deny requests
without grpc metadata.
In provisioner, add support for launching nodes without the config
(config is not delivered to the provisioned nodes).
Breaking change in `pkg/provision`: now `NodeRequest.Type` should be set
to the node type (as config can be missing now).
In `talosctl cluster create` add a flag to skip providing config to the
nodes so that they enter maintenance mode, while the generated configs
are written down to disk (so they can be tweaked and applied easily).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Instead of hosting a web service, we decided to implement a gRPC service
that exposes APIs that can be used in a client-side interactive installer.
Signed-off-by: Andrew Rynhard <andrew@rynhard.io>