There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This the first step towards replacing all import paths to be based on
`siderolabs/` instead of `talos-systems/`.
All updates contain no functional changes, just refactorings to adapt to
the new path structure.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Don't allow worker nodes to act as apid routers:
* don't try to issue client certificate for apid on worker nodes
* if worker nodes receives incoming connections with `--nodes` set to
one of the local addresses of the nodd, it routes the request to
itself without proxying
Second point allows using `talosctl -e worker -n worker` to connect
directly to the worker if the connection from the control plane is not
available for some reason.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6119
With new stable default hostname feature, any default hostname is
disabled until the machine config is available.
Talos enters maintenance mode when the default config source is empty,
so it doesn't have any machine config available at the moment
maintenance service is started.
Hostname might be set via different sources, e.g. kernel args or via
DHCP before the machine config is available, but if all these sources
are not available, hostname won't be set at all.
This stops waiting for the hostname, and skips setting any DNS names in
the maintenance mode certificate SANs if the hostname is not available.
Also adds a regression test via new `--disable-dhcp-hostname` flag to
`talosctl cluster create`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Overview: deprecate existing Talos resource API, and introduce new COSI
API.
Consequences:
* COSI API can only go via one-2-one proxy (`client.WithNode`)
* client-side API access is way easier with `state.State` wrappers
* lots of small changes on the client side to use new APIs
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
We add a new CRD, `serviceaccounts.talos.dev` (with `tsa` as short name), and its controller which allows users to get a `Secret` containing a short-lived Talosconfig in their namespaces with the roles they need. Additionally, we introduce the `talosctl inject serviceaccount` command to accept a YAML file with Kubernetes manifests and inject them with Talos service accounts so that they can be directly applied to Kubernetes afterwards. If Talos API access feature is enabled on Talos side, the injected workloads will be able to talk to Talos API.
Closessiderolabs/talos#4422.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fix `Talos` sequencer to run only a single sequence at the same time.
Sequences priority was updated. To match the table:
| what is running (columns) what is requested (rows) | boot | reboot | reset | upgrade |
|----------------------------------------------------|------|--------|-------|---------|
| reboot | Y | Y | Y | N |
| reset | Y | N | N | N |
| upgrade | Y | N | N | N |
With a small addition that `WithTakeover` is still there.
If set, priority is ignored.
This is mainly used for `Shutdown` sequence invokation.
And if doing apply config with reboot enabled.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Clear the kubelet certificates and kubeconfig when hostname changes so that on next start, kubelet goes through the bootstrap process and new certificates are generated and the node is joined to the cluster with the new name.
Fixessiderolabs/talos#5834.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Overwrite cluster's server URL in the kubeconfig file used by kubelet when the cluster control plane endpoint is changed in machineconfig, so that kubelet doesn't lose connectivity to kube-apiserver.
Closessiderolabs/talos#4470.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
There should be no functional change with this PR.
The primary driver is supporting strategic merge configuration patches.
For such type of patches machine config should be loaded from incomplete
fragments, so it becomes critically important to distinguish between a
field having zero value vs. field being set in YAML.
E.g. with following struct:
```go
struct { AEnabled *bool `yaml:"a"` }
```
It's possible to distinguish between:
```yaml
a: false
```
and no metion of `a` in YAML.
Merging process trewats zero values as "not set" (skips them when
merging), so it's important to allow overriding value to explicit
`false`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
The new mode allows changing the config for a period of time, which
allows trying the configuration and automatically rolling it back in case
if it doesn't work for example.
The mode can only be used with changes that can be applied without a
reboot.
When changed it doesn't write the configuration to disk, only changes it
in memory.
`--timeout` parameter can be used to customize the rollback delay.
The default timeout is 1 minute.
Any consequent configuration change will abort try mode and the last
applied configuration will be used.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Dry run prints out config diff, selected application mode without
changing the configuration.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
This is the follow-up fix to the PR #5129.
1. Correctly catch only expected errors in the tests.
2. Rewind the snapshot each time the upload is retried.
3. Correctly unwrap errors in the `EtcdRecovery` client.
4. Update the `grpc-proxy` library to pass through the EOF error.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.
Additionally, `apid` and `machined` are now started right after
containerd is up and running.
That makes it possible for the operator to connect to the node using
talosctl and fix the config.
Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Talos shouldn't try to re-encode the machine config it was provided
with.
So add a `ReadonlyWrapper` around `*v1alpha1.Config` which makes sure
that raw config object is not available anymore (it's a private field),
but config accessors are available for read-only access.
Another thing that `ReadonlyWrapper` does is that it preserves the
original `[]byte` encoding of the config keeping it exactly same way as
it was loaded from file or read over the network.
Improved `talosctl edit mc` to preserve the config as it was submitted,
and preserve the edits on error from Talos (previously edits were lost).
`ReadonlyWrapper` is not used on config generation path though - config
there is represented by `*v1alpha.Config` and can be freely modified.
Why almost? Some parts of Talos (platform code) patch the machine
configuration with new data. We need to fix platforms to provide
networking configuration in a different way, but this will come with
other PRs later.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Sometimes pushing/pulling to Kubernetes registry is delayed due to
backoff on failed attempts to talk to the API server when the cluster is
still bootstrapping. Workaround that by adding retries.
Also disable kernel module controller in container mode, as it will keep
always failing.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This verifies that members match cluster state and that both cluster
registries work in sync producing same discovery data.
Fixes#4191
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This test sometimes fails with a message like:
```
=== RUN TestIntegration/api.EventsSuite/TestEventsWatch
assertion_compare.go:323:
Error Trace: events.go:88
Error: "0" is not greater than or equal to "14"
Test: TestIntegration/api.EventsSuite/TestEventsWatch
Messages: []
```
I believe the root cause is that the initial (first event) delivery
might be more than 100ms, so instead of waiting for 100ms for each
event, block for 500ms for all events to arrive.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4094
Deprecate old networkd APIs, `talosctl interfaces` and `talosctl routes`
now suggest different commands to be used to achieve same task.
TUI installer was updated to stop using Interfaces API.
Those APIs will be completely removed in 0.14.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This simply uses new protobuf package instead of old one.
Old protobuf package is still in use by Talos dependencies.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.
Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes a scenario when control plane node loses contents of `/var`
without leaving etcd first: on reboot etcd data directory is empty, but
member is already present in the etcd member list, so etcd won't be able
to join because of raft log being empty.
The fix is to remove a member with matching hostname if found in the
etcd member list followed by new member add.
The risk here is removing another member which has same hostname as the
joining node, but having duplicate hostnames for control plane node is a
problem anyways.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Route handling is very similar to addresses:
* `RouteStatus` describes kernel routing table state,
`RouteStatusController` reflects kernel state into resources
* `RouteSpec` defines routes to be configured
* `RouteConfigController` creates `RouteSpec`s based on cmdline and
machine configuration
* `RouteMergeController` merges different configuration layers into the
final representation
* `RouteSpecController` applies the specs to the kernel routing table
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.
Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.
This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>