109 Commits

Author SHA1 Message Date
Utku Ozdemir
84e712a9f1
feat: introduce Talos API access from Kubernetes
We add a new CRD, `serviceaccounts.talos.dev` (with `tsa` as short name), and its controller which allows users to get a `Secret` containing a short-lived Talosconfig in their namespaces with the roles they need. Additionally, we introduce the `talosctl inject serviceaccount` command to accept a YAML file with Kubernetes manifests and inject them with Talos service accounts so that they can be directly applied to Kubernetes afterwards. If Talos API access feature is enabled on Talos side, the injected workloads will be able to talk to Talos API.

Closes siderolabs/talos#4422.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-08-08 18:27:26 +02:00
Andrey Smirnov
a6b010a8b4
chore: update Go to 1.19, Linux to 5.15.58
See https://go.dev/doc/go1.19

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-03 17:03:58 +04:00
Andrey Smirnov
fe2ee3b100
feat: implement MachineStatus resource
Fixes #5789

Example:

```yaml
spec:
    stage: running
    status:
        ready: false
        unmetConditions:
            - name: staticPods
              reason: kube-system/kube-controller-manager-talos-default-master-1 not ready, kube-system/kube-scheduler-talos-default-master-1 not ready
```

As events (CLI doesn't show full contents):

```
172.20.0.2   cbhf2l6f9lrs738hehfg   talos/runtime/machine.MachineStatusEvent   BOOTING   ready: false, unmet conditions: [time network services]
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-01 18:36:10 +04:00
Artem Chernyshev
e5994ff7a7
fix: skip ResetDuringBoot test if the Cluster config is unknown
And improve retry logic in the test.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-07-28 15:57:58 +03:00
Artem Chernyshev
ae1bec59e9
feat: allow running only one sequence at a time
Fix `Talos` sequencer to run only a single sequence at the same time.
Sequences priority was updated. To match the table:

| what is running (columns) what is requested (rows) | boot | reboot | reset | upgrade |
|----------------------------------------------------|------|--------|-------|---------|
| reboot                                             | Y    | Y      | Y     | N       |
| reset                                              | Y    | N      | N     | N       |
| upgrade                                            | Y    | N      | N     | N       |

With a small addition that `WithTakeover` is still there.
If set, priority is ignored.

This is mainly used for `Shutdown` sequence invokation.
And if doing apply config with reboot enabled.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-07-27 17:21:36 +03:00
Dmitriy Matrenichev
30f7851d2a
chore: bump golangci-lint from 1.45.2 to 1.47.2
Minor linter upgrade.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-07-22 17:49:44 +03:00
Utku Ozdemir
bb4abc0961
fix: regenerate kubelet certs when hostname changes
Clear the kubelet certificates and kubeconfig when hostname changes so that on next start, kubelet goes through the bootstrap process and new certificates are generated and the node is joined to the cluster with the new name.

Fixes siderolabs/talos#5834.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-07-21 01:54:15 +02:00
Utku Ozdemir
87ea1d9611
fix: update kubelet kubeconfig when cluster control plane endpoint changes
Overwrite cluster's server URL in the kubeconfig file used by kubelet when the cluster control plane endpoint is changed in machineconfig, so that kubelet doesn't lose connectivity to kube-apiserver.

Closes siderolabs/talos#4470.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-07-16 14:19:25 +02:00
Andrey Smirnov
86a0a7bdf7
refactor: use pointer types more in machine config structs
There should be no functional change with this PR.

The primary driver is supporting strategic merge configuration patches.
For such type of patches machine config should be loaded from incomplete
fragments, so it becomes critically important to distinguish between a
field having zero value vs. field being set in YAML.

E.g. with following struct:

```go
struct { AEnabled *bool `yaml:"a"` }
```

It's possible to distinguish between:

```yaml
a: false
```

and no metion of `a` in YAML.

Merging process trewats zero values as "not set" (skips them when
merging), so it's important to allow overriding value to explicit
`false`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-07-01 17:27:11 +04:00
Utku Ozdemir
8d2be5e315
feat: extend node definition used in health checks
Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-06-13 14:13:42 +02:00
Andrey Smirnov
f2997c0f22
chore: bump dependencies
dependabot + go-mod-outdated

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-06-06 23:27:17 +04:00
Andrey Smirnov
2ae0e3a569
test: add a test for version of Go Talos was built with
This is to ensure that in fact Talos is built with Go version we expect.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-05-11 21:51:12 +03:00
Artem Chernyshev
2b03057b91
feat: implement a new mode try in the config manipulation commands
The new mode allows changing the config for a period of time, which
allows trying the configuration and automatically rolling it back in case
if it doesn't work for example.

The mode can only be used with changes that can be applied without a
reboot.

When changed it doesn't write the configuration to disk, only changes it
in memory.
`--timeout` parameter can be used to customize the rollback delay.
The default timeout is 1 minute.

Any consequent configuration change will abort try mode and the last
applied configuration will be used.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-04-21 20:31:45 +03:00
Artem Chernyshev
2b9722d1f5
feat: add dry-run flag in apply-config and edit commands
Dry run prints out config diff, selected application mode without
changing the configuration.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-04-14 19:12:57 +03:00
Dmitriy Matrenichev
e06e1473b0
feat: update golangci-lint to 1.45.0 and gofumpt to 0.3.0
- Update golangci-lint to 1.45.0
- Update gofumpt to 0.3.0
- Fix gofumpt errors
- Add goimports and format imports since gofumports is removed
- Update Dockerfile
- Fix .golangci.yml configuration
- Fix linting errors

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-03-24 08:14:04 +04:00
Andrey Smirnov
f477507262
fix: the etcd recovery client and tests
This is the follow-up fix to the PR #5129.

1. Correctly catch only expected errors in the tests.
2. Rewind the snapshot each time the upload is retried.
3. Correctly unwrap errors in the `EtcdRecovery` client.
4. Update the `grpc-proxy` library to pass through the EOF error.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-22 16:51:36 +03:00
Artem Chernyshev
27af5d41c6
feat: pause the boot process on some failures instead of rebooting
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.

Additionally, `apid` and `machined` are now started right after
containerd is up and running.

That makes it possible for the operator to connect to the node using
talosctl and fix the config.

Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-03-21 17:39:45 +03:00
Andrey Smirnov
0da370dfef
test: unlock CABPT/CACPPT provider versions
We should always test latest versions of our providers.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 00:14:15 +03:00
Artem Chernyshev
2f2bdb26aa
feat: replace flags with --mode in apply, edit and patch commands
Fixes: https://github.com/talos-systems/talos/issues/4588

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-01-13 16:09:53 +03:00
Andrey Smirnov
2f4b9d8d6d
feat: make machine configuration read-only in Talos (almost)
Talos shouldn't try to re-encode the machine config it was provided
with.

So add a `ReadonlyWrapper` around `*v1alpha1.Config` which makes sure
that raw config object is not available anymore (it's a private field),
but config accessors are available for read-only access.

Another thing that `ReadonlyWrapper` does is that it preserves the
original `[]byte` encoding of the config keeping it exactly same way as
it was loaded from file or read over the network.

Improved `talosctl edit mc` to preserve the config as it was submitted,
and preserve the edits on error from Talos (previously edits were lost).

`ReadonlyWrapper` is not used on config generation path though - config
there is represented by `*v1alpha.Config` and can be freely modified.

Why almost? Some parts of Talos (platform code) patch the machine
configuration with new data. We need to fix platforms to provide
networking configuration in a different way, but this will come with
other PRs later.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-28 20:12:55 +03:00
Andrey Smirnov
d2a7e082c2
test: retry in discovery tests
Sometimes pushing/pulling to Kubernetes registry is delayed due to
backoff on failed attempts to talk to the API server when the cluster is
still bootstrapping. Workaround that by adding retries.

Also disable kernel module controller in container mode, as it will keep
always failing.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-28 16:55:41 +03:00
Noel Georgi
4c96e936ed
docs: add cilium guide
- Add Cilium CNI install guide
- Use Canal CNI for default examples

Fixes #4477

Signed-off-by: Noel Georgi <git@frezbo.dev>
2021-12-16 20:37:03 +05:30
Rohit Dandamudi
7f9922296a
feat: add powercycle mode in reboot
- Fixes #4569
- Updated reboot process sequence
- Updted api.descriptors to avoid proto type change linting error https://github.com/talos-systems/talos/pull/4612#discussion_r758599242
Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>

Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>
2021-12-02 22:40:04 +05:30
Andrey Smirnov
753a82188f
refactor: move pkg/resources to machinery
Fixes #4420

No functional changes, just moving packages around.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-15 19:50:35 +03:00
Alexey Palazhchenko
7462733bcb
chore: update golangci-lint
Fix context propagation.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-15 14:55:25 +00:00
Andrey Smirnov
b6b78e7fef
test: add cluster discovery integration tests
This verifies that members match cluster state and that both cluster
registries work in sync producing same discovery data.

Fixes #4191

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-25 21:03:29 +03:00
Andrey Smirnov
3e100aa977
test: workaround EventsWatch test flakiness
This test sometimes fails with a message like:

```
=== RUN   TestIntegration/api.EventsSuite/TestEventsWatch
    assertion_compare.go:323:
        	Error Trace:	events.go:88
        	Error:      	"0" is not greater than or equal to "14"
        	Test:       	TestIntegration/api.EventsSuite/TestEventsWatch
        	Messages:   	[]
```

I believe the root cause is that the initial (first event) delivery
might be more than 100ms, so instead of waiting for 100ms for each
event, block for 500ms for all events to arrive.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-15 12:51:56 +03:00
Andrey Smirnov
b450b7cef0
chore: deprecate Interfaces and Routes APIs
Fixes #4094

Deprecate old networkd APIs, `talosctl interfaces` and `talosctl routes`
now suggest different commands to be used to achieve same task.

TUI installer was updated to stop using Interfaces API.

Those APIs will be completely removed in 0.14.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-27 15:21:02 +03:00
Andrey Smirnov
a059454045
chore: build using Go 1.17
`initramfs` size for amd64 shrinks by 1.3 MiB.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-13 22:33:47 +03:00
Alexey Palazhchenko
eea750de2c chore: rename "join" type to "worker"
Closes #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 07:10:45 -07:00
Andrey Smirnov
b969e7720e chore: update references to old protobuf package
This simply uses new protobuf package instead of old one.

Old protobuf package is still in use by Talos dependencies.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 05:34:12 -07:00
Andrey Smirnov
10c28758a4 fix: ignore DeadlineExceeded error correctly on bootstrap
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.

Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 12:02:26 -07:00
Andrey Smirnov
62c702c4fd fix: remove conflicting etcd member on rejoin with empty data directory
This fixes a scenario when control plane node loses contents of `/var`
without leaving etcd first: on reboot etcd data directory is empty, but
member is already present in the etcd member list, so etcd won't be able
to join because of raft log being empty.

The fix is to remove a member with matching hostname if found in the
etcd member list followed by new member add.

The risk here is removing another member which has same hostname as the
joining node, but having duplicate hostnames for control plane node is a
problem anyways.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-03 15:11:44 -07:00
Andrey Smirnov
0acb04ad7a feat: implement route network controllers
Route handling is very similar to addresses:

* `RouteStatus` describes kernel routing table state,
`RouteStatusController` reflects kernel state into resources
* `RouteSpec` defines routes to be configured
* `RouteConfigController` creates `RouteSpec`s based on cmdline and
machine configuration
* `RouteMergeController` merges different configuration layers into the
final representation
* `RouteSpecController` applies the specs to the kernel routing table

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-05-25 11:09:21 -07:00
Alexey Palazhchenko
1fcf38f9d6 feat: add support for "none" CNI type
Closes #3411.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-09 12:53:00 -07:00
Andrey Smirnov
e0650218a6 feat: support etcd recovery from snapshot on bootstrap
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.

Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.

This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-08 10:15:37 -07:00
Andrey Smirnov
7d91258475 test: fix data race in apply config tests
Variable `chanErr` was read before waiting for the goroutine to finish.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 10:46:50 -07:00
Andrey Smirnov
204caf8eb9 test: fix apply-config integration test, bump clusterctl version
Tests for ApplyConfig API were relying on not really supported behavior
of modifying config via the `Provider` interface (and it was "fixed" in
another PR which cleans up such access to the configuration).

Cluster version bumped to try to workaround strange CAPI bootstrap
failures in e2e-capi.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 09:55:53 -07:00
Andrey Smirnov
2ea20f598a feat: replace timed with time sync controller
This is a complete rewrite of time sync process.

Now the time sync process starts early at boot time, and it adapts to
configuration changes:

* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used

Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.

Talos services which depend on time now wait on same resource instead of
waiting on timed health.

New features:

* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy

* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time

* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time

Features to implement later:

* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump

* trustd should be updated in same way

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 09:29:43 -07:00
Alexey Palazhchenko
d7e9f6d6a8 chore: build integration tests with -race
Refs https://github.com/talos-systems/talos/issues/3378.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-26 10:08:12 -07:00
Artem Chernyshev
6ffabe5169 feat: add ability to find disk by disk properties
Fixes: https://github.com/talos-systems/talos/issues/3323

Not exactly matching with udevd generated `by-<id>` symlinks, but should
provide sufficient amount of property selectors to be able to pick
specific disks for any kind of disk: sd card, hdd, ssd, nvme.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-23 14:23:02 -07:00
Artem Chernyshev
22f375300c chore: update golanci-lint to 1.38.0
Fix all discovered issues.
Detected couple bugs, fixed them as well.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-12 06:50:02 -08:00
Alexey Palazhchenko
df52c13581 chore: fix //nolint directives
That's the recommended syntax:
https://golangci-lint.run/usage/false-positives/

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-05 05:58:33 -08:00
Andrey Smirnov
31e56e63db fix: update in-cluster kubeconfig validity to match other certs
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).

Add a certificate refresh at 50% of the validity.

Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-01 11:16:04 -08:00
Artem Chernyshev
7108bb3f5b test: upgrade master to master tests
Verify upgrade flow using the same version of the installer.
Run that with disk encryption enabled.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-24 07:56:44 -08:00
Artem Chernyshev
06b8c09484 test: enable disk encryption key rotation test
Verify that disk encryption sync operations work properly.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-20 06:17:55 -08:00
Andrey Smirnov
32d2588528 test: update integration tests to use wrapped client for etcd APIs
This continues the fix from #3167.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-18 08:08:48 -08:00
Artem Chernyshev
58ff2c9808 feat: implement ephemeral partition encryption
This PR introduces the first part of disk encryption support.
New config section `systemDiskEncryption` was added into MachineConfig.
For now it contains only Ephemeral partition encryption.

Encryption itself supports two kinds of keys for now:
- node id deterministic key.
- static key which is hardcoded in the config and mainly used for test
purposes.

Talosctl cluster create can now be told to encrypt ephemeral partition
by using `--encrypt-ephemeral` flag.

Additionally:
- updated pkgs library version.
- changed Dockefile to copy cryptsetup deps from pkgs.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-17 13:39:04 -08:00
Andrey Smirnov
cc83b83808 feat: rename apply-config --no-reboot to --on-reboot
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 12:49:47 -08:00
Andrey Smirnov
d99a016af2 fix: correct response structure for GenerateConfig API
Also fix recovery grpc handler to print panic stacktrace to the log.

Any API should follow the structure compatible with apid proxying
injection of errors/nodes.

Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-11 06:34:10 -08:00