Commit Graph

92 Commits

Author SHA1 Message Date
Andrey Smirnov
0da370dfef
test: unlock CABPT/CACPPT provider versions
We should always test latest versions of our providers.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 00:14:15 +03:00
Artem Chernyshev
2f2bdb26aa
feat: replace flags with --mode in apply, edit and patch commands
Fixes: https://github.com/talos-systems/talos/issues/4588

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-01-13 16:09:53 +03:00
Andrey Smirnov
2f4b9d8d6d
feat: make machine configuration read-only in Talos (almost)
Talos shouldn't try to re-encode the machine config it was provided
with.

So add a `ReadonlyWrapper` around `*v1alpha1.Config` which makes sure
that raw config object is not available anymore (it's a private field),
but config accessors are available for read-only access.

Another thing that `ReadonlyWrapper` does is that it preserves the
original `[]byte` encoding of the config keeping it exactly same way as
it was loaded from file or read over the network.

Improved `talosctl edit mc` to preserve the config as it was submitted,
and preserve the edits on error from Talos (previously edits were lost).

`ReadonlyWrapper` is not used on config generation path though - config
there is represented by `*v1alpha.Config` and can be freely modified.

Why almost? Some parts of Talos (platform code) patch the machine
configuration with new data. We need to fix platforms to provide
networking configuration in a different way, but this will come with
other PRs later.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-28 20:12:55 +03:00
Andrey Smirnov
d2a7e082c2
test: retry in discovery tests
Sometimes pushing/pulling to Kubernetes registry is delayed due to
backoff on failed attempts to talk to the API server when the cluster is
still bootstrapping. Workaround that by adding retries.

Also disable kernel module controller in container mode, as it will keep
always failing.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-28 16:55:41 +03:00
Noel Georgi
4c96e936ed
docs: add cilium guide
- Add Cilium CNI install guide
- Use Canal CNI for default examples

Fixes #4477

Signed-off-by: Noel Georgi <git@frezbo.dev>
2021-12-16 20:37:03 +05:30
Rohit Dandamudi
7f9922296a
feat: add powercycle mode in reboot
- Fixes #4569
- Updated reboot process sequence
- Updted api.descriptors to avoid proto type change linting error https://github.com/talos-systems/talos/pull/4612#discussion_r758599242
Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>

Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>
2021-12-02 22:40:04 +05:30
Andrey Smirnov
753a82188f
refactor: move pkg/resources to machinery
Fixes #4420

No functional changes, just moving packages around.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-11-15 19:50:35 +03:00
Alexey Palazhchenko
7462733bcb
chore: update golangci-lint
Fix context propagation.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-15 14:55:25 +00:00
Andrey Smirnov
b6b78e7fef
test: add cluster discovery integration tests
This verifies that members match cluster state and that both cluster
registries work in sync producing same discovery data.

Fixes #4191

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-25 21:03:29 +03:00
Andrey Smirnov
3e100aa977
test: workaround EventsWatch test flakiness
This test sometimes fails with a message like:

```
=== RUN   TestIntegration/api.EventsSuite/TestEventsWatch
    assertion_compare.go:323:
        	Error Trace:	events.go:88
        	Error:      	"0" is not greater than or equal to "14"
        	Test:       	TestIntegration/api.EventsSuite/TestEventsWatch
        	Messages:   	[]
```

I believe the root cause is that the initial (first event) delivery
might be more than 100ms, so instead of waiting for 100ms for each
event, block for 500ms for all events to arrive.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-10-15 12:51:56 +03:00
Andrey Smirnov
b450b7cef0
chore: deprecate Interfaces and Routes APIs
Fixes #4094

Deprecate old networkd APIs, `talosctl interfaces` and `talosctl routes`
now suggest different commands to be used to achieve same task.

TUI installer was updated to stop using Interfaces API.

Those APIs will be completely removed in 0.14.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-27 15:21:02 +03:00
Andrey Smirnov
a059454045
chore: build using Go 1.17
`initramfs` size for amd64 shrinks by 1.3 MiB.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-09-13 22:33:47 +03:00
Alexey Palazhchenko
eea750de2c chore: rename "join" type to "worker"
Closes #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 07:10:45 -07:00
Andrey Smirnov
b969e7720e chore: update references to old protobuf package
This simply uses new protobuf package instead of old one.

Old protobuf package is still in use by Talos dependencies.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-08 05:34:12 -07:00
Andrey Smirnov
10c28758a4 fix: ignore DeadlineExceeded error correctly on bootstrap
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.

Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-07-07 12:02:26 -07:00
Andrey Smirnov
62c702c4fd fix: remove conflicting etcd member on rejoin with empty data directory
This fixes a scenario when control plane node loses contents of `/var`
without leaving etcd first: on reboot etcd data directory is empty, but
member is already present in the etcd member list, so etcd won't be able
to join because of raft log being empty.

The fix is to remove a member with matching hostname if found in the
etcd member list followed by new member add.

The risk here is removing another member which has same hostname as the
joining node, but having duplicate hostnames for control plane node is a
problem anyways.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-03 15:11:44 -07:00
Andrey Smirnov
0acb04ad7a feat: implement route network controllers
Route handling is very similar to addresses:

* `RouteStatus` describes kernel routing table state,
`RouteStatusController` reflects kernel state into resources
* `RouteSpec` defines routes to be configured
* `RouteConfigController` creates `RouteSpec`s based on cmdline and
machine configuration
* `RouteMergeController` merges different configuration layers into the
final representation
* `RouteSpecController` applies the specs to the kernel routing table

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-05-25 11:09:21 -07:00
Alexey Palazhchenko
1fcf38f9d6 feat: add support for "none" CNI type
Closes #3411.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-09 12:53:00 -07:00
Andrey Smirnov
e0650218a6 feat: support etcd recovery from snapshot on bootstrap
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.

Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.

This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-08 10:15:37 -07:00
Andrey Smirnov
7d91258475 test: fix data race in apply config tests
Variable `chanErr` was read before waiting for the goroutine to finish.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 10:46:50 -07:00
Andrey Smirnov
204caf8eb9 test: fix apply-config integration test, bump clusterctl version
Tests for ApplyConfig API were relying on not really supported behavior
of modifying config via the `Provider` interface (and it was "fixed" in
another PR which cleans up such access to the configuration).

Cluster version bumped to try to workaround strange CAPI bootstrap
failures in e2e-capi.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 09:55:53 -07:00
Andrey Smirnov
2ea20f598a feat: replace timed with time sync controller
This is a complete rewrite of time sync process.

Now the time sync process starts early at boot time, and it adapts to
configuration changes:

* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used

Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.

Talos services which depend on time now wait on same resource instead of
waiting on timed health.

New features:

* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy

* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time

* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time

Features to implement later:

* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump

* trustd should be updated in same way

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 09:29:43 -07:00
Alexey Palazhchenko
d7e9f6d6a8 chore: build integration tests with -race
Refs https://github.com/talos-systems/talos/issues/3378.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-26 10:08:12 -07:00
Artem Chernyshev
6ffabe5169 feat: add ability to find disk by disk properties
Fixes: https://github.com/talos-systems/talos/issues/3323

Not exactly matching with udevd generated `by-<id>` symlinks, but should
provide sufficient amount of property selectors to be able to pick
specific disks for any kind of disk: sd card, hdd, ssd, nvme.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-23 14:23:02 -07:00
Artem Chernyshev
22f375300c chore: update golanci-lint to 1.38.0
Fix all discovered issues.
Detected couple bugs, fixed them as well.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-12 06:50:02 -08:00
Alexey Palazhchenko
df52c13581 chore: fix //nolint directives
That's the recommended syntax:
https://golangci-lint.run/usage/false-positives/

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-05 05:58:33 -08:00
Andrey Smirnov
31e56e63db fix: update in-cluster kubeconfig validity to match other certs
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).

Add a certificate refresh at 50% of the validity.

Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-01 11:16:04 -08:00
Artem Chernyshev
7108bb3f5b test: upgrade master to master tests
Verify upgrade flow using the same version of the installer.
Run that with disk encryption enabled.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-24 07:56:44 -08:00
Artem Chernyshev
06b8c09484 test: enable disk encryption key rotation test
Verify that disk encryption sync operations work properly.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-20 06:17:55 -08:00
Andrey Smirnov
32d2588528 test: update integration tests to use wrapped client for etcd APIs
This continues the fix from #3167.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-18 08:08:48 -08:00
Artem Chernyshev
58ff2c9808 feat: implement ephemeral partition encryption
This PR introduces the first part of disk encryption support.
New config section `systemDiskEncryption` was added into MachineConfig.
For now it contains only Ephemeral partition encryption.

Encryption itself supports two kinds of keys for now:
- node id deterministic key.
- static key which is hardcoded in the config and mainly used for test
purposes.

Talosctl cluster create can now be told to encrypt ephemeral partition
by using `--encrypt-ephemeral` flag.

Additionally:
- updated pkgs library version.
- changed Dockefile to copy cryptsetup deps from pkgs.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-17 13:39:04 -08:00
Andrey Smirnov
cc83b83808 feat: rename apply-config --no-reboot to --on-reboot
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 12:49:47 -08:00
Andrey Smirnov
d99a016af2 fix: correct response structure for GenerateConfig API
Also fix recovery grpc handler to print panic stacktrace to the log.

Any API should follow the structure compatible with apid proxying
injection of errors/nodes.

Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-11 06:34:10 -08:00
Andrey Smirnov
7f3dca8e4c test: add support for IPv6 in talosctl cluster create
Modify provision library to support multiple IPs, CIDRs, gateways, which
can be IPv4/IPv6. Based on IP types, enable services in the cluster to
run DHCPv4/DHCPv6 in the test environment.

There's outstanding bug left with routes not being properly set up in
the cluster so, IPs are not properly routable, but DHCPv6 works and IPs
are allocated (validates DHCPv6 client).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 13:28:53 -08:00
Andrey Smirnov
87ccf0eb21 test: clear connection refused errors after reset
After node reboot (and gRPC API unavailability), gRPC stack might cache
connection refused errors for up to backoff timeout. Explicitly clear
such errors in reset tests before trying to read data from the node to
verify reset success.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-01 08:11:27 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
47fb5720cf test: skip etcd tests on non-HA clusters
We can't test much of the flow on single-node clusters.

Fixes #3013

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-08 07:39:36 -08:00
Andrey Smirnov
a8dd2ff30d fix: checkpoint controller-manager and scheduler
Default manifests created by bootkube so far were only enabling
pod-checkpointer for kube-apiserver. This seems to have issues with
single-node control plane scenario, when without scheduler and
controller-manager node might fall into `NodeAffinity` state.

See https://github.com/talos-systems/bootkube-plugin/pull/23

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-28 11:53:17 -08:00
Andrey Smirnov
3dae6df27b test: stabilize upgrade test by running health check several times
For single node clusters, control plane is unstable after reboot, run
health check several times to let it settle down to avoid failures in
subsequent checks.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-11 08:31:01 -08:00
Andrey Smirnov
54ed80e244 feat: reset with system disk wipe spec
Idea is to add an option to perform "selective" reset: default reset
operation is to wipe all partitions (triggering reinstall), while spec
allows only to wipe some of the operations.

Other operations are performed exactly in the same way for any reset
flow.

Possible use case: reset only `EPHEMERAL` partition.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-10 11:31:07 -08:00
Andrey Smirnov
350280eb59 feat: implement "staged" (failsafe/backup) upgrades
Regular upgrade path takes just one reboot, but it requires all the
processes to be stopped on the node before upgrade might proceed. Under
some circumstances and with potential Talos bugs it might not work
rendering Talos upgrades almost impossible.

Staged upgrades build upon regular install flow to run the upgrade on
the node reboot. Such upgrades require two reboots of the node, and it
requires two pulls of the installer image, but they should be much less
suspicious to the failure. Once the upgrade is staged, node can be
rebooted in any possible way, including hard reset and upgrade is
performed on the next boot.

New ADV format was implemented as well to allow to store install image
ref/options across reboots. New format allows for bigger values and
takes 50% of the `META` partition. Old ADV is still kept for
compatibility reasons.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 08:34:26 -08:00
Artem Chernyshev
8aad711f18 feat: implement network interfaces list API
To be used in the interactive installer to configure networking.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-27 10:48:45 -08:00
Artem Chernyshev
f96cffd2b2 feat: add ability to choose CNI config
Initial version which only allows setting CNI using preset, no custom
CNI urls are supported at the moment. Still need to figure out what kind
of UI can be used for that.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-26 06:49:54 -08:00
Andrey Smirnov
9a32e34cb1 feat: implement apply configuration without reboot
This allows config to be written to disk without being applied
immediately.

Small refactoring to extract common code paths.

At first, I tried to implement this via the sequencer, but looks like
it's too hard to get it right, as sequencer lacks context and config to
be written is not applied to the runtime.

Fixes #2828

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-23 12:42:44 -08:00
Artem Chernyshev
2588e2960b feat: make GenerateConfiguration API reuse current node auth
Fixes: https://github.com/talos-systems/talos/issues/2819

Only if requested config type is not `TypeInit`.
This functionality will help implementing TUI installer cluster
extension workflow.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-23 12:12:15 -08:00
Artem Chernyshev
8513123d22 feat: return client config as the second value in GenerateConfiguration
To be used in interactive installer to output the node client
configuration to a file.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-17 07:20:05 -08:00
Artem Chernyshev
0f924b5122 feat: add generate config gRPC API
Fixes: https://github.com/talos-systems/talos/issues/2766

This API is implemented in Maintenance and Machine services.
Can be used to generate configuration on the node, instead of using
talosctl to generate it locally.

To be used in interactive installer and talosctl gen config.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-13 08:07:32 -08:00
Andrey Smirnov
8560fb9662 chore: enable nlreturn linter
Most of the fixes were automatically applied.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 06:48:07 -08:00
Andrey Smirnov
773912833e test: clean up integration test code, fix flakes
This enables golangci-lint via build tags for integration tests (this
should have been done long ago!), and fixes the linting errors.

Two tests were updated to reduce flakiness:

* apply config: wait for nodes to issue "boot done" sequence event
before proceeding
* recover: kill pods even if they appear after the initial set gets
killed (potential race condition with previous test).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-19 15:44:14 -07:00
Artem Chernyshev
e7e99cf1b3 feat: support disk usage command in talosctl
Usage example:

```bash
talosctl du --nodes 10.5.0.2 /var -H -d 2
NODE       NAME
10.5.0.2   8.4 kB   etc
10.5.0.2   1.3 GB   lib
10.5.0.2   16 MB    log
10.5.0.2   25 kB    run
10.5.0.2   4.1 kB   tmp
10.5.0.2   1.3 GB   .
```

Supported flags:
- `-a` writes counts for all files, not just directories.
- `-d` recursion depth
- '-H' humanize size outputs.
- '-t' size threshold (skip files if < size or > size).

Fixes: https://github.com/talos-systems/talos/issues/2504

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-13 09:30:31 -07:00