339 Commits

Author SHA1 Message Date
Artem Chernyshev
f96548e165 refactor: extract go-cmd into a separate library
To be used in the `go-blockdevice` library.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-16 10:31:20 -08:00
Andrey Smirnov
d99a016af2 fix: correct response structure for GenerateConfig API
Also fix recovery grpc handler to print panic stacktrace to the log.

Any API should follow the structure compatible with apid proxying
injection of errors/nodes.

Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-11 06:34:10 -08:00
Andrey Smirnov
daea9d3811 feat: support version contract for Talos config generation
This allows to generating current version Talos configs (by default) or
backwards compatible configuration (e.g. for Talos 0.8).

`talosctl gen config` defaults to current version, but explicit version
can be passed to the command via flags.

`talosctl cluster create` defaults to install/container image version,
but that can be overridden. This makes `talosctl cluster create` now
compatible with 0.8.1 images out of the box.

Upgrade tests use contract based on source version in the test.

When used as a library, `VersionContract` can be omitted (defaults to
current version) or passed explicitly. `VersionContract` can be
convienietly parsed from Talos version string or specified as one of the
constants.

Fixes #3130

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-10 13:02:52 -08:00
Andrey Smirnov
7f3dca8e4c test: add support for IPv6 in talosctl cluster create
Modify provision library to support multiple IPs, CIDRs, gateways, which
can be IPv4/IPv6. Based on IP types, enable services in the cluster to
run DHCPv4/DHCPv6 in the test environment.

There's outstanding bug left with routes not being properly set up in
the cluster so, IPs are not properly routable, but DHCPv6 works and IPs
are allocated (validates DHCPv6 client).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 13:28:53 -08:00
Andrey Smirnov
edf5777222 feat: add an option to force upgrade without checks
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 10:20:03 -08:00
Andrey Smirnov
2277ce8abe feat: move to ECDSA keys for all Kubernetes/etcd certs and keys
ECDSA keys are smaller which decreases Talos config size, they are more
efficient in terms of key generation, signing, etc., so it makes boot
performance better (and config generation as well).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-02 13:25:00 -08:00
Andrey Smirnov
87ccf0eb21 test: clear connection refused errors after reset
After node reboot (and gRPC API unavailability), gRPC stack might cache
connection refused errors for up to backoff timeout. Explicitly clear
such errors in reset tests before trying to read data from the node to
verify reset success.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-01 08:11:27 -08:00
Andrey Smirnov
e0a0f58801 feat: use multi-arch images for k8s and Flannel CNI
Flannel got updated to 0.13 version which has multi-arch image.

Kubernetes images are multi-arch.

Fixes #3049

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-28 08:26:02 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
d71ac4c4ff feat: update Kubernetes to 1.20.2
Minor point release, official changelog:

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-15 09:06:18 -08:00
Artem Chernyshev
d515613bb7 fix: list command unlimited recursion default behavior
Revert back to old behavior.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-01-15 05:06:41 -08:00
Andrey Smirnov
47fb5720cf test: skip etcd tests on non-HA clusters
We can't test much of the flow on single-node clusters.

Fixes #3013

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-08 07:39:36 -08:00
Andrey Smirnov
a8dd2ff30d fix: checkpoint controller-manager and scheduler
Default manifests created by bootkube so far were only enabling
pod-checkpointer for kube-apiserver. This seems to have issues with
single-node control plane scenario, when without scheduler and
controller-manager node might fall into `NodeAffinity` state.

See https://github.com/talos-systems/bootkube-plugin/pull/23

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-28 11:53:17 -08:00
Andrey Smirnov
f2c029a07d chore: update upgrade test version used
Now with official 0.8.0 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-24 18:49:29 +03:00
Artem Chernyshev
a83e8758db feat: add commands to manage/query etcd cluster
Used already existing protobufs for that.

Commands:
`talosctl etcd members -n <node>`
`talosctl etcd leave -n <node>`
`talosctl etcd forfeit-leadership -n <node>`

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-22 11:49:10 -08:00
Andrey Smirnov
b1d4814308 feat: update Kubernetes to 1.20.1
See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-21 23:52:29 +03:00
Andrey Smirnov
3dae6df27b test: stabilize upgrade test by running health check several times
For single node clusters, control plane is unstable after reboot, run
health check several times to let it settle down to avoid failures in
subsequent checks.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-11 08:31:01 -08:00
Andrey Smirnov
54ed80e244 feat: reset with system disk wipe spec
Idea is to add an option to perform "selective" reset: default reset
operation is to wipe all partitions (triggering reinstall), while spec
allows only to wipe some of the operations.

Other operations are performed exactly in the same way for any reset
flow.

Possible use case: reset only `EPHEMERAL` partition.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-10 11:31:07 -08:00
Artem Chernyshev
68dd5b9add feat: add talosctl merge config command
Allows merging two Talos configs into one. Merges the config in whatever
is set by TALOSCONFIG or ~/.talos/config.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-09 13:07:45 -08:00
Artem Chernyshev
d7ce831465 feat: add talosctl config contexts
Bonus to `talosctl config merge`.
Got that idea after using talosctl for a weekend.
I feel that can be a good addition to have a command that can list existing
contexts in a table view, which is similar to what `kubectl config get-contexts`
does. To avoid going through the file which has all the certs and such.

Called it just `contexts` to align with whatever we have now (to switch
    context you need to use `talosctl config context`).

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-09 12:19:10 -08:00
Andrey Smirnov
872e792dbc feat: update Kubernetes to 1.20.0
Official K8s release matching Talos 0.8.0.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-09 06:11:48 -08:00
Andrey Smirnov
350280eb59 feat: implement "staged" (failsafe/backup) upgrades
Regular upgrade path takes just one reboot, but it requires all the
processes to be stopped on the node before upgrade might proceed. Under
some circumstances and with potential Talos bugs it might not work
rendering Talos upgrades almost impossible.

Staged upgrades build upon regular install flow to run the upgrade on
the node reboot. Such upgrades require two reboots of the node, and it
requires two pulls of the installer image, but they should be much less
suspicious to the failure. Once the upgrade is staged, node can be
rebooted in any possible way, including hard reset and upgrade is
performed on the next boot.

New ADV format was implemented as well to allow to store install image
ref/options across reboots. New format allows for bigger values and
takes 50% of the `META` partition. Old ADV is still kept for
compatibility reasons.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 08:34:26 -08:00
Andrey Smirnov
1cf6b98fb8 test: bump Talos release version for upgrade test to 0.7.1
We should always use latest releases.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 18:41:28 +03:00
Andrey Smirnov
11c2b8f80c test: bump defaults for provision tests resources
Our defaults are too low now.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-07 07:01:41 -08:00
Andrey Smirnov
e4ebc4ab95 feat: suggest fixed control plane endpoints in talosctl gen config
Ex.:

```
$ talosctl gen config foo 192.168.0.1
no scheme and port specified for the cluster endpoint URL
try: "https://192.168.0.1:6443"
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-02 13:16:30 -08:00
Andrey Smirnov
621968977e feat: update kubernetes to 1.20.0-rc.0
Talos 0.8 is going to ship with K8s 1.20.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-02 10:50:58 -08:00
Artem Chernyshev
8aad711f18 feat: implement network interfaces list API
To be used in the interactive installer to configure networking.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-27 10:48:45 -08:00
Artem Chernyshev
f96cffd2b2 feat: add ability to choose CNI config
Initial version which only allows setting CNI using preset, no custom
CNI urls are supported at the moment. Still need to figure out what kind
of UI can be used for that.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-26 06:49:54 -08:00
Andrey Smirnov
28ba6e416e feat: update Kubernetes to v1.20.0-beta.2
Talos 0.8 is going to ship with K8s 1.20.x.

Changes to support new `control-plane` label,
upgrade-k8s supports automated fixups for 1.20.

See also: https://github.com/talos-systems/bootkube-plugin/pull/22

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-25 06:39:14 -08:00
Andrey Smirnov
9a32e34cb1 feat: implement apply configuration without reboot
This allows config to be written to disk without being applied
immediately.

Small refactoring to extract common code paths.

At first, I tried to implement this via the sequencer, but looks like
it's too hard to get it right, as sequencer lacks context and config to
be written is not applied to the runtime.

Fixes #2828

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-23 12:42:44 -08:00
Artem Chernyshev
2588e2960b feat: make GenerateConfiguration API reuse current node auth
Fixes: https://github.com/talos-systems/talos/issues/2819

Only if requested config type is not `TypeInit`.
This functionality will help implementing TUI installer cluster
extension workflow.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-23 12:12:15 -08:00
Artem Chernyshev
b6874ee82a feat: add TUI based talos interactive installer
This is initial commit of the installer.
What's done:
- verifying node availability before starting any operations.
- gathering information about disks on the machine.
- allows setting: install disk, hostname, machine type, installer image,
  kubernetes version, dns domain, cluster-name.
- dumps/merges talosconfig to a file after applying configuration.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-18 12:34:15 -08:00
Andrey Smirnov
07cbf4be3f test: update integration test versions, clean up names
Bump to 0.7.0 as we have a new release.

Clean up the tests we do: 0.6.3 is a previous release, 0.7.0 is a stable
release, current version (0.8.x) is the "next" release.

We test the following:

* 0.6.3 -> 0.7.0
* 0.7.0 -> 0.8-current
* 0.7.0 -> 0.8-current (single node)

This tests upgrades always between two releases.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-18 16:39:40 +03:00
Artem Chernyshev
8513123d22 feat: return client config as the second value in GenerateConfiguration
To be used in interactive installer to output the node client
configuration to a file.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-17 07:20:05 -08:00
Artem Chernyshev
0f924b5122 feat: add generate config gRPC API
Fixes: https://github.com/talos-systems/talos/issues/2766

This API is implemented in Maintenance and Machine services.
Can be used to generate configuration on the node, instead of using
talosctl to generate it locally.

To be used in interactive installer and talosctl gen config.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-13 08:07:32 -08:00
Andrey Smirnov
df6ad3fa80 feat: upgrade Kubernetes default version to 1.19.4
k8s.io modules don't have 1.19.4 tag yet :(

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-12 08:51:04 -08:00
Andrey Smirnov
b2b86a622e fix: remove 'token creds' from maintenance service
This fixes the reverse Go dependency from `pkg/machinery` to `talos`
package.

Add a check to `Dockerfile` to prevent `pkg/machinery/go.mod` getting
out of sync, this should prevent problems in the future.

Fix potential security issue in `token` authorizer to deny requests
without grpc metadata.

In provisioner, add support for launching nodes without the config
(config is not delivered to the provisioned nodes).

Breaking change in `pkg/provision`: now `NodeRequest.Type` should be set
to the node type (as config can be missing now).

In `talosctl cluster create` add a flag to skip providing config to the
nodes so that they enter maintenance mode, while the generated configs
are written down to disk (so they can be tweaked and applied easily).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 14:10:32 -08:00
Andrey Smirnov
8560fb9662 chore: enable nlreturn linter
Most of the fixes were automatically applied.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 06:48:07 -08:00
Artem Chernyshev
061b296530 feat: allow specifying user-disks in talosctl cluster create
User-disks are supported by QEMU and Firecracker providers.
Can be defined by using the following parameters:
```
--user-disk /mount/path:1GB
```

Can get more than 1 user disk.
Same set of user disks will be created for all master and worker nodes.

Additionally enable user-disks in qemu e2e test.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-30 08:44:08 -07:00
Andrey Smirnov
66829b14d5 test: bump Talos version for upgrade tests, bump Cilium version
Use 0.6.3 as upgrade source version, use latest Cilium release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-29 22:22:21 +03:00
Andrey Smirnov
bc9e0c0dba fix: re-implement upgrade (install) with preserve
For 0.6 -> 0.7 upgrade, in any case config.yaml is preserved and moved
from `/boot` to `/system/state`.

For single node upgrade, `EPHEMERAL` partition is not touched and other
partitions are re-created as needed.

Bump provision tests to 0.6/0.7 upgrades as we get closer to the new
release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-28 07:25:26 -07:00
Andrey Smirnov
ff4d702f77 fix: implement preserving contents of partition on install
This fixes A/B upgrades and rollback API.

Installer manifest supports now an option to preserve partition contents
while disk is being re-partitioned and partitions are re-formatted.

Mount `/boot` partition as needed (to find current label before starting
the installation and in the rollback API).

Fix upgrade API for non-master nodes.

Contents of `/boot`, `/system/state` and META partitions are preserved
in memory while the disk is re-partitioned.

Remove `--save` flag from the installer as it's not being used.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-22 23:56:39 +03:00
Andrey Smirnov
56f1ee37fd feat: upgrade Kubernetes to 1.19.3
Just minor release bump.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-20 05:12:32 -07:00
Andrey Smirnov
773912833e test: clean up integration test code, fix flakes
This enables golangci-lint via build tags for integration tests (this
should have been done long ago!), and fixes the linting errors.

Two tests were updated to reduce flakiness:

* apply config: wait for nodes to issue "boot done" sequence event
before proceeding
* recover: kill pods even if they appear after the initial set gets
killed (potential race condition with previous test).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-19 15:44:14 -07:00
Artem Chernyshev
e7e99cf1b3 feat: support disk usage command in talosctl
Usage example:

```bash
talosctl du --nodes 10.5.0.2 /var -H -d 2
NODE       NAME
10.5.0.2   8.4 kB   etc
10.5.0.2   1.3 GB   lib
10.5.0.2   16 MB    log
10.5.0.2   25 kB    run
10.5.0.2   4.1 kB   tmp
10.5.0.2   1.3 GB   .
```

Supported flags:
- `-a` writes counts for all files, not just directories.
- `-d` recursion depth
- '-H' humanize size outputs.
- '-t' size threshold (skip files if < size or > size).

Fixes: https://github.com/talos-systems/talos/issues/2504

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-13 09:30:31 -07:00
Andrey Smirnov
dc6ea74c35 fix: random failures in cluster health checks
The problem was that some of the health checks sort the list of the
nodes in place (via `sort.Strings()`). If cluster info provider returns
original slice, it might be mutated in such a way that it gets
corrupted.

We never noticed it before CAPI clusters, as in our tests IPs are
assigned sequentially, and sort operation is a no-op.

Specifically, the problem was with the `Nodes()` function, it returns
`append(controlPlaneNodes, workerNodes...)` slice, which by definition
might share memory with `controlPlaneNodes` slice. For example,
if control plane nodes were `4, 5, 6` and worker nodes were `3`, the
returned slice will be `4, 5, 6, 3`, and it shares memory with
`controlPlaneNodes` slice (firs three items). If we apply `sort` to the
returned slice, it re-orders it as `3, 4, 5, 6`, but as it is done
in-place, the `controlPlaneNodes` slice is now `3, 4, 5`, which is
obviously wrong.

Fix that by always returning a copy of the slice from the functions
implementing `ClusterInfo` interface.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-08 07:13:24 -07:00
Andrew Rynhard
4eeef28e90 feat: add etcd API
This adds RPCs for basic etcd management tasks.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-10-06 11:30:04 -07:00
Andrey Smirnov
d7f5de62c3 feat: colorize output of cluster health checks
It only gets enabled if output is a terminal. Failures which resolve
themselves are removed from the final output. Small spinner to indicate
progress.

While I was at it, I fixed client-side `talosctl health` when init node
is missing.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-06 07:59:30 -07:00
Andrey Smirnov
16eb47a1a3 feat: use kubeconfig merge in talosctl kubeconfig by default
Kubeconfig merge was completely rewritten to be "smarter":

* automatically apply renames done at previous stages to avoid asking
over and over again (in general should ask just once)

* skip checks if parts of the config match exactly

* allow overwrite as an option

* flexible way to control the output

* activating context in the end

* custom merged context name

Fixes #2578

Fixes #2587

Fixes #2577

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-03 05:36:15 -07:00
Andrey Smirnov
b9bfe00b88 feat: support custom filename for talosctl kubeconfig
This also refactors much of the CLI code for the `talosctl kubeconfig`:

1. Do all the checks before fetching kubeconfig from the server: as
kubeconfig generation takes a few seconds, it doesn't make sense to
generate it if it's not going to be used.

2. Unify most of merge & write directly features.

3. Don't use ExtractTarGz method to be more flexible.

4. Allow custom paths for kubeconfig, whether it is a directory or full
path to the file to be created.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-30 12:05:50 -07:00