61 Commits

Author SHA1 Message Date
Andrey Smirnov
e2f1fbcfdb feat: support control plane upgrades with Talos managed control plane
Upgrade is performed by updating node configuration (node by node, service
by service), watching internal resource state to get new configuration
version and verifying that pod with matching version successfully
propagated to the API server state and pod is ready.

Process is similar to the rolling update of the DaemonSet.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-20 11:57:32 -08:00
Andrey Smirnov
e9fc54f6e3 feat: update Kubernetes to 1.20.3
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1202

Also updater pkgs for:

* talos-systems/pkgs#238 (raspberrypi-firmware update)
* talos-systems/pkgs#242 (Linux 5.10.17 + init_on_free=0)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-19 05:22:34 -08:00
Andrey Smirnov
7751920dba feat: add a tool and package to convert self-hosted CP to static pods
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).

Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 23:26:57 -08:00
Artem Chernyshev
02b3719df9 feat: skip filesystem for state and ephemeral partitions in the installer
Filesystem creation step is moved on the later stage: when Talos mounts
the partition for the first time.
Now it checks if the partition doesn't have any filesystem and formats
it right before mounting.

Additionally refactored mount options a bit:
- replaced separate options with a set of binary flags.
- implemented pre-mount and post-unmount hooks.

And fixed typos in couple of places and increased timeout for `apid ready`.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-17 09:37:21 -08:00
Andrey Smirnov
daea9d3811 feat: support version contract for Talos config generation
This allows to generating current version Talos configs (by default) or
backwards compatible configuration (e.g. for Talos 0.8).

`talosctl gen config` defaults to current version, but explicit version
can be passed to the command via flags.

`talosctl cluster create` defaults to install/container image version,
but that can be overridden. This makes `talosctl cluster create` now
compatible with 0.8.1 images out of the box.

Upgrade tests use contract based on source version in the test.

When used as a library, `VersionContract` can be omitted (defaults to
current version) or passed explicitly. `VersionContract` can be
convienietly parsed from Talos version string or specified as one of the
constants.

Fixes #3130

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-10 13:02:52 -08:00
Andrey Smirnov
7f3dca8e4c test: add support for IPv6 in talosctl cluster create
Modify provision library to support multiple IPs, CIDRs, gateways, which
can be IPv4/IPv6. Based on IP types, enable services in the cluster to
run DHCPv4/DHCPv6 in the test environment.

There's outstanding bug left with routes not being properly set up in
the cluster so, IPs are not properly routable, but DHCPv6 works and IPs
are allocated (validates DHCPv6 client).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 13:28:53 -08:00
Andrey Smirnov
edf5777222 feat: add an option to force upgrade without checks
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 10:20:03 -08:00
Andrey Smirnov
2277ce8abe feat: move to ECDSA keys for all Kubernetes/etcd certs and keys
ECDSA keys are smaller which decreases Talos config size, they are more
efficient in terms of key generation, signing, etc., so it makes boot
performance better (and config generation as well).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-02 13:25:00 -08:00
Andrey Smirnov
e0a0f58801 feat: use multi-arch images for k8s and Flannel CNI
Flannel got updated to 0.13 version which has multi-arch image.

Kubernetes images are multi-arch.

Fixes #3049

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-28 08:26:02 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
d71ac4c4ff feat: update Kubernetes to 1.20.2
Minor point release, official changelog:

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-15 09:06:18 -08:00
Andrey Smirnov
f2c029a07d chore: update upgrade test version used
Now with official 0.8.0 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-24 18:49:29 +03:00
Andrey Smirnov
b1d4814308 feat: update Kubernetes to 1.20.1
See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-21 23:52:29 +03:00
Andrey Smirnov
3dae6df27b test: stabilize upgrade test by running health check several times
For single node clusters, control plane is unstable after reboot, run
health check several times to let it settle down to avoid failures in
subsequent checks.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-11 08:31:01 -08:00
Andrey Smirnov
872e792dbc feat: update Kubernetes to 1.20.0
Official K8s release matching Talos 0.8.0.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-09 06:11:48 -08:00
Andrey Smirnov
350280eb59 feat: implement "staged" (failsafe/backup) upgrades
Regular upgrade path takes just one reboot, but it requires all the
processes to be stopped on the node before upgrade might proceed. Under
some circumstances and with potential Talos bugs it might not work
rendering Talos upgrades almost impossible.

Staged upgrades build upon regular install flow to run the upgrade on
the node reboot. Such upgrades require two reboots of the node, and it
requires two pulls of the installer image, but they should be much less
suspicious to the failure. Once the upgrade is staged, node can be
rebooted in any possible way, including hard reset and upgrade is
performed on the next boot.

New ADV format was implemented as well to allow to store install image
ref/options across reboots. New format allows for bigger values and
takes 50% of the `META` partition. Old ADV is still kept for
compatibility reasons.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 08:34:26 -08:00
Andrey Smirnov
1cf6b98fb8 test: bump Talos release version for upgrade test to 0.7.1
We should always use latest releases.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 18:41:28 +03:00
Andrey Smirnov
621968977e feat: update kubernetes to 1.20.0-rc.0
Talos 0.8 is going to ship with K8s 1.20.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-02 10:50:58 -08:00
Andrey Smirnov
28ba6e416e feat: update Kubernetes to v1.20.0-beta.2
Talos 0.8 is going to ship with K8s 1.20.x.

Changes to support new `control-plane` label,
upgrade-k8s supports automated fixups for 1.20.

See also: https://github.com/talos-systems/bootkube-plugin/pull/22

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-25 06:39:14 -08:00
Artem Chernyshev
b6874ee82a feat: add TUI based talos interactive installer
This is initial commit of the installer.
What's done:
- verifying node availability before starting any operations.
- gathering information about disks on the machine.
- allows setting: install disk, hostname, machine type, installer image,
  kubernetes version, dns domain, cluster-name.
- dumps/merges talosconfig to a file after applying configuration.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-18 12:34:15 -08:00
Andrey Smirnov
07cbf4be3f test: update integration test versions, clean up names
Bump to 0.7.0 as we have a new release.

Clean up the tests we do: 0.6.3 is a previous release, 0.7.0 is a stable
release, current version (0.8.x) is the "next" release.

We test the following:

* 0.6.3 -> 0.7.0
* 0.7.0 -> 0.8-current
* 0.7.0 -> 0.8-current (single node)

This tests upgrades always between two releases.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-18 16:39:40 +03:00
Andrey Smirnov
df6ad3fa80 feat: upgrade Kubernetes default version to 1.19.4
k8s.io modules don't have 1.19.4 tag yet :(

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-12 08:51:04 -08:00
Andrey Smirnov
b2b86a622e fix: remove 'token creds' from maintenance service
This fixes the reverse Go dependency from `pkg/machinery` to `talos`
package.

Add a check to `Dockerfile` to prevent `pkg/machinery/go.mod` getting
out of sync, this should prevent problems in the future.

Fix potential security issue in `token` authorizer to deny requests
without grpc metadata.

In provisioner, add support for launching nodes without the config
(config is not delivered to the provisioned nodes).

Breaking change in `pkg/provision`: now `NodeRequest.Type` should be set
to the node type (as config can be missing now).

In `talosctl cluster create` add a flag to skip providing config to the
nodes so that they enter maintenance mode, while the generated configs
are written down to disk (so they can be tweaked and applied easily).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 14:10:32 -08:00
Artem Chernyshev
061b296530 feat: allow specifying user-disks in talosctl cluster create
User-disks are supported by QEMU and Firecracker providers.
Can be defined by using the following parameters:
```
--user-disk /mount/path:1GB
```

Can get more than 1 user disk.
Same set of user disks will be created for all master and worker nodes.

Additionally enable user-disks in qemu e2e test.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-30 08:44:08 -07:00
Andrey Smirnov
66829b14d5 test: bump Talos version for upgrade tests, bump Cilium version
Use 0.6.3 as upgrade source version, use latest Cilium release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-29 22:22:21 +03:00
Andrey Smirnov
bc9e0c0dba fix: re-implement upgrade (install) with preserve
For 0.6 -> 0.7 upgrade, in any case config.yaml is preserved and moved
from `/boot` to `/system/state`.

For single node upgrade, `EPHEMERAL` partition is not touched and other
partitions are re-created as needed.

Bump provision tests to 0.6/0.7 upgrades as we get closer to the new
release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-28 07:25:26 -07:00
Andrey Smirnov
56f1ee37fd feat: upgrade Kubernetes to 1.19.3
Just minor release bump.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-20 05:12:32 -07:00
Andrey Smirnov
773912833e test: clean up integration test code, fix flakes
This enables golangci-lint via build tags for integration tests (this
should have been done long ago!), and fixes the linting errors.

Two tests were updated to reduce flakiness:

* apply config: wait for nodes to issue "boot done" sequence event
before proceeding
* recover: kill pods even if they appear after the initial set gets
killed (potential race condition with previous test).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-19 15:44:14 -07:00
Andrey Smirnov
ff0d4b305a feat: build Talos images/artifacts for amd64/arm64
By default, build outside of Drone works the same and builds only amd64
version, loads images back into dockerd, etc.

If multiple platforms are used, multi-arch images are built which can't
be exported to docker or to `.tar` image, they're always pushed to the
registry (even for PR builds to our internal CI registry).

Artifacts as files (initramfs, kernel) now have `-arch` suffix:
`vmlinuz-amd64`, `initramfs-amd64.xz`. "Magic" script normalizes output
paths depending on whether single platform or multiple platforms were
given.

VM provisioners accept magic `${ARCH}` in initramfs/kernel paths which
gets replaced by cluster architecture.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-27 10:32:07 -07:00
Andrey Smirnov
0f54574d89 fix: update one more places which had stale reference for constants
s/constants/images/

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-25 10:51:35 -07:00
Andrew Rynhard
27c7bc0788 fix: use images package in integration tests
This fixes an incorrect import path.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-25 08:11:27 -07:00
Andrey Smirnov
15181aeade feat: use architecture-specific image for core k8s components
This is one step towards running Talos on non-amd64 architectures (e.g. arm64).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-16 01:11:40 -07:00
Andrey Smirnov
f6e075ea55 test: verify kubernetes control plane upgrade in provision tests
Add Kubernetes upgrade as part of the provisioning (upgrade tests):
first K8s control plane is upgraded, then Talos is upgraded (with
kubelet), and e2e test is run last.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-11 10:53:33 -07:00
Andrey Smirnov
788cd15c29 test: add e2e test to the provision (upgrade) tests
Add sonobuoy runner code with log fetching on failure. Use hand-picked
set of e2e tests to run: verify basic pod functionality, verify service
connectivity.

Add option `--run-e2e` to the `talosctl health` to run quick e2e test to
verify cluster health.

Add option to run provision tests with custom CNI, run one track of
provision tests with Cilium.

Bump Cilium to 1.8.2.

Talos 0.6 won't uncordon node automatically after upgrade from 0.5, as
0.5 doesn't put annotation. Workaround that in upgrade tests.

Bump upgrade test version to 0.6.0 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-08 13:26:31 -07:00
Andrey Smirnov
f6ecf000c9 refactor: extract packages loadbalancer and retry
This removes in-tree packages in favor of:

* github.com/talos-systems/go-retry
* github.com/talos-systems/go-loadbalancer

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-02 13:46:22 -07:00
Andrew Rynhard
1a4059a553 feat: add grub bootloader
This moves to using grub instead of syslinux.

BREAKING CHANGE: Single node upgrades will fail in this change. This
will also break the A/B fallback setup since this version introduces
an entirely new partition scheme, that any fallback will not know about.
We plan on addressing these issues in a follow up change.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-01 12:06:43 -07:00
Andrew Rynhard
83aa3bd3ab chore: bump next version to v0.6.0-beta.2
This updates the "next" version in our integration tests.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-08-21 01:44:26 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
2697b99b7d refactor: extract pkg/net as github.com/talos-systems/net
This extracts common package as new module/repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-14 11:04:50 -07:00
Andrey Smirnov
9379cf9ee1 refactor: expose provision as public package
This change is only moving packages and updating import paths.

Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.

As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.

Other changes were direct dependencies discovered by `importvet` which
were updated.

Public packages (useful, general purpose packages with stable API):

* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`

Private packages (used only on provisioning library internally):

* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-12 05:12:05 -07:00
Andrey Smirnov
ede662bcb1 test: bump timeout for upgrade tests
'cordonAndDrainNode' task sometimes takes 5 minutes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-31 00:28:29 +03:00
Andrey Smirnov
a48c1dbe89 chore: use qemu instead of firecracker in CI
qemu opens up a bunch of possibilities, including the bootloader
testing.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-30 22:43:16 +03:00
Andrey Smirnov
a5d64d97c1 test: update qemu/firecracker provisioners
Fixes #2363 #2364 #2370 #2371

Several changes packed together:

* use compressed `vmlinuz` everywhere, firecracker provisioner
uncompresses it before first use, drop `vmlinux`

* handle reboots in qemu launcher to support reset API case, update
empty disk check to handle reset behavior (erasing partition table)

* make bootloader support default in provisioners, and flag to disable
that

* early support for target architecture for qemu provisioner

This should allow us to use `qemu` in CI/CD (not included into this PR):
integration test passes with qemu.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-30 21:17:25 +03:00
Andrey Smirnov
47608fb874 refactor: make pkg/config not rely on machined/../internal/runtime
This makes `pkg/config` directly importable from other projects.

There should be no functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-29 12:40:12 -07:00
Andrey Smirnov
2770d6414c test: upgrade versions the upgrade tests are operating on
This bumps next version to the latest 0.6 alpha and latest 0.5.

This also enables single node preserve test.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 12:35:37 -07:00
Andrey Smirnov
5ecddf2866 feat: add round-robin LB policy to Talos client by default
Handling of multiple endpoints has already been implemented in #2094.

This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).

Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.

gprc handles dead/unresponsive nodes automatically for us.

`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.

On the additional improvements we got:

* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)

* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors

Fixes #1641

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 08:35:15 -07:00
Andrey Smirnov
81d1c2bfe7 chore: enable godot linter
Issues were fixed automatically.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 10:39:56 -07:00
Andrey Smirnov
51112a1d86 fix: use kubernetes version in config generator
Update all k8s image references to point to the version specified by the user.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-26 17:05:19 -07:00
Andrew Rynhard
77150f51cf chore: update provision test versions
This adds latest 0.6 alpha and 0.5 stable to the upgrade tests.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-05-29 14:58:54 -07:00
Andrey Smirnov
652531853f test: update Talos versions for upgrade tests
Our policy it to support two last releases (0.4, 0.5 at the moment).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-05-20 07:43:10 -07:00