230 Commits

Author SHA1 Message Date
Andrey Smirnov
a8dd2ff30d fix: checkpoint controller-manager and scheduler
Default manifests created by bootkube so far were only enabling
pod-checkpointer for kube-apiserver. This seems to have issues with
single-node control plane scenario, when without scheduler and
controller-manager node might fall into `NodeAffinity` state.

See https://github.com/talos-systems/bootkube-plugin/pull/23

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-28 11:53:17 -08:00
Artem Chernyshev
73c81c501e fix: pass disk image flags to e2e-qemu cluster create command
Forgot to add it in the original PR.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-22 23:57:31 -08:00
Artem Chernyshev
6540e9bf70 feat: support disk image in talosctl cluster create
Fixes: https://github.com/talos-systems/talos/issues/2973

Can now supply disk image using `--disk-image-path` flag.
May need to enable `--with-apply-config` if it's necessary to bootstrap
nodes properly.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-22 17:06:00 +03:00
Andrey Smirnov
b1d4814308 feat: update Kubernetes to 1.20.1
See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-21 23:52:29 +03:00
Andrey Smirnov
9d1ac81be5 chore: lower MTU to 1450 for the tests in the CI
This should help with the CNI encapsulation in the cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-17 17:14:07 +03:00
Andrew Rynhard
6f979d463c test: add ISO test
Adds a simple test for the ISO. Boots the ISO, and then uses the `apply-config` command
in `talosctl` to create a cluster.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-12-10 13:04:29 -08:00
Andrey Smirnov
80184393bc feat: update kernel to 5.9.13, new KSPP requirements
Pulls in following changes:

* https://github.com/talos-systems/toolchain/pull/20
* https://github.com/talos-systems/tools/pull/116
* https://github.com/talos-systems/pkgs/pull/214
* https://github.com/talos-systems/pkgs/pull/215
* https://github.com/talos-systems/pkgs/pull/216
* https://github.com/talos-systems/pkgs/pull/217
* https://github.com/talos-systems/go-procfs/pull/4

New empty amd64 images for u-boot & rpi-firmware reduce the size of
amd64 installer image.

For backwards compatibility QEMU provisioner still injects "legacy" KSPP
kernel args into initial boot environment.

Installer correctly upgrades KSPP options when moving from one version
of Talos to another.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-10 12:41:58 -08:00
Andrey Smirnov
872e792dbc feat: update Kubernetes to 1.20.0
Official K8s release matching Talos 0.8.0.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-09 06:11:48 -08:00
Andrey Smirnov
11c2b8f80c test: bump defaults for provision tests resources
Our defaults are too low now.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-07 07:01:41 -08:00
Andrey Smirnov
621968977e feat: update kubernetes to 1.20.0-rc.0
Talos 0.8 is going to ship with K8s 1.20.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-02 10:50:58 -08:00
Andrey Smirnov
28ba6e416e feat: update Kubernetes to v1.20.0-beta.2
Talos 0.8 is going to ship with K8s 1.20.x.

Changes to support new `control-plane` label,
upgrade-k8s supports automated fixups for 1.20.

See also: https://github.com/talos-systems/bootkube-plugin/pull/22

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-25 06:39:14 -08:00
Andrey Smirnov
1add26b42a chore: bump K8s to 1.19.4 in e2e scripts with CABPT version
This should fix the problem with the kubelet image.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-16 07:18:44 -08:00
Andrey Smirnov
61facf700a chore: build arm64 images in CI
This changes installer image/iso output to be tar via stdout
(optionally), so that we can copy back artifacts back from remote docker
daemon.

Fixes #2776

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-13 12:34:48 -08:00
Andrey Smirnov
df6ad3fa80 feat: upgrade Kubernetes default version to 1.19.4
k8s.io modules don't have 1.19.4 tag yet :(

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-12 08:51:04 -08:00
Andrey Smirnov
350d75eb46 feat: build talosctl-cni-bundle, use it in talosctl for QEMU
This builds a bundle with CNI plugins for talosctl which is
automatically downloaded by `talosctl` if CNI plugins are missing.

CNI directories are moved by default to the `~/.talos/cni` path.

Also add a bunch of pre-flight checks to the QEMU provisioner to make it
easier to bootstrap the Talos QEMU cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-10-30 16:30:37 -07:00
Artem Chernyshev
061b296530 feat: allow specifying user-disks in talosctl cluster create
User-disks are supported by QEMU and Firecracker providers.
Can be defined by using the following parameters:
```
--user-disk /mount/path:1GB
```

Can get more than 1 user disk.
Same set of user disks will be created for all master and worker nodes.

Additionally enable user-disks in qemu e2e test.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-30 08:44:08 -07:00
Andrew Rynhard
1b0ed13231 docs: move to gridsome
Brings in a new theme, improved content, and restructured layout.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-10-26 21:14:14 -07:00
Spencer Smith
7b4633b35d chore: update CI scripts
This PR pulls in the latest version of our CAPI providers, as well as
makes some minor tweaks to our bash scripts to disable terminal output
of commands during certain actions.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-10-22 09:00:41 -07:00
Andrew Rynhard
7017327059 chore: update qemu hack script to use ISO
This can serve as an example of providing the config via an ISO, and
simplify local setups a bit.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-30 10:58:42 -07:00
Andrey Smirnov
ff0d4b305a feat: build Talos images/artifacts for amd64/arm64
By default, build outside of Drone works the same and builds only amd64
version, loads images back into dockerd, etc.

If multiple platforms are used, multi-arch images are built which can't
be exported to docker or to `.tar` image, they're always pushed to the
registry (even for PR builds to our internal CI registry).

Artifacts as files (initramfs, kernel) now have `-arch` suffix:
`vmlinuz-amd64`, `initramfs-amd64.xz`. "Magic" script normalizes output
paths depending on whether single platform or multiple platforms were
given.

VM provisioners accept magic `${ARCH}` in initramfs/kernel paths which
gets replaced by cluster architecture.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-27 10:32:07 -07:00
Andrew Rynhard
7d2741fc4b chore: migrate to ghcr.io
Move to GHCR.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-23 15:06:30 -07:00
Andrey Smirnov
788cd15c29 test: add e2e test to the provision (upgrade) tests
Add sonobuoy runner code with log fetching on failure. Use hand-picked
set of e2e tests to run: verify basic pod functionality, verify service
connectivity.

Add option `--run-e2e` to the `talosctl health` to run quick e2e test to
verify cluster health.

Add option to run provision tests with custom CNI, run one track of
provision tests with Cilium.

Bump Cilium to 1.8.2.

Talos 0.6 won't uncordon node automatically after upgrade from 0.5, as
0.5 doesn't put annotation. Workaround that in upgrade tests.

Bump upgrade test version to 0.6.0 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-08 13:26:31 -07:00
Andrew Rynhard
1a4059a553 feat: add grub bootloader
This moves to using grub instead of syslinux.

BREAKING CHANGE: Single node upgrades will fail in this change. This
will also break the A/B fallback setup since this version introduces
an entirely new partition scheme, that any fallback will not know about.
We plan on addressing these issues in a follow up change.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-01 12:06:43 -07:00
Andrey Smirnov
59adf7315d feat: provide option to run Talos under UEFI in QEMU
This also adds integration pipeline tests for UEFI.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-28 12:51:10 -07:00
Spencer Smith
c07ce17b7a fix: update e2e scripts to work with python3
This PR replaces python with python3

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-08-11 12:38:52 -07:00
Spencer Smith
303c477051 chore: update capi CI manifests to use control planes
This PR will update the CI testing to make use of our control plane
provider, as well as the other CAPI components.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-08-11 11:14:44 -04:00
Andrey Smirnov
55f3249783 test: use registry mirrors in CI
This relies on registry caching mirrors running in the CI.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-31 16:30:41 +03:00
Andrey Smirnov
58aa2b75bb test: destroy clusters in e2e tests (qemu/firecracker)
As the build runs inside containers which are part of a single pod, we
need to clean up networking bits (bridge interface, etc.), so that it
doesn't cause problems for other steps.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-31 06:21:09 -07:00
Andrey Smirnov
a5d64d97c1 test: update qemu/firecracker provisioners
Fixes #2363 #2364 #2370 #2371

Several changes packed together:

* use compressed `vmlinuz` everywhere, firecracker provisioner
uncompresses it before first use, drop `vmlinux`

* handle reboots in qemu launcher to support reset API case, update
empty disk check to handle reset behavior (erasing partition table)

* make bootloader support default in provisioners, and flag to disable
that

* early support for target architecture for qemu provisioner

This should allow us to use `qemu` in CI/CD (not included into this PR):
integration test passes with qemu.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-30 21:17:25 +03:00
Artem Chernyshev
c6eb18eed5 feat: qemu provisioner
Starts and stops qemu VMs, has some initial configuration subset.
Sets up networking through CNI tools, sets up DHCP server which gives IP
addresses to nodes.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-28 14:55:35 -07:00
Andrey Smirnov
6a81f30941 test: provide node discovery for cli tests via kubectl
Fixes #2330

CLI tests require node discovery as `--nodes` flag is enforced for most
of the `talosctl commands`.

For clusters created via `talosctl cluster create`, cluster provisioner
state provides all the necessary information, but clusters created via
CAPI don't have the state attached.

API tests rely on Talos and Kubernetes APIs to fetch kubeconfig and
access Nodes K8s API.

CLI tests should rely only on CLI tools, so we use `kubectl get nodes` +
`talosctl kubeconfig` to fetch list of master and worker nodes.

This discovery method relies on "bootstrap" node being set in
`talosconfig` (to fetch `kubeconfig`).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 11:35:47 -07:00
Andrey Smirnov
76c44ac468 test: remove apid load balancer for firecracker
We're not using load balancer for `apid` (always using client-side load
balancing), so we can remove this safely.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 20:21:21 +03:00
Andrew Rynhard
1f31d24e55 chore: use Kubernetes pipelines
This moves to using Kubernetes pipelines.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 12:09:53 -07:00
Andrey Smirnov
3d8418a689 feat: force nodes to be set in talosctl commands using the API
With load-balancing enabled by default running `talosctl` without
`--nodes` is risky, as it might hit any control plane by default without
`--nodes`.

Only two commands do not enforce this check, as they do their own node
contexts: `crashdump` and `health` (client-side).

Integration tests were updated to always supply `--nodes` cli argument,
while doing that I refactored the storage for discovered nodes to use
existing `cluster.Info` interface.

The downside is that with e2e CAPI tests CLI tests will be mostly
skipped as we don't support discovery in CLI tests at the momemnt. This
can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for
node discovery.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 12:17:43 -07:00
Spencer Smith
f290f88160 chore: update clusterctl for CI testing
This PR brings in the latest version of clusterctl that has built-in
support for the talos repos. I'll be chasing this with a move to using
the control-plane provider as well!

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-15 19:33:59 -04:00
Andrey Smirnov
9590030a84 feat: print crash dump in talosctl cluster create on failure
When cluster fails to be bootstrapped or it fails the health check, it's
hard to find the root cause without the logs.

This change adds optional crashdump (it dumps firecracker logs or docker
logs) after provisioning failure. It's not enabled by default.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 11:54:07 -07:00
Andrey Smirnov
4f5660b22b test: fix sonobuoy delete
It expects kubeconfig as required argument.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 18:46:57 +03:00
Spencer Smith
67cddaff44 chore: wait for resource deletion in sonobuoy
This PR fixes the fix where we try to cleanup sonobuoy. We did that
successfully, but still got errors b/c we were immediately trying to
create service accounts in a namespace that was being deleted. This
should fix that. The sonobuoy default wait period is 1hr, should be
plenty.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-07 10:58:47 -07:00
Spencer Smith
13bd77355e chore: cleanup sonobuoy after failed attempts
This PR will make sure that, if we're going to retry sonobuoy, we run
the delete command first to clean up any dangling resources.

Closes #2266.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-06 11:46:49 -07:00
Andrey Smirnov
3ae5e0e749 test: add short integration test with custom CNI
This adds new flug to `cluster create` to launch cluster with custom
CNI, `integration` pipeline gets a new step to run short test with
Cilium 1.8.0 CNI.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 11:19:19 -07:00
Patatman
90acb01a4e docs: digital rebar docs
Digital rebar docs in the guide section.

Signed-off-by: Patatman <git@jeursen.nl>
2020-06-30 18:52:39 -07:00
Andrey Smirnov
e46a09f56a chore: make default pipeline run shorter integration test
This moves full integratation test and provision tests to
the `integration` pipeline.

Docker test wasn't affected much, as anyways docker can't run long
integration tests, so it mostly affects firecracker and provision tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 00:14:55 +03:00
Andrew Rynhard
d0d2ac3c74 test: default to using the bootstrap API
This moves our test scripts to using the bootstrap API. Some
automation around invoking the bootstrap API was also added
to give the same ease of use when creating clusters with the
CLI.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-24 08:46:10 -07:00
Spencer Smith
e03a68f8eb feat: update k8s and sonobuoy versions
This PR will update k8s to the latest 1.18 release and bump sonobuoy to
help resolve some e2e flakes. Also adds some retry logic around the
sonobuoy run.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-06-10 06:47:36 -07:00
Spencer Smith
c1b6f05b00 chore: use clusterctl and v1alpha3 providers for tests
This PR will update our testing ocde to make use of the clusterctl tool,
as well as use the newer versions of various providers and updated
manifests.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-05-01 07:42:19 -07:00
Spencer Smith
8d2f8d6127 chore: remove random.trust_cpu references
This PR removes the references to adding in the random CPU trust to the
kernel for all v0.4 docs, as well as in the iso command in the
installer. This is no longer needed with the newer linux kernel.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-04-14 17:10:56 -07:00
Spencer Smith
3a4eaeeef0 feat: upgrade kubernetes to 1.18
This PR will pull in the latest release of k8s 1.18 so we can start
validating it through our test suite.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-26 14:59:43 -04:00
Andrey Smirnov
104af4380e feat: make --wait default option to talosctl cluster create
It seems to be useful enough to be the default one and it prevents
simple mistakes while trying to access the cluster which is not ready
yet.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-25 06:36:43 -07:00
Andrey Smirnov
e38cde9b48 chore: update upgrade tests for new version, split into two tracks
This updates upgrade tests to run two flows with 3+1 clusters:

1. 0.3 -> current (testing upgrade with partition wiping)
2. 0.4-alpha.7 -> current (testing upgrade without partition wiping,
boot-a/boot-b)

And small upgrade with preserve enabled for single-node cluster.

Provision tests are now split into two parallel tracks in Drone.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-24 15:30:00 -07:00
Spencer Smith
3485ea9f09 fix: update k8s to 1.17.3
This PR will update k8s to v1.17.3 to address CVEs mentioned in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/kubernetes-security-announce/2UOlsba2g0s

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-23 17:08:52 -07:00