75 Commits

Author SHA1 Message Date
Spencer Smith
e03a68f8eb feat: update k8s and sonobuoy versions
This PR will update k8s to the latest 1.18 release and bump sonobuoy to
help resolve some e2e flakes. Also adds some retry logic around the
sonobuoy run.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-06-10 06:47:36 -07:00
Andrew Rynhard
00b7176a8a feat: upgrade Linux to v5.6.13
This brings in the latest version of Linux.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-05-18 14:41:59 -07:00
Andrew Rynhard
7cf28dc805 refactor: rename ntpd to timed
This renames the ntpd application to timed.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-13 15:02:26 -07:00
Andrew Rynhard
681b1a8cb2 feat: upgrade Linux to v5.5.15
This brings in the latest 5.5 version of Linux.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-07 09:06:18 -07:00
Andrew Rynhard
6fe5fed6f9 fix: make upgrades work with UEFI
Since the `--once` option of `extlinux` seems to only work with BIOS, we
needed to change to remove any reliance on this option. Instead of
booting the upgraded version once, and then making it the default after
a successful boot, we now make it the default, and then revert on any
boot error.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-26 13:34:00 -07:00
Spencer Smith
3a4eaeeef0 feat: upgrade kubernetes to 1.18
This PR will pull in the latest release of k8s 1.18 so we can start
validating it through our test suite.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-26 14:59:43 -04:00
Spencer Smith
3485ea9f09 fix: update k8s to 1.17.3
This PR will update k8s to v1.17.3 to address CVEs mentioned in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/kubernetes-security-announce/2UOlsba2g0s

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-23 17:08:52 -07:00
Andrew Rynhard
69fa63a7b2 refactor: perform upgrade upon reboot
This PR introduces a new strategy for upgrades. Instead of attempting to
zap the partition table, create a new one, and then format the
partitions, this change will only update the `vmlinuz`, and
`initramfs.xz` being used to boot. It introduces an A/B style upgrade
process, which will allow for easy rollbacks. One deviation from our
original intention with upgrades is that this change does not completely
reset a node. It falls just short of that and does not reset the
partition table. This forces us to keep the current partition scheme in
mind as we make changes in the future, because an upgrade assumes a
specific partition scheme. We can improve upgrades further in the
future, but this will at least make them more dependable. Finally, one
more feature in this PR is the ability to keep state. This enables
single node clusters to upgrade since we keep the etcd data around.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 17:32:18 -07:00
Spencer Smith
2f4ccfda9a fix: respect dns domain from machine config
BREAKING CHANGE: This PR fixes a bug where we were only passing `cluster.local` to the
kubelet configuration. It will also pull in a new version of the
bootkube fork to ensure that custom domains got propogated down to the
API Server certs, as well as the CoreDNS configuration for a cluster.

Existing users should be aware that, if they were previously trying to
use this option in machine configs, that an upgrade will may break
their cluster. It will update a kubelet flag with the new domain, but
CoreDNS and API Server certs will not change since bootkube has already
run. One option may be to change these values manually inside the
Kubernetes cluster. However, it may prove easier to rebuild the cluster
if necessary.

Additionally, this PR also exposes a flag to `osctl config generate`
to allow tweaking this domain value as well.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-20 12:28:17 -04:00
Spencer Smith
1cbbf9cd5a feat: update talos base packages
This PR will update the base packages to the latest versions. Updated
packages are:

- ca-certificates
- cni
- iptables
- kernel
- kmod
- libseccomp
- musl
- runc
- socat
- util-linux
- xfsprogs

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-17 19:08:13 -04:00
Spencer Smith
853ce16df4 feat: respect panic kernel flag
This PR allows Talos to respect the panic=0 flag if users pass that in
their kernel args. Doing this makes it easier to catch kernel panics in
debug scenarios and allows the user to manually trigger a restart with
ctrl+alt+del when they're ready.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-10 13:21:34 -04:00
Spencer Smith
b1e4b3891f chore: cleanup assets dir after bootkube is done
This PR will clean up bootkube assets regardless of whether bootkube
succeeds. This will allow for a failed bootkube deployment to retry on
reboot.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-06 14:25:44 -05:00
Spencer Smith
12bfd8dd94 feat: allow for persistence of config data
This PR will allow users to set the `persist: true` value in their
config data to tell talos not to re-pull the config data at each reboot.
The default will still remain as a "pull every time" methodolgy in order
to encourage immutability by default.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-06 11:42:00 -05:00
Andrey Smirnov
a068acfbe4 feat: split routerd from apid
New service `routerd` performs exactly single task: based on incoming
API call service name, it routes the requests to the appropriate Talos
service (`networkd`, `osd`, etc.) Service `routerd` listens of file
socket and routes requests to file sockets.

Service `apid` now does single task as well:

* it either fans out request to other `apid` services running on other
nodes and aggregates responses
* or it forwards requests to local `routerd` (when request destination
is local node)

Cons:

* one more proxying layer on request path

Pros:

* more clear service roles
* `routerd` is part of core Talos, services should register with it to
expose their API; no auth in the service (not exposed to the world)
* `apid` might be replaced with other implementation, it depends on TLS infra,
auth, etc.
* `apid` is better segregated from other Talos services (can only access
`routerd`, can't talk to other Talos services directly, so less exposure
in case of a bug)

This change is no-op to the end users.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-05 22:05:56 +03:00
Andrey Smirnov
bbe2c53d29 feat: generate kubeconfig on the fly on request
This extracts admin kubeconfig generation out of bootkube, now based on
Talos x509 library. On each API request for `kubeconfig`, config is
generated on the fly and sent back on the wire.

This fixes two issues:

* any master node can now generate `kubeconfig` (worker nodes can do
that too, but that should probably change in the future)
* after upgrade-and-wipe the disk scenario, `osctl kubeconfig` still
works

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-28 21:00:52 +03:00
Andrey Smirnov
e6dc87dfa4 chore: update pkgs & tools for Go 1.14
See also:

* https://github.com/talos-systems/tools/pull/89
* https://github.com/talos-systems/pkgs/pull/103

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-27 01:15:46 +03:00
Andrey Smirnov
923ef4537b test: implement new class of tests: provision tests (upgrades)
This class of tests is included/excluded by build tags, but as it is
pretty different from other integration tests, we build it as separate
executable. Provision tests provision cluster for the test run, perform
some actions and verify results (could be upgrade, reset, scale up/down,
etc.)

There's now framework to implement upgrade tests, first of the tests
tests upgrade from latest 0.3 (0.3.2 at the moment) to current version
of Talos (being built in CI). Tests starts by booting with 0.3
kernel/initramfs, runs 0.3 installer to install 0.3.2 cluster, wait for
bootstrap, followed by upgrade to 0.4 in rolling fashion. As Firecracker
supports bootloader, this boots 0.4 system from boot disk (as installed
by installer).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-21 07:04:03 -08:00
Andrey Smirnov
fae5e6915d chore: rework firecracker code around upstream Go SDK + PRs
This removes use of private fork with custom `ip=` kernel argument
handling and switches fully to upstream version of it.

Firecracker Go SDK version is `master` + following PRs:

* https://github.com/firecracker-microvm/firecracker-go-sdk/pull/167
* https://github.com/firecracker-microvm/firecracker-go-sdk/pull/177
* https://github.com/firecracker-microvm/firecracker-go-sdk/pull/178

MTU handling support was implemented as well.

Changes:

* hostname to each node is passed via `talos.hostname=` kernel arg
* IP configuration is generated by SDK from CNI result
* fixed bugs with wrong netmask
* nameservers & MTU is passed via Talos config

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-29 02:35:15 +03:00
Andrey Smirnov
9da687d2a3 test: firecracker provisioner fixes, implement cluster destroy
This implements `osctl cluster destroy` for Firecracker, adds
new utility command `osctl cluser show`.

Firecracker mode now has control process for firecracker VMs, allowing
clean reboots and background operations.

Lots of small fixes to Firecracker mode, clean CNI shutdown, cleaning up
netns, etc.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-21 17:11:06 -08:00
Spencer Smith
67e50f6f50 feat: allow for bootkube images to be customized
This PR allows for pod checkpointer and coredns images to be customized
for bootkube. We can already customize the hyperkube image and all other
images used by bootkube are CNI-related and can be customized with the
"custom" CNI setup.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-01-21 11:17:28 -08:00
Spencer Smith
60260c85d1 feat: upgrade kubernetes version to 1.17.1
This PR will bring in the latest point release of k8s 1.17

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-01-17 09:39:26 -08:00
Andrey Smirnov
2bf8540855 test: provision Talos clusters via Firecracker VMs
This is initial PR to push the initial code, it has several known
problems which are going to be addressed in follow-up PRs:

1. there's no "cluster destroy", so the only way to stop the VMs is to
`pkill firecracker`

2. provisioner creates state in `/tmp` and never deletes it, that is
required to keep cluster running when `osctl cluster create` finishes

3. doesn't run any controller process around firecracker to support
reboots/CNI cleanup (vethxyz interfaces are lingering on the host as
they're never cleaned up)

The plan is to create some structure in `~/.talos` to manage cluster
state, e.g. `~/.talos/clusters/<name>` which will contain all the
required files (disk images, file sockets, VM logs, etc.). This
directory structure will also work as a way to detect running clusters
and clean them up.

For point number 3, `osctl cluster create` is going to exec lightweight
process to control the firecracker VM process and to simulate VM reboots
if firecracker finishes cleanly (when VM reboots).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-16 00:27:08 +03:00
Andrew Rynhard
cb93646c07 fix: update kernel version constant
This needs to be updated for integrations tests.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-12 09:21:19 -08:00
Andrew Rynhard
7edd96947a feat: upgrade Linux to v5.4.10
This brings in the latest stable Linux.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-10 20:51:07 -08:00
Andrew Rynhard
4242acd085 feat: upgrade linux to v5.4.8
This brings in the latest 5.4 kernel.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-08 11:59:05 -06:00
Andrew Rynhard
e4a1bc3cf9 chore: add help menu to the Makefile
This adds a help  menu to the Makefile. It documents all build
dependencies, and how to get started.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-25 11:11:41 -08:00
Andrew Rynhard
907f87d8e0 feat: upgrade Linux to v5.4.5
This brings in the latest stable version of Linux.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-19 17:43:34 -08:00
Brad Beam
9584b47cd7 feat: Upgrade kubernetes to 1.17.0
Primarily doc/constant changes.

Added additionnal bits to `docs` target in makefile to generate osctl
docs as well as config files. Explicitly define a HOME variable so we
get consistent home directories for talosconfig variables in our docs.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-12-10 16:03:35 -08:00
Andrew Rynhard
fa515b8117 fix: kill POD network mode pods first on upgrades
When we upgrade a node, we kill off all pods before performing a fresh
install. The issue with this is that we run the risk of killing the CNI
pod before we finish killing all other pods, leaving the CRI unable to
teardown the pod's networking. This works around that by first killing
any pods running without host networking so that the CNI can do its'
job, and then removing the remaining pods.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-09 13:45:31 -08:00
Spencer Smith
92b5bd9b2b feat: allow ability to specify custom CNIs
This PR will allow users to specify one or many URLs for CNI so that
they can bypass bootkube deploying flannel and bring their own. Will
close #1593

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-12-06 15:27:36 -05:00
Andrew Rynhard
7b6a1fdc94 fix: update kernel version constant
This is required to pass integration tests.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-04 20:27:53 -08:00
Andrew Rynhard
d4c202438c refactor: set CRI config to /etc/cri/containerd.toml
This changes the CRI specific containerd instance's config to a
different path.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-04 19:32:00 -08:00
Andrew Rynhard
43e6703b8b feat: upgrade containerd to v1.3.2
This brings in the latest version of Containerd.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-04 10:19:51 -08:00
Andrew Rynhard
9745c3a504 fix: update kernel version constant
This is needed in order for integration tests to pass.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-12-02 15:26:28 -08:00
Andrey Smirnov
5b7bea2471 feat: use grpc-proxy in apid
This replaces codegen version of apid proxying with
talos-systems/grpc-proxy based version. Proxying is transparent, it
doesn't require exact information about methods and response types. It
requires some common layout response to enhance it properly with node
metadata or errors.

There should be no signifcant changes to the API with the previous
version, but it's worth mentioning a few changes:

1. grpc.ClientConn is established just once per upstream (either local
service or remote apid instance).

2. When called without `-t` (`targets`), apid proxies immediately down
to local service skipping proxying to itself (as before), which results
in empty node metadata in response (before it had local node IP). Might
revert this later to proxy to itself (?).

3. Streaming APIs are now fully supported with multiple targets, but
message definition doesn't contain `ResponseMetadata`, so streaming APIs
are broken now with targets (needs a fix).

4. Errors are now returned as responses with `Error` field set in
`ResponseMetadata`, this requires client library update and `osctl` to
handle it properly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-29 22:57:25 +03:00
Andrew Rynhard
e78e1655f1 feat: upgrade packages
This brings in the following changes:

- Linux 5.3.13
- Containerd 1.3.1

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-25 10:41:47 -08:00
Andrey Smirnov
63212ab17e test: fix integration test for k8s version
Push versions to constants, introduce 'platform' to version API to
discover node mode. Check kernel version for non-containers.

A bit of refactoring on version package to expose something closer to a
single response.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-11 13:42:21 -08:00
Andrew Rynhard
17cce5468f feat: add metadata file to boot partition
This introduces the notion of metadata for a node. In this initial pass
there are only two fields. A timestamp to indicate when the install was
performed, and a field to indicate if the install was performed as part
of an upgrade.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-05 17:59:45 -08:00
Andrew Rynhard
5abbb9b041 fix: Avoid running bootkube on reboots
Since bootkube should only be ran once, we need a way to determine if it
has already been ran. This makes use of etcd to store a key-value pair
indicating that the cluster has been initialized.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-01 15:20:43 -07:00
Andrew Rynhard
3c6d0135d0 feat: upgrade Kubernetes to 1.16.2
This brings in 1.16.2 modules and bumps the default hyperkube image.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-30 06:35:12 -07:00
Brad Beam
457c6416a6 feat: Add network api to apid
This extends apid to include the network api

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-10-28 04:21:48 -07:00
Brad Beam
ee24e42319 feat: Add time api to apid
This extends apid to cover the time api.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-10-25 14:35:14 -07:00
Andrey Smirnov
d3d011c8d2 chore: replace /* */ comments with // comments in license header
This fixes issues with `// +build` directives not being recognized in
source files.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-25 14:15:17 -07:00
Brad Beam
573cce8d18 feat: Add APId
This PR introduces APId. This service replaces the frontend functionality
previously provided by OSD. The main driver for this is two fold:

1. Create a single purpose application to expose the talos api

2. Make use of code generation to DRY api changes

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-10-25 13:02:33 -05:00
Andrew Rynhard
10b6202c4f refactor: improve metal platform
This brings in a few minor improvements to the metal platform. The first
is to use talos.config=metal-iso to indicate that the machine's config
can be found in an ISO image. The second is a fix to ensure that /mnt
exists.

This adds support for creating more than one node using the qemu-boot.sh
script.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-14 22:05:56 -07:00
Andrew Rynhard
80e3876df5 feat: remove proxyd
We have decided that proxyd is not the best architectue for HA
Kubernetes. Our recommendation to users will be to create a load
balancer instead.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-14 08:11:00 -07:00
Brad Beam
d3f20db0aa fix: Use correct names for kubelet config
With the change to bootkube, kubelet.conf has changed names and is now kubelet-kubeconfig.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-10-11 07:42:32 -07:00
Andrey Smirnov
bb5f5cc754 chore: bump golangci-lint to 1.20
Memory usage reduced around 8-10x: now it stays stable at 1GB.

I disabled some of the new linters, and one rule which is violated a
lot.

I might make sense to go back and enable `wsl` fixing all the issues
(leaving that for another PR).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-09 22:21:08 +03:00
Andrew Rynhard
04313bd48c feat: add CNI, and pod and service CIDR to configurator
This adds more methods to the Cluster interface that allows for more
granular control of the cluster network settings.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-08 07:53:27 -07:00
Andrew Rynhard
b29391f0be feat: use bootkube for cluster creation
This replaces kubeadm with bootkube.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-07 17:17:57 -07:00