Route handling is very similar to addresses:
* `RouteStatus` describes kernel routing table state,
`RouteStatusController` reflects kernel state into resources
* `RouteSpec` defines routes to be configured
* `RouteConfigController` creates `RouteSpec`s based on cmdline and
machine configuration
* `RouteMergeController` merges different configuration layers into the
final representation
* `RouteSpecController` applies the specs to the kernel routing table
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
The problem is that some patches can't be applied to join config, as
some nodes don't even exist in the config, for example
`/cluster/apiServer` node, and applying such patches doesn't make any
sense.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
In preparation for going 0.10 beta, start testing upgrades to 0.10, drop
0.8 and self-hosted control plane handling in the tests.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.
Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.
This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3410
Same as in `talosctl cluster create`. Will apply RFC6902 json patch
during the config generation if specified.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This adds a simple API and `talosctl etcd snapshot` command to stream
snapshot of etcd from one of the control plane nodes to the local file.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Version 0.9.1 contains a fix for concurrent map write on unmount which
was frequently breaking our upgrade tests.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Tests for ApplyConfig API were relying on not really supported behavior
of modifying config via the `Provider` interface (and it was "fixed" in
another PR which cleans up such access to the configuration).
Cluster version bumped to try to workaround strange CAPI bootstrap
failures in e2e-capi.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is a complete rewrite of time sync process.
Now the time sync process starts early at boot time, and it adapts to
configuration changes:
* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used
Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.
Talos services which depend on time now wait on same resource instead of
waiting on timed health.
New features:
* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy
* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time
* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time
Features to implement later:
* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump
* trustd should be updated in same way
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This moves implementation of the user-facing APIs to the machined, and
as now all the APIs are implemented by machined, remove routerd and
adjust apid to proxy to machined.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3323
Not exactly matching with udevd generated `by-<id>` symlinks, but should
provide sufficient amount of property selectors to be able to pick
specific disks for any kind of disk: sd card, hdd, ssd, nvme.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This removes container images for the aforementioned services, they are
now built into `machined` executable which launches one or another
service based on `argv[0]`.
Containers are started with rootfs directory which contains only a
single executable file for the service.
This creates rootfs on squashfs for each container in
`/opt/<container>`.
Service `networkd` is not touched as it's handled in #3350.
This removes all the image imports, snapshots and other things which
were associated with the existing way to run containers.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
First, if the config for some component image (e.g. `apiServer`) is empty,
Talos pushes default image which is unknown to the script, so verify
that change is not no-op, as otherwise script will hang forvever waiting
for k8s control plane config update.
Second, with bootkube bootstrap it was fine to omit explicit kubernetes
version in upgrade test, but with Talos-managed that means that after
Talos upgrade Kubernetes gets upgraded as well (as Talos config doesn't
contain K8s version, and defaults are used). This is not what we want to
test actually.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
CNI was removed from build-container which works fine for
`talosctl cluster create` clusters as it installs its own CNI, but fails
for upgrade tests as they were never updated for the CNI bundle.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Resources/types were renamed after alpha.4, so we need Talos API to
match expectations of the upgrade test built against master.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
See https://github.com/talos-systems/os-runtime/pull/12 for new mnaming
conventions.
No functional changes.
Additionally implements printing extra columns in `talosctl get xyz`.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This drops support for 0.7.x in upgrade tests, and bumps tests to use
version 0.9.0-alpha.3 as the next stable (it will eventually graduate to
0.9.0).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).
Add a certificate refresh at 50% of the validity.
Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes output of `talosctl containers` to show failed/exited
containers so that it's possible to see e.g. `kube-apiserver` container
when it fails to start. This also enables using ID from the container
list to see logs of failing containers, so it's easy to debug issues
when control plane pods don't start because of wrong configuration.
Also remove option to use either CRI or containerd inspector, default to
containerd for system namespace and to CRI for kubernetes namespace.
The only side effect is that we can't see `kubelet` container in the
output of `talosctl containers -k`, but `kubelet` itself is available in
`talosctl services` and `talosctl logs kubelet`.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3209
Using parts of `kubectl` package to run the editor.
Also using the same approach as in `kubectl edit` command:
- add commented section to the top of the file with the description.
- if the config has errors, display validation errors in the commented
section at the top of the file.
- retry apply config until it succeeds.
- abort if no changes were detected or if the edited file is empty.
Patch currently supports jsonpatch only and can read it either from the
file or from the inline argument.
https://asciinema.org/a/wPawpctjoCFbJZKo2z2ATDXeC
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Verify upgrade flow using the same version of the installer.
Run that with disk encryption enabled.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Upgrade is performed by updating node configuration (node by node, service
by service), watching internal resource state to get new configuration
version and verifying that pod with matching version successfully
propagated to the API server state and pod is ready.
Process is similar to the rolling update of the DaemonSet.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).
Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR introduces the first part of disk encryption support.
New config section `systemDiskEncryption` was added into MachineConfig.
For now it contains only Ephemeral partition encryption.
Encryption itself supports two kinds of keys for now:
- node id deterministic key.
- static key which is hardcoded in the config and mainly used for test
purposes.
Talosctl cluster create can now be told to encrypt ephemeral partition
by using `--encrypt-ephemeral` flag.
Additionally:
- updated pkgs library version.
- changed Dockefile to copy cryptsetup deps from pkgs.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Filesystem creation step is moved on the later stage: when Talos mounts
the partition for the first time.
Now it checks if the partition doesn't have any filesystem and formats
it right before mounting.
Additionally refactored mount options a bit:
- replaced separate options with a set of binary flags.
- implemented pre-mount and post-unmount hooks.
And fixed typos in couple of places and increased timeout for `apid ready`.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Also fix recovery grpc handler to print panic stacktrace to the log.
Any API should follow the structure compatible with apid proxying
injection of errors/nodes.
Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This allows to generating current version Talos configs (by default) or
backwards compatible configuration (e.g. for Talos 0.8).
`talosctl gen config` defaults to current version, but explicit version
can be passed to the command via flags.
`talosctl cluster create` defaults to install/container image version,
but that can be overridden. This makes `talosctl cluster create` now
compatible with 0.8.1 images out of the box.
Upgrade tests use contract based on source version in the test.
When used as a library, `VersionContract` can be omitted (defaults to
current version) or passed explicitly. `VersionContract` can be
convienietly parsed from Talos version string or specified as one of the
constants.
Fixes#3130
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Modify provision library to support multiple IPs, CIDRs, gateways, which
can be IPv4/IPv6. Based on IP types, enable services in the cluster to
run DHCPv4/DHCPv6 in the test environment.
There's outstanding bug left with routes not being properly set up in
the cluster so, IPs are not properly routable, but DHCPv6 works and IPs
are allocated (validates DHCPv6 client).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
ECDSA keys are smaller which decreases Talos config size, they are more
efficient in terms of key generation, signing, etc., so it makes boot
performance better (and config generation as well).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
After node reboot (and gRPC API unavailability), gRPC stack might cache
connection refused errors for up to backoff timeout. Explicitly clear
such errors in reset tests before trying to read data from the node to
verify reset success.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Flannel got updated to 0.13 version which has multi-arch image.
Kubernetes images are multi-arch.
Fixes#3049
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>