This is the follow-up fix to the PR #5129.
1. Correctly catch only expected errors in the tests.
2. Rewind the snapshot each time the upload is retried.
3. Correctly unwrap errors in the `EtcdRecovery` client.
4. Update the `grpc-proxy` library to pass through the EOF error.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.
Additionally, `apid` and `machined` are now started right after
containerd is up and running.
That makes it possible for the operator to connect to the node using
talosctl and fix the config.
Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Fixes#4947
It turns out there's something related to boot process in BIOS mode
which leads to initramfs corruption on later `kexec`.
Booting via GRUB is always successful.
Problem with kexec was confirmed with:
* direct boot via QEMU
* QEMU boot via iPXE (bundled with QEMU)
The root cause is not known, but the only visible difference is the
placement of RAMDISK with UEFI and BIOS boots:
```
[ 0.005508] RAMDISK: [mem 0x312dd000-0x34965fff]
```
or:
```
[ 0.003821] RAMDISK: [mem 0x711aa000-0x747a7fff]
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos shouldn't try to re-encode the machine config it was provided
with.
So add a `ReadonlyWrapper` around `*v1alpha1.Config` which makes sure
that raw config object is not available anymore (it's a private field),
but config accessors are available for read-only access.
Another thing that `ReadonlyWrapper` does is that it preserves the
original `[]byte` encoding of the config keeping it exactly same way as
it was loaded from file or read over the network.
Improved `talosctl edit mc` to preserve the config as it was submitted,
and preserve the edits on error from Talos (previously edits were lost).
`ReadonlyWrapper` is not used on config generation path though - config
there is represented by `*v1alpha.Config` and can be freely modified.
Why almost? Some parts of Talos (platform code) patch the machine
configuration with new data. We need to fix platforms to provide
networking configuration in a different way, but this will come with
other PRs later.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Sometimes pushing/pulling to Kubernetes registry is delayed due to
backoff on failed attempts to talk to the API server when the cluster is
still bootstrapping. Workaround that by adding retries.
Also disable kernel module controller in container mode, as it will keep
always failing.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
As `talosctl time` relies on default time server set in the config, and
our nodes start with `pool.ntp.org`, sometimes request to the timeserver
fails failing the tests.
Retry such errors in the tests to avoid spurious failures.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4656
As now changes to kubelet configuration can be applied without a reboot,
`talosctl upgrade-k8s` can handle the kubelet upgrades as well.
The gist is simply modifying machine config and waiting for `Node`
version to be updated, rest of the code is required for reliability of
the process.
Also fixed a bug in the API while watching deleted items with
tombstones.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes: https://github.com/talos-systems/talos/issues/4385
Now sysctls defined in the config can override kernel args defined by
defaults controller.
In that case controller shows the warning that tells which param was
overridden and the new value and tells that it is not recommended.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Fixes#4407fixes#4489
This PR started by enabling simple restart of the `kubelet` service via
services API, but it turned out there's a problem:
When kubelet restarts, CNI is already up, so there's an interface on the
host with CNI node IP, the code which picks kubelet node IP finds it and
tries to add it to the list of kubelet node IPs which completely breaks
kubelet.
Solution was easy: allow node IPs to be filtered out - e.g. we never
want kubelet node IP to be from the pod CIDR.
But this filtering feature is also useful in other cases, so I added
that as well.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
With recent changes and kexec, Talos upgrades much faster in the tests
and mutex is not released properly (#4525).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4418
Only one resource (one of the very first ones) was polymorphic: its
actual spec type depends on its ID. This was a bad idea, and it doesn't
work with protobuf specs (as type <> protobuf relationship can't be
established).
Refactor this by splitting into three separate resource types:
`OSRoot` (OS-level root secrets), `EtcdRoot` (for etcd),
`KubernetesRoot` (for Kubernetes).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This verifies that members match cluster state and that both cluster
registries work in sync producing same discovery data.
Fixes#4191
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This test sometimes fails with a message like:
```
=== RUN TestIntegration/api.EventsSuite/TestEventsWatch
assertion_compare.go:323:
Error Trace: events.go:88
Error: "0" is not greater than or equal to "14"
Test: TestIntegration/api.EventsSuite/TestEventsWatch
Messages: []
```
I believe the root cause is that the initial (first event) delivery
might be more than 100ms, so instead of waiting for 100ms for each
event, block for 500ms for all events to arrive.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4094
Deprecate old networkd APIs, `talosctl interfaces` and `talosctl routes`
now suggest different commands to be used to achieve same task.
TUI installer was updated to stop using Interfaces API.
Those APIs will be completely removed in 0.14.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This adds information about file ownership in the long listing which is
crucial sometimes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This simply uses new protobuf package instead of old one.
Old protobuf package is still in use by Talos dependencies.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
The problem was that gRPC method `status.Code(err)` doesn't unwrap
errors, while Talos client returns errors wrapped with
`multierror.Error` and `fmt.Errrorf`, so `status.Code` doesn't return
error code correctly.
Fix that by introducing our own client method which correctly goes over
the chain of wrapped errors.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This commit also introduces a hidden `--json` flag for `talosctl version` command
that is not supported and should be re-worked at #907.
Refs #3852.
Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
* `talosctl config new` now sets endpoints in the generated config.
* Avoid duplication of roles in metadata.
* Remove method name prefix handling. All methods should be set explicitly.
* Add tests.
Closes#3421.
Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
This fixes a scenario when control plane node loses contents of `/var`
without leaving etcd first: on reboot etcd data directory is empty, but
member is already present in the etcd member list, so etcd won't be able
to join because of raft log being empty.
The fix is to remove a member with matching hostname if found in the
etcd member list followed by new member add.
The risk here is removing another member which has same hostname as the
joining node, but having duplicate hostnames for control plane node is a
problem anyways.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Changes `gen config` to output `controlplane` and `join` machine config
types only. Users can manually set the `type` to `init` if they need to.
Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
The structure of the controllers is really similar to addresses and
routes:
* `LinkSpec` resource describes desired link state
* `LinkConfig` controller generates `LinkSpecs` based on machine
configuration and kernel cmdline
* `LinkMerge` controller merges multiple configuration sources into a
single `LinkSpec` paying attention to the config layer priority
* `LinkSpec` controller applies the specs to the kernel state
Controller `LinkStatus` (which was implemented before) watches the
kernel state and publishes current link status.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Route handling is very similar to addresses:
* `RouteStatus` describes kernel routing table state,
`RouteStatusController` reflects kernel state into resources
* `RouteSpec` defines routes to be configured
* `RouteConfigController` creates `RouteSpec`s based on cmdline and
machine configuration
* `RouteMergeController` merges different configuration layers into the
final representation
* `RouteSpecController` applies the specs to the kernel routing table
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>