Describe common failures and debugging approach.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Co-authored-by: Spencer Smith <rsmitty@users.noreply.github.com>
This is more of a in-depth guide explaining internals.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Co-authored-by: Spencer Smith <rsmitty@users.noreply.github.com>
There was an issue that `talosctl apply config` version was printing out
the help even if arguments are correct.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This makes sure source directory exists before performing mount
operation.
Also adds an ability to patch the config bundle configs with JSON patch,
which is exposed in `talosctl cluster create`, this allowed me to easily
test this fix:
```
talosctl cluster create ... --config-patch='[{"op": "add", "path": "/machine/kubelet/extraMounts", "value": [{"destination": "/var/log/containers", "type": "bind", "source": "/var/log/containers", "options": ["rshared", "rbind", "rw"]}]}]'
```
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Without loadbalancer, when api-server goes down, there will be
connection refused errors which should be retried.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Resources/types were renamed after alpha.4, so we need Talos API to
match expectations of the upgrade test built against master.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
On upgrade with persistenct, etcd PKI path retains old mode 0600 which
breaks networkd bind mount for etcd certs.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/2997
Listen for restart events in parallel with the boot sequence and cancel
the context if got `RestartEvent`.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This allows to apply config even if sequencer is locked to recover from
confguration mistakes.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This adds support for `-o json` (easier to use `jq` to query additional
data), and prints event name in `--watch` mode.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is required to correctly handle ACPI reboot or forceful reboots
during sequence that locks the controller.
Additionally fix `NoSchedule` untaint when the configuration is changed.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
See https://github.com/talos-systems/os-runtime/pull/12 for new mnaming
conventions.
No functional changes.
Additionally implements printing extra columns in `talosctl get xyz`.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes a race condition between `udevd` issuing ioctl `BLKRRPART`
when block device is closed after partitioning/formatting and Talos
trying to mount a partition. When `BLKRRPART` is issued, kernel
temporarily wipes out all the in-memory partitions killing `/dev/sdX`
devices until partition scan is done.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This fixes a problem when Talos pulls `etcd` image one every reboot, as
`etcd` was running in the system containerd which is completely
ephemeral (backed by `tmpfs`).
Also skip pulling if image is already present and unpacked (same fix for
the `kubelet` image).
Fixes#3229
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This drops support for 0.7.x in upgrade tests, and bumps tests to use
version 0.9.0-alpha.3 as the next stable (it will eventually graduate to
0.9.0).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).
Add a certificate refresh at 50% of the validity.
Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This should align all `apply-config` modes to use the same flows.
Also added unit-tests for `ApplyDynamicConfig`.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3219
We already have `etcd leave`, which makes the node exclude itself from
etcd members.
But in case if the node can't remove itself because it doesn't have
connection to etcd we need this etcd remove-member cli, which basically removes
a node from a different node.
No unit tests for that as it's going to destroy the test cluster.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This fixes output of `talosctl containers` to show failed/exited
containers so that it's possible to see e.g. `kube-apiserver` container
when it fails to start. This also enables using ID from the container
list to see logs of failing containers, so it's easy to debug issues
when control plane pods don't start because of wrong configuration.
Also remove option to use either CRI or containerd inspector, default to
containerd for system namespace and to CRI for kubernetes namespace.
The only side effect is that we can't see `kubelet` container in the
output of `talosctl containers -k`, but `kubelet` itself is available in
`talosctl services` and `talosctl logs kubelet`.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This adds a VIP (virtual IP) option to the network configuration of an
interface, which will allow a set of nodes to share a floating IP
address among them. For now, this is restricted to control plane use
and only a single shared IP is supported.
Fixes#3111
Signed-off-by: Seán C McCord <ulexus@gmail.com>
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3209
Using parts of `kubectl` package to run the editor.
Also using the same approach as in `kubectl edit` command:
- add commented section to the top of the file with the description.
- if the config has errors, display validation errors in the commented
section at the top of the file.
- retry apply config until it succeeds.
- abort if no changes were detected or if the edited file is empty.
Patch currently supports jsonpatch only and can read it either from the
file or from the inline argument.
https://asciinema.org/a/wPawpctjoCFbJZKo2z2ATDXeC
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This version is finally using working `go.mod` files and tags, so no
more hacks with imports, and allows us to bump `grpc` library to the
latest version (I also did for this PR).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This changes introduces top-level cancellable on signal context to
networkd to abort operations when networkd is being stopped.
This allows for clean restarts of networkd container, and it is required
to support canceallable context for VIP etcd operations.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Allow setting individual options for the network interface while
generating config instead of providing whole config. This solves the
problem of merging options from different sources to build the config.
There should be no changes with this PR.
This is prep work for control plane VIP.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Critical bug (I believe) was that drain code entered the loop to evict
the pod after wait for pod to be deleted returned success effectively
evicting pod once again once it got rescheduled to a different node.
Add a global timeout to prevent draining code from running forever.
Filter more pod types which shouldn't be ever drained.
Fixes#3124
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
As 'healthy' was always set to true, some tasks started earlier than
expected, and specifically etcd cert was generated while the time sync
was happening leading to half-broken cert on RPi4.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>