1. Don't set max cgroups limit if race mode is enabled (only in test
mode). When e.g. apid/trustd are built with race detector on, they
consume 10x the memory.
2. Fix a data race in `talosctl support` when showing UI progress.
3. Fix an issue pulling `kubeconfig` in `talosctl support` - pull from
endpoints (controlplanes) without setting any nodes.
Fixes#10036
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Save `support.zip` always, also use a different folder for saving logs,
so we can save artifacts of multi cluster tests.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Bring in new tools, pkgs, update Go dependencies and others.
In preparation for Talos 1.9.0-alpha.0.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
The fix in #9233 wasn't correct, as it was looking for number of
replicas in a "random" ReplicaSet. If the deployment has multiple
replica sets, it leads to unexpected results.
Instead, read the Deployment resource directly.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Explicitly enable access to host DNS from pod/service IPs.
Also fix the Kubernetes health checks to assert number of ready pods to
match expectation, otherwise the check might skip a pod (e.g.
`kube-proxy` one) which is not ready, allowing the test to proceed too
early.
Update DNS test to print more logs on error.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Re-structure k8s components health checks so that K8s health can be
independently checked without auxiliary components being up.
Signed-off-by: Noel Georgi <git@frezbo.dev>
- replace `interface{}` with `any` using `gofmt -r 'interface{} -> any -w'`
- replace `a = []T{}` with `var a []T` where possible.
- replace `a = []T{}` with `a = make([]T, 0, len(b))` where possible.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
(This is not user-facing, but rather internal use of the kubeconfig in
the tests/inside the machine).
This was added 4 years ago as a workaround, but instead of a global
timeout we should rather use contexts with timeouts/deadlines (and we
do!).
Setting a global timeout breaks streaming Kubernetes pod logs.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Talos diagnostics analyzes current system state and comes up with detailed
warnings on the system misconfiguration which might be tricky to figure
out other way.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Sort the pod names, so the check output doesn't re-print itself on no
change to the list of pods.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Dynamically map Kubernetes and Talos API ports to an available port on
the host, so every cluster gets its own unique set of parts.
As part of the changes, refactor the provision library and interfaces,
dropping old weird interfaces replacing with (hopefully) much more
descriprive names.
Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This PR ensures that we can test our siderolink communication using embedded siderolink-agent.
If `--with-siderolink` provided during `talos cluster create` talosctl will embed proper kernel string and setup `siderolink-agent` as a separate process. It should be used with combination of `--skip-injecting-config` and `--with-apply-config` (the latter will use newly generated IPv6 siderolink addresses which talosctl passes to the agent as a "pre-bind").
Fixes#8392
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The current code was stipping non-`v1alpha1.Config` documents. Provide a
proper method in the config provider, and update places using it.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
To be used in the `go-talos-support` module without importing the whole
Talos repo.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
The previous implementation used old events API, which had several
issues:
* buffer overruns, and weird checks
* big timeout even if the all nodes are booted up
Replace that with direct reading of `MachineStatus` resource which is
available since Talos 1.2.0.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This commit deprecates those things:
- Removes the support of `.persist` flag. From now, it should always be enabled or not defined in the config.
- Removes the documentation for `.bootloader`. It never worked anyway.
- Adds a warning for `.machine.install.extensions`, suggests to use boot-assets.
Closes#7972Closes#7507
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Also add a unit-test to prevent issues like that (I upgraded to 1.29 but
forgot to update go-kubernetes).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This feature allows us to remove any comments from the machineconfig after
upgrading Kubernetes.
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Currently, we use `github.com/coreos/go-semver/semver` and `github.com/hashicorp/go-version`
for version parsing. As we use `github.com/blang/semver/v4` in our other projects, and it
has more features, it makes sense to use it across the projects. It also doesn't allocate
like crazy in `KubernetesVersion.SupportedWith`.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Fixes#6391
Implement a set of APIs and commands to manage images in the CRI, and
pre-pull images on Kubernetes upgrades.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add flags for configuring the qemu bridge interface with chaos options:
- network-chaos-enabled
- network-jitter
- network-latency
- network-packet-loss
- network-packet-reorder
- network-packet-corrupt
- network-bandwidth
These flags are used in /pkg/provision/providers/vm/network.go at the end of the CreateNetwork function to first see if the network-chaos-enabled flag is set, and then check if bandwidth is set. This will allow developers to simulate clusters having a degraded WAN connection in the development environment and testing pipelines.
If bandwidth is not set, it will then enable the other options.
- Note that if bandwidth is set, the other options such as jitter, latency, packet loss, reordering and corruption will not be used. This is for two reasons:
- Restriction the bandwidth can often intoduce many of the other issues being set by the other options.
- Setting the bandwidth uses a separate queuing discipline (Token Bucket Filter) from the other options (Network Emulator) and requires a much more complex configuration using a Heirarchial Token Bucket Filter which cannot be configured at a granular enough level using the vishvananda/netlink library.
Adding both queuing disciplines to the same interface may be an option to look into in the future, but would take more extensive testing and control over many more variables which I believe is out of the scope of this PR. It is also possible to add custom profiles, but will also take more research to develop common scenarios which combine different options in a realistic manner.
Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This PR adds support for creating a list of API endpoints (each is pair of host and port).
It gets them from
- Machine config cluster endpoint.
- Localhost with LocalAPIServerPort if machine is control panel.
- netip.Addr[0] and port from affiliates if they are control panels.
For #7191
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Use `udevd` rules to create stable interface names.
Link controllers should wait for `udevd` to settle down, otherwise link
rename will fail (interface should not be UP).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
`config.Container` implements a multi-doc container which implements
both `Container` interface (encoding, validation, etc.), and `Conifg`
interface (accessing parts of the config).
Refactor `generate` and `bundle` packages to support multi-doc, and
provide backwards compatibility.
Implement a first (mostly example) machine config document for
SideroLink API URL.
Many places don't properly support multi-doc yet (e.g. config patches).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See #7230
Refactor more config interfaces, move config accessor interfaces
to different package to break the dependency loop.
Make `.RawV1Alpha1()` method typed to avoid type assertions everywhere.
No functional changes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7253
Also fix the case that `kube-proxy` version was updated in the machine
config in `--dry-run` mode.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See #7230
This is a step towards preparing for multi-doc config.
Split the `config.Provider` interface into parts which have different
implementation:
* `config.Config` accesses the config itself, it might be implemented by
`v1alpha1.Config` for example
* `config.Container` will be a set of config documents, which implement
validation, encoding, etc.
`Version()` method dropped, as it makes little sense and it was almost
not used.
`Raw()` method renamed to `RawV1Alpha1()` to support legacy direct
access to `v1alpha1.Config`, next PR will refactor more to make it
return proper type.
There will be many more changes coming up.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Use new version of go-kubernetes, and move the `kube-proxy` DaemonSet
update to follow common logic of bootstrap manifests update.
This fixes a confusing behavior when after `k8s-upgrade` the version of
`kube-proxy` is not updated in the machine config.
See https://github.com/siderolabs/go-kubernetes/pull/3
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add cilium e2e tests. The existing cilium check was very old, update to
latest cilium version and also add a test for KPR strict mode.
Signed-off-by: Noel Georgi <git@frezbo.dev>
The shared code is going out to the
github.com/siderolabs/go-kubernetes library.
The code will be used in Talos and other projects using same features.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Not all Kubernetes deprecated resources are same - if the old API
version is deprecated, but new one is available, API server handles
trnasition for us. If some resource is removed completely, we need to
check for it. This reduces number of items to check, and simplifies the
check.
Move the check under the umbrella of the 'upgrade pre-checks', and make
it actually fatal.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>