This PR does those things:
- It allows API calls `MetaWrite` and `MetaRead` in maintenance mode.
- SystemInformation resource now waits for available META
- SystemInformation resource now overwrites UUID from META if there is an override
- META now supports "UUID override" and "unique token" keys
- ProvisionRequest now includes unique token and Talos version
For #7694
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Support full configuration for image generation, including image
outputs, support most features (where applicable) for all image output
types, unify image generation process.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add security state resource that describes the state of Talos SecureBoot
and PCR signing key fingerprints.
The UKI fingerprint is currently not populated.
Fixes: #7514
Signed-off-by: Noel Georgi <git@frezbo.dev>
This refactors code to handle partial machine config - only multi-doc
without v1alpha1 config.
This uses improvements from
https://github.com/cosi-project/runtime/pull/300:
* where possible, use `TransformController`
* use integrated tracker to reduce boilerplate
Sometimes fix/rewrite tests where applicable.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7453
The goal is to make it possible to load some multi-doc configuration
from the platform source (or persisted in STATE) before machine acquires
full configuration.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7430
Introduce a set of resources which look similar to other API
implementations: CA, certs, cert SANs, etc.
Introduce a controller which manages the service based on resource
state.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This commit adds support for API load balancer. Quick way to enable it is during cluster creation using new `api-server-balancer-port` flag (0 by default - disabled). When enabled all API request will be routed across
cluster control plane endpoints.
Closes#7191
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Fixes#7233
Waiting for node readiness now happens in the `MachineStatus` controller
which won't mark the node as ready until Kubernetes `Node` is ready.
Handling cordoning/uncordining happens with help of additional resource
in `NodeApplyController`.
New controller provides reactive `NodeStatus` resource to see current
status of Kubernetes `Node`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See #7233
The controlplane label is simply injected into existing controller-based
node label flow.
For controlplane taint default NoScheduleTaint, additional controller &
resource was implemented to handle node taints.
This also fixes a problem with `allowSchedulingOnControlPlanes` not
being reactive to config changes - now it is.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
I ended up completely rewriting the controller, simplifying the flow
(somewhat) so that there's just a single control flow in the controller,
while reading from v1alpha1 events is converted to reading from a
channel.
Fixes#7227
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7226
This follows same flow as other similar changes - split out logging
configuration as a separate resource, source it for now in the cmdline.
Rewrite the controller to allow multiple log outputs, add send retries.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This PR adds support for creating a list of API endpoints (each is pair of host and port).
It gets them from
- Machine config cluster endpoint.
- Localhost with LocalAPIServerPort if machine is control panel.
- netip.Addr[0] and port from affiliates if they are control panels.
For #7191
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Use `udevd` rules to create stable interface names.
Link controllers should wait for `udevd` to settle down, otherwise link
rename will fail (interface should not be UP).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See #7230
This is a step towards preparing for multi-doc config.
Split the `config.Provider` interface into parts which have different
implementation:
* `config.Config` accesses the config itself, it might be implemented by
`v1alpha1.Config` for example
* `config.Container` will be a set of config documents, which implement
validation, encoding, etc.
`Version()` method dropped, as it makes little sense and it was almost
not used.
`Raw()` method renamed to `RawV1Alpha1()` to support legacy direct
access to `v1alpha1.Config`, next PR will refactor more to make it
return proper type.
There will be many more changes coming up.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Introduce a new resource, `SiderolinkConfig`, to store SideroLink connection configuration (api endpoint for now).
Introduce a controller for this resource which populates it from the Kernel cmdline.
Rework the SideroLink `ManagerController` to take this new resource as input and reconfigure the link on changes.
Additionally, if the siderolink connection is lost, reconnect to it and reconfigure the links/addresses.
Closessiderolabs/talos#7142, siderolabs/talos#7143.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fixes#7159
The change looks big, but it's actually pretty simple inside: the static
pods had an annotation which tracks a version of the secrets which
forced control plane pods to reload on a change. At the same time
`kube-apiserver` can reload certificate inputs automatically from files
without restart.
So the inputs were split: the dynamic (for kube-apiserver) inputs don't
need to be reloaded, so its version is not tracked in static pod
annotation, so they don't cause a reload. The previous non-dynamic
resource still causes a reload, but it doesn't get updated when e.g.
node addresses change.
There might be many more refactoring done, the resource chain is a bit
of a mess there, but I wanted to keep number of changes minimal to keep
this backportable.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7121
Talos pulls some images on its own (without CRI/kubelet) to the `system`
namespace of the CRI containerd. These images are not visible to the
CRI/kubelet, so we need to clean them up manually.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Network probes are configured with the specs, and provide their output
as a status.
At the moment only platform code can configure network probes.
If any network probes are configured, they affect network.Status
'Connectivity' flag.
Example, create the probe:
```
talosctl -n 172.20.0.3 meta write 0xa '{"probes": [{"interval": "1s", "tcp": {"endpoint": "google.com:80", "timeout": "10s"}}]}'
```
Watch probe status:
```
$ talosctl -n 172.20.0.3 get probe
NODE NAMESPACE TYPE ID VERSION SUCCESS
172.20.0.3 network ProbeStatus tcp:google.com:80 5 true
```
With failing probes:
```
$ talosctl -n 172.20.0.3 get probe
NODE NAMESPACE TYPE ID VERSION SUCCESS
172.20.0.3 network ProbeStatus tcp:google.com:80 4 true
172.20.0.3 network ProbeStatus tcp:google.com:81 1 false
$ talosctl -n 172.20.0.3 get networkstatus
NODE NAMESPACE TYPE ID VERSION ADDRESS CONNECTIVITY HOSTNAME ETC
172.20.0.3 network NetworkStatus status 5 true true true true
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Implement the new summary dashboard with node info and logs.
Replace the previous metrics dashboard with the new dashboard which has multiple screens for node summary, metrics and editing network config.
Port the old metrics dashboard to the tview library and assign it to be a screen in the new dashboard, accessible by F2 key.
Add a new resource, infos.cluster.talos.dev which contains the cluster name and id of a node.
Disable the network config editor screen in the new dashboard until it is fully implemented with its backend.
Closessiderolabs/talos#4790.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
These endpoints are used for workers to find the addresses of the
controlplane nodes to connect to `trustd` to issue certificates of
`apid`.
These endpoints today come from two sources:
* discovery service data
* Kubernetes API server endpoints
This PR adds to the list static entry based on the Kubernetes control
plane endpoint in the machine config.
E.g. if the loadbalancer is used for the controlplane endpoint, and that
loadbalancer also proxies requests for port 50001 (trustd), this static
endpoint will provide workers with connectivity to trustd even if the
discovery service is disabled, and Kubernetes API is not up.
If this endpoint doesn't provide any trustd API, Talos will still try
other endpoints.
Talos does server certificate validation when calling trustd,
so including malicious endpoints doesn't cause any harm, as malicious
endpoint can't provider proper server certificate.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This brings many fixes, including a new Watch with support for
Bootstapped and Errored event types.
`talosctl` from before this change is still compatible, as there's gRPC
API level backwards compatibility versioning.
New client doesn't yet depend on new event types, so it will work
against Talos 1.2.x.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
We add the `nodeLabels` key to the machine config to allow users to add
node labels to the kubernetes Node object. A controller
reads the nodeLabels from the machine config and applies them via the
kubernetes API.
Older versions of talosctl will throw an unknown keys error if `edit mc`
is called on a node with this change.
Fixes#6301
Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
We add a controller that provides the etcd member id as a resource
and change the etcd related commands to support member ids next to
hostnames.
Fixes: #6223
Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This the first step towards replacing all import paths to be based on
`siderolabs/` instead of `talos-systems/`.
All updates contain no functional changes, just refactorings to adapt to
the new path structure.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Previously static pod manifests were written to and read from a folder
on the disk. We add a controller that cleans up the default static pod
manifests on the disk and serves them as a PodList manifest via HTTP.
The to the manifest is injected into the kubelet. File based static pod
manifests are still supported and may be enabled by setting the key
kubelet -> enableManifestsDirectory in the machine config.
Fixes#5494
Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>
See #6333
Using permanent address fixes issues with mis-matching the links after
they got bonded.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This allows to update the member information (for the current node) with
new advertised peer URLs as the config changes.
E.g. if the node IP changes, this will update the peer URLs for the
member accordingly.
At the same time any member update requires quorum, so changing IPs can
only be done on node-by-node basis.
If there are no changes to advertised peer URLs, controller does
nothing.
Talos node might still need a reboot to update the listen addresses, as
these are not handled automatically for now.
Fixes#6080
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
We add a new CRD, `serviceaccounts.talos.dev` (with `tsa` as short name), and its controller which allows users to get a `Secret` containing a short-lived Talosconfig in their namespaces with the roles they need. Additionally, we introduce the `talosctl inject serviceaccount` command to accept a YAML file with Kubernetes manifests and inject them with Talos service accounts so that they can be directly applied to Kubernetes afterwards. If Talos API access feature is enabled on Talos side, the injected workloads will be able to talk to Talos API.
Closessiderolabs/talos#4422.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This is mostly same as the way `apid` consumes certificates generated by
`machined` via COSI API connection.
Service `trustd` consumes two resources:
* `secrets.Trustd` which contains `trustd` server TLS certificates and
it gets refreshed as e.g. node IP changes
* `secrets.OSRoot` which contains Talos API CA and join token
This PR fixes an issue with `trustd` certs not always including all IPs
of the node, as previously `trustd` certs will only capture addresses of
the node at the moment of `trustd` startup.
Another thing is that refactoring allows to dynamically change API CA
and join token. This needs more work, but `trustd` should now pick up
changes without any additional changes.
Fixes#5863
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This extracts etcd configuration and finalized run arguments as
resources managed by controllers.
The biggest change in terms of UX is that Talos now waits for the etcd
configured subnet to be actually available before starting etcd.
Previously etcd quickly failed if the requested subnet was not available
on the host.
Coupled with other fixes (#5951, #5988), this should bring etcd
join/promote sequence back into proper shape.
I also reverted all temporary measures for discovering etcd endpoints,
now etcd join doesn't depend on Kubernetes (once again).
Fixes#5889
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Instead of writing PKI "once" around the startup time, keep writing PKI
files as the certificates get updated. `etcd` is able to reload
certificates, so we should keep updating them e.g. if the hostname/IPs
change over time.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This is a cosmetic fix: when `KubeletServiceController` tries to write
files to `/etc/kubernetes` before `/var` mounted, it would fail.
Controller will be restarted, but each restart involves a backoff on
each restart which gets longer with each restart.
On the first boot, or when EPHEMERAL is encrypted, mounting might take
considerable time (seconds), so during that time controller might enter
such long backoff timeout that it will delay whole boot sequence - it
won't finish before `kubelet` is started.
By waiting for `EPHEMERAL` to be mounted before starting the controller
we eliminate long backoff cycles.
Also fix a bug when `StartAllServices` task might start a kubelet early
(before `KubeletServiceController` is actually going to start it).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The problem is that Virtual IP operator configuration might require
accessing platform metadata server (e.g. on Equinix Metal), while
regular operator sets up critical operators like DHCP.
The issue observed on Equinix Metal without the split:
* on initial boot, DHCP is set up on `eth2`
* platform network configuration is fetched and `bond0` configuration is
created
* node IP is assigned both to `eth2` and `bond0`, while `eth2` is a
slave to `bond0`
* networking is broken
* operator config controller is stuck trying to fetch EM VIP
configuration, as the network is broken, it fails to do so, but retries
for 3 minutes (in `download.Download`)
* network is broken for 3 minutes until `OperatorConfig` controller is
unblocked and cleans up DHCP operator for `eth2` as it should
The issue here is that DHCP operator setup is much more tricky on one
hand (depends on link status, other configuration items, etc.), while
VIP operator depends on DHCP operator setup, as it needs outbound
networking.
By splitting the controllers, we split the flows and remove
dependencies.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#5003
This implements a way to configure API server admission plugins via
Talos machine configuration.
If Pod Security admission is enabled, default cluster-wide policy is
generated which enforces baseline policy.
Policy can be overridden per-namespace.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4694
User services run alongside with Talos system services.
Every user service container root filesystem should be already present
in the Talos root filesystem.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#4727
On worker nodes, static pods are injected, but status can't be monitored
by Talos. On control plane nodes full status is available via
`StaticPodStatus`.
Pod definition is left as `Unstructured` in the machine configuration,
and no specific validation is performed to avoid pulling in Kubernetes
libraries into Talos machinery package.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>