`config.Container` implements a multi-doc container which implements
both `Container` interface (encoding, validation, etc.), and `Conifg`
interface (accessing parts of the config).
Refactor `generate` and `bundle` packages to support multi-doc, and
provide backwards compatibility.
Implement a first (mostly example) machine config document for
SideroLink API URL.
Many places don't properly support multi-doc yet (e.g. config patches).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Introduce `cluster.NodeInfo` to represent the basic info about a node which can be used in the health checks. This information, where possible, will be populated by the discovery service in following PRs. Part of siderolabs#5554.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This test sometimes fails with a message like:
```
=== RUN TestIntegration/api.EventsSuite/TestEventsWatch
assertion_compare.go:323:
Error Trace: events.go:88
Error: "0" is not greater than or equal to "14"
Test: TestIntegration/api.EventsSuite/TestEventsWatch
Messages: []
```
I believe the root cause is that the initial (first event) delivery
might be more than 100ms, so instead of waiting for 100ms for each
event, block for 500ms for all events to arrive.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This is a complete rewrite of time sync process.
Now the time sync process starts early at boot time, and it adapts to
configuration changes:
* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used
Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.
Talos services which depend on time now wait on same resource instead of
waiting on timed health.
New features:
* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy
* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time
* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time
Features to implement later:
* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump
* trustd should be updated in same way
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Talos generates in-cluster kubeconfig for the kube-scheduler and
kube-controller-manager to authenticate to kube-apiserver. Bug was that
validity of that kubeconfig was set to 24h by mistake. Fix that by
bumping validity to default for other Kubernetes certs (1 year).
Add a certificate refresh at 50% of the validity.
Fix bugs with copying secret resources which was leading to updates not
being propagated correctly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
For single node clusters, control plane is unstable after reboot, run
health check several times to let it settle down to avoid failures in
subsequent checks.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.
And `pkg/machinery` is published as Go module inside Talos repository.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
With load-balancing enabled by default running `talosctl` without
`--nodes` is risky, as it might hit any control plane by default without
`--nodes`.
Only two commands do not enforce this check, as they do their own node
contexts: `crashdump` and `health` (client-side).
Integration tests were updated to always supply `--nodes` cli argument,
while doing that I refactored the storage for discovered nodes to use
existing `cluster.Info` interface.
The downside is that with e2e CAPI tests CLI tests will be mostly
skipped as we don't support discovery in CLI tests at the momemnt. This
can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for
node discovery.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#2316
Simply update dependencies we don't track on version level to be
compatible with Talos components (like etcd or k8s).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Handling of multiple endpoints has already been implemented in #2094.
This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).
Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.
gprc handles dead/unresponsive nodes automatically for us.
`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.
On the additional improvements we got:
* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)
* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors
Fixes#1641
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids
are sortable and unique enough. Xids also encode event publishing
time with a second precision.
2. Add three ways to look back into event history: based on number of
events, on time and ID. Lookup via ID might be used to restart event
polling in case of broken API connection from the same moment.
3. Reimplement core event buffer with positions which are always
incremented instead of generation+index, this implementation is much
more simple (idea from circular buffer).
4. By default, Events API works the same - it shows no history and
starts streaming new events only.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This implements service events, adds test for events API based on
service events as they're the easiest to generate on demand.
Disabled validate test for 'metal' as it validates disk device against
local system which doesn't make much sense.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>