Launch goroutine that copies kubelet pki folder contents into
`/system/secrets/kubelet` every minute before starting apid container
when running on the worker node.
Mounting kubelet secrets directly from `/var/lib/kubelet/pki` breaks
upgrade flow, because we are not able to unmount ephemeral partition,
which is being used by apid, which is not stopped during the upgrade.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This moves endpoint refresh from the context of the service `apid` in
`machined` into `apid` service itself for the workers. `apid` does
initial poll for the endpoints when it boots, but also periodically
polls for new endpoints to make sure it has accurate list of `trustd`
endpoints to talk to, this handles cases when control plane endpoints
change (e.g. rolling replace of control plane nodes with new IPs).
Related to #3069Fixes#3068
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
On worker join, apid waits for kubelet client certficiate to be pulled
by the kubelet. But if control plane and worker nodes are bootstrapped
around same time, apid might give up waiting:
On master:
```
2021-01-29 02:37:33.174609 I | [talos] phase startEverything (8/9): done, 46.064546765s
2021-01-29 02:37:33.174622 I | [talos] phase labelMaster (9/9): 1 tasks(s)
2021-01-29 02:37:33.174656 I | [talos] task labelNodeAsMaster (1/1): starting
2021-01-29 02:37:35.739865 I | [talos] retrying error: Get "https://10.5.0.2:6443/api/v1/nodes/e2e-docker-master-1?timeout=30s": dial tcp 10.5.0.2:6443: connect: connection refused
2021-01-29 02:37:50.250264 I | [talos] retrying error: nodes "e2e-docker-master-1" not found
2021-01-29 02:38:35.296363 I | [talos] task labelNodeAsMaster (1/1): done, 1m2.121719377s
2021-01-29 02:38:35.296404 I | [talos] phase labelMaster (9/9): done, 1m2.121782296s
2021-01-29 02:38:35.296411 I | [talos] boot sequence: done: 1m49.200972734s
```
On worker:
```
2021-01-29 02:34:23.354741 I | [talos] service[kubelet](Running): Health check successful
2021-01-29 02:38:08.081764 I | [talos] service[apid](Failed): Failed to create runner: 2 error(s) occurred:
failed to create client: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]
timeout
```
It is clear from the timestamps that the worker gave up almost the same
time master was bootstrapped.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes were applied automatically.
Import ordering might be questionable, but it's strict:
* stdlib
* other packages
* same package imports
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR changes the bool for disabling ntp to `disable` instead of the
previous `enable`. We need to do this because customers were seeing
failure in cases where they were defining time servers only, which
results in `enabled: false` when configs get unmarshalled. Users wishing
to disable ntp altogether should now use `disabled: true`.
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
Adds the capability to diasable NTP when it cannot be provided in the deployed network
Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
add document update.
Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
This bug is sometimes reproducible with QEMU/arm64, as it runs really
slow. Looks like multiple concurrent image unpacks sharing some layers
might fail unexpectedly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is what happens when massive find-replace goes wrong...
Change should be cosmetic though, it doesn't affect operations.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
In order to perform upgrades the way we would like, it is important that
we avoid any bind mounts into containers. This change ensures that all
system services get their config via stdin.
Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.
And `pkg/machinery` is published as Go module inside Talos repository.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This change is only moving packages and updating import paths.
Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.
As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.
Other changes were direct dependencies discovered by `importvet` which
were updated.
Public packages (useful, general purpose packages with stable API):
* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`
Private packages (used only on provisioning library internally):
* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This makes `pkg/config` directly importable from other projects.
There should be no functional changes.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#2272
`gofumpt` is now included into `golangci-lint`, but not the
`gofumports`, so we keep it using it as separate binary, but we keep
versions in sync with `golangci-lint`.
This contains fixes from:
* `gofumpt` (automated, mostly around octal constants)
* `exhaustive` in `switch` statements
* `noctx` (adding context with default timeout to http requests)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This implements existing server-side health checks as defined in
`internal/pkg/cluster/checks` in Talos API.
Summary of changes:
* new `cluster` API
* `apid` now listens without auth on local file socket
* `cluster` API is for now implemented in `machined`, but we can move it
to the new service if we find it more appropriate
* `talosctl health` by default now does server-side health check
UX: `talosctl health` without arguments does health check for the
cluster if it has healthy K8s to return master/worker nodes. If needed,
node list can be overridden with flags.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
There's a global timeout for all services to be up: it's 5 minutes. We
need to make sure each service startup takes less than that, otherwise
boot sequence is aborted and there's no way to see the error message for
each particular service.
Also propagate contexts correctly and set some default timeouts to make
sure API operations are not hanging forever.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This skips running udevd, udevd-trigger, and timed when running in
container mode. Since the containers run as privileged containers
these services will contend with the host's equivalent services.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Using this `LoggingManager` all the log flows (reading and writing) were
refactored. Inteface of `LoggingManager` should be now generic enough to
replace log handling with almost any implementation - log rotation,
sending logs to remote destination, keeping logs in memory, etc.
There should be no functional changes.
As part of changes, `follow.Reader` was implemented which makes
appending file feel like a stream. `file.NewChunker` was refactored to
use `follow.Reader` and `stream.NewChunker` to do the actual work. So
basically now we have only a single instance of chunker - stream
chunker, as everything is represented as a stream.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.
A few highlights are:
- no more event bus
- functional approach to tasks (no more types defined for each task)
- the task function definition now offers a lot more context, like
access to raw API requests, the current sequence, a logger, the new
state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
install force option
Additionally, this pulls various packes in under machined to make the
code easier to navigate.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This changes the name of the `system-containerd` service to `containerd`
and the `containerd` service to `cri`. This is so that the CRI service
is a little more generic. In the future, when we add support for other
CRIs, it will be better to refer to this service generically as "cri."
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This PR will allow us to take conditional actions in the postfunc of our
services by passing the state of the service into the postfunc call. We
can use this to do conditional cleanups and finalizers if success.
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
New service `routerd` performs exactly single task: based on incoming
API call service name, it routes the requests to the appropriate Talos
service (`networkd`, `osd`, etc.) Service `routerd` listens of file
socket and routes requests to file sockets.
Service `apid` now does single task as well:
* it either fans out request to other `apid` services running on other
nodes and aggregates responses
* or it forwards requests to local `routerd` (when request destination
is local node)
Cons:
* one more proxying layer on request path
Pros:
* more clear service roles
* `routerd` is part of core Talos, services should register with it to
expose their API; no auth in the service (not exposed to the world)
* `apid` might be replaced with other implementation, it depends on TLS infra,
auth, etc.
* `apid` is better segregated from other Talos services (can only access
`routerd`, can't talk to other Talos services directly, so less exposure
in case of a bug)
This change is no-op to the end users.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
`gomnd` disabled, as it complains about every number used in the code,
and `wsl` became much more thorough.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
We have been using two packages that define a config type and a machine
type, when really they are one and the same. This unifies the types down
to one set.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This extracts Docker Talos cluster provisioner as common code
which might be shared between `osctl cluster` and integration-test.
There should be almost no functional changes.
As proof of concept, abstract cluster readiness checks were implemented
based on provisioned cluster state. It implements same checks as
`basic-integration.sh` in pure Go via Talos/K8s clients.
`conditions` package was promoted from machined-internal to
`internal/pkg` as it is used to run the checks.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This replaces codegen version of apid proxying with
talos-systems/grpc-proxy based version. Proxying is transparent, it
doesn't require exact information about methods and response types. It
requires some common layout response to enhance it properly with node
metadata or errors.
There should be no signifcant changes to the API with the previous
version, but it's worth mentioning a few changes:
1. grpc.ClientConn is established just once per upstream (either local
service or remote apid instance).
2. When called without `-t` (`targets`), apid proxies immediately down
to local service skipping proxying to itself (as before), which results
in empty node metadata in response (before it had local node IP). Might
revert this later to proxy to itself (?).
3. Streaming APIs are now fully supported with multiple targets, but
message definition doesn't contain `ResponseMetadata`, so streaming APIs
are broken now with targets (needs a fix).
4. Errors are now returned as responses with `Error` field set in
`ResponseMetadata`, this requires client library update and `osctl` to
handle it properly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Since APId/gRPC connections should never go through a proxy, we will explicitly exclude
these environment variables from apid.
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
The name helper isn't very good. This renames it to Client. A new func
was also added, NewForConfig, that will allow for the creation of the helper
client from an arbitrary Kubernetes REST config.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
Since dmesg is not streamed, it becomes difficult to debug issues with
machined. This fixes that by setting up the logging of machine to go to
/dev/kmsg and to a log file.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This removes the default privileged mode that all containers were
started with and adds the required capabilities on a per-service basis.
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
This PR introduces APId. This service replaces the frontend functionality
previously provided by OSD. The main driver for this is two fold:
1. Create a single purpose application to expose the talos api
2. Make use of code generation to DRY api changes
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>