35 Commits

Author SHA1 Message Date
Artem Chernyshev
a07cfbd5a4 fix: mount kubelet secrets from system instead of ephemeral
Launch goroutine that copies kubelet pki folder contents into
`/system/secrets/kubelet` every minute before starting apid container
when running on the worker node.

Mounting kubelet secrets directly from `/var/lib/kubelet/pki` breaks
upgrade flow, because we are not able to unmount ephemeral partition,
which is being used by apid, which is not stopped during the upgrade.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-02-09 08:00:04 -08:00
Andrey Smirnov
5855b8d532 fix: refresh control plane endpoints on worker apids on schedule
This moves endpoint refresh from the context of the service `apid` in
`machined` into `apid` service itself for the workers. `apid` does
initial poll for the endpoints when it boots, but also periodically
polls for new endpoints to make sure it has accurate list of `trustd`
endpoints to talk to, this handles cases when control plane endpoints
change (e.g. rolling replace of control plane nodes with new IPs).

Related to #3069

Fixes #3068

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-03 14:27:03 -08:00
Andrey Smirnov
7be3a86091 fix: bump timeout for worker apid waiting for kubelet client config
On worker join, apid waits for kubelet client certficiate to be pulled
by the kubelet. But if control plane and worker nodes are bootstrapped
around same time, apid might give up waiting:

On master:

```
2021-01-29 02:37:33.174609 I | [talos] phase startEverything (8/9): done, 46.064546765s
2021-01-29 02:37:33.174622 I | [talos] phase labelMaster (9/9): 1 tasks(s)
2021-01-29 02:37:33.174656 I | [talos] task labelNodeAsMaster (1/1): starting
2021-01-29 02:37:35.739865 I | [talos] retrying error: Get "https://10.5.0.2:6443/api/v1/nodes/e2e-docker-master-1?timeout=30s": dial tcp 10.5.0.2:6443: connect: connection refused
2021-01-29 02:37:50.250264 I | [talos] retrying error: nodes "e2e-docker-master-1" not found
2021-01-29 02:38:35.296363 I | [talos] task labelNodeAsMaster (1/1): done, 1m2.121719377s
2021-01-29 02:38:35.296404 I | [talos] phase labelMaster (9/9): done, 1m2.121782296s
2021-01-29 02:38:35.296411 I | [talos] boot sequence: done: 1m49.200972734s
```

On worker:

```
2021-01-29 02:34:23.354741 I | [talos] service[kubelet](Running): Health check successful
2021-01-29 02:38:08.081764 I | [talos] service[apid](Failed): Failed to create runner: 2 error(s) occurred:
failed to create client: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]
		timeout
```

It is clear from the timestamps that the worker gave up almost the same
time master was bootstrapped.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-29 06:36:52 -08:00
Andrey Smirnov
a2efa44663 chore: enable gci linter
Fixes were applied automatically.

Import ordering might be questionable, but it's strict:

* stdlib
* other packages
* same package imports

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 08:09:48 -08:00
Spencer Smith
cfb2c50dd7 fix: update handling of ntp disable
This PR changes the bool for disabling ntp to `disable` instead of the
previous `enable`. We need to do this because customers were seeing
failure in cases where they were defining time servers only, which
results in `enabled: false` when configs get unmarshalled. Users wishing
to disable ntp altogether should now use `disabled: true`.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-10-20 08:58:55 -07:00
Niklas Wik
eb9ee06dbc feat: add support for disabling time
Adds the capability to diasable NTP when it cannot be provided in the deployed network

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>

add document update.

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
2020-09-30 06:58:33 -07:00
Andrey Smirnov
98443cd0e9 fix: retry container image import
This bug is sometimes reproducible with QEMU/arm64, as it runs really
slow. Looks like multiple concurrent image unpacks sharing some layers
might fail unexpectedly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-28 08:58:47 -07:00
Andrey Smirnov
2085e9220c fix: change apid container image name to expected value
This is what happens when massive find-replace goes wrong...

Change should be cosmetic though, it doesn't affect operations.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-02 14:40:55 -07:00
Andrey Smirnov
f6ecf000c9 refactor: extract packages loadbalancer and retry
This removes in-tree packages in favor of:

* github.com/talos-systems/go-retry
* github.com/talos-systems/go-loadbalancer

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-02 13:46:22 -07:00
Andrew Rynhard
d4f103ffcb fix: pass config via stdin
In order to perform upgrades the way we would like, it is important that
we avoid any bind mounts into containers. This change ensures that all
system services get their config via stdin.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-08-20 15:26:13 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
9379cf9ee1 refactor: expose provision as public package
This change is only moving packages and updating import paths.

Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other
projects to import Talos provisioning library.

As cluster checks are almost always required as part of provisioning
process, package `internal/pkg/cluster` was also made public as
`pkg/cluster`.

Other changes were direct dependencies discovered by `importvet` which
were updated.

Public packages (useful, general purpose packages with stable API):

* `internal/pkg/conditions` -> `pkg/conditions`
* `internal/pkg/tail` -> `pkg/tail`

Private packages (used only on provisioning library internally):

* `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp`
* `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz`
* `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils`

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-12 05:12:05 -07:00
Andrey Smirnov
47608fb874 refactor: make pkg/config not rely on machined/../internal/runtime
This makes `pkg/config` directly importable from other projects.

There should be no functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-29 12:40:12 -07:00
Andrey Smirnov
41d5f7859a chore: update golangci-lint to 1.28.3
Fixes #2272

`gofumpt` is now included into `golangci-lint`, but not the
`gofumports`, so we keep it using it as separate binary, but we keep
versions in sync with `golangci-lint`.

This contains fixes from:

* `gofumpt` (automated, mostly around octal constants)
* `exhaustive` in `switch` statements
* `noctx` (adding context with default timeout to http requests)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 08:05:42 -07:00
Andrey Smirnov
c54639e541 feat: implement server-side API for cluster health checks
This implements existing server-side health checks as defined in
`internal/pkg/cluster/checks` in Talos API.

Summary of changes:

* new `cluster` API

* `apid` now listens without auth on local file socket

* `cluster` API is for now implemented in `machined`, but we can move it
to the new service if we find it more appropriate

* `talosctl health` by default now does server-side health check

UX: `talosctl health` without arguments does health check for the
cluster if it has healthy K8s to return master/worker nodes. If needed,
node list can be overridden with flags.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-15 13:52:13 -07:00
Andrey Smirnov
ddbe9cfc2f fix: update timeouts on service startup to match boot timeout
There's a global timeout for all services to be up: it's 5 minutes. We
need to make sure each service startup takes less than that, otherwise
boot sequence is aborted and there's no way to see the error message for
each particular service.

Also propagate contexts correctly and set some default timeouts to make
sure API operations are not hanging forever.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 07:39:36 -07:00
Andrey Smirnov
81d1c2bfe7 chore: enable godot linter
Issues were fixed automatically.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 10:39:56 -07:00
Andrey Smirnov
4ad4511b38 chore: enable nolintlint linter
It makes sure our `//nolint:` directives are not redundant.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 07:39:19 -07:00
Andrew Rynhard
f21bd4071f fix: skip services when in container mode
This skips running udevd, udevd-trigger, and timed when running in
container mode. Since the containers run as privileged containers
these services will contend with the host's equivalent services.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-13 17:59:24 -07:00
Andrey Smirnov
a9766d31bc refactor: implement LoggingManager as central log flow processor
Using this `LoggingManager` all the log flows (reading and writing) were
refactored. Inteface of `LoggingManager` should be now generic enough to
replace log handling with almost any implementation - log rotation,
sending logs to remote destination, keeping logs in memory, etc.

There should be no functional changes.

As part of changes, `follow.Reader` was implemented which makes
appending file feel like a stream. `file.NewChunker` was refactored to
use `follow.Reader` and `stream.NewChunker` to do the actual work. So
basically now we have only a single instance of chunker - stream
chunker, as everything is represented as a stream.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-10 14:30:36 -07:00
Andrew Rynhard
92a1ae4f03 fix: make services depend on timed
This adds a dependency on timed to services that depend on time.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-03 11:16:12 -07:00
Andrew Rynhard
49307d554d refactor: improve machined
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.

A few highlights are:

- no more event bus
- functional approach to tasks (no more types defined for each task)
  - the task function definition now offers a lot more context, like
    access to raw API requests, the current sequence, a logger, the new
    state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
  install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
  install force option

Additionally, this pulls various packes in under machined to make the
code easier to navigate.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 08:20:55 -07:00
Andrew Rynhard
9776d7265f refactor: rename system-containerd and containerd services
This changes the name of the `system-containerd` service to `containerd`
and the `containerd` service to `cri`. This is so that the CRI service
is a little more generic. In the future, when we add support for other
CRIs, it will be better to refer to this service generically as "cri."

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-13 13:44:05 -07:00
Spencer Smith
17c8336d20 chore: add service state to postfunc
This PR will allow us to take conditional actions in the postfunc of our
services by passing the state of the service into the postfunc call. We
can use this to do conditional cleanups and finalizers if success.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-06 12:10:06 -05:00
Andrey Smirnov
a068acfbe4 feat: split routerd from apid
New service `routerd` performs exactly single task: based on incoming
API call service name, it routes the requests to the appropriate Talos
service (`networkd`, `osd`, etc.) Service `routerd` listens of file
socket and routes requests to file sockets.

Service `apid` now does single task as well:

* it either fans out request to other `apid` services running on other
nodes and aggregates responses
* or it forwards requests to local `routerd` (when request destination
is local node)

Cons:

* one more proxying layer on request path

Pros:

* more clear service roles
* `routerd` is part of core Talos, services should register with it to
expose their API; no auth in the service (not exposed to the world)
* `apid` might be replaced with other implementation, it depends on TLS infra,
auth, etc.
* `apid` is better segregated from other Talos services (can only access
`routerd`, can't talk to other Talos services directly, so less exposure
in case of a bug)

This change is no-op to the end users.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-05 22:05:56 +03:00
Andrey Smirnov
01d696ed10 chore: update golangci-lint-1.23.3
`gomnd` disabled, as it complains about every number used in the code,
and `wsl` became much more thorough.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-04 08:56:39 -08:00
Andrew Rynhard
898cf01f0a refactor: unify generate type and machine type
We have been using two packages that define a config type and a machine
type, when really they are one and the same. This unifies the types down
to one set.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-10 16:46:28 -08:00
Andrey Smirnov
0081ac5fac refactor: extract Talos cluster provisioner as common code
This extracts Docker Talos cluster provisioner as common code
which might be shared between `osctl cluster` and integration-test.

There should be almost no functional changes.

As proof of concept, abstract cluster readiness checks were implemented
based on provisioned cluster state. It implements same checks as
`basic-integration.sh` in pure Go via Talos/K8s clients.

`conditions` package was promoted from machined-internal to
`internal/pkg` as it is used to run the checks.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-12-27 12:14:19 -08:00
Andrey Smirnov
5b7bea2471 feat: use grpc-proxy in apid
This replaces codegen version of apid proxying with
talos-systems/grpc-proxy based version. Proxying is transparent, it
doesn't require exact information about methods and response types. It
requires some common layout response to enhance it properly with node
metadata or errors.

There should be no signifcant changes to the API with the previous
version, but it's worth mentioning a few changes:

1. grpc.ClientConn is established just once per upstream (either local
service or remote apid instance).

2. When called without `-t` (`targets`), apid proxies immediately down
to local service skipping proxying to itself (as before), which results
in empty node metadata in response (before it had local node IP). Might
revert this later to proxy to itself (?).

3. Streaming APIs are now fully supported with multiple targets, but
message definition doesn't contain `ResponseMetadata`, so streaming APIs
are broken now with targets (needs a fix).

4. Errors are now returned as responses with `Error` field set in
`ResponseMetadata`, this requires client library update and `osctl` to
handle it properly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-29 22:57:25 +03:00
Brad Beam
4b3cc34ab0 fix: Disable support for proxy variables for apid.
Since APId/gRPC connections should never go through a proxy, we will explicitly exclude
these environment variables from apid.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-05 10:34:33 -08:00
Andrew Rynhard
03a09c2294 refactor: rename Helper to Client
The name helper isn't very good. This renames it to Client. A new func
was also added, NewForConfig, that will allow for the creation of the helper
client from an arbitrary Kubernetes REST config.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 19:31:27 -08:00
Andrew Rynhard
e81b3d11a8 feat: output machined logs to /dev/kmsg and file
Since dmesg is not streamed, it becomes difficult to debug issues with
machined. This fixes that by setting up the logging of machine to go to
/dev/kmsg and to a log file.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 12:53:13 -08:00
Andrew Rynhard
41619f9016 feat: lock down container permissions
This removes the default privileged mode that all containers were
started with and adds the required capabilities on a per-service basis.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-29 11:50:37 -07:00
Andrey Smirnov
d3d011c8d2 chore: replace /* */ comments with // comments in license header
This fixes issues with `// +build` directives not being recognized in
source files.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-25 14:15:17 -07:00
Brad Beam
573cce8d18 feat: Add APId
This PR introduces APId. This service replaces the frontend functionality
previously provided by OSD. The main driver for this is two fold:

1. Create a single purpose application to expose the talos api

2. Make use of code generation to DRY api changes

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-10-25 13:02:33 -05:00