113 Commits

Author SHA1 Message Date
Andrey Smirnov
20f4d77d39 fix(init): move directory creation to kubeadm pre-func (#688)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-28 09:51:38 -07:00
Andrey Smirnov
40a5b7c177
feat(init): expose networkd as goroutine-based server (#682)
This adds generic goroutine runner which simply wraps service as process
goroutine. It supports log redirection and basic panic handling.

DHCP-related part of the network package was slightly adjusted to run as
service with logging updates (to redirect logs to a file) and context
canceling.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-27 17:07:28 +03:00
Brad Beam
d8249c8779
refactor(init): Allow kubeadm init on controlplane (#658)
* refactor(init): Allow kubeadm init on controlplane

This shifts the cluster formation from init(bootstrap) and join(control plane)
to init(control plane).

This makes use of the previously implemented initToken to provide a TTL for
cluster initialization to take place and allows us to mostly treat all control
plane nodes equal. This also sets up the path for us to handle master upgrades
and not be concerned with odd behavior when upgrading the previously defined
init node.

To facilitate kubeadm init across all control plane nodes, we make use of the
initToken to run `kubeadm init phase certs` command to generate any missing
certificates once. All other control plane nodes will attempt to sync the
necessary certs/files via all defined trustd endpoints and being the startup
process.

* feat(init): Add service runner context to PreFunc

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-24 16:05:49 -05:00
Andrey Smirnov
a0188aff73
feat(init): implement service dependencies, correct start and shutdown (#680)
This PR introduces dependencies between the services. Now each service
has two virtual events associated with it: 'up' (running and healthy)
and 'down' (finished or failed). These events are used to establish
correct order via conditions abstraction.

Service image unpacking was moved into 'pre' stage simplifying
`init/main.go`, service images are now closer to the code which runs the
service itself.

Step 'pre' now runs after 'wait' step, and service dependencies are now
mixed into other conditions of 'wait' step on startup.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-24 19:17:52 +03:00
Andrey Smirnov
06bff97a3f
refactor: change conditions to be interface, add descriptions (#677)
Conditions are now implemented as interface with two methods: `Wait` for
condition to be true (cancelable via context) and 'String' which
describes what condition is waiting for.

Generic 'WaitForAll' was implemented to wait for multiple conditions at
once.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-21 21:25:08 +03:00
Brad Beam
b0dab6e021
fix(osd): Sanitize request.id for log streams (#673)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-20 14:46:05 -05:00
Brad Beam
a64de7ed51
feat(init): Add initToken parameter to userdata (#664)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-20 14:23:38 -05:00
Andrey Smirnov
204873e257
refactor: fix filechunker not exiting on context cancel (#668)
This started as a simple unit-test for file chunker, but the first test
hung immediately, so I started looking into the code.

One problem was that when entering inotify() code, ctx cancel wasn't
considered. Another problem is that remove fsnotify was never triggered,
but I saw that with unit-test later.

Small nit was that inotify() was initialized every time we got to EOF,
which is not efficient for "follow" mode.

So I moved inotify into the main loop, and plugged context cancel watch
into the place when chunk is delivered. Chunker code is supposed to
block in two places: when it tries to deliver next chunk (as client
might be slow to recieve buffers) or when there's no new data (on
inotify). So it makes sense to assert context canceled condition in both
cases.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-20 18:00:40 +03:00
Andrey Smirnov
54168cef1c feat(init): implement healthchecks for the services (#667)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-18 08:44:56 -07:00
Andrey Smirnov
75b2ce7fd2
feat(init): implement services list API and osctl service CLI (#662)
This returns list of all the services registered, with their current
status, past events, health state, etc.

New CLI is `osctl service [<id>]`: without `<id>` it prints list of all
the services, with specific `<id>` it provides details for a service.

I decided to create "parallel" data structures in protobuf as Go
structures don't map nicely onto what protoc generates: pointers vs.
values, additional fields like mutexes, etc. Probably there's a better
approach, I'm open for it.

For CLI, I tried to keep CLI stuff in `cmd/` package, and I also created
simple wrapper to remove duplicated code which sets up client for each
command.

Examples:

```
$ osctl service
SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
containerd   Running   OK       21s ago       Health check successful
kubeadm      Running   ?        2s ago        Started task kubeadm (PID 280) for container kubeadm
kubelet      Running   ?        0s ago        Started task kubelet (PID 383) for container kubelet
ntpd         Running   ?        14s ago       Started task ntpd (PID 129) for container ntpd
osd          Running   ?        14s ago       Started task osd (PID 126) for container osd
proxyd       Waiting   ?        14s ago       Waiting for conditions
trustd       Running   ?        14s ago       Started task trustd (PID 125) for container trustd
udevd        Running   ?        14s ago       Started task udevd (PID 130) for container udevd
```

```
$ osctl service proxyd
ID       proxyd
STATE    Running
HEALTH   ?
EVENTS   [Preparing]: Running pre state (22s ago)
         [Waiting]: Waiting for conditions (22s ago)
         [Preparing]: Creating service runner (6s ago)
         [Running]: Started task proxyd (PID 461) for container proxyd (6s ago)
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-17 18:01:12 +03:00
Brad Beam
dd3d3fac9c
fix(osd): Read talos service logs from file (#663)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-16 20:05:23 -05:00
Andrey Smirnov
d034987f71 fix(init): fix containerd healthcheck leaking memory in init/containerd (#661)
As containerd client API wasn't closed after use, connection was leaking
every time healthcheck was run.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-16 18:35:14 -05:00
Andrew Rynhard
98d76d8198
fix(init): mount /sys into kubelet container (#660)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-16 14:42:34 -07:00
Andrew Rynhard
92fb18e3ea
feat: use github.com/mdlayher/kobject (#653)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-15 11:18:08 -07:00
Andrey Smirnov
1dde9f8cc0 feat(init): implement health checks for services (#656)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-15 09:30:35 -07:00
Andrey Smirnov
4bf649f14c chore: workaround flaky tests (#651)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-13 15:00:59 -07:00
Brad Beam
0b33280915
feat(init): Add upgrade endpoint (#623)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-13 15:15:25 -05:00
Andrey Smirnov
3dc5606053 fix(init): don't close ACPI listen handle too early (#647)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-13 07:50:21 -07:00
Andrew Rynhard
ff58642d93
feat: improve package for /proc/cmdline parsing and management (#645)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-12 09:05:29 -07:00
Brad Beam
a6989db1d1
fix(osd): Use correct context in stats endpoint (#644)
Without this we never set the namespace for the context which prevents it from functioning at all

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-11 14:26:23 -05:00
Andrey Smirnov
995f4c6841 feat(init): core health check package (#632)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-09 10:03:05 -07:00
Andrew Rynhard
5160cbc5b6
feat: remove EC2 verification step (#631)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-09 08:10:23 -07:00
Andrew Rynhard
86e17c91fb
feat: update partition layout to accomodate upgrades (#621)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-07 13:31:34 -07:00
Brad Beam
2c0ec43a0b
feat: Add additional kubernetes certs (#619)
Add support for supplying all of the necessary CA cert and key pairs for
kubeadm use.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-07 11:30:10 -05:00
Andrew Rynhard
7676a31b20
chore: move osinstall to cmd (#620)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-07 06:41:03 -07:00
Andrew Rynhard
00eb0658aa
feat: add support for ISO based installations (#606)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-05-02 21:30:06 -07:00
Andrew Rynhard
e4c5385f3d
fix(init): start udevd with parent cgroup devices (#605)
WithParentCgroupDevices uses the default cgroup setup to inherit the container's parent cgroup's allowed and denied devices
Without this, we get 'operation not permitted' when attempting to read the block devices.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-30 19:03:56 -07:00
Andrew Rynhard
f045b10dd4
fix: add support for trustd username and password auth back in (#604)
We should still support username and password for backwards compatibility.
This also sets us up for for implementing auth for users using something like LDAP in the future.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-30 17:50:30 -07:00
Andrew Rynhard
0df1d9ca70
feat(init): run udevd as a container (#601)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-30 08:48:48 -07:00
Tim Jones
4341411c16 refactor(init): add helper for getting specific kernel parameters (#596)
Signed-off-by: Tim Jones <timniverse@gmail.com>
2019-04-29 10:58:51 -07:00
Tim Jones
7127998f56 feat(init): Add support for hostname kernel parameter (#591)
Signed-off-by: Tim Jones <timniverse@gmail.com>
2019-04-29 09:50:43 -07:00
Andrew Rynhard
020d11d4ba
feat(init): enforce KSPP kernel parameters (#585)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-28 13:12:07 -07:00
Andrew Rynhard
ea99788ef1
feat(trustd): use a token instead of username and password (#586)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-28 12:18:56 -07:00
Andrew Rynhard
9b4fec0fa8
feat(osctl): add ability to create docker based clusters (#584)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-28 12:06:03 -07:00
Andrew Rynhard
2a4b56d4a1
feat(init): load only the images required by the node type (#582)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-26 20:13:48 -07:00
Andrey Smirnov
ab2917e833
feat(init): implement init gRPC API, forward reboot to init (#579)
This implements insecure over-file-socket gRPC API for init with two
first simplest APIs: reboot and shutdown (poweroff).

File socket is mounted only to `osd` service, so it is the only service
which can access init API. Osd forwards reboot/shutdown already
implemented APIs to init which actually executes these.

This enables graceful shutdown/reboot with service shutdown, sync, etc.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-26 23:04:24 +03:00
Andrew Rynhard
fc05224b4f
feat: add shutdown command (#577)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-26 08:53:12 -07:00
Andrew Rynhard
a8fa1f5cd1
feat(osctl): add df command (#569)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-26 08:24:31 -07:00
Andrey Smirnov
505b5022c4
feat(init): implement graceful shutdown of 'init' (#562)
Most crucial changes in `init/main.go`: on shutdown now Talos tries
to stop gracefully all the services. All the shutdown paths are unified,
including poweroff, reboot and panic handling on startup.

While I was at it, I also fixed bug with containers failing to start
when old snapshot is still around.

Service lifecycle is wrapped with `ServiceRunner` object now which
handles state transitions and captures events related to state changes.
Every change goes to the log as well.

There's no way to capture service state yet, but that is planned to be
implemented as RPC API for `init` which is exposed via `osd` to `osctl`.

Future steps:

1. Implement service dependencies for correct startup order and
shutdown order.

2. Implement service health, so that we can say "start trustd when
containerd is up and healthy".

3. Implement gRPC API for init, expose via osd (service status, restart,
poweroff, ...)

4. Impement 'String()' for conditions, so that we can see what service
is waiting on right now.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-26 16:53:19 +03:00
Brad Beam
3f358b12ae
feat(osctl): Add osctl top (#560)
Also adds pkg/proc as the backing package for top data

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-04-23 21:25:41 -05:00
Andrey Smirnov
a858cb4986
refactor: extract 'restart' piece of the runners into wrapper runner (#559)
This changes `runner.Runner` API to support more methods to allow for
containerd runner to create container object only once, and start/stop
tasks to implement restarts.

New API: `Open()` (initialize), `Run()` (run once until exits), `Stop()`
(stop running instance), `Close()` (free resource, no longer available
for new `Run()`).

So the sequence might be: `Open`, `Run`, `Stop`, `Run`, `Stop`, `Close`.

Process and containerd runners were updated for the new API, and
'restart' part was removed, now both runners only run the task once.

Restart piece was implemented in an abstract way for any wrapped
`runner.Runner` in the `runner/restart` package. Restart supports three
restart policies: `Once`, `UntilSuccess` and `Forever`.

Service API was changed slightly to return the `runner.Runner`
interface, and `system.Services` now handles running the service.

For all the services, code was adjusted to either return runner (run
once), or was wrapped with `restart` runner to provide restart policy.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-23 01:25:26 +03:00
Brad Beam
271d28244b fix(osd): Fix k8s.io namespace logs (#557)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-04-18 08:49:33 -07:00
Andrey Smirnov
7da7c8c2ff refactor: add stub unit-tests to non-trivial Go packages (#556)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-17 13:25:22 -07:00
Brad Beam
46bdf2371c
fix(osd): Fix osctl ps output (#554)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-04-17 08:51:19 -05:00
Andrey Smirnov
7cbc177a59
refactor: add unit-test for containerd image import (#553)
Just because we can easily do that, this also covers prior work
on converting panics to errors: #518

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-17 00:31:33 +03:00
Andrey Smirnov
d29e27ee33 refactor: containerd runner refactoring and unit-tests (#551)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-16 13:56:52 -07:00
Andrew Rynhard
a817e744c7
feat: remove blockd (#536)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-14 16:57:37 -07:00
Andrew Rynhard
47d2bbd318
feat: log the xfs_growfs of the data partition (#537)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-14 15:20:56 -07:00
Andrew Rynhard
2faf36bd67
feat: add support for extra disk management (#524)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-04-13 22:41:03 -07:00
Andrey Smirnov
9f12352433
chore: clean up outer variable used in inner func (#519)
Inner function in goroutine was using `err` (return variable) of the
outer function.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-04-10 23:56:15 +03:00