146 Commits

Author SHA1 Message Date
Andrey Smirnov
a2efa44663 chore: enable gci linter
Fixes were applied automatically.

Import ordering might be questionable, but it's strict:

* stdlib
* other packages
* same package imports

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 08:09:48 -08:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
4ad4511b38 chore: enable nolintlint linter
It makes sure our `//nolint:` directives are not redundant.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-30 07:39:19 -07:00
Andrey Smirnov
0a4645fe80 feat: implement circular buffer for system logs
This replaces logging to files with inotify following to pure in-memory
circular buffer which grows on demand capped at specified maximum
capacity.

The concern with previous approach was that logs on tmpfs were growing
without any bound potentially consuming all the node memory.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-26 15:33:54 -07:00
Andrew Rynhard
49307d554d refactor: improve machined
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.

A few highlights are:

- no more event bus
- functional approach to tasks (no more types defined for each task)
  - the task function definition now offers a lot more context, like
    access to raw API requests, the current sequence, a logger, the new
    state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
  install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
  install force option

Additionally, this pulls various packes in under machined to make the
code easier to navigate.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 08:20:55 -07:00
Andrey Smirnov
e38cde9b48 chore: update upgrade tests for new version, split into two tracks
This updates upgrade tests to run two flows with 3+1 clusters:

1. 0.3 -> current (testing upgrade with partition wiping)
2. 0.4-alpha.7 -> current (testing upgrade without partition wiping,
boot-a/boot-b)

And small upgrade with preserve enabled for single-node cluster.

Provision tests are now split into two parallel tracks in Drone.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-24 15:30:00 -07:00
Spencer Smith
4d5c7e482c fix: ensure printing of panic message
This PR reworks the ordering of our recovery function. It will make sure
we actually show the user the recovery message prior to looking into
whether to auto-reboot.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-17 16:40:47 -04:00
Spencer Smith
853ce16df4 feat: respect panic kernel flag
This PR allows Talos to respect the panic=0 flag if users pass that in
their kernel args. Doing this makes it easier to catch kernel panics in
debug scenarios and allows the user to manually trigger a restart with
ctrl+alt+del when they're ready.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-10 13:21:34 -04:00
Andrew Rynhard
4efccd96ea refactor: rename virtual package to pseudo
This aligns the nomenclature for filesystems like /dev and /proc with
what is used in the kernel code.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-26 22:32:48 -08:00
Andrew Rynhard
e81b3d11a8 feat: output machined logs to /dev/kmsg and file
Since dmesg is not streamed, it becomes difficult to debug issues with
machined. This fixes that by setting up the logging of machine to go to
/dev/kmsg and to a log file.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 12:53:13 -08:00
Andrey Smirnov
d3d011c8d2 chore: replace /* */ comments with // comments in license header
This fixes issues with `// +build` directives not being recognized in
source files.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-25 14:15:17 -07:00
Andrew Rynhard
d430a37e46 refactor: use go 1.13 error wrapping
This removes the github.com/pkg/errors package in favor of the official
error wrapping in go 1.13.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-15 22:20:50 -07:00
Andrey Smirnov
c2cb0f9778 chore: enable 'wsl' linter and fix all the issues
I wish there were less of them :)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-10 01:16:29 +03:00
Andrew Rynhard
5ee554128e chore: move from gofumpt to gofumports
The gofumports does everything that gofumpt does with the addition of
formatting imports. This change proposes the use of the `-local` flag so
that we can have imports separated in the following order:

- standard library
- third party
- Talos specific

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-12 07:49:12 -07:00
Andrew Rynhard
90c91807bd refactor: restructure the project layout
This change moves packages into more appropriate places.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 22:19:42 -07:00
Andrew Rynhard
ca35b85300 refactor: improve installation reliability
This change aims to make installations more unified and reliable. It
introduces the concept of a mountpoint manager that is capable of
mounting, unmounting, and moving a set of mountpoints in the correct
order.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 11:44:40 -07:00
Andrew Rynhard
e63c882b89 refactor: split machined into phases
This change aims to standardize the boot process. It introduces the
concept of a phase, which is comprised of tasks. Phases are ran in serial and
the tasks that make up a phase are ran concurrently.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-29 12:40:03 -07:00
Andrew Rynhard
b7a9acbe88 refactor: move setup logic into machined
The responsibility of init should only be to mount the rootfs. This
change moves Talos specific logic into machined. This will allow us to
define a version of Talos in a single binary instead of split across
two. This will enable cleaner upgrades and helps make the codebase
easier to reason about.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-26 07:48:49 -07:00
Andrew Rynhard
0ec17e4169 feat: run rootfs from squashfs
This change moves the rootfs to a squashfs image.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-25 08:38:31 -07:00
Spencer Smith
c9f0dbbd4c feat: set default mtu for gce platform
This PR is needed so that the eth0 device will have the proper mtu when
coming online in google cloud

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-07-17 19:16:50 -04:00
Andrew Rynhard
8e8aae98dd feat: add machined
This commit splits our current init into init and machined.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-16 13:12:21 -07:00
Brad Beam
7adef1ea62 feat(init): Add azure as a supported platform
Update initramfs to interact with azure endpoints for userdata.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-16 12:59:53 -07:00
Brad Beam
e9482a4041 fix: Fix integration of extra kernel args
Switch from `StringSliceVar` to `StringArrayVar` to maintain commas
in kernel args.

Update entrypoint script to allow specifying extra kernel args.

Remove default console settings in kernel config.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-16 14:38:55 -05:00
Andrew Rynhard
1e9548d149 feat: use new pkgs for initramfs and rootfs
This brings in the newly compiled libraries and binaries from our new
pkg builds.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-15 10:32:29 -07:00
Andrew Rynhard
992c54c667 chore: improve network setup logging
Minor improvements to help when debugging.
Without this, if bringing up the default interface fails, the logs can
be misleading.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-13 15:52:49 -07:00
Andrew Rynhard
c40802b122 fix: return non-nil response in reset
The gRPC response will fail to be decoded because our reply is nil.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-13 15:52:25 -07:00
Andrew Rynhard
d197d5c6cd feat: add install flag for extra kernel args
In addition to adding a flag, this adds a field to the user data that allows
for extra kernel arguments to be specified.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-12 13:27:44 -07:00
Andrey Smirnov
82fe5b55e5 chore: make unit-tests use isolated instances of containerd
This makes test launch their own isolated instance of containerd with
its own root/state directories and listening socket address. Each test
brings this instance up/down on its own.

Add options to override containerd address in the code (used only in the
tests).

Enable parallel go test runs once again.

P.S. I wish I could share that 'SetupSuite' phase across the tests, but
afaik there's no way in Go to share `_test.go` code across packages. If
we put it as normal package, this might pull in test dependencies (like
`testify`) into production code, which I don't like.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-07-10 19:46:32 +03:00
Brad Beam
551e24e268 fix(init): Dont log an error when context canceled
When we receive all the necessary files from trustd, we cancel the context. This
was treated as an error case and a message was logged accordingly. However,
this case was not really an error versus a signal to stop trying to fetch a
given file.

Fixes #723

Add basic FileSet tests. Minor refactor to FileSet call to allow easier testing
Add context canceled test for download
Add config tests and trustd coverage

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-06 14:54:02 -07:00
Brad Beam
c194621e56 feat(initramfs): Add kernel arg for default interface
Should allow us to handle edge cases where eth0 is not the primary interface

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-05 12:17:18 -07:00
Andrew Rynhard
5d8ee0a3a5 fix: use existing logic to perform reset
This PR moves the reset API to the init API definition.
It leverages the same code we use for upgrades.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-04 18:26:14 -07:00
Brad Beam
40d3484469
refactor: Userdata.download supports functional args (#819)
This also adds in support for downloading userdata that is initially encoded in
base64.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-07-03 10:05:20 -05:00
Andrey Smirnov
0662af19d1 chore: seed math.rand PRNG on startup in every service (#801)
This is important as otherwise `math/rand` outputs predictable sequence
each time.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-28 11:03:15 -07:00
Andrey Smirnov
6b0a66b514
fix(init): secret data at rest encryption key should be truly random (#797)
First, use cryptographically secure random number generator.

Second, generate random 32 bytes, don't limit them to any range, as
they're going to be base64-encoded anyways.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-28 17:57:51 +03:00
Andrew Rynhard
fde6b4b6b8
feat: enable debug in udevd service (#783)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-06-26 08:17:13 -07:00
Andrey Smirnov
6d5ee0ca80
feat(init): unify filesystem walkers for ls/cp APIs (#779)
This unifies low-level filesystem walker code for `ls` and `cp`.

New features:

* `ls` now reports relative filenames
* `ls` now prints symlink destination for symlinks
* `cp` now properly always reports errors from the API
* `cp` now reports all the errors back to the client

Example for `ls`:

```
osctl-linux-amd64 --talosconfig talosconfig ls -l /var
MODE          SIZE(B)   LASTMOD       NAME
drwxr-xr-x    4096      Jun 26 2019   .
Lrwxrwxrwx    4         Jun 25 2019   etc -> /etc
drwxr-xr-x    4096      Jun 26 2019   lib
drwxr-xr-x    4096      Jun 21 2019   libexec
drwxr-xr-x    4096      Jun 26 2019   log
drwxr-xr-x    4096      Jun 21 2019   mail
drwxr-xr-x    4096      Jun 26 2019   opt
Lrwxrwxrwx    6         Jun 21 2019   run -> ../run
drwxr-xr-x    4096      Jun 21 2019   spool
dtrwxrwxrwx   4096      Jun 21 2019   tmp
-rw-------    14979     Jun 26 2019   userdata.yaml
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-26 17:43:09 +03:00
Andrew Rynhard
85afe4f828
feat: use eudev for udevd (#780)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-06-25 19:25:57 -07:00
Andrew Rynhard
ebc725afa6
feat: add support for upgrading init nodes (#761)
Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-06-24 15:25:32 -07:00
Brad Beam
d935ee0b33 fix(init): Add modules mountpoint for kube services (#767)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-24 12:38:57 -07:00
Andrey Smirnov
76071abbb8
feat(init): move 'ls' API to init from osd (#755)
Service `osd` doesn't have access to rootfs, as it is running in a
container, so move API to `init` which has unconstrained access to
rootfs. (This is in line with another API, `osctl cp`).

Fixes: #752

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-21 22:29:39 +03:00
Andrey Smirnov
9ed45f7090 feat(osctl): implement 'cp' to copy files out of the Talos node (#740)
Actual API is implemented in the `init`, as it has access to root
filesystem. `osd` proxies API back to `init` with some tricks to support
grpc streaming.

Given some absolute path, `init` produces and streams back .tar.gz
archive with filesystem contents.

`osctl cp` works in two modes. First mode streams data to stdout, so
that we can do e.g.: `osctl cp /etc - | tar tz`. Second mode extracts
archive to specified location, dropping ownership info and adjusting
permissions a bit. Timestamps are not preserved.

If full dump with owner/permisisons is required, it's better to stream
data to `tar xz`, for quick and dirty look into filesystem contents
under unprivileged user it's easier to use in-place extraction.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-20 17:02:58 -07:00
Andrey Smirnov
854395517f
chore: improve test stability for containerd tests (#733)
This should be no-op but allows to depend less on timing for concurrent
operations.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-15 00:00:06 +03:00
Brad Beam
0d5f521291
feat(init): Add support for kubeadm reset during upgrade (#714)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-06 22:41:22 -05:00
Brad Beam
d68e303f27
feat(init): Add service stop api (#708)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-05 14:49:03 -05:00
Andrey Smirnov
7a4a677f04
fix(init): use 127.0.0.1 IP in healthchecks to avoid resolver weirdness (#715)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-05 19:30:28 +03:00
Brad Beam
1a01440482
feat(init): Add support for stopping individual services (#706)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-06-04 15:51:30 -05:00
Andrey Smirnov
bf6ef7043c
chore: address flaky tests instability (#713)
For #711, this should be a complete fix - waiting for container to be
started.

For #712, this should be more of a workaround - playing with timeouts to
hit the failure less likely. Idea of the test is that health check
should be aborted on timeout (1ms) while health check succeeds if not
aborted in 50ms. Before the fix it was 1ms/10ms, but still concurrently
there was a chance that goroutine exits successfully after 10ms while
1ms context deadline is not reached.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-04 23:22:05 +03:00
Andrey Smirnov
d9f4f378c2 fix(osd): consistent container ids in stats, ps and reset (#707)
Fixes: #689, #690

Refactor container inspection code into a package of its own with some
rudimentary tests. Use this package consistently in osd commands dealing
with containers.

Improvements for the next PRs:

* implement API to fetch info about container by ID (to avoid fetching
full list)

* handle and display errors on client side, not to the log of the
server

* more tests, including k8s containers (how can we do that?)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-06-03 20:51:01 -05:00
Brad Beam
8537e7eeb6
feat(init): Add support for control plane join config (#700)
Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-05-31 12:21:00 -05:00
Andrey Smirnov
dc79b0ad05
refactor(init): use 'switch' instead of long condition (#701)
Based on feedback from #699

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-05-31 17:39:38 +03:00