Commit Graph

50 Commits

Author SHA1 Message Date
Andrey Smirnov
a2efa44663 chore: enable gci linter
Fixes were applied automatically.

Import ordering might be questionable, but it's strict:

* stdlib
* other packages
* same package imports

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-09 08:09:48 -08:00
Andrey Smirnov
93f6586900 fix: don't abort reboot sequence on bootloader meta failure
If bootloader meta failed to be found/to be reverted, don't abort the
whole sequence of actions leading to reboot, otherwise control returns
back and machined tries to run next sequence in failed state.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-09-07 13:59:22 -07:00
Andrew Rynhard
1a4059a553 feat: add grub bootloader
This moves to using grub instead of syslinux.

BREAKING CHANGE: Single node upgrades will fail in this change. This
will also break the A/B fallback setup since this version introduces
an entirely new partition scheme, that any fallback will not know about.
We plan on addressing these issues in a follow up change.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-09-01 12:06:43 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
74413b1393 fix: ignore sequence lock errors in machined
This prevents reboots when some actions triggers sequence while another
sequence is still running.

Fixes #2209

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 14:36:06 -07:00
Andrey Smirnov
4cc074cdba feat: implement API access to event history
1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids
are sortable and unique enough. Xids also encode event publishing
time with a second precision.

2. Add three ways to look back into event history: based on number of
events, on time and ID. Lookup via ID might be used to restart event
polling in case of broken API connection from the same moment.

3. Reimplement core event buffer with positions which are always
incremented instead of generation+index, this implementation is much
more simple (idea from circular buffer).

4. By default, Events API works the same - it shows no history and
starts streaming new events only.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 10:54:50 -07:00
Andrey Smirnov
fb585902a3 chore: replace underlying event implementation with single slice
The idea here is to use single slice of events for all the consumers.
Each consumer keeps its own position within the stream, and stream is
structured as circular buffer to avoid using too much memory.

This implementation allows for one more future: looking "back" into the
event history and returning past event starting with some offset (e.g.
timestamp, event ID, etc.). This feature is not implemented yet.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-05-20 11:12:43 -07:00
Andrew Rynhard
a733a9714f fix: run machined API as a service
In recent refactoring the machined API service was changed to run outside
of the service framework. This brings it back as a service.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-05-15 17:27:19 -07:00
Andrew Rynhard
1902519727 feat: add events API
This adds an event stream to the runtime, and the ability to stream
events via the API.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-05-13 12:18:10 -07:00
Andrew Rynhard
83062f37bd fix: write machined RPC logs to file
This ensures that the machined RPC logs are written to disk so that users
can retrieve them via the log API.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-05-07 14:17:59 -07:00
Andrew Rynhard
49307d554d refactor: improve machined
This is a rewrite of machined. It addresses some of the limitations and
complexity in the implementation. This introduces the idea of a
controller. A controller is responsible for managing the runtime, the
sequencer, and a new state type introduced in this PR.

A few highlights are:

- no more event bus
- functional approach to tasks (no more types defined for each task)
  - the task function definition now offers a lot more context, like
    access to raw API requests, the current sequence, a logger, the new
    state interface, and the runtime interface.
- no more panics to handle reboots
- additional initialize and reboot sequences
- graceful gRPC server shutdown on critical errors
- config is now stored at install time to avoid having to download it at
  install time and at boot time
- upgrades now use the local config instead of downloading it
- the upgrade API's preserve option takes precedence over the config's
  install force option

Additionally, this pulls various packes in under machined to make the
code easier to navigate.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 08:20:55 -07:00
Andrew Rynhard
a10acd592a chore: address random CI nits
This PR does the following:

- updates the conform config
- cleans up conform scopes
- moves slash commands to the talos-bot
- adds a check list to the pull request template
- disables codecov comments
- uses `BOT_TOKEN` so all actions are performed as the talos-bot user
- adds a `make conformance` target to make it easy for contributors to
check their commit before creating a PR
- bumps golangci-lint to v1.24.0

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-13 13:01:14 -07:00
Andrew Rynhard
83d0851563 fix: delete tag on revert with empty label
We need to ensure that we delete the upgrade tag from the ADV even if
the tag value is an empty string.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-30 15:15:22 -07:00
Andrew Rynhard
47327eca09 fix: move empty label check
We should always set the fallback tag on an upgrade, and only revert if
the tag value is not an empty string.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-30 13:42:08 -07:00
Andrew Rynhard
6fe5fed6f9 fix: make upgrades work with UEFI
Since the `--once` option of `extlinux` seems to only work with BIOS, we
needed to change to remove any reliance on this option. Instead of
booting the upgraded version once, and then making it the default after
a successful boot, we now make it the default, and then revert on any
boot error.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-26 13:34:00 -07:00
Andrew Rynhard
69fa63a7b2 refactor: perform upgrade upon reboot
This PR introduces a new strategy for upgrades. Instead of attempting to
zap the partition table, create a new one, and then format the
partitions, this change will only update the `vmlinuz`, and
`initramfs.xz` being used to boot. It introduces an A/B style upgrade
process, which will allow for easy rollbacks. One deviation from our
original intention with upgrades is that this change does not completely
reset a node. It falls just short of that and does not reset the
partition table. This forces us to keep the current partition scheme in
mind as we make changes in the future, because an upgrade assumes a
specific partition scheme. We can improve upgrades further in the
future, but this will at least make them more dependable. Finally, one
more feature in this PR is the ability to keep state. This enables
single node clusters to upgrade since we keep the etcd data around.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 17:32:18 -07:00
Andrew Rynhard
fe7847e0b8 feat: add reboot flag to reset API
This adds the ability to automatically reboot a machine after a reboot.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-02-19 05:10:58 -08:00
Spencer Smith
8092362098 fix: fix reset command
This PR will fix the reset command to actually wipe the system disk as
expected.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-02-18 16:18:43 -05:00
Andrey Smirnov
565c747582 fix: install sequence stuck on event bus
machined's main.go waits for boot sequence to finish, while metal
platform initializer tries to send a message to the event bus without
any listeners, so this is pure deadlock.

Resolve that by panicking from initializer, this aborts phase and
sequence, and leads to reboot on panic. Not really clean as it leaves
scary stacktraces in the dmesg, but it works. Cleanup might be done by
introducing error value for reboot, and ignoring it when printing the
errors.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-01-21 16:28:00 -06:00
Andrew Rynhard
5b5d171c07 fix: block when handling bus event
If we don't block, there is the potential for multiple shutdown,
reboot, and upgrade requests to be processed.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-01-20 09:19:50 -08:00
Brad Beam
f722adb865 fix(machined): Add additional defaults for http transport
Followup from #1680.

This also moves the setting from phases to machine.init to set it earlier in
the boot sequence to ensure that we get the defaults set properly from the
start and set it only once.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-12-30 08:13:22 -08:00
Andrew Rynhard
031c65be47 feat: add IMA policy
This creates an IMA policy at boot. It uses the default TCB policy with
a dont_measure rule for XFS.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-26 16:49:48 -08:00
Andrey Smirnov
d3d011c8d2 chore: replace /* */ comments with // comments in license header
This fixes issues with `// +build` directives not being recognized in
source files.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-25 14:15:17 -07:00
Andrew Rynhard
d430a37e46 refactor: use go 1.13 error wrapping
This removes the github.com/pkg/errors package in favor of the official
error wrapping in go 1.13.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-15 22:20:50 -07:00
Andrew Rynhard
94c28657d3 feat: add config validation task
This should provide a better UX around misconfigured Talos nodes. It is
just the start of something we can expand on.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-15 20:26:26 -07:00
Andrey Smirnov
c2cb0f9778 chore: enable 'wsl' linter and fix all the issues
I wish there were less of them :)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-10-10 01:16:29 +03:00
Andrew Rynhard
89789fe0a6 fix: catch panics in boot go routine
The builtin recover func is scoped to the current go routine, and since
our boot sequence is kicked off in its' own go routine, we were failing
to recover from panics.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-10-08 19:57:39 -07:00
Andrew Rynhard
8f10647d3f fix: set extra kernel args for all platforms
This change ensures that the installer has access to the machine config
so that it can set the extra kernel arguments when installing.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-23 11:50:13 -07:00
Andrew Rynhard
3a92537a30 refactor: rename RPCs
The following RPCs have been renamed:

- ps to containers
- top to processes
- df to mounts

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-20 14:33:51 -07:00
Andrew Rynhard
6efd6fbe08 chore: move gRPC API to public
In order for other projects to make use of our APIs, they must not
reside underneath the internal directory. This moves the protobuf
definitions to a top-level "api" directory and scopes them according to
their domain. This change also removes generated code from the gitignore
file so that users don't have to generate the code themseleves.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-19 08:55:13 -07:00
Andrew Rynhard
2955428850 chore: format code with gofumpt
The gofumpt linter is a stricter drop-in replacement for gofmt. The
rules are ones that I strongly agree with and I think it would be better
if we added this linter instead of nit picking every PR.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-09-11 11:03:29 -07:00
Andrey Smirnov
c0698c1815 chore(machined): implement process reaper for PID 1 machined process
In UNIX, any zombies without parent process get re-parented to process
with PID 1 (usually running init), and PID 1 process should take care of
them (usually simply clean them up). Cleaning up zombies is important,
as they still take kerner resources, and having enormous amount of
zombie processes signifcantly degrades system performance.

For Talos, PID 1 process is machined, and machined itself forks to run
other processes in process runner and `pkg/cmd` one-time commands. Naive
solution of running `wait()` loop doesn't work as it might race with
`Process.Wait()` and clean up zombie which wasn't re-parented which
leads to process execution false failure.

After considering other solutions, we decided to go with the simple
approach: machined runs global zombie process reaper which publishes
information about reaped zombies. Any call to `Process.Wait()` (or
`Command.Wait()` which calls it) should be replaced with listening to
reaper's channel for notifications to catch info about the process which
was created in this call.

There are several changes in this PR:

1. Reaper implementation itself, started from machined.

2. Process runner and `pkg/cmd` can either use regular `Command.Wait()`
or use reaper notifications depending on reaper status (running/not
running). This allows using this code outside of machined.

3. Small bug fixes with process log which was affecting the tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-09-05 10:01:02 -07:00
Andrew Rynhard
d4770d41ad feat: run installs via container
This moves to performing installs via a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 15:01:20 -05:00
Andrew Rynhard
0bdaff1a90 feat: perform upgrades via container
This moves to performing upgrades via a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 09:44:50 -07:00
Andrew Rynhard
43e20217e8 feat: add ability to pass data on event bus
We need to support eventing with associated data. This moves the event
bus to an observer design pattern that allows observers to register for
specific events, and to receive the associated data.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-26 13:27:02 -07:00
Andrew Rynhard
9eaa2d8140 feat: add sequencer interface
This adds an interface that can be used to descibe boot, shutdown, and
upgrade events in a set of phases.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-25 12:59:42 -07:00
Andrew Rynhard
be8f58c15d feat: add overlay task
This adds a well defined task for handling all overlay mount points that
are required by the system.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-25 10:47:54 -07:00
Brad Beam
313c118ad0 refactor(networkd): Replace networkd with a standalone app
This is a major rewrite of our network subsystem.

- This changes networkd to run as a standalone app versus internal goroutine
- This changes out the netlink package with the more idiomatic netlink/rtnetlink
  packages
- This changes the initial network bootstrap/discovery from using a single
  interface to attempting to bring up all interfaces
- This moves us back on to the upstream dhcp library

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-21 13:24:51 -05:00
Andrew Rynhard
2e65cff3ce feat: mount /sys/fs/bpf
The BPF filesystem is required to pin BPF objects.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-18 07:37:08 -07:00
Brad Beam
da1f73249f fix(machined): Clean up installation process
This also includes a fix for #955 which had the unintended side effect
of breaking image creation ( since it would attempt to grow the filesystem
always ).

The refactor standardizes around looking for the DATA and ESP labels to
discover any existing installations/filesystems. If none are found, an
installation will proceed -- for both image creation and bare metal.
During bootup, the DATA partition will always attempt to expand/grow.

This also introduces a new phase to verify the installation through the
existance of /boot/installed ( migrated from install stage ).

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-08 22:10:14 -05:00
Andrey Smirnov
71640662e0 chore(init): rearrange phase handling to push shutdown to main
This re-arranges phases a bit so that shutdown actions are pushed back
to the top-level main.go of machined.

Small rudimentary event.Bus is introduce to facilitate event passing
(shutdown/restart) between various machined components and main.go. This
might be not the best implementation, just something to allow this
message passing without global variables or such.

Machined API was refactored to run as goroutine service.

ACPI & signal handlers re-built as phase tasks, and activated for
non-container, container modes respectively.

As part of the fix, now `docker stop` triggers correct shutdown of Talos
(not a big deal, but good for testing).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-02 08:42:12 -07:00
Andrew Rynhard
90c91807bd refactor: restructure the project layout
This change moves packages into more appropriate places.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 22:19:42 -07:00
Andrew Rynhard
ca35b85300 refactor: improve installation reliability
This change aims to make installations more unified and reliable. It
introduces the concept of a mountpoint manager that is capable of
mounting, unmounting, and moving a set of mountpoints in the correct
order.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 11:44:40 -07:00
Andrew Rynhard
835d72b74a fix: create overlay mounts after install
Without running the install task first, /var is read-only. This causes
the overlay phase to fail as it tries to create /var/system.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 06:35:12 -07:00
Andrey Smirnov
084378ac04 fix(init): flip concurrency of tasks/services, fix small issues
Phases should run sequentially, while tasks concurrently in a phase.

There are two potential issues fixed:

1. `result` multierror was updated inside goroutine without any
synchronization, so this is a data race
2. panic inside task/phase runner might happen and as unhandled panic in a
goroutine aborts whole process, this might lead to a system halt as
as the 'machined' exits

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-07-31 14:21:07 -07:00
Andrew Rynhard
e63c882b89 refactor: split machined into phases
This change aims to standardize the boot process. It introduces the
concept of a phase, which is comprised of tasks. Phases are ran in serial and
the tasks that make up a phase are ran concurrently.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-29 12:40:03 -07:00
Andrew Rynhard
b7a9acbe88 refactor: move setup logic into machined
The responsibility of init should only be to mount the rootfs. This
change moves Talos specific logic into machined. This will allow us to
define a version of Talos in a single binary instead of split across
two. This will enable cleaner upgrades and helps make the codebase
easier to reason about.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-26 07:48:49 -07:00
Andrew Rynhard
0ec17e4169 feat: run rootfs from squashfs
This change moves the rootfs to a squashfs image.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-25 08:38:31 -07:00
Andrew Rynhard
88bdedf3e6 fix: make /etc/resolv.conf writable
We need /etc/resolv.conf to be writable so that networkd can update it.
This change achieves this by creating a symlink at /etc/resolv.conf that
points to /var/resolv.conf.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-19 20:37:00 -07:00
Andrew Rynhard
8e8aae98dd feat: add machined
This commit splits our current init into init and machined.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-16 13:12:21 -07:00