talos

mirror of https://github.com/siderolabs/talos.git synced 2025-12-15 22:41:55 +01:00

Author	SHA1	Message	Date
Andrey Smirnov	a12eb76734	test: add unit-test for the installer manifest This test only works on local machine (see notes in the file). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-15 13:31:42 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Andrey Smirnov	90d0efec48	feat: pull kubeconfig from the cluster on successful `cluster create` Kubeconfig is merged into `~/.kube/config` with rename option (existing configuration is never overwritten). If endpoint was used, it is automatically put into the `kubeconfig`. This should make OS X experience literally `talosctl cluster create` followed by any `kubectl get ...`. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-06 05:45:28 -07:00
Andrey Smirnov	018086d1fa	refactor: extract blockdevice library Library `blockdevice` was extracted as `talos-systems/go-blockdevice`, this PR finalizes the move by removing Talos copy of it. Some functions around `mkfs`/`growfs` were extracted as `makefs` package, as they depend on `cmd` package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-05 11:18:43 -07:00
Andrey Smirnov	16eb47a1a3	feat: use kubeconfig merge in `talosctl kubeconfig` by default Kubeconfig merge was completely rewritten to be "smarter": * automatically apply renames done at previous stages to avoid asking over and over again (in general should ask just once) * skip checks if parts of the config match exactly * allow overwrite as an option * flexible way to control the output * activating context in the end * custom merged context name Fixes #2578 Fixes #2587 Fixes #2577 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-03 05:36:15 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	54887c094d	fix: provide unique username in generate kubeconfig This allows for more clean merge of multiple kubeconfigs from different Talos clusters. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-29 13:10:36 -07:00
Andrey Smirnov	98443cd0e9	fix: retry container image import This bug is sometimes reproducible with QEMU/arm64, as it runs really slow. Looks like multiple concurrent image unpacks sharing some layers might fail unexpectedly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-28 08:58:47 -07:00
Andrey Smirnov	8236822c90	fix: retry image pulling, stop on 404, no duplicate pulls This uses go-retry feature (https://github.com/talos-systems/go-retry/pull/3) to print errors being retried. If image is not found in the index, abort retries immediately. Don't pull installer image twice (if already pulled by the validation code before). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-22 07:07:45 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Andrew Rynhard	1a4059a553	feat: add grub bootloader This moves to using grub instead of syslinux. BREAKING CHANGE: Single node upgrades will fail in this change. This will also break the A/B fallback setup since this version introduces an entirely new partition scheme, that any fallback will not know about. We plan on addressing these issues in a follow up change. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-09-01 12:06:43 -07:00
Andrew Rynhard	d4f103ffcb	fix: pass config via stdin In order to perform upgrades the way we would like, it is important that we avoid any bind mounts into containers. This change ensures that all system services get their config via stdin. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-08-20 15:26:13 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	2697b99b7d	refactor: extract `pkg/net` as `github.com/talos-systems/net` This extracts common package as new module/repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 11:04:50 -07:00
Andrey Smirnov	52c5911fcd	chore: extract pkg/crypto as external module Package `pkg/crypto` was extracted as `github.com/talos-systems/crypto` repository and Go module. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 06:33:30 -07:00
Andrey Smirnov	7474b8ba52	feat: upgrade etcd to 3.4.10 This upgrades etcd to latest v3.4.x version as smooth upgrade from version 3.3.22 in 0.6. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-13 07:33:51 -07:00
Andrey Smirnov	9379cf9ee1	refactor: expose `provision` as public package This change is only moving packages and updating import paths. Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other projects to import Talos provisioning library. As cluster checks are almost always required as part of provisioning process, package `internal/pkg/cluster` was also made public as `pkg/cluster`. Other changes were direct dependencies discovered by `importvet` which were updated. Public packages (useful, general purpose packages with stable API): * `internal/pkg/conditions` -> `pkg/conditions` * `internal/pkg/tail` -> `pkg/tail` Private packages (used only on provisioning library internally): * `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp` * `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz` * `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-12 05:12:05 -07:00
Andrey Smirnov	050d34275a	chore: integrate importvet This integrates [importvet](https://github.com/talos-systems/importvet) into `lint` target. First rule file was added for public packages `pkg/` which shouldn't depend on other parts of Talos tree (except for the API definitions). Only one change: `internal/cis` was moved under single user - `pkg/config/internal/cis` to satisfy the rules. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-11 13:19:15 -07:00
Andrew Rynhard	92523bc422	refactor: remove structs from config provider This make the config provider a pure interface definition by removing all concrete internal types, and making them an interface. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-08-06 13:21:41 -07:00
Andrey Smirnov	b5d082de8a	fix: update qemu launcher on arm64 to boot Talos properly This includes better machine args, support for UEFI parallel flash images required as low-level bootloader, and miscallenous cleanups. Qemu support was enabled for mapping host random source to the guest as entropy source to prevent stalls on the boot waiting for the entropy. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-04 11:45:19 -07:00
Andrey Smirnov	a5d64d97c1	test: update qemu/firecracker provisioners Fixes #2363 #2364 #2370 #2371 Several changes packed together: * use compressed `vmlinuz` everywhere, firecracker provisioner uncompresses it before first use, drop `vmlinux` * handle reboots in qemu launcher to support reset API case, update empty disk check to handle reset behavior (erasing partition table) * make bootloader support default in provisioners, and flag to disable that * early support for target architecture for qemu provisioner This should allow us to use `qemu` in CI/CD (not included into this PR): integration test passes with qemu. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-30 21:17:25 +03:00
Andrew Rynhard	849959fefc	feat: add dynamic config decoder This adds the ability to dynamically decode mult-doc YAML files. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-30 08:07:14 -07:00
Andrey Smirnov	3926442704	feat: taint master nodes with `NoSchedule` taint Fixes #2350 This also brings in a fix for `coredns` tolerations from https://github.com/talos-systems/bootkube-plugin/pull/19. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 14:02:41 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Artem Chernyshev	c6eb18eed5	feat: qemu provisioner Starts and stops qemu VMs, has some initial configuration subset. Sets up networking through CNI tools, sets up DHCP server which gives IP addresses to nodes. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-28 14:55:35 -07:00
Andrey Smirnov	76c44ac468	test: remove apid load balancer for firecracker We're not using load balancer for `apid` (always using client-side load balancing), so we can remove this safely. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-28 20:21:21 +03:00
Andrey Smirnov	020aea1b89	fix: fail ntpd service if initial time sync fails Talos depends on accurate time for many actions, so many services depend on timed successful health check. If timed fails to do initial sync, it enters pretty long wait loop for the next attempt which might not come in time for the boot timeout. Instead, fail timed service on initial sync and rely on service restart for another attempt. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-28 08:44:58 -07:00
Andrew Rynhard	d8b689e9d1	fix: generate admin kubeconfig with default namespace This ensures that the generated kubeconfig has a namespace. This fixes an edge case when a user attempts to use the kubeconfig from within a pod of a different kubernetes cluster. If the kubeconfig does not have a namespace, kubectl will use the "in cluster namespace" which is unexpected, especially if the "in cluster namespace" does not exist in the target cluster. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-27 20:08:17 -07:00
Andrey Smirnov	c85608b8d9	test: add an option to bind docker to specific host IP This allows to override default `0.0.0.0` (`*`) to a specific IP to avoid conflicts. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 21:13:28 +03:00
Andrey Smirnov	13c0052a6c	test: fix racy test ReaderNoFollow Due to the race between `Read()` and context cancellation, error might be returned which we can safely ignore. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 21:13:11 +03:00
Artem Chernyshev	c70c08c8ce	chore: extract loadbalancer, network, crashdup and process from firecracker Second part of refactoring to split common logic for VM provisioners from Firecracker provisioner. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-20 11:03:03 -07:00
Andrey Smirnov	f047c42ae7	test: provider correct installer kernel args for firecracker Firecracker never executes the bootloader, so kernel args passed to the installer aren't used, but if the same disk image is used to boot Talos e.g. in `qemu`, it fails to set up console properly for example. This PR simply provides those kernel args to the installer so that they're persisted in the image. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 16:52:08 +03:00
Artem Chernyshev	19cd46459b	chore: initial extraction of base vm provisioner Created base provisioner struct for all VM based provisioners. Moved state.go and reflect.go to the common module. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-18 15:45:54 -07:00
Artem Chernyshev	3d25ceb13e	chore: move inmemhttp from firecracker provisioner to internal/pkg/ To be reused in qemu provisioner later on. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-18 07:11:50 -07:00
Andrey Smirnov	2f4fb34baf	fix: update container name in docker crashdump Small bug resulted in container names being cut in the wrong way. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:49:29 -07:00
Andrey Smirnov	41d5f7859a	chore: update golangci-lint to 1.28.3 Fixes #2272 `gofumpt` is now included into `golangci-lint`, but not the `gofumports`, so we keep it using it as separate binary, but we keep versions in sync with `golangci-lint`. This contains fixes from: * `gofumpt` (automated, mostly around octal constants) * `exhaustive` in `switch` statements * `noctx` (adding context with default timeout to http requests) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 08:05:42 -07:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	aa687cf8cd	fix: update the control plane cluster health check Include kube-apiserver in the list of daemon sets to be checked, and for each daemon set verify number of pods running and ready, as when control plane is damaged daemon set properties are not updated properly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 17:53:21 +03:00
Andrey Smirnov	ddbe9cfc2f	fix: update timeouts on service startup to match boot timeout There's a global timeout for all services to be up: it's 5 minutes. We need to make sure each service startup takes less than that, otherwise boot sequence is aborted and there's no way to see the error message for each particular service. Also propagate contexts correctly and set some default timeouts to make sure API operations are not hanging forever. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 07:39:36 -07:00
Andrey Smirnov	219425f629	test: resolve old TODO item I had to copy over some oci stuff from newer package version, but as we for a long time use newer oic, we don't need a copy anymore. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-02 11:09:58 -07:00
Andrey Smirnov	ba12095ac7	test: stabilize race unit-tests (circular, events) Fixes #2243 These tests rely on some kind of sync between readers and writers, as if circular buffer is overrun, test no longer runs as expected. We use time-sensitive rate limiter to limit write speed to make sure readers can always catch up. Lowering the rate should slow down writers and make tests more likely to succeed. For #2243, the failure was from buffer overrun: when overrun is detected, `Watch` function closes the channel (and test "receives" zero element). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-01 13:39:49 -07:00
Andrew Rynhard	888c8b948a	feat: add /system directory This adds the `/system` directory to provide a dedicated directory for all system related runtime files. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-01 09:51:56 -07:00
Andrey Smirnov	81d1c2bfe7	chore: enable godot linter Issues were fixed automatically. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 10:39:56 -07:00
Andrey Smirnov	4ad4511b38	chore: enable nolintlint linter It makes sure our `//nolint:` directives are not redundant. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 07:39:19 -07:00
Andrey Smirnov	0a4645fe80	feat: implement circular buffer for system logs This replaces logging to files with inotify following to pure in-memory circular buffer which grows on demand capped at specified maximum capacity. The concern with previous approach was that logs on tmpfs were growing without any bound potentially consuming all the node memory. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-26 15:33:54 -07:00
Andrew Rynhard	d0d2ac3c74	test: default to using the bootstrap API This moves our test scripts to using the bootstrap API. Some automation around invoking the bootstrap API was also added to give the same ease of use when creating clusters with the CLI. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-06-24 08:46:10 -07:00
Andrew Rynhard	6ea313fa7d	fix: detect if partition table is missing This adds a sentinel error for a missing partition table. This error is used to detect if a partition table already exists when setting up user defined disks. In addition to the fix, this removes a legacy parameter from the `PartitionTable` method that indicated that the partition table should be read. It is safer to just read it every time. Also, I can't think of a case when the block device partition table is nil and we want to read. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-06-16 18:26:59 -07:00
Andrey Smirnov	a9766d31bc	refactor: implement LoggingManager as central log flow processor Using this `LoggingManager` all the log flows (reading and writing) were refactored. Inteface of `LoggingManager` should be now generic enough to replace log handling with almost any implementation - log rotation, sending logs to remote destination, keeping logs in memory, etc. There should be no functional changes. As part of changes, `follow.Reader` was implemented which makes appending file feel like a stream. `file.NewChunker` was refactored to use `follow.Reader` and `stream.NewChunker` to do the actual work. So basically now we have only a single instance of chunker - stream chunker, as everything is represented as a stream. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-10 14:30:36 -07:00

1 2 3 4 5 ...

306 Commits