talos

mirror of https://github.com/siderolabs/talos.git synced 2025-10-05 20:51:15 +02:00

Author	SHA1	Message	Date
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00
Andrey Smirnov	dc6ea74c35	fix: random failures in cluster health checks The problem was that some of the health checks sort the list of the nodes in place (via `sort.Strings()`). If cluster info provider returns original slice, it might be mutated in such a way that it gets corrupted. We never noticed it before CAPI clusters, as in our tests IPs are assigned sequentially, and sort operation is a no-op. Specifically, the problem was with the `Nodes()` function, it returns `append(controlPlaneNodes, workerNodes...)` slice, which by definition might share memory with `controlPlaneNodes` slice. For example, if control plane nodes were `4, 5, 6` and worker nodes were `3`, the returned slice will be `4, 5, 6, 3`, and it shares memory with `controlPlaneNodes` slice (firs three items). If we apply `sort` to the returned slice, it re-orders it as `3, 4, 5, 6`, but as it is done in-place, the `controlPlaneNodes` slice is now `3, 4, 5`, which is obviously wrong. Fix that by always returning a copy of the slice from the functions implementing `ClusterInfo` interface. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-08 07:13:24 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Andrey Smirnov	d7f5de62c3	feat: colorize output of cluster health checks It only gets enabled if output is a terminal. Failures which resolve themselves are removed from the final output. Small spinner to indicate progress. While I was at it, I fixed client-side `talosctl health` when init node is missing. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-06 07:59:30 -07:00
Andrey Smirnov	16eb47a1a3	feat: use kubeconfig merge in `talosctl kubeconfig` by default Kubeconfig merge was completely rewritten to be "smarter": * automatically apply renames done at previous stages to avoid asking over and over again (in general should ask just once) * skip checks if parts of the config match exactly * allow overwrite as an option * flexible way to control the output * activating context in the end * custom merged context name Fixes #2578 Fixes #2587 Fixes #2577 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-10-03 05:36:15 -07:00
Andrey Smirnov	b9bfe00b88	feat: support custom filename for talosctl kubeconfig This also refactors much of the CLI code for the `talosctl kubeconfig`: 1. Do all the checks before fetching kubeconfig from the server: as kubeconfig generation takes a few seconds, it doesn't make sense to generate it if it's not going to be used. 2. Unify most of merge & write directly features. 3. Don't use ExtractTarGz method to be more flexible. 4. Allow custom paths for kubeconfig, whether it is a directory or full path to the file to be created. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-30 12:05:50 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	ff0d4b305a	feat: build Talos images/artifacts for amd64/arm64 By default, build outside of Drone works the same and builds only amd64 version, loads images back into dockerd, etc. If multiple platforms are used, multi-arch images are built which can't be exported to docker or to `.tar` image, they're always pushed to the registry (even for PR builds to our internal CI registry). Artifacts as files (initramfs, kernel) now have `-arch` suffix: `vmlinuz-amd64`, `initramfs-amd64.xz`. "Magic" script normalizes output paths depending on whether single platform or multiple platforms were given. VM provisioners accept magic `${ARCH}` in initramfs/kernel paths which gets replaced by cluster architecture. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-27 10:32:07 -07:00
Andrey Smirnov	0f54574d89	fix: update one more places which had stale reference for constants s/constants/images/ Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-25 10:51:35 -07:00
Andrew Rynhard	27c7bc0788	fix: use images package in integration tests This fixes an incorrect import path. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-09-25 08:11:27 -07:00
Andrew Rynhard	c693e556d2	feat: add images command This adds a command that lists all of the images used by Talos. This is useful in the case of airgap installs, so that users will know which images to pull. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-09-18 12:55:08 -07:00
Andrey Smirnov	15181aeade	feat: use architecture-specific image for core k8s components This is one step towards running Talos on non-amd64 architectures (e.g. arm64). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-16 01:11:40 -07:00
Andrey Smirnov	f6e075ea55	test: verify kubernetes control plane upgrade in provision tests Add Kubernetes upgrade as part of the provisioning (upgrade tests): first K8s control plane is upgraded, then Talos is upgraded (with kubelet), and e2e test is run last. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-11 10:53:33 -07:00
Andrey Smirnov	788cd15c29	test: add e2e test to the provision (upgrade) tests Add sonobuoy runner code with log fetching on failure. Use hand-picked set of e2e tests to run: verify basic pod functionality, verify service connectivity. Add option `--run-e2e` to the `talosctl health` to run quick e2e test to verify cluster health. Add option to run provision tests with custom CNI, run one track of provision tests with Cilium. Bump Cilium to 1.8.2. Talos 0.6 won't uncordon node automatically after upgrade from 0.5, as 0.5 doesn't put annotation. Workaround that in upgrade tests. Bump upgrade test version to 0.6.0 release. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-08 13:26:31 -07:00
Andrey Smirnov	f6ecf000c9	refactor: extract packages loadbalancer and retry This removes in-tree packages in favor of: * github.com/talos-systems/go-retry * github.com/talos-systems/go-loadbalancer Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-09-02 13:46:22 -07:00
Andrew Rynhard	1a4059a553	feat: add grub bootloader This moves to using grub instead of syslinux. BREAKING CHANGE: Single node upgrades will fail in this change. This will also break the A/B fallback setup since this version introduces an entirely new partition scheme, that any fallback will not know about. We plan on addressing these issues in a follow up change. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-09-01 12:06:43 -07:00
Marco De Luca	1fbb171fd0	test: determine reboots using boot id Changed the RebootSuite to use /proc/sys/kernel/random/boot_id rather than /proc/uptime Signed-off-by: Marco De Luca <marcodl404@gmail.com>	2020-08-26 06:09:02 -07:00
Andrew Rynhard	83aa3bd3ab	chore: bump next version to v0.6.0-beta.2 This updates the "next" version in our integration tests. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-08-21 01:44:26 -07:00
Andrey Smirnov	6a7cc02648	fix: handle bootkube recover correctly, support recovery from etcd Bootkube recover process (and `talosctl recover`) was actually regenerating assets each time `recover` runs forcing control plane to be at the state when cluster got created. This PR fixes that by running recover process correctly. Recovery via etcd was fixed to handle encrypted etcd data: it follows the way `apiserver` handles encryption at rest, and as at the moment AES CBC is the only supported encryption method, code simply follows the same path. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-18 14:24:14 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	2697b99b7d	refactor: extract `pkg/net` as `github.com/talos-systems/net` This extracts common package as new module/repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-14 11:04:50 -07:00
Andrey Smirnov	9379cf9ee1	refactor: expose `provision` as public package This change is only moving packages and updating import paths. Goal: expose `internal/pkg/provision` as `pkg/provision` to enable other projects to import Talos provisioning library. As cluster checks are almost always required as part of provisioning process, package `internal/pkg/cluster` was also made public as `pkg/cluster`. Other changes were direct dependencies discovered by `importvet` which were updated. Public packages (useful, general purpose packages with stable API): * `internal/pkg/conditions` -> `pkg/conditions` * `internal/pkg/tail` -> `pkg/tail` Private packages (used only on provisioning library internally): * `internal/pkg/inmemhttp` -> `pkg/provision/internal/inmemhttp` * `internal/pkg/kernel/vmlinuz` -> `pkg/provision/internal/vmlinuz` * `internal/pkg/cniutils` -> `pkg/provision/internal/cniutils` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-12 05:12:05 -07:00
Andrey Smirnov	ede662bcb1	test: bump timeout for upgrade tests 'cordonAndDrainNode' task sometimes takes 5 minutes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-31 00:28:29 +03:00
Andrey Smirnov	a48c1dbe89	chore: use qemu instead of firecracker in CI qemu opens up a bunch of possibilities, including the bootloader testing. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-30 22:43:16 +03:00
Andrey Smirnov	a5d64d97c1	test: update qemu/firecracker provisioners Fixes #2363 #2364 #2370 #2371 Several changes packed together: * use compressed `vmlinuz` everywhere, firecracker provisioner uncompresses it before first use, drop `vmlinux` * handle reboots in qemu launcher to support reset API case, update empty disk check to handle reset behavior (erasing partition table) * make bootloader support default in provisioners, and flag to disable that * early support for target architecture for qemu provisioner This should allow us to use `qemu` in CI/CD (not included into this PR): integration test passes with qemu. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-30 21:17:25 +03:00
Andrey Smirnov	3926442704	feat: taint master nodes with `NoSchedule` taint Fixes #2350 This also brings in a fix for `coredns` tolerations from https://github.com/talos-systems/bootkube-plugin/pull/19. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 14:02:41 -07:00
Andrey Smirnov	47608fb874	refactor: make `pkg/config` not rely on `machined/../internal/runtime` This makes `pkg/config` directly importable from other projects. There should be no functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-29 12:40:12 -07:00
Andrey Smirnov	2770d6414c	test: upgrade versions the upgrade tests are operating on This bumps next version to the latest 0.6 alpha and latest 0.5. This also enables single node preserve test. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-28 12:35:37 -07:00
Andrey Smirnov	6a81f30941	test: provide node discovery for cli tests via kubectl Fixes #2330 CLI tests require node discovery as `--nodes` flag is enforced for most of the `talosctl commands`. For clusters created via `talosctl cluster create`, cluster provisioner state provides all the necessary information, but clusters created via CAPI don't have the state attached. API tests rely on Talos and Kubernetes APIs to fetch kubeconfig and access Nodes K8s API. CLI tests should rely only on CLI tools, so we use `kubectl get nodes` + `talosctl kubeconfig` to fetch list of master and worker nodes. This discovery method relies on "bootstrap" node being set in `talosconfig` (to fetch `kubeconfig`). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-28 11:35:47 -07:00
Andrey Smirnov	3d8418a689	feat: force nodes to be set in `talosctl` commands using the API With load-balancing enabled by default running `talosctl` without `--nodes` is risky, as it might hit any control plane by default without `--nodes`. Only two commands do not enforce this check, as they do their own node contexts: `crashdump` and `health` (client-side). Integration tests were updated to always supply `--nodes` cli argument, while doing that I refactored the storage for discovered nodes to use existing `cluster.Info` interface. The downside is that with e2e CAPI tests CLI tests will be mostly skipped as we don't support discovery in CLI tests at the momemnt. This can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for node discovery. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 12:17:43 -07:00
Andrey Smirnov	1a0e1bc393	chore: update module dependencies Fixes #2316 Simply update dependencies we don't track on version level to be compatible with Talos components (like etcd or k8s). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:00:50 -07:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Artem Chernyshev	8fc352ec4f	feat: merge mode in talosctl kubeconfig New flag `-m` will enable merge mechanism in `talosctl kubeconfig` Command examples: ``` talosctl kubeconfig -m talosctl kubeconfig -m ~/.kube/config ``` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-10 12:39:30 -07:00
Andrey Smirnov	931237b23c	test: update init node check in reset API tests Previously we assumed that node 0 is the init node, and it can't be reset. With new bootstrap API approach, there's no init node, and all the nodes can be reset. This corrects the check to skip only the init node, and with bootstrap API there's no init node (so no nodes are skipped). Fixes #2277 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 10:48:14 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	97d18b1c43	test: fix cli tests after load-balancing got enabled There were three problems: * cli tests did commands in sequence assuming they all hit the same node, but with load-balancing it's no longer true * restart test was affected, as it hit different node for check after restart, and it succeeded immediately, while on original node process was still starting which resulted in failure in the next tests; replace the check to make sure service is up and healthy, so that test leaves cluster in a good state * restart API response had wrong format (no message returned) which resulted in failures with apid proxy (when used with `-n`) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 14:06:30 -07:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrey Smirnov	81d1c2bfe7	chore: enable godot linter Issues were fixed automatically. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-30 10:39:56 -07:00
Andrey Smirnov	6fb55229a2	test: fix and improve reboot/reset tests These tests rely on node uptime checks. These checks are quite flaky. Following fixes were applied: * code was refactored as common method shared between reset/reboot tests (reboot all nodes does checks in a different way, so it wasn't updated) * each request to read uptime times out in 5 seconds, so that checks don't wait forever when node is down (or connection is aborted) * to account for node availability vs. lower uptime in the beginning of test, add extra elapsed time to the check condition Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-29 13:56:48 -07:00
Andrey Smirnov	51112a1d86	fix: use kubernetes version in config generator Update all k8s image references to point to the version specified by the user. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-26 17:05:19 -07:00
Andrey Smirnov	0a4645fe80	feat: implement circular buffer for system logs This replaces logging to files with inotify following to pure in-memory circular buffer which grows on demand capped at specified maximum capacity. The concern with previous approach was that logs on tmpfs were growing without any bound potentially consuming all the node memory. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-06-26 15:33:54 -07:00
Andrew Rynhard	d0d2ac3c74	test: default to using the bootstrap API This moves our test scripts to using the bootstrap API. Some automation around invoking the bootstrap API was also added to give the same ease of use when creating clusters with the CLI. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-06-24 08:46:10 -07:00
Andrew Rynhard	77150f51cf	chore: update provision test versions This adds latest 0.6 alpha and 0.5 stable to the upgrade tests. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2020-05-29 14:58:54 -07:00
Andrey Smirnov	795a10b681	test: improve reboot/reset test resiliency against request timeouts After node reboot test code tries endlessly to read the uptime until it goes down after reboot, but during actual reboot API won't be responsive and it might happen that this call will time out only with parent context canceling, and by that time retry timeout is already exhausted, so no more attempts will be made (while node successfully booted after a reboot). ``` uptime didn't go down: before 219.730000, after 267.020000 uptime didn't go down: before 219.730000, after 268.030000 EOF rpc error: code = DeadlineExceeded desc = context deadline exceeded timeout ``` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-22 12:31:06 -07:00
Andrey Smirnov	652531853f	test: update Talos versions for upgrade tests Our policy it to support two last releases (0.4, 0.5 at the moment). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-20 07:43:10 -07:00
Seán C McCord	3e0e01e2c3	fix: refactor client creation API Create a new `client.New` to make external API systems easier to construct. A new type `client.OptionFunc` allows the client to be extended with specific configuration. This also makes a first pass at supporting multiple endpoints properly by creating a custom grpc resolver. (Proper load balancing support is still a TODO.) Fixes #2093 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-05-11 10:21:07 -07:00
Andrey Smirnov	28a6eb207a	test: add node name to error messages in RebootAllNodes This makes troubleshooting easier. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-05-07 12:12:46 -07:00

1 2

95 Commits