vault

mirror of https://github.com/hashicorp/vault.git synced 2025-08-23 23:51:08 +02:00

Author	SHA1	Message	Date
Ryan Cragun	84935e4416	[QT-697] enos: add descriptions and quality verification (#27311 ) In order to take advantage of enos' ability to outline scenarios and to inventory what verification they perform we needed to retrofit all of that information to our existing scenarios and steps. This change introduces an initial set of descriptions and verification declarations that we can continue to refine over time. As doing this required that I re-read every scenanario in its entirety I also updated and fixed a few things along the way that I noticed, including adding a few small features to enos that we utilize to make handling initial versions programtic between versions instead of having a delta between our globals in each branch. * Update autopilot and in-place upgrade initial versions * Programatically determine which initial versions to use based on Vault version * Partially normalize steps between scenarios to make comparisons easier * Update the MOTD to explain that VAULT_ADDR and VAULT_TOKEN have been set * Add scenario and step descriptions to scenarios * Add initial scenario quality verification declarations to scenarios * Unpin Terraform in scenarios as >= 1.8.4 should work fine	2024-06-13 11:16:33 -06:00
Rebecca Willett	1f0639a79c	Remove Leap 15.4 from testing matrices and AMI data sources; remove vestiges of Ubuntu 18.04 testing (#27416 )	2024-06-10 11:44:32 -04:00
Ryan Cragun	0513545dd8	[VAULT-27917] fix(enos): handle SLES guestregister.service unreliability (#27380 ) * [VAULT-27917] fix(enos): handle SLES guestregister.service unreliability The SLES provided `guestregister.service` systemd unit is unreliable enough that it will fail ~ 1/9 times when provisioning SLES instances. When this happens the machine will never successfully exec SUSEConnect to enroll and we'll get no access to the SLES repositories and subsequently break our scenarios. I resolved this by restructuring our `install_packages` module to to separate repository synchronization, repository addition, and package installation into different scripts and resources and by adding special case handling for SLES and the `guestregister.service`. I also make a distinction between `dnf` and `yum` because while they are sort of the same thing on RHEL, it is not the case with Amazon2. I also shimmed out the rest of the support for Apt in case we ever need to add repos there. * Revert "Temporarily remove SLES from samples (#27378)" This reverts commit 490cdd90661a57cf849c7d64aec545e87fb393c8. Signed-off-by: Ryan Cragun <me@ryan.ec>	2024-06-06 17:37:50 -06:00
Rebecca Willett	c28739512a	Add Amazon Linux, openSUSE Leap, and SUSE SLES support to Enos scenarios and modules (#25983 ) Add Consul edition support to Enos scenarios and modules Add Linux distros and Consul edition to Enos samples Bump RHEL versions to 9.3 and 8.9	2024-06-05 12:58:35 -04:00
Ryan Cragun	fea81ab8bc	enos: improve artifact:local dev scenario experience (#27095 ) * Better handle symlinks in artifact paths. * Fix a race condition in the local builder where Terraform wouldn't wait for local builds to finish before attempting to install vault on target nodes. * Make building the web ui configurable in the dev scenario. * Rename `vault_artifactory_artifact` to `build_artifactory_artifact` to better align with existing "build" modules. Signed-off-by: Ryan Cragun <me@ryan.ec>	2024-05-17 10:22:08 -06:00
Ryan Cragun	aa2aa0b627	enos: correctly set up profile (#27055 ) Handle cases where `vault_cluster` is used more than once on a host. This includes cases where we aren't initing. Signed-off-by: Ryan Cragun <me@ryan.ec>	2024-05-15 20:07:35 +00:00
Ryan Cragun	dced3ad3f0	[VAULT-26888] Create developer scenarios (#27028 ) * [VAULT-26888] Create developer scenarios Create developer scenarios that have simplified inputs designed for provisioning clusters and limited verification. * Migrate Artifactory installation module from support team focused scenarios to the vault repository. * Migrate support focused scenarios to the repo and update them to use the latest in-repo modules. * Fully document and comment scenarios to help users outline, configure, and use the scenarios. * Remove outdated references to the private registry that is not needed. * Automatically configure the login shell profile to include the path to the vault binary and the VAULT_ADDR/VAULT_TOKEN environment variables. Signed-off-by: Ryan Cragun <me@ryan.ec>	2024-05-15 12:10:27 -06:00
Ryan Cragun	27ab988205	[QT-695] Add `config_mode` variant to some scenarios (#26380 ) Add `config_mode` variant to some scenarios so we can dynamically change how we primarily configure the Vault cluster, either by a configuration file or with environment variables. As part of this change we also: * Start consuming the Enos terraform provider from public Terraform registry. * Remove the old `seal_ha_beta` variant as it is no longer required. * Add a module that performs a `vault operator step-down` so that we can force leader elections in scenarios. * Wire up an operator step-down into some scenarios to test both the old and new multiseal code paths during leader elections. Signed-off-by: Ryan Cragun <me@ryan.ec>	2024-04-22 12:34:47 -06:00
Ryan Cragun	89c75d3d7c	[QT-637] Streamline our build pipeline (#24892 ) Context ------- Building and testing Vault artifacts on pull requests and merges is responsible for about 1/3rd of our overall spend on Vault CI. Of the artifacts that we ship as part of a release, we do Enos testing scenarios on the `linux/amd64` and `linux/arm64` binaries and their derivative artifacts. The extended build artifacts for non-Linux platforms or less common machine architectures are not tested at this time. They are built, notarized, and signed as part of every pull request update and merge. As we don't actually test these artifacts, the only gain we get from this rather expensive behavior is that we wont merge a change that would prevent Vault from building on one of the extended targets. Extended platform or architecture changes are quite rare, so performing this work as frequently as we do is costly in both monetary and developer time for little relative safety benefit. Goals ----- Rethink and implement how and when we build binaries and artifacts of Vault so that we can spend less money on repetitive work and while also reducing the time it takes for the build and test pipelines to complete. Solution -------- Instead of building all release artifacts on every push, we'll opt to build only our testable (core) artifacts. With this change we are introducing a bit of risk. We could merge a change that breaks an extended platform and only find out after the fact when we trigger a complete build for a release. We'll hedge against that risk by building all of the release targets on a scheduled cadence to ensure that they are still buildable. We'll make building all of the targets optional on any pull request by use of a `build/all` label on the pull request. Further considerations ---------------------- * We want to reduce the total number of workflows and runners for all of our pipelines if possible. As each workflow runner has infrastructure cost and runner time penalties, using a single runner over many is often preferred. * Many of our jobs runners have been optimized for cost and performance. We should simplify the choices of which runners to use. * CRT requires us to use the same build workflow in both CE and Ent. Historically that meant that modifying `build.yml` in CE would result in a merge conflict with `build.yml` in Ent, and break our merge workflows. * Workflow flow control in both `build.yml` and `ci.yml` can be quite complicated, as each needs to maintain compatibility whether executed as CE or Ent, and when triggered with various Github events like pull_request, push, and workflow_call, each with their own requirements. * Many jobs utilize similar patterns of flow control and metadata but are not reusable. * Workflow call depth has a maximum of four, so we need to be quite considerate when calling other workflows. * Called workflows can only have 10 inputs. Implementation -------------- * Refactor the `build.yml` workflow to be agnostic to whether or not it is executing in CE or Ent. That makes future updates to the build much easier as we won't have to worry about merge conflicts when the change is merged downstream. * Extract common steps in workflows into composite actions that we can reuse. * Fix bugs where some but not all workflows would use different Git references when building and testing a pull request. * We rewrite the application, docs, and UI change helpers as a composite action. This allows us to re-use this logic to make consistent behavior choices across build and CI. * We combine several `build.yml` and `ci.yml` jobs into our final job. This reduces the number of workflows required for the same behavior while saving time overall. * Update most of our action pins. Results ------- \| Metric \| Before \| After \| Diff \| \|-------------------\|----------\|---------\|-------\| \| Duration: \| ~14-18m \| ~15-18m \| ~ = \| \| Workflows: \| 43 \| 18 \| - 58% \| \| Billable time: \| ~1h15m \| 16m \| - 79% \| \| Saved artifacts: \| 34 \| 12 \| - 65% \| Infra costs should map closely to billable time. Network I/O costs should map closely to the workflow count. Storage costs should map directly with saved artifacts. We could probably get parity with duration by getting more clever with our UBI container build, as that's where we're seeing the increase. I'm not yet concerned as it takes roughly the same time for this job to complete as it did before. While the CI workflow was not the focus on the PR, some shared refactoring does show some marginal improvements there. \| Metric \| Before \| After \| Diff \| \|-------------------\|----------\|----------\|--------\| \| Duration: \| ~24m \| ~12.75m \| - 15% \| \| Workflows: \| 55 \| 47 \| - 8% \| \| Billable time: \| ~4h20m \| ~3h36m \| - 7% \| Further focus on streamlining the CI workflows would likely result in a few more marginal improvements, but nothing on the order like we've seen with the build workflow. Signed-off-by: Ryan Cragun <me@ryan.ec>	2024-02-06 21:11:33 +00:00
Mike Palmiotto	3389a572b9	enos: Add Default LCQ validation to autopilot upgrade scenario (#24602 ) * enos: Add default lcq validation to autopilot upgrade scenario * Add timeout/retries to default lcq autopilot test	2023-12-20 15:25:20 -07:00
Ryan Cragun	a087f7b267	[QT-627] enos: add `pkcs11` seal testing with softhsm (#24349 ) Add support for testing `+ent.hsm` and `+ent.hsm.fips1402` Vault editions with `pkcs11` seal types utilizing a shared `softhsm` token. Softhsm2 is a software HSM that will load seal keys from a local disk via pkcs11. The pkcs11 seal implementation is fairly complex as we have to create a one or more shared tokens with various keys and distribute them to all nodes in the cluster before starting Vault. We also have to ensure that each sets labels are unique. We also make a few quality of life updates by utilizing globals for variants that don't often change and update base versions for various scenarios. * Add `seal_pkcs11` module for creating a `pkcs11` seal key using `softhsm2` as our backing implementation. * Require the latest enos provider to gain access to the `enos_user` resource to ensure correct ownership and permissions of the `softhsm2` data directory and files. * Add `pkcs11` seal to all scenarios that support configuring a seal type. * Extract system package installation out of the `vault_cluster` module and into its own `install_package` module that we can reuse. * Fix a bug when using the local builder variant that mangled the path. This likely slipped in during the migration to auto-version bumping. * Fix an issue where restarting Vault nodes with a socket seal would fail because a seal socket sync wasn't available on all nodes. Now we start the socket listener on all nodes to ensure any node can become primary and "audit" to the socket listner. * Remove unused attributes from some verify modules. * Go back to using cheaper AWS regions. * Use globals for variants. * Update initial vault version for `upgrade` and `autopilot` scenarios. * Update the consul versions for all scenarios that support a consul storage backend. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-12-08 14:00:45 -07:00
Ryan Cragun	30a8435499	[QT-617] Add seal migration to `seal_ha` scenario (#23919 ) Test HA seal migration in the `seal_ha` by removing the primary seal, ensuring seal rewrap has completed, and verifying that data written through the primary seal is available in the new primary seal. We also add a verification for the seal type at various stages of the scenario. * Allow configuring the seal alias and priority in the `start_vault` module. * Add seal migration to `seal_ha` scenario. * Verify the data written through the original primary seal after the seal migration. * [QT-629] Verify the seal type at various stages in `seal_ha`. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-10-31 19:42:26 +00:00
Ryan Cragun	a46def288f	[QT-616] Add `seal_ha` enos scenario (#23812 ) Add support for testing Vault Enterprise with HA seal support by adding a new `seal_ha` scenario that configures more than one seal type for a Vault cluster. We also extend existing scenarios to support testing with or without the Seal HA code path enabled. * Extract starting vault into a separate enos module to allow for better handling of complex clusters that need to be started more than once. * Extract seal key creation into a separate module and provide it to target modules. This allows us to create more than one seal key and associate it with instances. This also allows us to forego creating keys when using shamir seals. * [QT-615] Add support for configuring more that one seal type to `vault_cluster` module. * [QT-616] Add `seal_ha` scenario * [QT-625] Add `seal_ha_beta` variant to existing scenarios to test with both code paths. * Unpin action-setup-terraform * Add `kms:TagResource` to service user IAM profile Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-10-26 15:13:30 -06:00
Ryan Cragun	9afd5e52ae	[QT-602] Don't fail if scenarios cannot completely destroy infra (#23473 ) Sometimes destroying resources in AWS will fail because of unexpected dependency violations or other such nonsense. When this happens the behavior of Vault that we wanted to verify has already been successfully accomplished, however the required workflow will fail. This change allows us to succeed if `enos scenario launch` completes but allows `enos scenario destroy` to fail. We still notify our slack channel on destroy failures so that we can investigate issues, however it won't require a PR author to retry. * Execute `enos scenario launch` instead of `enos scenario run` to allow for very occasional issues when tearing down test infrastructure. * Improve an error message when getting secondary cluster IP addresses. * Don't race to get secondary cluster IP addresses. * Add secondary token to replication scenario outputs. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-10-03 13:04:55 -06:00
Ryan Cragun	1b321e3e7e	test: restart socket sink if it's not listening (#23397 ) Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-28 22:20:24 +00:00
Ryan Cragun	460b5de47b	test: increase wait timers in new modules (#23355 ) Increase default retries for modules used in replication. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-27 17:19:57 -06:00
Ryan Cragun	5cdce48a6a	replication: wait longer for replication to resync (#23336 ) Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-27 20:50:28 +00:00
Ryan Cragun	391cc1157a	[QT-602] Run `proxy` and `agent` test scenarios (#23176 ) Update our `proxy` and `agent` scenarios to support new variants and perform baseline verification and their scenario specific verification. We integrate these updated scenarios into the pipeline by adding them to artifact samples. We've also improved the reliability of the `autopilot` and `replication` scenarios by refactoring our IP address gathering. Previously, we'd ask vault for the primary IP address and use some Terraform logic to determine followers. The leader IP address gathering script was also implicitly responsible for ensuring that a found leader was within a given group of hosts, and thus waiting for a given cluster to have a leader, and also for doing some arithmetic and outputting `replication` specific output data. We've broken these responsibilities into individual modules, improved their error messages, and fixed various races and bugs, including: * Fix a race between creating the file audit device and installing and starting vault in the `replication` scenario. * Fix how we determine our leader and follower IP addresses. We now query vault instead of a prior implementation that inferred the followers and sometimes did not allow all nodes to be an expected leader. * Fix a bug where we'd always always fail on the first wrong condition in the `vault_verify_performance_replication` module. We also performed some maintenance tasks on Enos scenarios byupdating our references from `oss` to `ce` to handle the naming and license changes. We also enabled `shellcheck` linting for enos module scripts. * Rename `oss` to `ce` for license and naming changes. * Convert template enos scripts to scripts that take environment variables. * Add `shellcheck` linting for enos module scripts. * Add additional `backend` and `seal` support to `proxy` and `agent` scenarios. * Update scenarios to include all baseline verification. * Add `proxy` and `agent` scenarios to artifact samples. * Remove IP address verification from the `vault_get_cluster_ips` modules and implement a new `vault_wait_for_leader` module. * Determine follower IP addresses by querying vault in the `vault_get_cluster_ips` module. * Move replication specific behavior out of the `vault_get_cluster_ips` module and into it's own `replication_data` module. * Extend initial version support for the `upgrade` and `autopilot` scenarios. We also discovered an issue with undo_logs that has been described in the VAULT-20259. As such, we've disabled the undo_logs check until it has been fixed. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-26 15:37:28 -06:00
Ryan Cragun	5449a99aba	test: wait for nc to be listening before enabling auditor (#23142 ) Rather than assuming a short sleep will work, we instead wait until netcat is listening of the socket. We've also configured the netcat listener to persist after the first connection, which allows Vault and us to check the connection without the process closing. As we implemented this we also ran into AWS issues in us-east-1 and us-west-2, so we've changed our deploy regions until those issues are resolved. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-18 14:47:13 -06:00
Marc Boudreau	00bbc0bd65	adjust nc command to ensure ssh session is not blocked (#23139 )	2023-09-18 10:14:26 -06:00
Ryan Cragun	464aeebddc	test: fix netcat install and listen for socket audit device (#23134 ) Fix an issue where netcat would not be installed correctly with certain package managers. We also fix an issue where SSH cannot exit because nc is waitaing for SIGHUP, resulting in scenarios running forever. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-15 18:33:47 -06:00
Marc Boudreau	e30c50321c	enable all audit devices in Enos's vault_cluster module (#22408 )	2023-09-15 10:44:23 -04:00
Ryan Cragun	d634700c9e	artifactory: handle all package lookups (#22963 ) Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-11 18:05:58 +00:00
Ryan Cragun	8edc24c7e1	test: fix release testing from artifactory (#22941 ) Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-08 20:47:27 +00:00
Ryan Cragun	5f1d2c56a2	[QT-506] Use enos scenario samples for testing (#22641 ) Replace our prior implementation of Enos test groups with the new Enos sampling feature. With this feature we're able to describe which scenarios and variant combinations are valid for a given artifact and allow enos to create a valid sample field (a matrix of all compatible scenarios) and take an observation (select some to run) for us. This ensures that every valid scenario and variant combination will now be a candidate for testing in the pipeline. See QT-504[0] for further details on the Enos sampling capabilities. Our prior implementation only tested the amd64 and arm64 zip artifacts, as well as the Docker container. We now include the following new artifacts in the test matrix: * CE Amd64 Debian package * CE Amd64 RPM package * CE Arm64 Debian package * CE Arm64 RPM package Each artifact includes a sample definition for both pre-merge/post-merge (build) and release testing. Changes: * Remove the hand crafted `enos-run-matrices` ci matrix targets and replace them with per-artifact samples. * Use enos sampling to generate different sample groups on all pull requests. * Update the enos scenario matrices to handle HSM and FIPS packages. * Simplify enos scenarios by using shared globals instead of cargo-culted locals. Note: This will require coordination with vault-enterprise to ensure a smooth migration to the new system. Integrating new scenarios or modifying existing scenarios/variants should be much smoother after this initial migration. [0] https://github.com/hashicorp/enos/pull/102 Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-09-08 12:46:32 -06:00
Sarah Thompson	a9a4b0b9ff	Onboard Vault to CRT version bump automation (#18311 ) * adding new version bump refactoring * address comments * remove changes used for testing * add the version bump event! * fix local enos scenarios * remove unnecessary local get_local_metadata steps from scenarios * add version base, pre, and meta to the get_local_metadata module * use the get_local_metadata module in the local builder for version metadata * update the version verifier to always require a build date Signed-off-by: Ryan Cragun <me@ryan.ec> * Update to embed the base version from the VERSION file directly into version.go. This ensures that any go tests can use the same (valid) version as CI and so can local builds and local enos runs. We still want to be able to set a default metadata value in version_base.go as this is not something that we set in the VERSION file - we pass this in as an ldflag in CI (matters more for ENT but we want to keep these files in sync across repos). * update comment * fixing bad merge * removing actions-go-build as it won't work with the latest go caching changes * fix logic for getting version in enos-lint.yml * fix version number * removing unneeded module --------- Signed-off-by: Ryan Cragun <me@ryan.ec> Co-authored-by: Claire <claire@hashicorp.com> Co-authored-by: Ryan Cragun <me@ryan.ec>	2023-09-06 17:08:48 +01:00
hashicorp-copywrite[bot]	0b12cdcfd1	[COMPLIANCE] License changes (#22290 ) * Adding explicit MPL license for sub-package. This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository. * Adding explicit MPL license for sub-package. This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository. * Updating the license from MPL to Business Source License. Going forward, this project will be licensed under the Business Source License v1.1. Please see our blog post for more details at https://hashi.co/bsl-blog, FAQ at www.hashicorp.com/licensing-faq, and details of the license at www.hashicorp.com/bsl. * add missing license headers * Update copyright file headers to BUS-1.1 * Fix test that expected exact offset on hcl file --------- Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com> Co-authored-by: Sarah Thompson <sthompson@hashicorp.com> Co-authored-by: Brian Kassouf <bkassouf@hashicorp.com>	2023-08-10 18:14:03 -07:00
Rebecca Willett	6654c425d2	Pass consul license in Enos scenarios that have `backend` in the matrix (#22177 )	2023-08-07 15:23:47 -04:00
Ryan Cragun	6b21994d76	[QT-588] test: fix drift between enos directories (#21695 ) * Sync missing scenarios and modules * Clean up variables and examples vars * Add a `lint` make target for enos * Update enos `fmt` workflow to run the `lint` target. * Always use ipv4 addresses in target security groups. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-07-20 14:09:44 -06:00
Ryan Cragun	fd1683698b	test: always use a unique id for target resources (#21472 ) Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-06-27 12:30:56 -04:00
Ryan Cragun	aed2783658	enos: use on-demand targets (#21459 ) Add an updated `target_ec2_instances` module that is capable of dynamically splitting target instances over subnet/az's that are compatible with the AMI architecture and the associated instance type for the architecture. Use the `target_ec2_instances` module where necessary. Ensure that `raft` storage scenarios don't provision unnecessary infrastructure with a new `target_ec2_shim` module. After a lot of trial, the state of Ec2 spot instance capacity, their associated APIs, and current support for different fleet types in AWS Terraform provider, have proven to make using spot instances for scenario targets too unreliable. The current state of each method: * `target_ec2_fleet`: unusable due to the fact that the `instant` type does not guarantee fulfillment of either `spot` or `on-demand` instance request types. The module does support both `on-demand` and `spot` request types and is capable of bidding across a maximum of four availability zones, which makes it an attractive choice if the `instant` type would always fulfill requests. Perhaps a `request` type with `wait_for_fulfillment` option like `aws_spot_fleet_request` would make it more viable for future consideration. * `target_ec2_spot_fleet`: more reliable if bidding for target instances that have capacity in the chosen zone. Issues in the AWS provider prevent us from bidding across multiple zones succesfully. Over the last 2-3 months target capacity for the instance types we'd prefer to use has dropped dramatically and the price is near-or-at on-demand. The volatility for nearly no cost savings means we should put this option on the shelf for now. * `target_ec2_instances`: the most reliable method we've got. It is now capable of automatically determing which subnets and availability zones to provision targets in and has been updated to be usable for both Vault and Consul targets. By default we use the cheapest medium instance types that we've found are reliable to test vault. * Update .gitignore * enos/modules/create_vpc: create a subnet for every availability zone * enos/modules/target_ec2_fleet: bid across the maximum of four availability zones for targets * enos/modules/target_ec2_spot_fleet: attempt to make the spot fleet bid across more availability zones for targets * enos/modules/target_ec2_instances: create module to use ec2:RunInstances for scenario targets * enos/modules/target_ec2_shim: create shim module to satisfy the target module interface * enos/scenarios: use target_ec2_shim for backend targets on raft storage scenarios * enos/modules/az_finder: remove unsed module Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-06-26 16:06:03 -06:00
Ryan Cragun	8d22142a3e	[QT-572][VAULT-17391] enos: use ec2 fleets for consul storage scenarios (#21400 ) Begin the process of migrating away from the "strongly encouraged not to use"[0] Ec2 spot fleet API to the more modern `ec2:CreateFleet`. Unfortuantely the `instant` type fleet does not guarantee fulfillment with either on-demand or spot types. We'll need to add a feature similar to `wait_for_fulfillment` on the `spot_fleet_request` resource[1] to `ec2_fleet` before we can rely on it. We also update the existing target fleets to support provisioning generic targets. This has allowed us to remove our usage of `terraform-enos-aws-consul` and replace it with a smaller `backend_consul` module in-repo. We also remove `terraform-enos-aws-infra` and replace it with two smaller in-repo modules `ec2_info` and `create_vpc`. This has allowed us to simplify the vpc resources we use for each scneario, which in turn allows us to not rely on flaky resources. As part of this refactor we've also made it possible to provision targets using different distro versions. [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html#which-spot-request-method-to-use [1] https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/spot_fleet_request#wait_for_fulfillment * enos/consul: add `backend_consul` module that accepts target hosts. * enos/target_ec2_spot_fleet: add support for consul networking. * enos/target_ec2_spot_fleet: add support for customizing cluster tag key. * enos/scenarios: create `target_ec2_fleet` which uses a more modern `ec2_fleet` API. * enos/create_vpc: replace `terraform-enos-aws-infra` with smaller and simplified version. Flatten the networking to a single route on the default route table and a single subnet. * enos/ec2_info: add a new module to give us useful ec2 information including AMI id's for various arch/distro/version combinations. * enos/ci: update service user role to allow for managing ec2 fleets. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-06-22 12:42:21 -06:00
Ryan Cragun	2ec5a28f51	test: handle occasional lower capacity in zone d (#21143 ) We seen instances where we try to schedule a spot fleet in the us-east-1d of the vault CI AWS account and cannot get capacity for our instance type. That zone currently supports far fewer instance types so we'll bump our max bid to handle cases where slightly more expensive instances are available. Most of the time we'll be using much cheaper instances but it's better to pay a fraction of a cent more than have to retry the pipeline. As such, we increase our max bid price to something that will almost certainly be fullfilled. We also allow our package installer to go ahead when cloud init does not update sources like we expect. This should handle occasional failures where cloud-init doesn't update the sources within a reasonable amount of time. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-06-12 10:49:58 -06:00
Ryan Cragun	27621e05d6	[QT-527][QT-509] enos: use latest version of enos-provider (#21129 ) Use the latest version of enos-provider and upstream consul module. These changes allow us to configure the vault log level in configuration and also support configuring consul with an enterprise license. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-06-12 10:00:16 -04:00
Ryan Cragun	b0aa808baa	[QT-509] enos: pin to enos-provider < 0.4.0 (#21108 ) Temporarily pin the enos provider to < 0.4.0 to gracefully roll out new provider changes. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-06-09 13:06:00 -06:00
Jaymala	8512858583	Fix autopilot scenario failures (#21025 ) * Fix autopilot scenario failures Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> Signed-off-by: Mike Baum <mike.baum@hashicorp.com> * use bash instead of sh in create logs dir shell script * ensure to only enable the file audit device in the upgrade cluster of the autopilot scenario if the variable is enabled --------- Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> Signed-off-by: Mike Baum <mike.baum@hashicorp.com> Co-authored-by: Mike Baum <mike.baum@hashicorp.com>	2023-06-06 17:03:50 -04:00
Mike Baum	dbe41c4fee	[QT-426] Always create the file audit directory (#20997 ) * Always create the file audit directory * Create audit file directory after unsealing the leader	2023-06-05 20:25:58 -04:00
Mike Baum	2c9a75b093	[QT-426] Ensure file audit device is only enabled if the leader is initialized. (#20974 )	2023-06-03 13:50:28 -04:00
Mike Baum	0115b5e43a	[QT-426] Add support for enabling the file audit device for enos scenarios (#20552 )	2023-06-02 13:07:33 -04:00
Ryan Cragun	cb23fcd83f	test: use correct pool allocation for spot strategy (#20593 ) Determine the allocation pool size for the spot fleet by the allocation strategy. This allows us to ensure a consistent attribute plan during re-runs which avoid rebuilding the target fleets. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-05-16 14:00:20 -06:00
Jaymala	1d5325f255	[QT-554] Remove Terraform validations from Enos replication scenario (#20570 ) Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com>	2023-05-12 16:06:46 -04:00
Ryan Cragun	57661b8da3	[QT-530] enos: allow-list all public IP addresses (#20304 ) The security groups that allow access to remote machines in Enos scenarios have been configured to only allow port 22 (SSH) from the public IP address of machine executing the Enos scenario. To achieve this we previously utilized the `enos_environment.public_ip_address` attribute. Sometime in mid March we started seeing sporadic SSH i/o timeout errors when attempting to execute Enos resources against SSH transport targets. We've only ever seen this when communicating from Azure hosted runners to AWS hosted machines. While testing we were able to confirm that in some cases the public IP address resolved using DNS over UDP4 to Google and OpenDNS name servers did not match what was resolved when using the HTTPS/TCP IP address service hosted by AWS. The Enos data source was implemented in a way that we'd attempt resolution of a single name server and only attempt resolving from the next if previous name server could not get a result. We'd then allow-list that single IP address. That's a problem if we can resolve two different public IP addresses depending our endpoint address. This change utlizes the new `enos_environment.public_ip_addresses` attribute and subsequent behavior change. Now the data source will attempt to resolve our public IP address via name servers hosted by Google, OpenDNS, Cloudflare, and AWS. We then return a unique set of these IP addresses and allow-list all of them in our security group. It is our hope that this resolves these i/o timeout errors that seem like they're caused by the security group black-holing our attempted access because the IP we resolved does not match what we're actually exiting with. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-04-23 16:25:32 -06:00
Ryan Cragun	1329a6b506	[QT-525] enos: use spot instances for Vault targets (#20037 ) The previous strategy for provisioning infrastructure targets was to use the cheapest instances that could reliably perform as Vault cluster nodes. With this change we introduce a new model for target node infrastructure. We've replaced on-demand instances for a spot fleet. While the spot price fluctuates based on dynamic pricing, capacity, region, instance type, and platform, cost savings for our most common combinations range between 20-70%. This change only includes spot fleet targets for Vault clusters. We'll be updating our Consul backend bidding in another PR. * Create a new `vault_cluster` module that handles installation, configuration, initializing, and unsealing Vault clusters. * Create a `target_ec2_instances` module that can provision a group of instances on-demand. * Create a `target_ec2_spot_fleet` module that can bid on a fleet of spot instances. * Extend every Enos scenario to utilize the spot fleet target acquisition strategy and the `vault_cluster` module. * Update our Enos CI modules to handle both the `aws-nuke` permissions and also the privileges to provision spot fleets. * Only use us-east-1 and us-west-2 in our scenario matrices as costs are lower than us-west-1. Signed-off-by: Ryan Cragun <me@ryan.ec>	2023-04-13 15:44:43 -04:00
Mike Baum	5d706c44d0	[QT-523] Remove copyright/license header from raft config used in the Docker/K8S integration test (#19584 )	2023-03-16 17:39:59 -04:00
Hamid Ghaf	e55c18ed12	adding copyright header (#19555 ) * adding copyright header * fix fmt and a test	2023-03-15 09:00:52 -07:00
Jaymala	bbc6af3444	Fetch replication status in its own resource (#19132 ) * Fix json decode errors for Enos replication verification module Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> * Rewrite the pr connection check script Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> * Do not fail on get replication status Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> --------- Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com>	2023-02-14 12:21:29 -05:00
Jaymala	65cb09c75f	Add Vault log level support (#19083 ) Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com>	2023-02-08 17:41:16 -05:00
Mike Baum	6b7787c86a	[QT-304] Add enos ui scenario (#18518 ) * Add enos ui scenario * Add github action for running the UI scenario	2023-02-03 09:55:06 -05:00
Jaymala	748ee9c7d5	Update replication verification to check connection status (#18921 ) * Update replication verification to check connection status Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> * Output replication status after verifying connection Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com> --------- Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com>	2023-01-31 16:23:46 -05:00
Hamid Ghaf	86d356e404	enos: default undo-logs to cluster behavior (#18771 ) * enos: default undo-logs to cluster behavior * change a step dependency * rearrange steps, wait a bit longer for undo logs	2023-01-20 10:25:14 -05:00

1 2

65 Commits