Replace our prior implementation of Enos test groups with the new Enos
sampling feature. With this feature we're able to describe which
scenarios and variant combinations are valid for a given artifact and
allow enos to create a valid sample field (a matrix of all compatible
scenarios) and take an observation (select some to run) for us. This
ensures that every valid scenario and variant combination will
now be a candidate for testing in the pipeline. See QT-504[0] for further
details on the Enos sampling capabilities.
Our prior implementation only tested the amd64 and arm64 zip artifacts,
as well as the Docker container. We now include the following new artifacts
in the test matrix:
* CE Amd64 Debian package
* CE Amd64 RPM package
* CE Arm64 Debian package
* CE Arm64 RPM package
Each artifact includes a sample definition for both pre-merge/post-merge
(build) and release testing.
Changes:
* Remove the hand crafted `enos-run-matrices` ci matrix targets and replace
them with per-artifact samples.
* Use enos sampling to generate different sample groups on all pull
requests.
* Update the enos scenario matrices to handle HSM and FIPS packages.
* Simplify enos scenarios by using shared globals instead of
cargo-culted locals.
Note: This will require coordination with vault-enterprise to ensure a
smooth migration to the new system. Integrating new scenarios or
modifying existing scenarios/variants should be much smoother after this
initial migration.
[0] https://github.com/hashicorp/enos/pull/102
Signed-off-by: Ryan Cragun <me@ryan.ec>
* adding new version bump refactoring
* address comments
* remove changes used for testing
* add the version bump event!
* fix local enos scenarios
* remove unnecessary local get_local_metadata steps from scenarios
* add version base, pre, and meta to the get_local_metadata module
* use the get_local_metadata module in the local builder for version
metadata
* update the version verifier to always require a build date
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Update to embed the base version from the VERSION file directly into version.go.
This ensures that any go tests can use the same (valid) version as CI and so can local builds and local enos runs.
We still want to be able to set a default metadata value in version_base.go as this is not something that we set in the VERSION file - we pass this in as an ldflag in CI (matters more for ENT but we want to keep these files in sync across repos).
* update comment
* fixing bad merge
* removing actions-go-build as it won't work with the latest go caching changes
* fix logic for getting version in enos-lint.yml
* fix version number
* removing unneeded module
---------
Signed-off-by: Ryan Cragun <me@ryan.ec>
Co-authored-by: Claire <claire@hashicorp.com>
Co-authored-by: Ryan Cragun <me@ryan.ec>
* Adding explicit MPL license for sub-package.
This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.
* Adding explicit MPL license for sub-package.
This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.
* Updating the license from MPL to Business Source License.
Going forward, this project will be licensed under the Business Source License v1.1. Please see our blog post for more details at https://hashi.co/bsl-blog, FAQ at www.hashicorp.com/licensing-faq, and details of the license at www.hashicorp.com/bsl.
* add missing license headers
* Update copyright file headers to BUS-1.1
* Fix test that expected exact offset on hcl file
---------
Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
Co-authored-by: Sarah Thompson <sthompson@hashicorp.com>
Co-authored-by: Brian Kassouf <bkassouf@hashicorp.com>
* Sync missing scenarios and modules
* Clean up variables and examples vars
* Add a `lint` make target for enos
* Update enos `fmt` workflow to run the `lint` target.
* Always use ipv4 addresses in target security groups.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Add an updated `target_ec2_instances` module that is capable of
dynamically splitting target instances over subnet/az's that are
compatible with the AMI architecture and the associated instance type
for the architecture. Use the `target_ec2_instances` module where
necessary. Ensure that `raft` storage scenarios don't provision
unnecessary infrastructure with a new `target_ec2_shim` module.
After a lot of trial, the state of Ec2 spot instance capacity, their
associated APIs, and current support for different fleet types in AWS
Terraform provider, have proven to make using spot instances for
scenario targets too unreliable.
The current state of each method:
* `target_ec2_fleet`: unusable due to the fact that the `instant` type
does not guarantee fulfillment of either `spot` or `on-demand`
instance request types. The module does support both `on-demand` and
`spot` request types and is capable of bidding across a maximum of
four availability zones, which makes it an attractive choice if the
`instant` type would always fulfill requests. Perhaps a `request` type
with `wait_for_fulfillment` option like `aws_spot_fleet_request` would
make it more viable for future consideration.
* `target_ec2_spot_fleet`: more reliable if bidding for target instances
that have capacity in the chosen zone. Issues in the AWS provider
prevent us from bidding across multiple zones succesfully. Over the
last 2-3 months target capacity for the instance types we'd prefer to
use has dropped dramatically and the price is near-or-at on-demand.
The volatility for nearly no cost savings means we should put this
option on the shelf for now.
* `target_ec2_instances`: the most reliable method we've got. It is now
capable of automatically determing which subnets and availability
zones to provision targets in and has been updated to be usable for
both Vault and Consul targets. By default we use the cheapest medium
instance types that we've found are reliable to test vault.
* Update .gitignore
* enos/modules/create_vpc: create a subnet for every availability zone
* enos/modules/target_ec2_fleet: bid across the maximum of four
availability zones for targets
* enos/modules/target_ec2_spot_fleet: attempt to make the spot fleet bid
across more availability zones for targets
* enos/modules/target_ec2_instances: create module to use
ec2:RunInstances for scenario targets
* enos/modules/target_ec2_shim: create shim module to satisfy the
target module interface
* enos/scenarios: use target_ec2_shim for backend targets on raft
storage scenarios
* enos/modules/az_finder: remove unsed module
Signed-off-by: Ryan Cragun <me@ryan.ec>
We seem to hit occasional capacity issues when attempting to launch spot
fleets with arm64 instance types. After checking pricing in the regions
that we use, it appears that current and older generation amd64 t2 and
t3 instance types are running at quite a discount whereas t4 arm64
instances are barely under on-demand price, suggesting limited capacity
for arm64 spot instances at this time. We'll change our default backend
instance architecture to amd64 to bid for the cheaper t2 and t3
instances and increase our `max_price` globally to that of a RHEL
machine running on-demand with a t3.medium.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Begin the process of migrating away from the "strongly encouraged not to
use"[0] Ec2 spot fleet API to the more modern `ec2:CreateFleet`.
Unfortuantely the `instant` type fleet does not guarantee fulfillment
with either on-demand or spot types. We'll need to add a feature similar
to `wait_for_fulfillment` on the `spot_fleet_request` resource[1] to
`ec2_fleet` before we can rely on it.
We also update the existing target fleets to support provisioning generic
targets. This has allowed us to remove our usage of `terraform-enos-aws-consul`
and replace it with a smaller `backend_consul` module in-repo.
We also remove `terraform-enos-aws-infra` and replace it with two smaller
in-repo modules `ec2_info` and `create_vpc`. This has allowed us to simplify
the vpc resources we use for each scneario, which in turn allows us to
not rely on flaky resources.
As part of this refactor we've also made it possible to provision
targets using different distro versions.
[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html#which-spot-request-method-to-use
[1] https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/spot_fleet_request#wait_for_fulfillment
* enos/consul: add `backend_consul` module that accepts target hosts.
* enos/target_ec2_spot_fleet: add support for consul networking.
* enos/target_ec2_spot_fleet: add support for customizing cluster tag
key.
* enos/scenarios: create `target_ec2_fleet` which uses a more modern
`ec2_fleet` API.
* enos/create_vpc: replace `terraform-enos-aws-infra` with smaller and
simplified version. Flatten the networking to a single route on the
default route table and a single subnet.
* enos/ec2_info: add a new module to give us useful ec2 information
including AMI id's for various arch/distro/version combinations.
* enos/ci: update service user role to allow for managing ec2 fleets.
Signed-off-by: Ryan Cragun <me@ryan.ec>
We seen instances where we try to schedule a spot fleet in the
us-east-1d of the vault CI AWS account and cannot get capacity for our
instance type. That zone currently supports far fewer instance types so
we'll bump our max bid to handle cases where slightly more expensive
instances are available. Most of the time we'll be using much cheaper
instances but it's better to pay a fraction of a cent more than have to
retry the pipeline. As such, we increase our max bid price to something
that will almost certainly be fullfilled.
We also allow our package installer to go ahead when cloud init does not
update sources like we expect. This should handle occasional failures
where cloud-init doesn't update the sources within a reasonable amount
of time.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Use the latest version of enos-provider and upstream consul module.
These changes allow us to configure the vault log level in configuration
and also support configuring consul with an enterprise license.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Fix autopilot scenario failures
Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com>
Signed-off-by: Mike Baum <mike.baum@hashicorp.com>
* use bash instead of sh in create logs dir shell script
* ensure to only enable the file audit device in the upgrade cluster of the autopilot scenario if the variable is enabled
---------
Signed-off-by: Jaymala Sinha <jaymala@hashicorp.com>
Signed-off-by: Mike Baum <mike.baum@hashicorp.com>
Co-authored-by: Mike Baum <mike.baum@hashicorp.com>
Determine the allocation pool size for the spot fleet by the allocation
strategy. This allows us to ensure a consistent attribute plan during
re-runs which avoid rebuilding the target fleets.
Signed-off-by: Ryan Cragun <me@ryan.ec>
The security groups that allow access to remote machines in Enos
scenarios have been configured to only allow port 22 (SSH) from the
public IP address of machine executing the Enos scenario. To achieve
this we previously utilized the `enos_environment.public_ip_address`
attribute. Sometime in mid March we started seeing sporadic SSH i/o
timeout errors when attempting to execute Enos resources against SSH
transport targets. We've only ever seen this when communicating from
Azure hosted runners to AWS hosted machines.
While testing we were able to confirm that in some cases the public IP
address resolved using DNS over UDP4 to Google and OpenDNS name servers
did not match what was resolved when using the HTTPS/TCP IP address
service hosted by AWS. The Enos data source was implemented in a way
that we'd attempt resolution of a single name server and only attempt
resolving from the next if previous name server could not get a result.
We'd then allow-list that single IP address. That's a problem if we can
resolve two different public IP addresses depending our endpoint address.
This change utlizes the new `enos_environment.public_ip_addresses`
attribute and subsequent behavior change. Now the data source will
attempt to resolve our public IP address via name servers hosted by
Google, OpenDNS, Cloudflare, and AWS. We then return a unique set of
these IP addresses and allow-list all of them in our security group. It
is our hope that this resolves these i/o timeout errors that seem like
they're caused by the security group black-holing our attempted access
because the IP we resolved does not match what we're actually exiting
with.
Signed-off-by: Ryan Cragun <me@ryan.ec>
The previous strategy for provisioning infrastructure targets was to use
the cheapest instances that could reliably perform as Vault cluster
nodes. With this change we introduce a new model for target node
infrastructure. We've replaced on-demand instances for a spot
fleet. While the spot price fluctuates based on dynamic pricing,
capacity, region, instance type, and platform, cost savings for our
most common combinations range between 20-70%.
This change only includes spot fleet targets for Vault clusters.
We'll be updating our Consul backend bidding in another PR.
* Create a new `vault_cluster` module that handles installation,
configuration, initializing, and unsealing Vault clusters.
* Create a `target_ec2_instances` module that can provision a group of
instances on-demand.
* Create a `target_ec2_spot_fleet` module that can bid on a fleet of
spot instances.
* Extend every Enos scenario to utilize the spot fleet target acquisition
strategy and the `vault_cluster` module.
* Update our Enos CI modules to handle both the `aws-nuke` permissions
and also the privileges to provision spot fleets.
* Only use us-east-1 and us-west-2 in our scenario matrices as costs are
lower than us-west-1.
Signed-off-by: Ryan Cragun <me@ryan.ec>
This uses aws-nuke and awslimitchecker to monitor the new vault CI account to clean up and prevent resource quota exhaustion. AWS-nuke will scan all regions of the accounts for lingering resources enos/terraform didn't clean up, and if they don't match exclusion criteria, delete them every night. By default, we exclude corp-sec created resources, our own CI resources, and when possible, anything created within the past 72 hours. Because this account is dedicated to CI, users should not expect resources to persist beyond this without additional configuration.
* Adding an Enos test for undo logs
* fixing a typo
* feedback
* fixing typo
* running make fmt
* removing a dependency
* var name change
* fixing a variable
* fix builder
* fix product version
* adding required fields
* feedback
* add artifcat bundle back
* fmt check
* point to correct instance
* minor fix
* feedback
* feedback
Introducing a new approach to testing Vault artifacts before merge
and after merge/notorization/signing. Rather than run a few static
scenarios across the artifacts, we now have the ability to run a
pseudo random sample of scenarios across many different build artifacts.
We've added 20 possible scenarios for the AMD64 and ARM64 binary
bundles, which we've broken into five test groups. On any given push to
a pull request branch, we will now choose a random test group and
execute its corresponding scenarios against the resulting build
artifacts. This gives us greater test coverage but lets us split the
verification across many different pull requests.
The post-merge release testing pipeline behaves in a similar fashion,
however, the artifacts that we use for testing have been notarized and
signed prior to testing. We've also reduce the number of groups so that
we run more scenarios after merge to a release branch.
We intend to take what we've learned building this in Github Actions and
roll it into an easier to use feature that is native to Enos. Until then,
we'll have to manually add scenarios to each matrix file and manually
number the test group. It's important to note that Github requires every
matrix to include at least one vector, so every artifact that is being
tested must include a single scenario in order for all workflows to pass
and thus satisfy branch merge requirements.
* Add support for different artifact types to enos-run
* Add support for different runner type to enos-run
* Add arm64 scenarios to build matrix
* Expand build matrices to include different variants
* Update Consul versions in Enos scenarios and matrices
* Refactor enos-run environment
* Add minimum version filtering support to enos-run. This allows us to
automatically exclude scenarios that require a more recent version of
Vault
* Add maximum version filtering support to enos-run. This allows us to
automatically exclude scenarios that require an older version of
Vault
* Fix Node 12 deprecation warnings
* Rename enos-verify-stable to enos-release-testing-oss
* Convert artifactory matrix into enos-release-testing-oss matrices
* Add all Vault editions to Enos scenario matrices
* Fix verify version with complex Vault edition metadata
* Rename the crt-builder to ci-helper
* Add more version helpers to ci-helper and Makefile
* Update CODEOWNERS for quality team
* Add support for filtering matrices by group and version constraints
* Add support for pseudo random test scenario execution
Signed-off-by: Ryan Cragun <me@ryan.ec>