Cache scopes allow other branches to inherit default branch scopes,
which means that release branches can restore a key from main. Instead,
we now include the vault version as part of the cache key to ensure
we don't include versions that are incompatible with our version.
Signed-off-by: Ryan Cragun <me@ryan.ec>
As the Vault pipeline and release processes evolve over time, so too must the tooling that drives them. Historically we've utilized a combination of CI features and shell scripts that are wrapped into make targets to drive our CI. While this
approach has worked, it requires careful consideration of what features to use (bash in CI almost never matches bash in developer machines, etc.) and often requires a deep understanding of several CLI tools (jq, etc). `make` itself also has limitations in user experience, e.g. passing flags.
As we're all in on Github Actions as our pipeline coordinator, continuing to utilize and build CLI tools to perform our pipeline tasks makes sense. This PR adds a new CLI tool called `pipeline` which we can use to build new isolated tasks that we can string together in Github Actions. We intend to use this utility as the interface for future release automation work, see VAULT-27514.
For the first task in this new `pipeline` tool, I've chosen to build two small sub-commands:
* `pipeline releases list-versions` - Allows us to list Vault versions between a range. The range is configurable either by setting `--upper` and/or `--lower` bounds, or by using the `--nminus` to set the N-X to go back from the current branches version. As CE and ENT do not have version parity we also consider the `--edition`, as well as none-to-many `--skip` flags to exclude specific versions.
* `pipeline generate enos-dynamic-config` - Which creates dynamic enos configuration based on the branch and the current list of release versions. It takes largely the same flags as the `release list-versions` command, however it also expects a `--dir` for the enos directory and a `--file` where the dynamic configuration will be written. This allows us to dynamically update and feed the latest versions into our sampling algorithm to get coverage over all supported prior versions.
We then integrate these new tools into the pipeline itself and cache the dynamic config on a weekly basis. We also cache the pipeline tool itself as it will likely become a repository for pipeline specific tooling. The caching strategy for the `pipeline` tool itself will make most workflows that require it super fast.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* VAULT-31402: Add verification for all container images
Add verification for all container images that are generated as part of
the build. Before this change we only ever tested a limited subset of
"default" containers based on Alpine Linux that we publish via the
Docker hub and AWS ECR.
Now we support testing all Alpine and UBI based container images. We
also verify the repository and tag information embedded in each by
deploying them and verifying the repo and tag metadata match our
expectations.
This does change the k8s scenario interface quite a bit. We now take in
an archive image and set image/repo/tag information based on the
scenario variants.
To enable this I also needed to add `tar` to the UBI base image. It was
already available in the Alpine image and is used to copy utilities to
the image when deploying and configuring the cluster via Enos.
Since some images contain multiple tags we also add samples for each
image and randomly select which variant to test on a given PR.
Signed-off-by: Ryan Cragun <me@ryan.ec>
- If we encounter a deadlock/long running test it is better to have go
test timeout. As we've noticed if we hit the GitHub step timeout, we
lose all information about what was running at the time of the timeout
making things harder to diagnose.
- Having the timeout through go test itself on a long running test it
outputs what test was running along with a full panic output within
the logs which is quite useful to diagnose
In order for our enterprise nightlies to run the same test-go job but
across a matrix of different base references we need to consider the
checkout ref in our failure and summary uploads in order to prevent
an upload race.
We also configure Git with our token before setting up Go so that
enterprise CI workflows can execute without downloading a module cache.
Signed-off-by: Ryan Cragun <me@ryan.ec>
It appears that with the latest runner image[0] that we occasionally see
a flaky test with an error related to our fontconfig cache:
```
Error: Browser timeout exceeded: 10s
Error while executing test: Acceptance | wrapped_token query param functionality: it authenticates when used with the with=token query param
Stderr:
Fontconfig error: No writable cache directories
[0822/180212.113587:WARNING:sandbox_linux.cc(430)] InitializeSandbox() called with multiple threads in process gpu-process.
```
This change rebuilds the fontconfig cache on Github hosted runners.
Hopefully we can remove this at some point when a new runner image is
released.
[0] https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20240818.1
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Add LTS explanation and clarify other label explanations
* Link to doc containing LTS calendar
* Change order for simpler cognitive load
* A bit simpler based on feedback
Optimize the cost of the Security `scan` workflow by utilizing a
different runner. Previously this workflow would use the
`custom-linux-xl` in `vault` vs. the `c6a.4xlarge` on-demand runner in
`vault-enterprise. This resulted in the `vault` workflow costing an
order of magnitude more each month.
I tested with the following instances sizes to compare cost to execution
time:
| Runnner | Estimated Time | Cost Factor | Cost Score |
|---------|-----------------|-------------|-------------|
|ubuntu-latest|19m|1|19|
|custom-linux-small|21.5m|2|43|
|custom-linux-medium|11.5m|4|46|
|custom-linux-xl|8.5m|16|136|
Currently the `CI` and `build` require workflows take anywhere from
16-20 minutes on `vault`. Our goal is to not exceed that.
At this time we're going to try out `ubuntu-latest` as it gives us ~85%
savings and by far the best bang for our buck. If it ends up being a
burden we can switch to `custom-linux-medium` for ~66% cost savings but
still a reasonable runtime.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* VAULT-29583: Modernize default distributions in enos scenarios
Our scenarios have been running the last gen of distributions in CI.
This updates our default distributions as follows:
- Amazon: 2023
- Leap: 15.6
- RHEL: 8.10, 9.4
- SLES: 15.6
- Ubuntu: 20.04, 24.04
With these changes we also unlock a few new variants combinations:
- `distro:amzn seal:pkcs11`
- `arch:arm64 distro:leap`
We also normalize our distro key for Amazon Linux to `amzn`, which
matches the uname output on both versions that we've supported.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Pin protoc-gen-go-grpc to 1.4.0
They introduced a replace statement within the go.mod file which
causes failures running go install protoc-gen-go-grpc@latest
Workaround for now is to pin to the previous version
See https://github.com/grpc/grpc-go/issues/7448
* Add missing v to version v1.4.0 instead of 1.4.0 within tools/tools.sh
We have seen instances where the github.event.pull_request.labels.*.name
context in Github Actions doesn't actually include the labels.
Instead, we now pull and parse them ourselves instead of relying on that
context.
Signed-off-by: Ryan Cragun <me@ryan.ec>
In order to take advantage of enos' ability to outline scenarios and to
inventory what verification they perform we needed to retrofit all of
that information to our existing scenarios and steps.
This change introduces an initial set of descriptions and verification
declarations that we can continue to refine over time.
As doing this required that I re-read every scenanario in its entirety I
also updated and fixed a few things along the way that I noticed,
including adding a few small features to enos that we utilize to make
handling initial versions programtic between versions instead of having a
delta between our globals in each branch.
* Update autopilot and in-place upgrade initial versions
* Programatically determine which initial versions to use based on Vault
version
* Partially normalize steps between scenarios to make comparisons easier
* Update the MOTD to explain that VAULT_ADDR and VAULT_TOKEN have been
set
* Add scenario and step descriptions to scenarios
* Add initial scenario quality verification declarations to scenarios
* Unpin Terraform in scenarios as >= 1.8.4 should work fine