* VAULT-34822: Add `pipeline github list changed-files`
Add a new `github list changed-files` sub-command to `pipeline` command and
integrate it into the pipeline. This replaces our previous
`changed-files.sh` script.
This command works quite a bit differently than the full checkout and
diff based solution we used before. Instead of checking out the base ref
and head ref and comparing a diff, we now provide either a pull request
number or git commit SHA and use the Github REST API to determine the
changed files.
This approach has several benefits:
- Not requiring a local checkout of the repo to get the list of
changed files. This yields a significant perfomance improvement in
`setup` jobs where we typically determine the changed files list.
- The CLI supports both PRs and commit SHAs.
- The implementation is portable and doesn't require any system tools
like `git` or `bash` to be installed.
- A much more advanced system for adding group metadata to the changed
files. These groupings are going to be used heavily in future
pipeline automation work and will be used to make required jobs
smarter.
The theoretical drawbacks:
- It requires a GITHUB_TOKEN and only works for remote branches or
commits in Github. We could eventually add a local diff sub-command
or option to work locally, but that was not required for what we're
trying to achieve here.
While the groupings that I added in this change are quite rudimentary,
the system will allow us to add additional groups with very little
overhead. I tried to make this change more or less a port of the old
system to enable future work. I did include one small change of
behavior, which is that we now build all extended targets if the
`go.mod` or `go.sum` files change. We do this to ensure that dependency
changes don't subtly result in some extended platform breakage.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Verify vault secret integrity in unauthenticated I/O streams (audit log, STDOUT/STDERR via the systemd journal) by scanning the text with Vault Radar. We search for both known and unknown secrets by using an index of KVV2 values and also by radar's built-in heuristics for credentials, secrets, and keys.
The verification has been added to many scenarios where a slight time increase is allowed, as we now have to install Vault Radar and scan the text. In practice this adds less than 10 seconds to the overall duration of a scenario.
In the in-place upgrade scenario we explicitly exclude this verification when upgrading from a version that we know will fail the check. We also make the verification opt-in so as to not require a Vault Radar license to run Enos scenarios, though it will always be enabled in CI.
As part of this we also update our enos workflow to utilize secret values from our self-hosted Vault when executing in the vault-enterprise repo context.
Signed-off-by: Ryan Cragun <me@ryan.ec>
As the Vault pipeline and release processes evolve over time, so too must the tooling that drives them. Historically we've utilized a combination of CI features and shell scripts that are wrapped into make targets to drive our CI. While this
approach has worked, it requires careful consideration of what features to use (bash in CI almost never matches bash in developer machines, etc.) and often requires a deep understanding of several CLI tools (jq, etc). `make` itself also has limitations in user experience, e.g. passing flags.
As we're all in on Github Actions as our pipeline coordinator, continuing to utilize and build CLI tools to perform our pipeline tasks makes sense. This PR adds a new CLI tool called `pipeline` which we can use to build new isolated tasks that we can string together in Github Actions. We intend to use this utility as the interface for future release automation work, see VAULT-27514.
For the first task in this new `pipeline` tool, I've chosen to build two small sub-commands:
* `pipeline releases list-versions` - Allows us to list Vault versions between a range. The range is configurable either by setting `--upper` and/or `--lower` bounds, or by using the `--nminus` to set the N-X to go back from the current branches version. As CE and ENT do not have version parity we also consider the `--edition`, as well as none-to-many `--skip` flags to exclude specific versions.
* `pipeline generate enos-dynamic-config` - Which creates dynamic enos configuration based on the branch and the current list of release versions. It takes largely the same flags as the `release list-versions` command, however it also expects a `--dir` for the enos directory and a `--file` where the dynamic configuration will be written. This allows us to dynamically update and feed the latest versions into our sampling algorithm to get coverage over all supported prior versions.
We then integrate these new tools into the pipeline itself and cache the dynamic config on a weekly basis. We also cache the pipeline tool itself as it will likely become a repository for pipeline specific tooling. The caching strategy for the `pipeline` tool itself will make most workflows that require it super fast.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* VAULT-31402: Add verification for all container images
Add verification for all container images that are generated as part of
the build. Before this change we only ever tested a limited subset of
"default" containers based on Alpine Linux that we publish via the
Docker hub and AWS ECR.
Now we support testing all Alpine and UBI based container images. We
also verify the repository and tag information embedded in each by
deploying them and verifying the repo and tag metadata match our
expectations.
This does change the k8s scenario interface quite a bit. We now take in
an archive image and set image/repo/tag information based on the
scenario variants.
To enable this I also needed to add `tar` to the UBI base image. It was
already available in the Alpine image and is used to copy utilities to
the image when deploying and configuring the cluster via Enos.
Since some images contain multiple tags we also add samples for each
image and randomly select which variant to test on a given PR.
Signed-off-by: Ryan Cragun <me@ryan.ec>
- If we encounter a deadlock/long running test it is better to have go
test timeout. As we've noticed if we hit the GitHub step timeout, we
lose all information about what was running at the time of the timeout
making things harder to diagnose.
- Having the timeout through go test itself on a long running test it
outputs what test was running along with a full panic output within
the logs which is quite useful to diagnose
In order for our enterprise nightlies to run the same test-go job but
across a matrix of different base references we need to consider the
checkout ref in our failure and summary uploads in order to prevent
an upload race.
We also configure Git with our token before setting up Go so that
enterprise CI workflows can execute without downloading a module cache.
Signed-off-by: Ryan Cragun <me@ryan.ec>
It appears that with the latest runner image[0] that we occasionally see
a flaky test with an error related to our fontconfig cache:
```
Error: Browser timeout exceeded: 10s
Error while executing test: Acceptance | wrapped_token query param functionality: it authenticates when used with the with=token query param
Stderr:
Fontconfig error: No writable cache directories
[0822/180212.113587:WARNING:sandbox_linux.cc(430)] InitializeSandbox() called with multiple threads in process gpu-process.
```
This change rebuilds the fontconfig cache on Github hosted runners.
Hopefully we can remove this at some point when a new runner image is
released.
[0] https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20240818.1
Signed-off-by: Ryan Cragun <me@ryan.ec>
Optimize the cost of the Security `scan` workflow by utilizing a
different runner. Previously this workflow would use the
`custom-linux-xl` in `vault` vs. the `c6a.4xlarge` on-demand runner in
`vault-enterprise. This resulted in the `vault` workflow costing an
order of magnitude more each month.
I tested with the following instances sizes to compare cost to execution
time:
| Runnner | Estimated Time | Cost Factor | Cost Score |
|---------|-----------------|-------------|-------------|
|ubuntu-latest|19m|1|19|
|custom-linux-small|21.5m|2|43|
|custom-linux-medium|11.5m|4|46|
|custom-linux-xl|8.5m|16|136|
Currently the `CI` and `build` require workflows take anywhere from
16-20 minutes on `vault`. Our goal is to not exceed that.
At this time we're going to try out `ubuntu-latest` as it gives us ~85%
savings and by far the best bang for our buck. If it ends up being a
burden we can switch to `custom-linux-medium` for ~66% cost savings but
still a reasonable runtime.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* VAULT-29583: Modernize default distributions in enos scenarios
Our scenarios have been running the last gen of distributions in CI.
This updates our default distributions as follows:
- Amazon: 2023
- Leap: 15.6
- RHEL: 8.10, 9.4
- SLES: 15.6
- Ubuntu: 20.04, 24.04
With these changes we also unlock a few new variants combinations:
- `distro:amzn seal:pkcs11`
- `arch:arm64 distro:leap`
We also normalize our distro key for Amazon Linux to `amzn`, which
matches the uname output on both versions that we've supported.
Signed-off-by: Ryan Cragun <me@ryan.ec>
In order to take advantage of enos' ability to outline scenarios and to
inventory what verification they perform we needed to retrofit all of
that information to our existing scenarios and steps.
This change introduces an initial set of descriptions and verification
declarations that we can continue to refine over time.
As doing this required that I re-read every scenanario in its entirety I
also updated and fixed a few things along the way that I noticed,
including adding a few small features to enos that we utilize to make
handling initial versions programtic between versions instead of having a
delta between our globals in each branch.
* Update autopilot and in-place upgrade initial versions
* Programatically determine which initial versions to use based on Vault
version
* Partially normalize steps between scenarios to make comparisons easier
* Update the MOTD to explain that VAULT_ADDR and VAULT_TOKEN have been
set
* Add scenario and step descriptions to scenarios
* Add initial scenario quality verification declarations to scenarios
* Unpin Terraform in scenarios as >= 1.8.4 should work fine
* Automate feature changelog checking
* Add changelog for testing
* Simplify check
* Forgot the end of line thing
* Escape the characters
* More testing
* Last test?
* Delete test changelog
Fix a node deprecation warning by updating our actions-slack-status to
v2.0.1, which pulls in a newer version of the github-script action that
causes the deprecation warning.
Signed-off-by: Ryan Cragun <me@ryan.ec>