When there is no SDK container image in the registry, the fallback
looks at bincache but bincache isn't backed up and may be cleaned of
old releases. While this won't be the regular case, the container
image registry may be unavailable (or renamed as happened now), or
people would like to rerun the image job which relies on the packages
container.
qemu_update vendor test was downloading a wrong LTS image when it was
testing the old LTS image. This is because it was using a current
symlink, which for LTS channel will always point to the new LTS. Old
LTS is available under current-${YEAR} symlink. We can get the
information about year from the lts-info file.
FLATCAR_VERSION and FLATCAR_SDK_VERSION are defined in the version
file, so it should be sourced before trying to use those. Here we try
to do it in a limited scope.
Also, SDK container link should use the dockerized version in a
directory name.
Currently we skip the nightly build if there are no changes. This
didn't work well because a new run doesn't fix any failure because the
rerun became a no-op.
Check if the main artifacts we expect from a step are found, as simple
heuristic on whether a rerun is needed.
I found a duplicate function and verified that it's the only one via
comm -12 <(sort ci-automation/ci_automation_common.sh) <(sort sdk_lib/sdk_container_common.sh) | grep function
I'm not sure if this is due to a case where we only import one but
can't import the other, hence I'm not deleting it now.
This failed when used from ( secret_to_file ... VAR ; cat $VAR )
because ( ) starts a new subshell PID and secret_to_file's returned
/proc/PID/fd/X path was then using the wrong PID.
the JSON object is passed from the Groovy script to the release script,
we just need to extract the correct AWS Marketplace product ID based on
the "<channel>-<arch>".
Exception for the stable-amd64 where we also need to get the stable-pro
product ID.
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
The mantle plume tool has two steps, pre-release is the mere upload and
release is the publication. In the past this was used to run the tests
inbetween but we don't do this anymore.
Run plume pre-release and release in a single job. Since plume can't
push to GCS in our case, we upload the files to bincache. Also do the
cloudformation update which was previously done in
flatcar-build-scripts but could only be run after the sync to Origin.
It requires the "aws" tool in the mantle container until we implement
this in plume directly.
I made a mistake and wrote a version like main-3363-0.0-stuff (note a
dash instead of a dot after the first number). Surprisingly the build
chugged along just fine almost until the end of the image job - it
detected invalid version string when the job wanted to create a
version.txt file:
ERROR build_image: script called: build_image '--board=amd64-usr' '--group=developer' '--output_root=/home/sdk/build/images' '--only_store_compressed' '--torcx_root=/home/sdk/build/torcx' 'prodtar' 'container'
ERROR build_image: Backtrace: (most recent call is last)
ERROR build_image: file build_image, line 196, called: split_ver '3363' 'SPLIT'
ERROR build_image: file common.sh, line 192, called: die 'Invalid version string '3363''
ERROR build_image:
ERROR build_image: Error was:
ERROR build_image: Invalid version string '3363'
Let's have a stricter version check in the beginning of the build
process, so the process fails sooner rather than later.
Now URLs for torcx packages are always present in the torcx manifest,
but for releases they may be pointing to the origin server where the
packages will be eventually uploaded. At the time of running the
tests, those packages are still only in the build cache, so change the
URLs to point to the build cache, so the test can pass.
Torcx manifest may contain paths and URLs as locations of
packages. There are two kinds of packages - vendored and
extra. Vendored packages normally have two locations - path to the
directory inside the image where the package is (which is why it's
called vendored), and a URL to the package on some remote
server. Extra packages only have a URL. But the URLs are added only
when we tell the build_torcx_store script to upload the packages at
the same time, which is what the old build pipeline was doing. With
the new pipeline, the upload happens as a separate step, thus the
upload is disabled when invoking build_torcx_store, and so the
packages are not getting URLs set. This change went unnoticed, because
a kola test checking the generated torcx manifest was only checking if
there is at least one location, either path or URL, and all the new
releases have no extra packages, only vendored ones.
When backporting the new pipeline to old LTS, the kola tests started
to fail, because old LTS had one extra package, and this is how I
noticed the problem.
The old pipeline had a release job where mantle's plume release tool
was invoked to publish the cloud images.
Implement a release job in the new pipeline with the same goals and
eventually even more automation.
To review the image changes and the changelog more easily and in case
of fixes, iterate over it without rebuilding the image, move this logic
to its own file where a new job could call it.
Instead of printing failed tests like this:
Failed tests: kubeadm.v1.25.0.cilium.base
kubeadm.v1.24.1.cilium.base
Do it like this:
Failed tests:
kubeadm.v1.25.0.cilium.base
kubeadm.v1.24.1.cilium.base
We have set success to true when the test cycle was broken, which was
a hacky way to avoid printing the give up message. But this setting
success to true also meant that the script returned with status 0,
which is wrong.
Add another variable for controlling printing the give up message.
The new m3.small instance does not have official Flatcar support yet
but we can already cover it in our PXE boot release tests.
The c3.small instances are legacy and m3.small is the new smallest
type.
We were running the run_sdk_container script with passing a value of a
variable named version to the script through the -v flag. But nowhere
is the variable defined. This worked under jenkins, because jenkins
job has a version parameter that gets exported into environment under
the same name. But running it manually outside jenkins revealed the
bug.
The script should have been using a vernum variable. Now, the
difference between this variable and the version variable is that
"version" was in form of <channel>-<version>-<build_id>, whereas
"vernum" comes without the channel part. Fortunately,
"run_sdk_container" was stripping the channel part before using this
value, so it makes no difference whether we pass
main-3333.0.0.0-some-id or just 3333.0.0-some-id.
Recently we changed the region from DA (Dallas) to DC (Washington),
because there are more ARM64 servers available. Reflect this change in
the new pipeline too.
When the build system runs the packages jobs for both architectures in
parallel and has to create a new tag, tagging fails due to the race in
the tagging.
Move the git tagging to its own script that is run from a new top-level
job that starts the packages jobs for both architectures.
The image comparison was done against the old release in the channel
we release to instead of the previous release with the same major
version. This means when a channel transition happens we see a large
diff instead of the diff against the previous release. While not bad
for finding problems, this is normally not needed. However, we want
to have two changelogs generated, one against the old release in the
channel we relese to and one against the previous release with the same
major version when a transition happens. There was no changelog
printing yet, and this is added now.
* Add SKIP_COPY_TO_BINCACHE environment variable that will skip
uploading test results to bincache. This is useful if we want to
upload test results as artifacts on github.
* make QEMU_IMAGE_NAME configurable
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The new build pipeline compresses images already but uploaded both the
compressed and uncompressed files because the whole build folder gets
uploaded.
Add a new flag "--only_store_compressed" to the image generation which
deletes the uncompressed file after compression is done. Uncompressed
images are still supported if specified in the flag
"image_compression_formats".
Closes https://github.com/flatcar-linux/Flatcar/issues/793
The original pipeline has package-diff commands to print out image
differences compared to the last release. This is used for the release
Go/No-Go QA checks.
Add the same logic to the new pipeline.
The image job builds an image container that is multiple GBs big and
takes >10 mins to be loaded in the vms job. The vms job can also do its
work by running from the packages container from the packages job when
it fetchs the built image from bincache first and assuming the images
job copies it there.
Skip generating the image container and instead use the packages
container for VM image building by copying the image folder first to
bincache and then retrieving it from there. While reworking this we
also address the issue that the VMs container had used the same name
for both architectures, causing a race when both run in parallel on
the same worker.
It uses the SIGNER environment variable to decide whether the
signatures should be created or not. It expect the key of the SIGNER
to exist in GPGHOME, and that's what gpg_setup.sh is already doing.
In some places we need to recursively change the owner of the
directory that contains artifacts to be signed, otherwise we won't be
able to create new files with signatures there. This is because some
of the artifacts are either created inside the SDK container (so the
created files belong to root outside the container) or are created
with `sudo`.
The functions are sourcing other files that define global variables,
so they will spill into the callers shell unnecessarily. We will also
add some functionality that uses traps in follow-up commits, so it's
good to limit the scope of traps too.
The kola test run time shouldn't be longer than the GC duration to
prevent failing tests caused by GC interference.
Align the Azure kola timeout with the GC duration.
The Azure tests use a similar logic as the GCE tests where an the
instance type parameter normally used in AWS/Equinix Metal tests is
here used to specify whether the VM gets started in Gen V1 or V2 mode.
Signed-off-by: Sayan Chowdhury <schowdhury@microsoft.com>
Co-authored-by: Kai Lüke <pothos@users.noreply.github.com>
When a nightly build is started that pushes the version file to the
branch it was doing so only at the end of the build, causing the push
to fail if something else got merged in between.
Push the version file early by generating it the same way it would be
generated by the run_sdk_container/bootstrap_sdk_container scripts.
In the case of the SDK the version file gets the same version for the
OS and the SDK. Add some explanations about the version formats. Note
that the scripts will still rewrite the file but it should be a no-op.
The coreos/portage refs were allowed to be empty strings but the way
the function was run from Groovy the lack of quoting caused the empty
strings to be missing parameters.
Since the two parameters are meant to be optional, support omitting
them.
`local -a stuff` does not make `stuff` a bound array variable, so
checking length of the array will trigger an error about unbound
variable. Fortunately, `local stuff=()` does the trick.
We forgot to clear the array with instance tests to rerun, so the list
grew from one iteration to another when going over all the instance
types. I did not spot it before, because I tested it with only one
extra instance.
The logic we had in some tests for covering different instance types
now got more easy to reuse for testing the GVNIC mode in GCE.
Align the GCE test with AWS and DigitalOcean to test an additional
"instance type" (here just changing the NIC) and break the retest spin
case it gets called for arm64.
The test framework from the AWS PR allows us to align the logic which
also addresses some bugs we had here.
Port the Equinix Metal test over to the new framework (and also use
different test basenames per architecture while at it which could
otherwise result in clashes).
The kola test scripts are named by the platforms. The image naming is
also quite difficult to know and remember, e.g., whether "ami" or
"ami_vmdk" is needed for AWS tests and whether it's "vmware" or
"vmware_ova".
To address these problems the vms build stage now accepts the platform
names as format input, and for each platform it will automatically
generate the needed image types to run the tests.
The garbage collect job should also clean up kola resources if a test
job failed to do so due to forced terminator or misbehavior. The
cleanup is done by "ore" which needs credentials like kola.
Run ore from the mantle container image. Unfortunately Docker does not
support Podman's --env-host option and the env vars had to be passed
explicitly. While --env-file=<(env) would work it contains a lot of
variables that cause the container to behave a bit weird.
The SDK container does not exist for arm64 and is quite heavy. We
currently also resort to a unconditional rebuilding of mantle inside
the SDK.
Use the new mantle container image to run kola tests, and pin its
version through a text file that gets updated by GitHub Actions.
this is required to keep "packet" in the SDK linguo while the user can
use "equinix_metal" term.
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
Co-authored-by: Krzesimir Nowak <knowak@microsoft.com>
The pipeline created two tags if an SDK was built, one for the SDK and
one for the OS build (which was a free-standing tag or a local state
that was equivalent to the existing tag of the same name). The
nightlies created update commits on the main branch, even if no change
was done, and on the release branches we lacked these commits.
Create the release tag in the nightly SDK bootstrap already and reuse
it for the nightly OS build. Instead of local state, checkout the
existing tags explicitly. Extend the nightly update commit logic to
cover release branches and detect if we can skip building because no
changes were done.
The nightly SDK image is not pushed to a registry but has to be
downloaded from the build server as tar ball.
Fall back to the tar ball import for a better user experience.
To reuse the ci logic it had to support the "docker" env variable.
The use of the pigz container is not always needed if the user has
pigz available.
Jenkins TAP file parser does not process non-printable ASCII characters
but bails out. This change removes all ASCII < 0x1F, so non-printable
characters are not included in the TAP report.
Fixes
Caused by: unacceptable character '' (0x1B) special characters are not allowed
This change removes all non-ASCII characters from test debug / error
output when ingesting the output for inclusion in the TAP report.
Jenkins TAP parser does not handle some unicode chars, leading to tap
parser errors with e.g. Cilium output (which uses unicode).
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
Move the final db commit to inside the subshell. Since the while loop
runs inside a subshell, the SQL variable outside of the subshell is not
modified, and so the last contents of the SQL variable are dropped. This
shows up when the last couple test cases don't have an error message,
and simply append the transaction to 'SQL'. They are never written to
the db.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The image needs to be set into official mode through a helper script
(see jenkins/images.sh) and the COREOS_OFFICIAL variable needs to be
set for prod_image_util.sh/build_image_util.sh/grub_install.sh.
For now we had only "developer" images in the new pipeline.
Based on the git tag like "alpha-1234.0.0" set the channel (group) for
the image and also use this logic when finding the channel in the QEMU
update test.
This change fixes and adds more string chars escaping in the test error
debug output ("\" are removed and a bug in removing '"' is fixed),
addressing a parser errof the CI encountered when ingesting TAP output.
Furthermore, line numbering is shortened, and test names have a spurious
"-" prefix removed.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
This change updates the tapfile helper to read test output from a file
instead of passig it inline in the SQL INSERT statement.
This fixes an issue with large error output from tests, which breaks
tap_ingest_tapfile():
ci-automation/tapfile_helper_lib.sh: line 31: /usr/bin/sqlite3: Argument list too long
Error observed with the cl.toolbox.dnf-install test, which can generate
8000 lines of output. Fix tested with the same output.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
The kola update test was missing. It is performed as update from the
old image to the newly built payload to ensure that the new image is
compatible for old clients.
This change adds the qemu_uefi.sh vendor test. It reuses most of the
implementation in qemu.sh (qemu_uefi.sh is a soft-link to qemu.sh).
This also enables qemu testing for ARM64.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
This change has sdk_bootstrap update the origin branch when run from the
main branch, updating the SDK and OS version in 'main' for each SDK
bootstrap build.
Release / maintenance branches have the SDK version set in the
versionfile at release time. But main is never updated.
Updating the versionfile in main when a new SDK is built ensures that
dev branches based on main will also use the correct SDK version (e.g.
in subsequent CI builds).
This change adds copying test results to the build cache server, and
adds respective deletion to the garbage collector.
Also, the patch fixes an issue with torcx publishing (manifest
publishing had arch hard-coded).
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
Use HTTP instead of https because Ignition does not recognise
letsencrypt certificates, leading to test breakage in
docker.torcx-manifest-pkgs.
Add a note in settings.env to explicitly call out HTTP requirement of
build cache server.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
This change updates the package build script to publish the torcx
manifest file to the build cache so it can be used by tests.
It also updates the generic test script to use the SDK container instead
of the packages container image, and to download and use the torcx
manifest from the build cache.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
- Git author configuration moves to tagging function and put under a
condition so as to not pollute peoples' workspaces.
- curl now less verbose since it was spamming logs with TLS debug
information.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
The original intention of the "binpkg" prefix in the CI binary package
cache URL was to separate packages from other build artifacts like
containers, images, and SDK tarballs. Motivation was to separate
developer content (binary packages) from CI automation artifacts
(everything else); since binary packages are not used by the CI.
This broke assumptions in scripts which use the binary host URL for
other things than packages - e.g. SDK tarballs or images. These
scripts would get a bincache URL with "binpkg/" prepended, while CI
automation would *not* use that prefix.
This change removes the use of "binpkg/" altogether since it would not
work as intended without more significant changes to build scripts.
garbage_collect.sh was using 'docker_vernum' where it should have been
using 'vernum' (as push_pkgs.sh does).
Also, make sure release directories are removed, not just packages.
Signed-off-by: Thilo Fromm <thilo@kinvolk.io>
This change adds a job for publishing binary packages to the build cache
server to the ci automation.
Also, setup_board is updated to use the buildcache package cache if a
nightly build version is detected.
Signed-off-by: flatcar-ci <infra+ci@flatcar-linux.org>
For test builds the commit that updates the submodules can be free-
standing but for releases we need to push it to the branch and also
sign the tag.
Add optional arguments that are used by the tag-release script in
flatcar-build-scripts.