The natlab-integrationtest CI job frequently flakes by exhausting its
3m go test timeout. The root cause is that the QEMU VMs run under
pure software emulation (TCG) with no KVM. Under TCG, the guest
kernel's timer calibration busy-loops are at the mercy of host CPU
scheduling. When two VMs boot simultaneously on a 2-core CI runner,
one VM's calibration gets starved and produces wrong results, leaving
the kernel with broken timers that prevent it from ever completing
boot — even after the other VM finishes and frees up CPU.
Additionally, the microvm machine type doesn't provide HPET hardware,
but the kernel command line specified clocksource=hpet. And the VM
image build (make natlab) ran inside the test itself, consuming most
of the 3m timeout budget before the actual test started.
Fix by:
- Enabling KVM when /dev/kvm is available, so timer calibration
uses real hardware timers unaffected by host CPU scheduling.
- Adding a CI step to set /dev/kvm permissions on the GitHub
Actions runner (ubuntu-latest provides KVM but needs a udev rule).
- Pre-building the VM image in a separate CI step so it doesn't
cut into the go test -timeout budget.
- Replacing the hardcoded 60s context timeout with one derived from
t.Deadline(), so the test uses the full -timeout budget.
- Adding VM boot progress detection (AwaitFirstPacket) and QMP
diagnostics, so boot failures produce clear errors instead of
opaque "context deadline exceeded" messages.
With KVM enabled, the test passes reliably even on a single CPU core
with 3 parallel workers — a scenario that was 100% broken under TCG.
Fixes#18906
Change-Id: I4c87631a9c9678d185b9f30cb05c0f7bfa9f5c62
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Parallelize the SSH integration tests across OS targets and reduce
per-container overhead:
- CI: use GitHub Actions matrix strategy to run all 4 OS containers
(ubuntu:focal, ubuntu:jammy, ubuntu:noble, alpine:latest) in parallel
instead of sequentially (~4x wall-clock improvement)
- Makefile: run docker builds in parallel for local dev too
- Dockerfile: consolidate ~20 separate RUN commands into 5 (one per
test phase), eliminating Docker layer overhead. Combine test binary
invocations where no state mutation is needed between them. Fix a bug
where TestDoDropPrivileges was silently not being run (was passed as a
second positional arg to -test.run instead of using regex alternation).
- TestMain: replace tail -F + 2s sleep with synchronous log read,
eliminating 2s overhead per test binary invocation. Set debugTest once
in TestMain instead of redundantly in each test function.
- session.read(): close channel on EOF so non-shell tests return
immediately instead of waiting for the 1s silence timeout.
Updates #19244
Change-Id: I2cc8588964fbce0dd7b654fb94e7ff33440b8584
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
go.cmd used cmd.exe to invoke PowerShell, which mangled arguments:
cmd.exe treats ^ as an escape character (so -run "^$" became -run "$",
running all tests instead of none) and = signs also caused issues in
the PowerShell→cmd.exe argument passing layer.
Replace it with a tiny no_std Rust binary (19KB, 32-bit x86 for
universal Windows compat: x86/x64/ARM64) that directly invokes the
Tailscale Go toolchain via CreateProcessW. The raw command line from
GetCommandLineW is passed through to CreateProcessW with only argv[0]
replaced, so arguments are never parsed or re-escaped.
The binary also handles first-run toolchain download natively using
curl.exe and tar.exe (both ship with Windows 10+), so PowerShell is
no longer required for normal operation. The PowerShell fallback is
only used for the rare TS_USE_GOCROSS=1 path.
PowerShell prefers go.exe over go.cmd when resolving ./tool/go, so
this is a drop-in replacement.
With go.exe in place, the CI can use the natural -bench=. -benchtime=1x
-run="^$" flags directly.
Also removes tool/go-win.ps1 which is now unused.
Updates #19255
Change-Id: I80da23285b74796e7694b89cff29a9fa0eaa6281
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The -run "^$" flag was being mangled by cmd.exe's argument processing.
The ^ character is cmd.exe's escape character, so go.cmd's cmd.exe layer
eats it, turning -run "^$" into -run "$" which matches all test names.
This caused the benchmark job to run every test, leading to timeouts
and Go runtime crashes.
Use -run XXXXNothingXXXX instead, which avoids special characters
entirely.
Updates #19252
Change-Id: I888c124254dd2767a40b61bcd68dbc9b22ad35a1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a new vet analyzer that checks t.Run subtest names don't contain
characters requiring quoting when re-running via "go test -run". This
enforces the style guide rule: don't use spaces or punctuation in
subtest names.
The analyzer flags:
- Direct t.Run calls with string literal names containing spaces,
regex metacharacters, quotes, or other problematic characters
- Table-driven t.Run(tt.name, ...) calls where tt ranges over a
slice/map literal with bad name field values
Also fix all 978 existing violations across 81 test files, replacing
spaces with hyphens and shortening long sentence-like names to concise
hyphenated forms.
Updates #19242
Change-Id: Ib0ad96a111bd8e764582d1d4902fe2599454ab65
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This repo's module is tailscale.com, and the tailscale-client-go-v2 repo
uses tailscale.com/client/tailscale/v2. It seems from #19010 that if we
have the client module as a dependency in this module, go vet will start
to consider the client module as part of tailscale.com/...
I'm not sure if this is a bug in go vet, but for now let's take the easy
fix and specify ./... instead. In my testing, it seems like this is
sufficient to make sure it just walks the file hierarchy and doesn't
find the client module as a sub-path.
Updates tailscale/corp#38418
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
We did so for Linux and macOS already, so also do so for Windows. We
only didn't already because originally we never produced binaries for
it (due to our corp repo not needing them), and later because we had
no ./tool/go wrapper. But we have both of those things now.
Updates #18884
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
After fixing the flakey tests in #18811 and #18814 we can enable running
the natlab testsuite running on CI generally.
Fixes#18810
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
I was confused when everything I was reading in the CI failure was
saying `go mod tidy`, but the thing that was actually failing was
related to nix flakes. Rename the pipeline and step name to the `make
tidy` that it actually runs.
Updates #16637
Signed-off-by: James Tucker <james@tailscale.com>
This switches our gokrazy builds to use a new variant of cmd/gok called
opinionated about using monorepos: https://github.com/bradfitz/monogok
And with that, we can get rid of all the go.mod files and builddir forests
under gokrazy/**.
Updates #13038
Updates gokrazy/gokrazy#361
Change-Id: I9f18fbe59b8792286abc1e563d686ea9472c622d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Recently, the golangci-lint workflow has been taking longer and longer
to complete, causing it to timeout after the default of 5 minutes.
Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: context deadline exceeded
Timeout exceeded: try increasing it by passing --timeout option
Although PR #18398 enabled the Go module cache, bootstrapping with a
cold cache still takes too long.
This PR doubles the default 5 minute timeout for golangci-lint to 10
minutes so that golangci-lint can finish downloading all of its
dependencies.
Note that this doesn’t affect the 5 minute timeout configured in
.golangci.yml, since running golangci-lint on your local instance
should still be plenty fast.
Fixes#18366
Signed-off-by: Simon Law <sfllaw@tailscale.com>
Recently, the golangci-lint workflow has been taking longer and longer
to complete, causing it to timeout after the default of 5 minutes.
Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: context deadline exceeded
Timeout exceeded: try increasing it by passing --timeout option
This PR upgrades actions/setup-go to version 6, the latest, and
enables caching for Go modules and build outputs. This should speed up
linting because most packages won’t have to be downloaded over and
over again.
Fixes#18366
Signed-off-by: Simon Law <sfllaw@tailscale.com>
Bump peter-evans/create-pull-request to 8.0.0 to ensure compatibility
with actions/checkout 6.x.
Updates #cleanup
Signed-off-by: Mario Minardi <mario@tailscale.com>
Add flags:
* --cigocached-host to support alternative host resolution in other
environments, like the corp repo.
* --stats to reduce the amount of bash script we need.
* --version to support a caching tool/cigocacher script that will
download from GitHub releases.
Updates tailscale/corp#10808
Change-Id: Ib2447bc5f79058669a70f2c49cef6aedd7afc049
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
To save rebuilding cigocacher on each CI job, build it on-demand, and
publish a release similar to how we publish releases for tool/go to
consume. Once the first release is done, we can add a new
tool/cigocacher script that pins to a specific release for each branch
to download.
Updates tailscale/corp#10808
Change-Id: I7694b2c2240020ba2335eb467522cdd029469b6c
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
Add support for pinning specific Tailscale versions during installation
via the TAILSCALE_VERSION environment variable.
Example usage:
curl -fsSL https://tailscale.com/install.sh | TAILSCALE_VERSION=1.88.4 sh
Fixes#17776
Signed-off-by: Raj Singh <raj@tailscale.com>
Implements a new disk put function for cigocacher that does not cause
locking issues on Windows when there are multiple processes reading and
writing the same files concurrently. Integrates cigocacher into test.yml
for Windows where we are running on larger runners that support
connecting to private Azure vnet resources where cigocached is hosted.
Updates tailscale/corp#10808
Change-Id: I0d0e9b670e49e0f9abf01ff3d605cd660dd85ebb
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
The cache artifacts from a full run of test.yml are 14GB. Only save
artifacts from the main branch to ensure we don't thrash too much. Most
branches should get decent performance with a hit from recent main.
Fixestailscale/corp#34739
Change-Id: Ia83269d878e4781e3ddf33f1db2f21d06ea2130f
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
Restrict running the golangci-lint workflow to when the workflow file
itself or a .go file, go.mod, or go.sum have actually been modified.
Updates #cleanup
Signed-off-by: Mario Minardi <mario@tailscale.com>
Skip the "request review" workflows for PRs that are in draft to reduce
noise / skip adding reviewers to PRs that are intentionally marked as
not ready to review.
Updates #cleanup
Signed-off-by: Mario Minardi <mario@tailscale.com>
This starts running the jsontags vet checker on the module.
All existing findings are adding to an allowlist.
Updates tailscale/corp#791
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
Drop usage of the branches filter with a single asterisk as this matches
against zero or more characters but not a forward slash, resulting in
PRs to branch names with forwards slashes in them not having these
workflow run against them as expected.
Updates https://github.com/tailscale/corp/issues/33523
Signed-off-by: Mario Minardi <mario@tailscale.com>