mirror of
https://github.com/tailscale/tailscale.git
synced 2026-04-28 08:52:07 +02:00
The natlab-integrationtest CI job frequently flakes by exhausting its 3m go test timeout. The root cause is that the QEMU VMs run under pure software emulation (TCG) with no KVM. Under TCG, the guest kernel's timer calibration busy-loops are at the mercy of host CPU scheduling. When two VMs boot simultaneously on a 2-core CI runner, one VM's calibration gets starved and produces wrong results, leaving the kernel with broken timers that prevent it from ever completing boot — even after the other VM finishes and frees up CPU. Additionally, the microvm machine type doesn't provide HPET hardware, but the kernel command line specified clocksource=hpet. And the VM image build (make natlab) ran inside the test itself, consuming most of the 3m timeout budget before the actual test started. Fix by: - Enabling KVM when /dev/kvm is available, so timer calibration uses real hardware timers unaffected by host CPU scheduling. - Adding a CI step to set /dev/kvm permissions on the GitHub Actions runner (ubuntu-latest provides KVM but needs a udev rule). - Pre-building the VM image in a separate CI step so it doesn't cut into the go test -timeout budget. - Replacing the hardcoded 60s context timeout with one derived from t.Deadline(), so the test uses the full -timeout budget. - Adding VM boot progress detection (AwaitFirstPacket) and QMP diagnostics, so boot failures produce clear errors instead of opaque "context deadline exceeded" messages. With KVM enabled, the test passes reliably even on a single CPU core with 3 parallel workers — a scenario that was 100% broken under TCG. Fixes #18906 Change-Id: I4c87631a9c9678d185b9f30cb05c0f7bfa9f5c62 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
45 lines
1.5 KiB
YAML
45 lines
1.5 KiB
YAML
# Run some natlab integration tests.
|
|
# See https://github.com/tailscale/tailscale/issues/13038
|
|
name: "natlab-integrationtest"
|
|
|
|
concurrency:
|
|
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
|
cancel-in-progress: true
|
|
|
|
on:
|
|
push:
|
|
branches:
|
|
- "main"
|
|
- "release-branch/*"
|
|
pull_request:
|
|
# all PRs on all branches
|
|
merge_group:
|
|
branches:
|
|
- "main"
|
|
jobs:
|
|
natlab-integrationtest:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- name: Check out code
|
|
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
|
|
- name: Enable KVM
|
|
run: |
|
|
echo 'KERNEL=="kvm", GROUP="kvm", MODE="0666", OPTIONS+="static_node=kvm"' | sudo tee /etc/udev/rules.d/99-kvm4all.rules
|
|
sudo udevadm control --reload-rules
|
|
sudo udevadm trigger --name-match=kvm
|
|
- name: Install qemu
|
|
run: |
|
|
sudo rm -f /var/lib/man-db/auto-update
|
|
sudo apt-get -y update
|
|
sudo apt-get -y remove man-db
|
|
sudo apt-get install -y qemu-system-x86 qemu-utils
|
|
- name: Build VM image
|
|
# The test will build this if missing, but we do it explicitly
|
|
# to avoid cutting into the go test -timeout budget, and to
|
|
# fail earlier with a clearer error if the image build breaks.
|
|
run: |
|
|
make -C gokrazy natlab
|
|
- name: Run natlab integration tests
|
|
run: |
|
|
./tool/go test -v -run=^TestEasyEasy$ -timeout=3m -count=1 ./tstest/integration/nat --run-vm-tests
|