Add a vmtest that brings up two gokrazy nodes A and B behind two
One2OneNAT networks (so direct UDP works in both directions and any
slowness can't be blamed on NAT traversal), establishes a WireGuard
tunnel A → B with TSMP, then rotates B's disco key four times and
asserts that the data plane recovers in both directions after each
rotation. All pings are TSMP (the data-plane ping; disco pings would
not exercise the WireGuard tunnel itself).
The five pings:
1. A → B (initial; brings up the tunnel; 30s budget)
2. B → A after rotate (LocalAPI rotate-disco-key debug action)
3. A → B after rotate (LocalAPI)
4. B → A after restart (SIGKILL; gokrazy supervisor respawns)
5. A → B after restart (SIGKILL)
Each post-rotation ping gets a 15-second budget. Two unavoidable
multi-second waits dominate today:
- The rotate-then-a→b phase takes ~10s on main because of LazyWG.
After B's WantRunning bounce, B's wgengine resets its
sentActivityAt/recvActivityAt maps and trims A out of the
wireguard-go config as an "idle peer"; B only re-adds A on
inbound activity, by which point A's first few TSMP packets
have been silently dropped at B's tundev. The
bradfitz/rm_lazy_wg branch removes that trimming entirely
(verified locally: this phase drops to <100ms there).
- The restart phases take ~5s for wireguard-go's RekeyTimeout
handshake retry. After SIGKILL+respawn the first WG handshake
init from the restarted node sometimes goes into the void
(likely the brief peer-removed window in the receiver's
two-step maybeReconfigWireguardLocked reconfig during which
the peer is absent from wireguard-go), and wg-go's 5s+jitter
retransmit timer is the next opportunity to retry. That retry
succeeds and the staged TSMP packet flushes. Intrinsic to the
protocol's retransmit policy.
Once LazyWG is removed and the first-handshake-after-reconfig race
is fixed, the budget should drop to 5s.
Supporting changes:
ipn/ipnlocal: DebugRotateDiscoKey now toggles WantRunning off and
back on after rotating the disco key. magicsock.Conn.RotateDiscoKey
only resets local disco state; without also dropping wireguard-go
session keys, peers keep encrypting with their stale per-peer
session against us until their rekey timer fires (WireGuard has no
data-plane signaling to invalidate sessions). Bouncing WantRunning
runs the engine through Reconfig(empty) → authReconfig, which
drops every peer's WG session so the next packet either way
triggers a fresh handshake.
ipn/ipnlocal, ipn/localapi: add a debug-only "peer-disco-keys"
LocalAPI action ([LocalBackend.DebugPeerDiscoKeys]) that returns
a map[NodePublic]DiscoPublic from the current netmap. Tests reach
it via [local.Client.DebugResultJSON]. We do not surface disco
keys via [ipnstate.PeerStatus] because adding a non-comparable
[key.DiscoPublic] field there breaks reflect-based test helpers
(e.g. TestFilterFormatAndSortExitNodes' use of cmp.Diff), and
general LocalAPI clients have no need for disco keys. Since the
debug LocalAPI is gated behind the ts_omit_debug build tag, this
endpoint is automatically stripped from small binaries.
cmd/tta: add /restart-tailscaled handler (Linux-only, via /proc walk)
to drive the SIGKILL phase. On gokrazy the supervisor respawns
tailscaled within a second.
tstest/integration/testcontrol: add Server.AllOnline. When set,
every peer entry in MapResponses is marked Online=true. Several
disco-key handling fast paths in controlclient and wgengine
(removeUnwantedDiscoUpdates, removeUnwantedDiscoUpdatesFromFull
NetmapUpdate, the wgengine tsmpLearnedDisco fast path) only fire
for online peers; without this flag, tests exercising disco-key
rotation only hit the offline-peer code paths, which mask issues
and are several seconds slower in this scenario. Finer-grained
per-node online tracking can be added later.
tstest/natlab/vmtest: add Env.RotateDiscoKey,
Env.RestartTailscaled, Env.PeerDiscoKey, Node.Name, an
[AllOnline] EnvOption that plumbs through to
testcontrol.Server.AllOnline, and an exported
Env.Ping(from, to, type, timeout). Ping replaces the unexported
helper so callers can specify both a ping type (PingDisco for
warming peer state, PingTSMP for asserting end-to-end
connectivity) and a deadline. PeerDiscoKey returns its LocalAPI
error so callers inside tstest.WaitFor can retry transient
failures rather than fataling the test.
Updates #12639
Updates #13038
Change-Id: I3644f27fc30e52990ba25a3983498cc582ddb958
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Two cloud-platform nodes (e.g. sr-a and sr-b in TestSiteToSite) boot in
parallel via errgroup and both call ensureCompiled and the inline image
preparation block, racing to Begin() the same shared *Step (which is
deduped by name in Env.Step). The second goroutine panics:
panic: Step "Compile linux_amd64 binaries": Begin called in state running
panic: Step "Prepare ubuntu-24.04 image": Begin called in state done
ensureCompiled had a TOCTOU dedup attempt (released compileMu before
doing the work, only added to the compiled set at the end), and image
preparation had no dedup at all.
Replace the compiled set with a per-key map[string]*sync.Once for each
of compile and image preparation, so concurrent callers serialize on
the Once and only the first executes Begin/work/End.
Fixes commit 02ffe5baa8ccb2b81c4cfba3b59653e2cff10e01.
Updates #13038
Change-Id: If710bcc9e0aafebf0ad5b61553bae11458d976d7
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Cache a pre-booted macOS VM snapshot on disk so subsequent test runs
restore from the snapshot instead of cold-booting. The snapshot is keyed
by the Tart base image digest and a code version constant
(macOSSnapshotCodeVersion); bumping either invalidates the cache.
Snapshot preparation (one-time):
- Boot the Tart base image with a NAT NIC (--nat-nic flag)
- Wait for SSH, compile and install cmd/tta as a LaunchDaemon
- TTA polls the host via AF_VSOCK for an IP assignment; during prep
the host replies "wait"
- Disconnect NIC, save VM state via SIGINT
Test fast path (cached, ~7s to agent connected):
- APFS clone the snapshot, write test-specific config.json
- Launch Host.app with --disconnected-nic --attach-network --assign-ip
- VZ restores from SaveFile.vzvmsave (~5s with 4GB RAM)
- TTA's vsock poll gets the IP config, sets static IP via ifconfig
(bypasses DHCP entirely), switches driver addr to the IP directly
(bypasses DNS), and resets the dial context so the reverse-dial
reconnects immediately
- TTA agent connects to test driver within ~2s of IP assignment
Key optimizations:
- 4GB RAM instead of 8GB: halves SaveFile.vzvmsave (1.4GB vs 2.4GB),
halves restore time (5.5s vs 11s)
- AF_VSOCK IP assignment: bypasses macOS DHCP (~5-7s saved)
- Direct IP dial: bypasses DNS resolution for test-driver.tailscale
- Dial context reset: cancels stale in-flight dials from snapshot
- Kill instead of SIGINT for test VM cleanup (no state save needed)
- Parallel VM launches
Also:
- Add TestDriverIPv4/TestDriverPort constants to vnet
- Add --nat-nic and --assign-ip flags to Host.app
- Fix SIGINT handler: retain DispatchSource globally, use dispatchMain()
- Add vsock listener (port 51011) to Host.app for IP config protocol
- Add disconnectNetwork() to VMController for clean snapshot state
- Fix Makefile: set -o pipefail so xcodebuild failures aren't swallowed
Updates #13038
Change-Id: Icbab73b57af7df3ae96136fb49cda2536310f31b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When --vmtest-web is set, Host.app is launched with --screenshot-port 0
to start a localhost HTTP server that captures the VZVirtualMachineView
display. The Go test harness parses the SCREENSHOT_PORT=<port> line from
stdout, then polls every 2 seconds for JPEG thumbnails and pushes them
over WebSocket to the web dashboard.
Clicking a screenshot thumbnail opens a full-resolution image proxied
through the web UI's /screenshot/{node} endpoint.
Screenshot events are excluded from the EventBus history (they're large
and only the latest matters, stored in NodeStatus.Screenshot).
Updates #13038
Change-Id: I9bc67ddd1cc72948b33c555d4be3d8db06a41f6d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add macOS VM support to the vmtest framework using Tart's pre-built
macOS images (ghcr.io/cirruslabs/macos-tahoe-base) instead of building
from IPSW. The Tart image has SIP disabled and SSH enabled.
At test time, the Tart base image's disk, NVRAM, and hardware identity
are APFS-cloned into a tailmac-compatible directory layout, and the VM
is booted headlessly via tailmac's Host.app (Virtualization.framework)
with its NIC connected to vnet's dgram socket.
New features:
- tailmac.go: ensureTartImage (auto-pull), cloneTartToTailmac (format
conversion), startTailMacVM (launch + cleanup)
- NoAgent() node option for VMs without TTA installed
- LANPing() for ICMP reachability testing via TTA's /ping endpoint
- IsMacOS field on OSImage, with GOOS/GOARCH support
- Dgram socket listener in Start() for macOS VMs
- Fix ReadFromUnix error spam on dgram socket close in vnet
TestMacOSAndLinuxCanPing verifies a macOS Tart VM and a gokrazy Linux
VM can ping each other on the same vnet LAN.
Updates #13038
Change-Id: I5e73a27878abf009f780fdf11a346fc857711cff
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a vmtest that brings up two Ubuntu nodes, each behind its own
EasyNAT, joined to the tailnet. The sender pushes a small file via
"tailscale file cp" and the receiver fetches it via "tailscale file
get --wait", asserting that the filename and contents round-trip
unchanged.
To make Taildrop work in vmtest, three small pieces were needed:
The Linux/FreeBSD cloud-init now starts tailscaled with --statedir as
well as --state=mem:, so the daemon has a VarRoot to host Taildrop's
incoming-files directory. State itself remains in-memory (so nothing
persists across reboots); only the var-root scratch space is on disk.
vmtest.New grows a variadic EnvOption parameter and a SameTailnetUser
helper. When the option is passed, Start sets AllNodesSameUser=true
on the embedded testcontrol.Server. Cross-node Taildrop requires the
sender and receiver to share a Tailnet user (or have an explicit
PeerCapabilityFileSharingTarget granted between them, which we don't
plumb here), so TestTaildrop opts in. Existing tests don't.
cmd/tta gains /taildrop-send and /taildrop-recv handlers that wrap
"tailscale file cp" and "tailscale file get --wait", plus
Env.SendTaildropFile and Env.RecvTaildropFile helpers in vmtest that
drive them.
Updates #13038
Change-Id: I8f5f70f88106e6e2ee07780dd46fe00f8efcfdf1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a vmtest that brings up a Tailscale client, an Ubuntu VM acting
as a Mullvad-style plain-WireGuard exit node, and a non-Tailscale
webserver, each on its own NAT'd vnet network with a distinct WAN
IP. The test exercises Tailscale's IsWireGuardOnly peer code path:
the way the control plane wires Mullvad exit nodes into a client's
netmap, including the per-client SelfNodeV4MasqAddrForThisPeer
source-IP rewrite that lets a Tailscale CGNAT IP egress through a
plain-WireGuard tunnel that has no idea what Tailscale is.
The mullvad VM doesn't run wireguard-tools or kernel WireGuard;
instead, a new TTA endpoint /wg-server-up creates a real Linux TUN
named wg0, drives it with wireguard-go (already vendored), and
configures the kernel side (ip addr/up, ip_forward, iptables NAT
MASQUERADE) so decrypted traffic from the peer egresses with the
mullvad VM's WAN IP. Userspace vs kernel WireGuard makes no
difference on the wire — what's being tested is Tailscale's
plain-WireGuard exit-node code path, not the kernel module — and
this lets the test avoid downloading and installing .deb packages
inside the VM.
Adds Env.BringUpMullvadWGServer (calls /wg-server-up, returns the
generated WG public key as a key.NodePublic), Env.SetExitNodeIP
(EditPrefs ExitNodeIP directly, for exit nodes whose IPs aren't
discoverable via TTA), Env.ControlServer (exposes the underlying
testcontrol.Server so tests can UpdateNode / SetMasqueradeAddresses
to inject custom peers), and Env.Status (fetches a node's tailscale
status, used to read the client's pubkey so we can pin it as the
WG server's only allowed peer).
The test verifies that the webserver's echoed source IP is the
client's WAN with no exit node selected, the mullvad VM's WAN with
the WG-only peer selected as exit, and the client's WAN again after
clearing.
Updates #13038
Change-Id: I5bac4e0d832f05929f12cb77fa9946d7f5fb5ef1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add an optional --vmtest-web flag that starts an HTTP server showing a
live dashboard for vmtest runs. The dashboard includes:
- Step progress tracker showing all test phases (compile, image prep,
QEMU launch, agent connect, tailscale up, test-specific steps)
with status icons and elapsed times
- Per-VM "virtual monitor" cards showing serial console output
streamed in realtime via WebSocket
- Per-NIC DHCP status (supporting multi-homed VMs like subnet routers)
- Per-node Tailscale status (hidden for non-tailnet VMs)
- Test status badge (Running/Passed/Failed) with live elapsed timer
- Event log showing all lifecycle events chronologically
Architecture follows the existing util/eventbus HTMX+WebSocket pattern:
the server pushes HTML fragments with hx-swap-oob attributes over a
WebSocket, and HTMX routes them to the correct DOM elements by ID.
Key components:
- vmstatus.go: Step tracker (Begin/End lifecycle), EventBus (pub/sub
with history for late joiners), VMEvent types, NodeStatus tracking
- web.go: HTTP server, WebSocket handler, template loading, ANSI-to-HTML
conversion via robert-nix/ansihtml, deterministic port selection
- assets/: HTML templates, CSS, HTMX library (copied from eventbus)
- vnet/vnet.go: DHCP event callback on Server for observing DHCP lifecycle
- qemu.go: Console log file tailing with manual offset-based reading
Usage:
go test ./tstest/natlab/vmtest/ --run-vm-tests --vmtest-web=:0 -v
When using :0, a deterministic port based on the test name is tried
first so re-runs get the same URL, falling back to OS-assigned on
conflict.
Updates #13038
Change-Id: I45281347b3d7af78ed9f4ff896033984f84dcb4d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a --test-version flag to run the natlab VM tests against
released tailscale/tailscaled binaries downloaded from
pkgs.tailscale.com instead of building from the source tree.
The value can be a concrete release like "1.97.255", or "stable" /
"unstable" which resolve to the latest TarballsVersion on that track
via pkgs.tailscale.com/<track>/?mode=json. The track for a concrete
version is derived from its minor (even=stable, odd=unstable). The
host architecture (amd64 or arm64) selects the tarball.
Tarballs are cached + extracted under
~/.cache/tailscale-vmtest/builds/<version>_<arch>/ so they are not
re-fetched per test. tta is still always built from the local tree.
Cloud VMs (Ubuntu, Debian) pick up the downloaded binaries via the
existing files.tailscale file server. Non-Linux GOOS (FreeBSD) falls
back to building from source since pkgs.tailscale.com only ships
Linux tarballs. Gokrazy nodes continue to use binaries baked into
the gokrazy image; --test-version is a no-op for them.
Updates #13038
Change-Id: I213ef7db362dd17bf69d2685cbf2ab0ec5a3fee1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add two tests building on TestExitNode's framework:
TestSubnetRouterPublicIP brings up a client, a subnet router, and a
webserver, each on its own NAT'd network with distinct WAN IPs. The
subnet router advertises the webserver's network as a route. The test
toggles the client's --accept-routes preference and asserts that the
webserver's echoed source IP switches between the client's own WAN
(direct dial) and the subnet router's WAN (forwarded through the
router and SNAT'd).
TestSubnetRouterAndExitNode adds a fourth node, an exit node that
advertises 0.0.0.0/0 + ::/0, and uses a table-driven layout with
subtests to cover the four combinations of (exit on/off, subnet
on/off). The case where both are on confirms longest-prefix match
wins: the subnet router's /24 takes precedence over the exit node's
/0. The exit node itself is configured with --accept-routes=off so
that, in the exit-only case, it forwards directly to the simulated
internet rather than re-routing the forwarded traffic via the subnet
router (which would otherwise mask the exit node's WAN as the
observed source).
Adds an Env.SetAcceptRoutes helper for toggling the RouteAll pref via
EditPrefs, used by both tests.
Updates #13038
Change-Id: Ifc2726db1df2f039c477c222484f535bebc40445
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a vmtest TestExitNode that brings up a client, two exit nodes, and a
non-Tailscale webserver, each on its own NAT'd vnet network with a
distinct WAN IP. The test cycles the client's exit node setting between
off, exit1, and exit2 and asserts that the webserver echoes the expected
post-NAT source IP for each.
Three pieces were needed to make this work:
vnet now forwards TCP between simulated networks at the packet level,
mirroring the existing UDP path. When a guest VM sends TCP to another
simulated network's WAN IP, the source network's gateway rewrites src
via doNATOut and routeTCPPacket hands the packet off to the destination
network, which rewrites dst via doNATIn and writes the rewritten frame
onto the destination LAN. The TCP stacks of the two guest VM kernels
talk end-to-end; vnet just NATs the IP/port headers in flight, so all
TCP semantics (handshakes, options, sequence numbers, payload) are
preserved without a gvisor TCP termination in the middle. Adds a
focused TestInterNetworkTCP that exercises this path without any
Tailscale machinery.
cmd/tta binds its outbound dial to the default route's interface using
SO_BINDTODEVICE. Without that, the moment tailscaled installs
0.0.0.0/0 → tailscale0 in response to setting an exit node, TTA's
existing TCP connection to test-driver gets rerouted through the exit
node. From the test driver's perspective the connection's packets then
arrive with the exit node's WAN IP as the source rather than the
client's, so they don't match the existing flow and the connection is
dead — manifesting in the test as a hang on EditPrefs (which had
actually completed in milliseconds on the daemon side, but whose
response never made it back). Pinning the socket to the underlying NIC
keeps TTA's agent connection on a real interface regardless of any
policy routing tailscaled installs later. We bind rather than carry the
Tailscale bypass fwmark because the fwmark approach is conditional on
tailscaled having configured SO_MARK-based policy routing, while
binding is unconditional.
vmtest grows an Env.SetExitNode helper that sets ExitNodeIP via
EditPrefs through the agent, used by the new test.
Updates #13038
Change-Id: I9fc8f91848b7aa2297ef3eaf71fed9d96056a024
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Verifies that site-to-site Tailscale subnet routing with
--snat-subnet-routes=false preserves the original source IP
end-to-end.
Topology: two sites, each with a Linux subnet router on a NATted WAN
plus an internal LAN, and a non-Tailscale backend on each LAN. Backends
are given static routes pointing to their local subnet router for the
remote site's prefix; an HTTP GET from backend-a to backend-b over
Tailscale returns a body containing backend-a's LAN IP.
Adds the supporting vmtest.SNATSubnetRoutes NodeOption and plumbs
snat-subnet-routes through TTA's /up handler. The webserver started by
vmtest.WebServer now also echoes the remote IP, for the preservation
assertion.
Adds a /add-route TTA endpoint (Linux-only for now) and a vmtest
Env.AddRoute helper so the test can install the backend static routes
through TTA rather than needing a host SSH key and debug NIC.
ensureGokrazy now always rebuilds the natlab qcow2 (once per test
process, via sync.Once) so the test picks up the new TTA and webserver
behavior.
This is pulled out of a larger pending change that adds FreeBSD
site-to-site subnet routing support; figured we should have at least
the Linux test covering what works today.
Updates #5573
Change-Id: I881c55b0f118ac9094546b5fbe68dddf179bb042
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
As a warm-up to making natlab support multiple operating systems,
start with an easy one (in that it's also Unixy and open source like
Linux) and add FreeBSD 15.0 as a VM OS option for the vmtest
integration test framework, and add TestSubnetRouterFreeBSD which
tests subnet routing through a FreeBSD VM (Gokrazy → FreeBSD →
Gokrazy).
Key changes:
- Add FreeBSD150 OSImage using the official FreeBSD 15.0
BASIC-CLOUDINIT cloud image (xz-compressed qcow2)
- Add GOOS()/IsFreeBSD() methods to OSImage for cross-compilation
and OS-specific behavior
- Handle xz-compressed image downloads in ensureImage
- Refactor compileBinaries into compileBinariesForOS to support
multiple GOOS targets (linux, freebsd), with binaries registered
at <goos>/<name> paths on the file server VIP
- Add FreeBSD-specific cloud-init (nuageinit) user-data generation:
string-form runcmd (nuageinit doesn't support YAML arrays),
fetch(1) instead of curl, FreeBSD sysctl names for IP forwarding,
mkdir /usr/local/bin, PATH setup for tta
- Skip network-config in cidata ISO for FreeBSD (DHCP via rc.conf)
Updates tailscale/tailscale#13038
Change-Id: Ibeb4f7d02659d5cd8e3a7c3a66ee7b1a92a0110d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add tstest/natlab/vmtest, a high-level framework for running multi-VM
integration tests with mixed OS types (gokrazy + Ubuntu/Debian cloud
images) connected via natlab's vnet virtual network.
The vmtest package provides:
- Env type that orchestrates vnet, QEMU processes, and agent connections
- OS image support (Gokrazy, Ubuntu2404, Debian12) with download/cache
- QEMU launch per OS type (microvm for gokrazy, q35+KVM for cloud)
- Cloud-init seed ISO generation with network-config for multi-NIC
- Cross-compilation of test binaries for cloud VMs
- Debug SSH NIC on cloud VMs for interactive debugging
- Test helpers: ApproveRoutes, HTTPGet, TailscalePing, DumpStatus,
WaitForPeerRoute, SSHExec
TTA enhancements (cmd/tta):
- Parameterize /up (accept-routes, advertise-routes, snat-subnet-routes)
- Add /set, /start-webserver, /http-get endpoints
- /http-get uses local.Client.UserDial for Tailscale-routed requests
- Fix /ping for non-gokrazy systems
TestSubnetRouter exercises a 3-VM subnet router scenario:
client (gokrazy) → subnet-router (Ubuntu, dual-NIC) → backend (gokrazy)
Verifies HTTP access to the backend webserver through the Tailscale
subnet route. Passes in ~30 seconds.
Updates tailscale/tailscale#13038
Change-Id: I165b64af241d37f5f5870e796a52502fc56146fa
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>