fixestailscale/corp#39422
Updates tailscale/certstore for properly macOS support and
builds the request signing support into macOS builds. iOS and builds
that do not use cGo are omitted.
Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
For debugging purposes, unstable builds will sometimes intentionally panic for
unexpected behaviours. We observed such a panic after loading a cached netmap,
but because we had a valid cached map, the client was unable to recover on its
own and the operator had to manually reset the cache.
As a defensive hedge, when netmap caching is enabled, check for a panic during
installation of a net network map: If one occurs, discard any cached netmaps
before letting the panic unwind, so that we do not lose the panic itself, but
reduce the need for manual intervention.
Updates #12639
Updates tailscale/corp#27300
Change-Id: I0436889c6bdc2fa728c9cb83630cd7b00a72ce68
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
If we get a 429 response during node registration, use the `Retry-After`
header for backoff instead of the regular exponential backoff.
The rate limiter error is propagated to the user, just like other
registration errors are, e.g.
```
$ tailscale up
backend error: node registration rate limited; will retry after 57s
exit status 1
```
Updates tailscale/corp#39533
Signed-off-by: Anton Tolchanov <anton@tailscale.com>
By adding a server-global parent bucket. Per-client rate limiting is
subject to the parent bucket if global rate limiting is enabled.
This implementation is experimental, and all related APIs should be
considered unstable.
Updates tailscale/corp#40291
Signed-off-by: Jordan Whited <jordan@tailscale.com>
* kube/authkey,cmd/containerboot: extract shared auth key reissue package
Move auth key reissue logic (set marker, wait for new key, clear marker,
read config) into a shared kube/authkey package and update containerboot
to use it. No behaviour change.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
* kube/authkey,kube/state,cmd/containerboot: preserve device_id across restarts
Stop clearing device_id, device_fqdn, and device_ips from state on startup.
These keys are now preserved across restarts so the operator can track
device identity. Expand ClearReissueAuthKey to clear device state and
tailscaled profile data when performing a full auth key reissue.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
* cmd/containerboot: use root context for auth key reissue wait
Pass the root context instead of bootCtx to setAndWaitForAuthKeyReissue.
The 60-second bootCtx timeout was cancelling the reissue wait before the
operator had time to respond, causing the pod to crash-loop.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
* cmd/k8s-proxy: add auth key renewal support
Add auth key reissue handling to k8s-proxy, mirroring containerboot.
When the proxy detects an auth failure (login-state health warning or
NeedsLogin state), it disconnects from control, signals the operator
via the state Secret, waits for a new key, clears stale state, and
exits so Kubernetes restarts the pod with the new key.
A health watcher goroutine runs alongside ts.Up() to short-circuit
the startup timeout on terminal auth failures.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
---------
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
Add an opt-in metrics.LabelMap tracking why patchifyPeer fails to
convert a PeersChanged entry into a PeersChangedPatch. The stats are
gated behind the TS_DEBUG_PATCHIFY_PEER_MISS envknob so there is zero
overhead in normal operation.
peerChangeDiff now takes an optional onFalse callback that is called
with the field name on every non-patchable return path. When the
envknob is off, nil is passed and replaced with a no-op at the top of
peerChangeDiff.
The resulting metric renders as:
counter_patchify_miss{why="Hostinfo"} 2
counter_patchify_miss{why="peer_not_found"} 1170
Updates tailscale/corp#40088
Change-Id: I2d4b9074bf42ec03ab296c0629a54106bafa873e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
On some nodes (found via natlab), the existing nodes last seen could be
unset. For these cases, we would want to accept the key and write a last
seen. This was breaking the cached netmap natlab tests.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
pickPort would bind a UDP socket on :0 to get a free port, close
the socket, then hope to rebind to the same port in NewConn. This
is a TOCTOU race that can cause flaky test failures when another
process grabs the port in between.
Instead, pass Port: 0 to NewConn and let the OS assign the port
atomically, then read back the assigned port via conn.LocalPort().
Fixes#19409
Change-Id: Ie44b599fb93c361e29a05f2171ad747c46f82b7a
Co-authored-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
Clients with the newly added node attribute
`"disable-linux-cgnat-drop-rule"` will not automatically drop inbound
traffic on non-Tailscale network interfaces with the source IP in the
CGNAT IP range. This is an initial proof-of-concept for enabling
connectivity with off-Tailnet CGNAT endpoints.
Fixestailscale/corp#36270.
Signed-off-by: Naman Sood <mail@nsood.in>
reflect.DeepEqual is expensive and allocates heavily. Replace it with
a field-by-field comparison that does zero allocations.
Adds tests and benchmarks for the new Equal method.
Fixes#19363
Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Fix a panic in getOrCreateChain when the kernel lacks nftables support
(CONFIG_NF_TABLES). When the nftables netlink connection fails, chain
objects returned by getChainFromTable can have nil Hooknum and Priority
fields. Dereferencing these caused tailscaled to SIGSEGV during router
configuration, which manifested as tailscaled silently crashing ~13
seconds after "tailscale up" on arm64 gokrazy (whose kernel.arm64
build doesn't include nftables).
Updates #13038
Change-Id: I14433616da5ed57895cad37038921fb4f79c3534
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Use linkat via /proc/self/fd with AT_SYMLINK_FOLLOW to create a
hardlink of the test binary instead of copying it. This avoids
copying ~50MB+ binaries into each test's temp directory, making
test setup faster and reducing disk I/O.
The simpler os.Link(b.Path, ret.Path) can't be used here because
the source binary lives in the first test's TempDir, which may be
cleaned up before later tests call CopyTo. The open FD keeps the
inode alive after the path is deleted, but os.Link needs a valid
path. (See also b9f468240f which tried os.Link but is racy for
this reason.)
The /proc/self/fd approach works without elevated privileges,
unlike AT_EMPTY_PATH which requires CAP_DAC_READ_SEARCH. If the
linkat fails for any reason (e.g. cross-filesystem temp dirs), it
falls back to the existing full-copy path.
Fixes#19397
Change-Id: I4b1f97f7e63a9ae9e09dce36dfbdd1f6cff92320
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The kernel version parser used strings.Cut with "-" to handle versions
like "5.4.0-76-generic", but Debian uses "+" in versions like
"6.12.41+deb13-amd64".
Use strings.IndexAny to find the first "-" or "+" and truncate there.
Fixes TestKernelVersion on Debian systems.
Fixes#19395
Change-Id: I70e5f95682d54baf908e51f9f4b51c130b00aaaa
Co-Authored-By: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
The compare-metrics-stats subtest reset two independent counting
systems (physical connection counters and expvar.Int user metrics)
non-atomically. Background WireGuard keepalives arriving between the
resets could increment one system but not the other, causing
off-by-one packet/byte mismatches in either direction.
Replace the reset-then-compare pattern with snapshot-and-delta:
snapshot both systems before pings, snapshot again after, and compare
the deltas. This eliminates the non-atomic reset window entirely.
As a belt-and-suspenders safety net, tolerate a difference of exactly
one packet (and corresponding bytes) from a stray keepalive that
could still arrive in the narrow window between the two snapshots.
flakestress passes with ~5900 runs (~2800 without -race, ~3100 with
-race) but it also passed previously too. This is an annoying one to
repro.
Fixes#11762
Change-Id: I3447ad67e71c8146e85eed38b7a665033ef9e284
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The test had two problems:
1. runFileWatcher passed hardcoded "/etc/" to the inotify watcher,
but the test filesystem uses a temp directory prefix. The watcher
was watching the real /etc/, never seeing the test's file writes.
2. The test's watchFile used gonotify.NewDirWatcher which creates
goroutines that block on real inotify syscalls. These don't work
inside synctest's fake-time bubble. The test only passed standalone
by accident: gonotify walks /etc/ on startup producing fake events
that happened to trigger trample detection at the right time.
Fix the path issue by adding ActualPath to the wholeFileFS interface,
which translates logical paths (like "/etc/resolv.conf") to real
filesystem paths (respecting any test prefix). Use it in
runFileWatcher so the inotify watch targets the correct directory.
Replace gonotify in the test with a one-shot timer that synctest can
advance through fake time, reliably triggering the trample check.
Fixes#19400
Change-Id: Idb252881ec24d0ab3b3c1d154dbdaf532db837d4
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The previous filters would allow for a handful of subtle issues such as
updating the last seen date when the key or online status had not
changed, and making online keys unconditionally make an engine update.
These have been fixed along side making no change updates from TSMP into
a no-op for the engine so we don't have to reconfigure.
A bunch of additional testing has been added as well.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
And cap WaitN calls to prevent token bucket errors. Frame length is
inclusive of DERP key for FrameSendPacket frames.
Updates tailscale/corp#40171
Signed-off-by: Jordan Whited <jordan@tailscale.com>
When running integration tests over SSH (e.g., in remote development
environments), the SSH_CLIENT environment variable is set. This causes
isSSHOverTailscale() to incorrectly detect an SSH session and change
behavior.
Clear SSH_CLIENT in the test node environment to prevent these false
positives.
Fixes#19393
Change-Id: I1411abf0be9704cce37051476efb04d59beed386
Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
Avery found a bunch of tests that fail with -count=2.
Updates tailscale/corp#40176 (tracks making our CI detect them)
Change-Id: Ie3e4398070dd92e4fe0146badddf1254749cca20
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Co-authored-by: Avery Pennarun <apenwarr@tailscale.com>
TestLookupMetric was added in e8d140654 (2023-08-17) without
initializing the dnsCache and dnsCacheBytes globals. When run in
isolation, handleBootstrapDNS writes a nil body (from the
uninitialized dnsCacheBytes), causing getBootstrapDNS to fail
decoding an empty response with EOF.
Add a setDNSCache test helper that stores the dnsEntryMap, marshals
dnsCacheBytes, and registers a t.Cleanup to nil both out, so tests
that forget to call it will hit the dnsCache-nil fatal in
getBootstrapDNS rather than silently depending on prior test state.
Also add AssertNotParallel and a dnsCache-nil fatal check to
getBootstrapDNS, the central helper all bootstrap DNS tests flow
through, to prevent future tests from running in parallel (they
all mutate package-level DNS caches and metrics) and to give a
clear error if a test forgets to initialize the DNS caches.
Fixes#19388
Change-Id: I8ad454ec6026c71f13ecfa14d25925df5478b908
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Co-authored-by: Avery Pennarun <apenwarr@tailscale.com>
The natlab-integrationtest CI job frequently flakes by exhausting its
3m go test timeout. The root cause is that the QEMU VMs run under
pure software emulation (TCG) with no KVM. Under TCG, the guest
kernel's timer calibration busy-loops are at the mercy of host CPU
scheduling. When two VMs boot simultaneously on a 2-core CI runner,
one VM's calibration gets starved and produces wrong results, leaving
the kernel with broken timers that prevent it from ever completing
boot — even after the other VM finishes and frees up CPU.
Additionally, the microvm machine type doesn't provide HPET hardware,
but the kernel command line specified clocksource=hpet. And the VM
image build (make natlab) ran inside the test itself, consuming most
of the 3m timeout budget before the actual test started.
Fix by:
- Enabling KVM when /dev/kvm is available, so timer calibration
uses real hardware timers unaffected by host CPU scheduling.
- Adding a CI step to set /dev/kvm permissions on the GitHub
Actions runner (ubuntu-latest provides KVM but needs a udev rule).
- Pre-building the VM image in a separate CI step so it doesn't
cut into the go test -timeout budget.
- Replacing the hardcoded 60s context timeout with one derived from
t.Deadline(), so the test uses the full -timeout budget.
- Adding VM boot progress detection (AwaitFirstPacket) and QMP
diagnostics, so boot failures produce clear errors instead of
opaque "context deadline exceeded" messages.
With KVM enabled, the test passes reliably even on a single CPU core
with 3 parallel workers — a scenario that was 100% broken under TCG.
Fixes#18906
Change-Id: I4c87631a9c9678d185b9f30cb05c0f7bfa9f5c62
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
For tests to loudly declare (and panic on violation) when they're doing
something that's not safe in a parallel test.
Fixes#19385
Change-Id: If79693b0c235c146871a05ed74fa9ea75bb500f9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The maxInFlightConnectionAttemptsForTest and
maxInFlightConnectionAttemptsPerClientForTest globals were plain ints
read by background gVisor TCP handler goroutines (via
wrapTCPProtocolHandler) and written by tstest.Replace cleanup in
TestTCPForwardLimits_PerClient. When a gVisor goroutine outlived the
test cleanup window, the race detector caught the unsynchronized
access.
The race-prone code was introduced in c5abbcd4b4d8 (2024-02-26,
"wgengine/netstack: add a per-client limit for in-flight TCP
forwards") which added both the plain int globals and the
TestTCPForwardLimits_PerClient test that writes them via
tstest.Replace. It is not obvious why this has only recently started
being detected as a data race; likely some combination of gVisor
version bumps, Go toolchain scheduler changes, and additional
TCP-injecting subtests (e.g. 03461ea7f, 2026-01-30) increased
goroutine churn enough to hit the window.
Change both globals to atomic.Int32 and replace tstest.Replace (which
does non-atomic *target = old on cleanup) with explicit Store/Cleanup
pairs.
Fixes#19118
Change-Id: Id26ba6fbfb2e4ade319976db80af8e16c7c8778e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When built with the Tailscale Go toolchain, include the toolchain's
git revision in the version output. The non-JSON output shows the
first 10 hex digits:
go version: go1.26.2 (tailscale/go dfe2a5fd8e)
The JSON output includes the full hash as "tailscaleGoGitHash", or
omits the field when not using tsgo.
The toolchain rev is read via a separate sync.OnceValue rather than
piggybacking on getEmbeddedInfo, because that function discards all
data when VCS fields are absent (e.g. in test binaries), while the
tailscale.toolchain.rev setting is still present.
Also add a CI-only test verifying tailscaleToolchainRev is non-empty
when built with the tailscale_go build tag.
Fixes#19374
Change-Id: Ied0b16d7aead5471d8c614c30cba8b0dcf80c691
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Parallelize the SSH integration tests across OS targets and reduce
per-container overhead:
- CI: use GitHub Actions matrix strategy to run all 4 OS containers
(ubuntu:focal, ubuntu:jammy, ubuntu:noble, alpine:latest) in parallel
instead of sequentially (~4x wall-clock improvement)
- Makefile: run docker builds in parallel for local dev too
- Dockerfile: consolidate ~20 separate RUN commands into 5 (one per
test phase), eliminating Docker layer overhead. Combine test binary
invocations where no state mutation is needed between them. Fix a bug
where TestDoDropPrivileges was silently not being run (was passed as a
second positional arg to -test.run instead of using regex alternation).
- TestMain: replace tail -F + 2s sleep with synchronous log read,
eliminating 2s overhead per test binary invocation. Set debugTest once
in TestMain instead of redundantly in each test function.
- session.read(): close channel on EOF so non-shell tests return
immediately instead of waiting for the 1s silence timeout.
Updates #19244
Change-Id: I2cc8588964fbce0dd7b654fb94e7ff33440b8584
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
I'm not sure how this file got into the repo without gofmt.
Maybe gofmt rules changed in some Go release?
Updates #cleanup
Change-Id: Ia8bd46e29f116f7fbfca11be80c8ef48699cd9f2
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Verify that GODEBUG=gocachehash=1 output from ./tool/go includes the
git revision from go.toolchain.rev, ensuring that bumping the Tailscale
Go fork (without a Go version number change) properly invalidates the
build cache.
The test only runs in CI or when the current Go binary is the Tailscale
toolchain (GOROOT contains /.cache/tsgo/), so open source contributors
using stock Go aren't forced to download tsgo.
Fixestailscale/corp#36589
Change-Id: Ia98d3a3aa8c7fa67f9a0293066fa02a1997dcb95
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a --headless flag to the Host.app Run subcommand for running
macOS VMs without a GUI, enabling use from test frameworks.
Key changes:
- HostCli.swift: When --headless is set, run the VM via VMController
+ RunLoop.main.run() instead of NSApplicationMain. Using the
RunLoop (not dispatchMain) is required because VZ framework
callbacks depend on RunLoop sources.
- VMController.swift: Add headless parameter to createVirtualMachine
that configures a single socket-based NIC (no NAT NIC). This
matches the NIC configuration used when creating/saving VMs, so
saved state restoration works correctly. A NIC count mismatch
causes VZ to silently fail to execute guest code.
- TailMacConfigHelper.swift: Clean up socket network device logging.
- Config.swift: Move VM storage from ~/VM.bundle to
~/.cache/tailscale/vmtest/macos/.
- TailMac.swift: Fix dispatchMain→RunLoop.main.run() in the create
command (same VZ RunLoop requirement).
Updates #13038
Change-Id: Iea51c043aa92e8fc6257139b9f0e2e7677072fa2
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add natlabapp.arm64 config and gokrazydeps.go for building a gokrazy
natlab appliance image targeting arm64 (Apple Silicon). This is the
arm64 counterpart to the existing natlabapp (amd64) used by vmtest.
The arm64 image uses github.com/gokrazy/kernel.arm64 and is built
with "make natlab-arm64" in the gokrazy directory.
Updates #13038
Change-Id: I0e1f8e5840083a5de5954f2cf46e3babec129d96
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a --rate-config flag pointing to a JSON file for per-client receive
rate limits (bytes/sec and burst bytes). The config is reloaded on SIGHUP,
updating all existing client connections live. The --per-client-rate-limit
and --per-client-rate-burst flags are removed in favor of the config file.
In derpserver, rate limiting uses an atomic.Pointer[xrate.Limiter] per
client: nil when unlimited or mesh (zero overhead), non-nil when
rate-limited.
Document that clientSet.activeClient Store operations require Server.mu.
Updates tailscale/corp#38509
Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
These test failures were never caught by CI because the package in question
was missing from our privileged tests list. tailscale/corp#40007 covers improving
our process around this.
Fixes#19316
Signed-off-by: Amal Bansode <amal@tailscale.com>
Start using a common helper for tests to declare that they require root.
This is step 1. A later step will then make this helper track which tests were
skipped so a subsequent pass will run these test as root.
Updates tailscale/corp#40007
Change-Id: I4979e1def0fa3691d38c83f48c89aaa443e7f62e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This reverts commit b25920dfc07452833895ad00b42db7e581b3cec8.
The `log.Printf` messages are causing panics in corp, in particular:
> panic: please use tailscale.com/logger.Logf instead of the log package
Fixing the TKA code to plumb through a logger properly is going to be
a hassle, so for now remove these logs to unblock merges to corp.
Updates tailscale/corp#39455
Signed-off-by: Alex Chan <alexc@tailscale.com>
ipn/local: add netmap mutations to the ipn bus
updates tailscale/tailscale#1909
This adds a new new NotifyWatchOpt that allows watchers to
receive PeerChange events (derived from node mutations)
on the IPN bus in lieu of a complete netmap. We'll continue
to send the full netmap for any map response that includes it,
but for mutations, sending PeerChange events gives the client
the option to manage it's own models more selectively and cuts
way down on json serialization overhead.
On chatty tailnets, this will vastly reduce the amount of
chatter on the bus.
This change should be backwards compatible, it is
purely additive. Clients that subscribe to NotifyNetmap will
get the full netmap for every delta. New clients can
omit that and instead opt into NotifyPeerChanges.
Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
On dual-stack clusters defaulting to IPv6, the ProxyGroup egress
service only got an IPv6 address, which causes request failures.
Individual egress proxies already set PreferDualStack correctly.
Fixes: #18768
Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Validated against a modern Debian install, fixes a typo.
Updates #cleanup
Signed-off-by: Andrew Dunham <andrew@du.nham.ca>
Change-Id: I7b26012f54dbd2f0f9fea98722e8edc2fe97645a
As a warm-up to making natlab support multiple operating systems,
start with an easy one (in that it's also Unixy and open source like
Linux) and add FreeBSD 15.0 as a VM OS option for the vmtest
integration test framework, and add TestSubnetRouterFreeBSD which
tests subnet routing through a FreeBSD VM (Gokrazy → FreeBSD →
Gokrazy).
Key changes:
- Add FreeBSD150 OSImage using the official FreeBSD 15.0
BASIC-CLOUDINIT cloud image (xz-compressed qcow2)
- Add GOOS()/IsFreeBSD() methods to OSImage for cross-compilation
and OS-specific behavior
- Handle xz-compressed image downloads in ensureImage
- Refactor compileBinaries into compileBinariesForOS to support
multiple GOOS targets (linux, freebsd), with binaries registered
at <goos>/<name> paths on the file server VIP
- Add FreeBSD-specific cloud-init (nuageinit) user-data generation:
string-form runcmd (nuageinit doesn't support YAML arrays),
fetch(1) instead of curl, FreeBSD sysctl names for IP forwarding,
mkdir /usr/local/bin, PATH setup for tta
- Skip network-config in cidata ISO for FreeBSD (DHCP via rc.conf)
Updates tailscale/tailscale#13038
Change-Id: Ibeb4f7d02659d5cd8e3a7c3a66ee7b1a92a0110d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>