The package updates started getting really slow yesterday. We can do
better, but attempt a band aid fix for now, as the test is failing about
a third of the time on PR CI.
Updates tailscale/corp#40465
Change-Id: Icf53292ba83dd1ed76b9bdf9fb94a8f6fb448c07
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
Add a new control/tsp package providing a client for speaking the
Tailscale protocol to a coordination server over Noise, along with a
cmd/tsp binary exposing it as a low-level composable tool for
generating keys, registering nodes, and issuing map requests.
Previously developed out-of-tree at github.com/bradfitz/tsp; imported
here without git history.
Updates #12542
Change-Id: I6ad21143c4aefe8939d4a46ae65b2184173bf69f
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
modifying DNS responses for domains they are also connectors for
For Connectors 2025, determine if a client is configured as a
connector and what domains it is a connector for. When acting as a
client, don't install Split DNS routes to other connectors for those
domains, and don't alter DNS responses for those domains. The responses
are forwarded back to the original client, which in turn does the alteration,
swapping the real IP for a Magic IP.
A client is also a connector for a domain if it has tags that overlap
with tags in the configured policy, and --advertise-connector=true
in the prefs (not in the self-node Hostinfo from the netmap). We use the prefs
as the source of truth because control only gets a copy from the prefs, and
may drift. And the AppConnector field is currently zeroed out in the
self-node Hostinfo from control.
The extension adds a ProfileStateChange hook to process prefs changes,
and the config type is split into prefs and nodeview sub-configs.
Fixestailscale/corp#39317
Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
Before:
tka initialized at head 325557575a59525354484e4a534f494b4c4e56575435583737564b5036584c4d4c335534554255344c344c36484c5a444a323341
After:
tka initialized at head 2UWWZYRSTHNJSOIKLNVWT5X77VKP6XLML3U4UBU4L4L6HLZDJ23A
Printing the AUM hash as hex makes it difficult to compare to other AUM
hashes; stringifying it will make it consistent with other printing.
Updates #cleanup
Change-Id: Ic1e23a9ce6a71a53cff7d2190f9fa06eb838ab89
Signed-off-by: Alex Chan <alexc@tailscale.com>
Endpoint's best address was cleared on trustBestAddrUntil expiry
only if it was a udprelay connection. This generalizes invalidation
to also cover direct UDP.
Trust deadline is checked in two cases:
On disco ping timeout from the endpoint's best address.
Traffic goes DERP-only, heartbeats to the old address stop.
The discovery pings are still in flight, handled by the following.
On disco ping success from an alternative. BestAddr switches to the
working path, trust refreshed, eager discovery stops. The still
in flight pongs are handled by betterAddr().
Updates #19407
Change-Id: Ic41ed18edb4a6e4350a2d49271ba01566a6a6964
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com>
TestUsedConsistently shells out to git grep to find forbidden
http.Method* uses across the repo. Since the test itself doesn't
open any repo files, Go's test cache considers it unchanged
between commits and serves stale passing results even when new
violations are introduced.
Fix by opening .git/index, which makes Go's test cache track it
as an input. The index file changes on git reset, checkout, pull,
etc., so the cache is properly invalidated when moving between
commits.
Updates tailscale/corp#40359
Change-Id: If1497b992a545351bdd68cff279d60f5591fe70b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This commit modifies the `DNSConfig` custom resource to allow the
user to specify affinity rules on the nameserver pods.
Updates: https://github.com/tailscale/tailscale/issues/18556
Signed-off-by: David Bond <davidsbond93@gmail.com>
fixestailscale/corp#39422
Updates tailscale/certstore for properly macOS support and
builds the request signing support into macOS builds. iOS and builds
that do not use cGo are omitted.
Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
For debugging purposes, unstable builds will sometimes intentionally panic for
unexpected behaviours. We observed such a panic after loading a cached netmap,
but because we had a valid cached map, the client was unable to recover on its
own and the operator had to manually reset the cache.
As a defensive hedge, when netmap caching is enabled, check for a panic during
installation of a net network map: If one occurs, discard any cached netmaps
before letting the panic unwind, so that we do not lose the panic itself, but
reduce the need for manual intervention.
Updates #12639
Updates tailscale/corp#27300
Change-Id: I0436889c6bdc2fa728c9cb83630cd7b00a72ce68
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
If we get a 429 response during node registration, use the `Retry-After`
header for backoff instead of the regular exponential backoff.
The rate limiter error is propagated to the user, just like other
registration errors are, e.g.
```
$ tailscale up
backend error: node registration rate limited; will retry after 57s
exit status 1
```
Updates tailscale/corp#39533
Signed-off-by: Anton Tolchanov <anton@tailscale.com>
By adding a server-global parent bucket. Per-client rate limiting is
subject to the parent bucket if global rate limiting is enabled.
This implementation is experimental, and all related APIs should be
considered unstable.
Updates tailscale/corp#40291
Signed-off-by: Jordan Whited <jordan@tailscale.com>
* kube/authkey,cmd/containerboot: extract shared auth key reissue package
Move auth key reissue logic (set marker, wait for new key, clear marker,
read config) into a shared kube/authkey package and update containerboot
to use it. No behaviour change.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
* kube/authkey,kube/state,cmd/containerboot: preserve device_id across restarts
Stop clearing device_id, device_fqdn, and device_ips from state on startup.
These keys are now preserved across restarts so the operator can track
device identity. Expand ClearReissueAuthKey to clear device state and
tailscaled profile data when performing a full auth key reissue.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
* cmd/containerboot: use root context for auth key reissue wait
Pass the root context instead of bootCtx to setAndWaitForAuthKeyReissue.
The 60-second bootCtx timeout was cancelling the reissue wait before the
operator had time to respond, causing the pod to crash-loop.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
* cmd/k8s-proxy: add auth key renewal support
Add auth key reissue handling to k8s-proxy, mirroring containerboot.
When the proxy detects an auth failure (login-state health warning or
NeedsLogin state), it disconnects from control, signals the operator
via the state Secret, waits for a new key, clears stale state, and
exits so Kubernetes restarts the pod with the new key.
A health watcher goroutine runs alongside ts.Up() to short-circuit
the startup timeout on terminal auth failures.
Updates #14080
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
---------
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
Add an opt-in metrics.LabelMap tracking why patchifyPeer fails to
convert a PeersChanged entry into a PeersChangedPatch. The stats are
gated behind the TS_DEBUG_PATCHIFY_PEER_MISS envknob so there is zero
overhead in normal operation.
peerChangeDiff now takes an optional onFalse callback that is called
with the field name on every non-patchable return path. When the
envknob is off, nil is passed and replaced with a no-op at the top of
peerChangeDiff.
The resulting metric renders as:
counter_patchify_miss{why="Hostinfo"} 2
counter_patchify_miss{why="peer_not_found"} 1170
Updates tailscale/corp#40088
Change-Id: I2d4b9074bf42ec03ab296c0629a54106bafa873e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
On some nodes (found via natlab), the existing nodes last seen could be
unset. For these cases, we would want to accept the key and write a last
seen. This was breaking the cached netmap natlab tests.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
pickPort would bind a UDP socket on :0 to get a free port, close
the socket, then hope to rebind to the same port in NewConn. This
is a TOCTOU race that can cause flaky test failures when another
process grabs the port in between.
Instead, pass Port: 0 to NewConn and let the OS assign the port
atomically, then read back the assigned port via conn.LocalPort().
Fixes#19409
Change-Id: Ie44b599fb93c361e29a05f2171ad747c46f82b7a
Co-authored-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
Clients with the newly added node attribute
`"disable-linux-cgnat-drop-rule"` will not automatically drop inbound
traffic on non-Tailscale network interfaces with the source IP in the
CGNAT IP range. This is an initial proof-of-concept for enabling
connectivity with off-Tailnet CGNAT endpoints.
Fixestailscale/corp#36270.
Signed-off-by: Naman Sood <mail@nsood.in>
reflect.DeepEqual is expensive and allocates heavily. Replace it with
a field-by-field comparison that does zero allocations.
Adds tests and benchmarks for the new Equal method.
Fixes#19363
Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Fix a panic in getOrCreateChain when the kernel lacks nftables support
(CONFIG_NF_TABLES). When the nftables netlink connection fails, chain
objects returned by getChainFromTable can have nil Hooknum and Priority
fields. Dereferencing these caused tailscaled to SIGSEGV during router
configuration, which manifested as tailscaled silently crashing ~13
seconds after "tailscale up" on arm64 gokrazy (whose kernel.arm64
build doesn't include nftables).
Updates #13038
Change-Id: I14433616da5ed57895cad37038921fb4f79c3534
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Use linkat via /proc/self/fd with AT_SYMLINK_FOLLOW to create a
hardlink of the test binary instead of copying it. This avoids
copying ~50MB+ binaries into each test's temp directory, making
test setup faster and reducing disk I/O.
The simpler os.Link(b.Path, ret.Path) can't be used here because
the source binary lives in the first test's TempDir, which may be
cleaned up before later tests call CopyTo. The open FD keeps the
inode alive after the path is deleted, but os.Link needs a valid
path. (See also b9f468240f which tried os.Link but is racy for
this reason.)
The /proc/self/fd approach works without elevated privileges,
unlike AT_EMPTY_PATH which requires CAP_DAC_READ_SEARCH. If the
linkat fails for any reason (e.g. cross-filesystem temp dirs), it
falls back to the existing full-copy path.
Fixes#19397
Change-Id: I4b1f97f7e63a9ae9e09dce36dfbdd1f6cff92320
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The kernel version parser used strings.Cut with "-" to handle versions
like "5.4.0-76-generic", but Debian uses "+" in versions like
"6.12.41+deb13-amd64".
Use strings.IndexAny to find the first "-" or "+" and truncate there.
Fixes TestKernelVersion on Debian systems.
Fixes#19395
Change-Id: I70e5f95682d54baf908e51f9f4b51c130b00aaaa
Co-Authored-By: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
The compare-metrics-stats subtest reset two independent counting
systems (physical connection counters and expvar.Int user metrics)
non-atomically. Background WireGuard keepalives arriving between the
resets could increment one system but not the other, causing
off-by-one packet/byte mismatches in either direction.
Replace the reset-then-compare pattern with snapshot-and-delta:
snapshot both systems before pings, snapshot again after, and compare
the deltas. This eliminates the non-atomic reset window entirely.
As a belt-and-suspenders safety net, tolerate a difference of exactly
one packet (and corresponding bytes) from a stray keepalive that
could still arrive in the narrow window between the two snapshots.
flakestress passes with ~5900 runs (~2800 without -race, ~3100 with
-race) but it also passed previously too. This is an annoying one to
repro.
Fixes#11762
Change-Id: I3447ad67e71c8146e85eed38b7a665033ef9e284
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The test had two problems:
1. runFileWatcher passed hardcoded "/etc/" to the inotify watcher,
but the test filesystem uses a temp directory prefix. The watcher
was watching the real /etc/, never seeing the test's file writes.
2. The test's watchFile used gonotify.NewDirWatcher which creates
goroutines that block on real inotify syscalls. These don't work
inside synctest's fake-time bubble. The test only passed standalone
by accident: gonotify walks /etc/ on startup producing fake events
that happened to trigger trample detection at the right time.
Fix the path issue by adding ActualPath to the wholeFileFS interface,
which translates logical paths (like "/etc/resolv.conf") to real
filesystem paths (respecting any test prefix). Use it in
runFileWatcher so the inotify watch targets the correct directory.
Replace gonotify in the test with a one-shot timer that synctest can
advance through fake time, reliably triggering the trample check.
Fixes#19400
Change-Id: Idb252881ec24d0ab3b3c1d154dbdaf532db837d4
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The previous filters would allow for a handful of subtle issues such as
updating the last seen date when the key or online status had not
changed, and making online keys unconditionally make an engine update.
These have been fixed along side making no change updates from TSMP into
a no-op for the engine so we don't have to reconfigure.
A bunch of additional testing has been added as well.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
And cap WaitN calls to prevent token bucket errors. Frame length is
inclusive of DERP key for FrameSendPacket frames.
Updates tailscale/corp#40171
Signed-off-by: Jordan Whited <jordan@tailscale.com>
When running integration tests over SSH (e.g., in remote development
environments), the SSH_CLIENT environment variable is set. This causes
isSSHOverTailscale() to incorrectly detect an SSH session and change
behavior.
Clear SSH_CLIENT in the test node environment to prevent these false
positives.
Fixes#19393
Change-Id: I1411abf0be9704cce37051476efb04d59beed386
Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
Avery found a bunch of tests that fail with -count=2.
Updates tailscale/corp#40176 (tracks making our CI detect them)
Change-Id: Ie3e4398070dd92e4fe0146badddf1254749cca20
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Co-authored-by: Avery Pennarun <apenwarr@tailscale.com>
TestLookupMetric was added in e8d140654 (2023-08-17) without
initializing the dnsCache and dnsCacheBytes globals. When run in
isolation, handleBootstrapDNS writes a nil body (from the
uninitialized dnsCacheBytes), causing getBootstrapDNS to fail
decoding an empty response with EOF.
Add a setDNSCache test helper that stores the dnsEntryMap, marshals
dnsCacheBytes, and registers a t.Cleanup to nil both out, so tests
that forget to call it will hit the dnsCache-nil fatal in
getBootstrapDNS rather than silently depending on prior test state.
Also add AssertNotParallel and a dnsCache-nil fatal check to
getBootstrapDNS, the central helper all bootstrap DNS tests flow
through, to prevent future tests from running in parallel (they
all mutate package-level DNS caches and metrics) and to give a
clear error if a test forgets to initialize the DNS caches.
Fixes#19388
Change-Id: I8ad454ec6026c71f13ecfa14d25925df5478b908
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Co-authored-by: Avery Pennarun <apenwarr@tailscale.com>
The natlab-integrationtest CI job frequently flakes by exhausting its
3m go test timeout. The root cause is that the QEMU VMs run under
pure software emulation (TCG) with no KVM. Under TCG, the guest
kernel's timer calibration busy-loops are at the mercy of host CPU
scheduling. When two VMs boot simultaneously on a 2-core CI runner,
one VM's calibration gets starved and produces wrong results, leaving
the kernel with broken timers that prevent it from ever completing
boot — even after the other VM finishes and frees up CPU.
Additionally, the microvm machine type doesn't provide HPET hardware,
but the kernel command line specified clocksource=hpet. And the VM
image build (make natlab) ran inside the test itself, consuming most
of the 3m timeout budget before the actual test started.
Fix by:
- Enabling KVM when /dev/kvm is available, so timer calibration
uses real hardware timers unaffected by host CPU scheduling.
- Adding a CI step to set /dev/kvm permissions on the GitHub
Actions runner (ubuntu-latest provides KVM but needs a udev rule).
- Pre-building the VM image in a separate CI step so it doesn't
cut into the go test -timeout budget.
- Replacing the hardcoded 60s context timeout with one derived from
t.Deadline(), so the test uses the full -timeout budget.
- Adding VM boot progress detection (AwaitFirstPacket) and QMP
diagnostics, so boot failures produce clear errors instead of
opaque "context deadline exceeded" messages.
With KVM enabled, the test passes reliably even on a single CPU core
with 3 parallel workers — a scenario that was 100% broken under TCG.
Fixes#18906
Change-Id: I4c87631a9c9678d185b9f30cb05c0f7bfa9f5c62
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
For tests to loudly declare (and panic on violation) when they're doing
something that's not safe in a parallel test.
Fixes#19385
Change-Id: If79693b0c235c146871a05ed74fa9ea75bb500f9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The maxInFlightConnectionAttemptsForTest and
maxInFlightConnectionAttemptsPerClientForTest globals were plain ints
read by background gVisor TCP handler goroutines (via
wrapTCPProtocolHandler) and written by tstest.Replace cleanup in
TestTCPForwardLimits_PerClient. When a gVisor goroutine outlived the
test cleanup window, the race detector caught the unsynchronized
access.
The race-prone code was introduced in c5abbcd4b4d8 (2024-02-26,
"wgengine/netstack: add a per-client limit for in-flight TCP
forwards") which added both the plain int globals and the
TestTCPForwardLimits_PerClient test that writes them via
tstest.Replace. It is not obvious why this has only recently started
being detected as a data race; likely some combination of gVisor
version bumps, Go toolchain scheduler changes, and additional
TCP-injecting subtests (e.g. 03461ea7f, 2026-01-30) increased
goroutine churn enough to hit the window.
Change both globals to atomic.Int32 and replace tstest.Replace (which
does non-atomic *target = old on cleanup) with explicit Store/Cleanup
pairs.
Fixes#19118
Change-Id: Id26ba6fbfb2e4ade319976db80af8e16c7c8778e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When built with the Tailscale Go toolchain, include the toolchain's
git revision in the version output. The non-JSON output shows the
first 10 hex digits:
go version: go1.26.2 (tailscale/go dfe2a5fd8e)
The JSON output includes the full hash as "tailscaleGoGitHash", or
omits the field when not using tsgo.
The toolchain rev is read via a separate sync.OnceValue rather than
piggybacking on getEmbeddedInfo, because that function discards all
data when VCS fields are absent (e.g. in test binaries), while the
tailscale.toolchain.rev setting is still present.
Also add a CI-only test verifying tailscaleToolchainRev is non-empty
when built with the tailscale_go build tag.
Fixes#19374
Change-Id: Ied0b16d7aead5471d8c614c30cba8b0dcf80c691
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Parallelize the SSH integration tests across OS targets and reduce
per-container overhead:
- CI: use GitHub Actions matrix strategy to run all 4 OS containers
(ubuntu:focal, ubuntu:jammy, ubuntu:noble, alpine:latest) in parallel
instead of sequentially (~4x wall-clock improvement)
- Makefile: run docker builds in parallel for local dev too
- Dockerfile: consolidate ~20 separate RUN commands into 5 (one per
test phase), eliminating Docker layer overhead. Combine test binary
invocations where no state mutation is needed between them. Fix a bug
where TestDoDropPrivileges was silently not being run (was passed as a
second positional arg to -test.run instead of using regex alternation).
- TestMain: replace tail -F + 2s sleep with synchronous log read,
eliminating 2s overhead per test binary invocation. Set debugTest once
in TestMain instead of redundantly in each test function.
- session.read(): close channel on EOF so non-shell tests return
immediately instead of waiting for the 1s silence timeout.
Updates #19244
Change-Id: I2cc8588964fbce0dd7b654fb94e7ff33440b8584
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
I'm not sure how this file got into the repo without gofmt.
Maybe gofmt rules changed in some Go release?
Updates #cleanup
Change-Id: Ia8bd46e29f116f7fbfca11be80c8ef48699cd9f2
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Verify that GODEBUG=gocachehash=1 output from ./tool/go includes the
git revision from go.toolchain.rev, ensuring that bumping the Tailscale
Go fork (without a Go version number change) properly invalidates the
build cache.
The test only runs in CI or when the current Go binary is the Tailscale
toolchain (GOROOT contains /.cache/tsgo/), so open source contributors
using stock Go aren't forced to download tsgo.
Fixestailscale/corp#36589
Change-Id: Ia98d3a3aa8c7fa67f9a0293066fa02a1997dcb95
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a --headless flag to the Host.app Run subcommand for running
macOS VMs without a GUI, enabling use from test frameworks.
Key changes:
- HostCli.swift: When --headless is set, run the VM via VMController
+ RunLoop.main.run() instead of NSApplicationMain. Using the
RunLoop (not dispatchMain) is required because VZ framework
callbacks depend on RunLoop sources.
- VMController.swift: Add headless parameter to createVirtualMachine
that configures a single socket-based NIC (no NAT NIC). This
matches the NIC configuration used when creating/saving VMs, so
saved state restoration works correctly. A NIC count mismatch
causes VZ to silently fail to execute guest code.
- TailMacConfigHelper.swift: Clean up socket network device logging.
- Config.swift: Move VM storage from ~/VM.bundle to
~/.cache/tailscale/vmtest/macos/.
- TailMac.swift: Fix dispatchMain→RunLoop.main.run() in the create
command (same VZ RunLoop requirement).
Updates #13038
Change-Id: Iea51c043aa92e8fc6257139b9f0e2e7677072fa2
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add natlabapp.arm64 config and gokrazydeps.go for building a gokrazy
natlab appliance image targeting arm64 (Apple Silicon). This is the
arm64 counterpart to the existing natlabapp (amd64) used by vmtest.
The arm64 image uses github.com/gokrazy/kernel.arm64 and is built
with "make natlab-arm64" in the gokrazy directory.
Updates #13038
Change-Id: I0e1f8e5840083a5de5954f2cf46e3babec129d96
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a --rate-config flag pointing to a JSON file for per-client receive
rate limits (bytes/sec and burst bytes). The config is reloaded on SIGHUP,
updating all existing client connections live. The --per-client-rate-limit
and --per-client-rate-burst flags are removed in favor of the config file.
In derpserver, rate limiting uses an atomic.Pointer[xrate.Limiter] per
client: nil when unlimited or mesh (zero overhead), non-nil when
rate-limited.
Document that clientSet.activeClient Store operations require Server.mu.
Updates tailscale/corp#38509
Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
These test failures were never caught by CI because the package in question
was missing from our privileged tests list. tailscale/corp#40007 covers improving
our process around this.
Fixes#19316
Signed-off-by: Amal Bansode <amal@tailscale.com>