This commit replaces crypto/rand challenge generation with a blake2s-256
MAC. This enables the peer relay server to respond to multiple forward
disco.BindUDPRelayEndpoint messages per handshake generation without
sacrificing the proof of IP ownership properties of the handshake.
Responding to multiple forward disco.BindUDPRelayEndpoint messages per
handshake generation improves client address/path selection where
lowest client->server path/addr one-way delay does not necessarily
equate to lowest client<->server round trip delay.
It also improves situations where outbound traffic is filtered
independent of input, and the first reply
disco.BindUDPRelayEndpointChallenge message is dropped on the reply
path, but a later reply using a different source would make it through.
Reduction in serverEndpoint state saves 112 bytes per instance, trading
for slightly more expensive crypto ops: 277ns/op vs 321ns/op on an M1
Macbook Pro.
Updates tailscale/corp#34414
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Adds cmd/cigocacher as the client to cigocached for Go caching over
HTTP. The HTTP cache is best-effort only, and builds will fall back to
disk-only cache if it's not available, much like regular builds.
Not yet used in CI; that will follow in another PR once we have runners
available in this repo with the right network setup for reaching
cigocached.
Updates tailscale/corp#10808
Change-Id: I13ae1a12450eb2a05bd9843f358474243989e967
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
When the underlying transport returns a network error, the RoundTrip
method returns (nil, error). The defer was attempting to access resp
without checking if it was nil first, causing a panic. Fix this by
checking for nil in the defer.
Also changes driveTransport.tr from *http.Transport to http.RoundTripper
and adds a test.
Fixes#17306
Signed-off-by: Andrew Dunham <andrew@tailscale.com>
Change-Id: Icf38a020b45aaa9cfbc1415d55fd8b70b978f54c
SetSubnetRoutes was not sending update notifications to nodes when their
approved routes changed, causing nodes to not fetch updated netmaps with
PrimaryRoutes populated. This resulted in TestUserMetricsRouteGauges
flaking because it waited for PrimaryRoutes to be set, which only happened
if the node happened to poll for other reasons.
Now send updateSelfChanged notification to affected nodes so they fetch
an updated netmap immediately.
Fixes#17962
Signed-off-by: Andrew Dunham <andrew@tailscale.com>
Linux kernel versions 6.6.102-104 and 6.12.42-45 have a regression
in /proc/net/tcp that causes seek operations to fail with "illegal seek".
This breaks portlist tests on these kernels.
Add kernel version detection for Linux systems and a SkipOnKernelVersions
helper to tstest. Use it to skip affected portlist tests on the broken
kernel versions.
Thanks to philiptaron for the list of kernels with the issue and fix.
Updates #16966
Signed-off-by: Andrew Dunham <andrew@tailscale.com>
Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
- a subscriber tries to acquire the same mutex as held by a publisher, or
- a subscriber for A events publishes B events
Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.
Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.
Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.
This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.
Further improvements can be implemented in the future as needed.
Fixes#17973Fixes#18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to publish events itself.
This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.
Updates #18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to acquire a mutex that can also be held by a publisher.
This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.
Updates #17973
Signed-off-by: Nick Khyl <nickk@tailscale.com>
This is causing confusing panics in tailscale/corp#34485. We'll keep
using the tka.ChonkMem constructor as much as we can, but don't panic
if you create a tka.Mem directly -- we know what the sensible thing is.
Updates #cleanup
Signed-off-by: Alex Chan <alexc@tailscale.com>
Change-Id: I49309f5f403fc26ce4f9a6cf0edc8eddf6a6f3a4
With the introduction of node sealing, store.New fails in some cases due
to the TPM device being reset or unavailable. Currently it results in
tailscaled crashing at startup, which is not obvious to the user until
they check the logs.
Instead of crashing tailscaled at startup, start with an in-memory store
with a health warning about state initialization and a link to (future)
docs on what to do. When this health message is set, also block any
login attempts to avoid masking the problem with an ephemeral node
registration.
Updates #15830
Updates #17654
Signed-off-by: Andrew Lytvynov <awly@tailscale.com>
These validations were previously performed in the CLI frontend. There
are two motivations for moving these to the local backend:
1. The backend controls synchronization around the relevant state, so
only the backend can guarantee many of these validations.
2. Doing these validations in the back-end avoids the need to repeat
them across every frontend (e.g. the CLI and tsnet).
Updates tailscale/corp#27200
Signed-off-by: Harry Harpham <harry@tailscale.com>
This commit adds the `spec.replicas` field to the `Recorder` custom
resource that allows for a highly available deployment of `tsrecorder`
within a kubernetes cluster.
Many changes were required here as the code hard-coded the assumption
of a single replica. This has required a few loops, similar to what we
do for the `Connector` resource to create auth and state secrets. It
was also required to add a check to remove dangling state and auth
secrets should the recorder be scaled down.
Updates: https://github.com/tailscale/tailscale/issues/17965
Signed-off-by: David Bond <davidsbond93@gmail.com>
fixestailscale/tailscale#17990
The logging for the netns caps is spammy. Log only on changes
to the values and don't log Darwin specific stuff on non Darwin
clients.
Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
This commit modifies the kubernetes operator to use the "stable" version
of `k8s-nameserver` by default.
Updates: https://github.com/tailscale/corp/issues/19028
Signed-off-by: David Bond <davidsbond93@gmail.com>
This commit enables user to set service backend to remote destinations, that can be a partial
URL or a full URL. The commit also prevents user to set remote destinations on linux system
when socket mark is not working. For user on any version of mac extension they can't serve a
service either. The socket mark usability is determined by a new local api.
Fixestailscale/corp#24783
Signed-off-by: KevinLiang10 <37811973+KevinLiang10@users.noreply.github.com>
Now that we support using an in-memory backend for TKA state (#17946),
this function always returns `nil` – we can always support Network Lock.
We don't need it any more.
Plus, clean up a couple of errant TODOs from that PR.
Updates tailscale/corp#33599
Change-Id: Ief93bb9adebb82b9ad1b3e406d1ae9d2fa234877
Signed-off-by: Alex Chan <alexc@tailscale.com>
Our style guide recommends avoiding Latin abbreviations in technical
documentation, which includes the CLI help text. This is causing linter
issues for the docs site, because this help text is copied into the docs.
See http://go/style-guide/kb/language-and-grammar/abbreviations#latin-abbreviations
Updates #cleanup
Change-Id: I980c28d996466f0503aaaa65127685f4af608039
Signed-off-by: Alex Chan <alexc@tailscale.com>
ArgoCD sends boolean values but the template expects strings, causing
"incompatible types for comparison" errors. Wrap values with toString
so both work.
Fixes#17158
Signed-off-by: Raj Singh <raj@tailscale.com>
Previously a TKA compaction would only run when a node starts, which means a long-running node could use unbounded storage as it accumulates ever-increasing amounts of TKA state. This patch changes TKA so it runs a compaction after every sync.
Updates https://github.com/tailscale/corp/issues/33537
Change-Id: I91df887ea0c5a5b00cb6caced85aeffa2a4b24ee
Signed-off-by: Alex Chan <alexc@tailscale.com>
This commit modifies the helm/static manifest configuration for the
k8s-operator to prefer the stable image tag. This avoids making those
using static manifests seeing unstable behaviour by default if they
do not manually make the change.
This is managed for us when using helm but not when generating the
static manifests.
Updates https://github.com/tailscale/tailscale/issues/10655
Signed-off-by: David Bond <davidsbond93@gmail.com>
(trying to get in smaller obvious chunks ahead of later PRs to make
them smaller)
Updates #17925
Change-Id: I184002001055790484e4792af8ffe2a9a2465b2e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
We now embed node information into network flow logs.
By default, netlogfmt still prints out using Tailscale IP addresses.
Support a "--resolve-addrs=TYPE" flag that can be used to specify
resolving IP addresses as node IDs, hostnames, users, or tags.
Updates tailscale/corp#33352
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
Adds the ability to rotate discovery keys on running clients, needed for
testing upcoming disco key distribution changes.
Introduces key.DiscoKey, an atomic container for a disco private key,
public key, and the public key's ShortString, replacing the prior
separate atomic fields.
magicsock.Conn has a new RotateDiscoKey method, and access to this is
provided via localapi and a CLI debug command.
Note that this implementation is primarily for testing as it stands, and
regular use should likely introduce an additional mechanism that allows
the old key to be used for some time, to provide a seamless key rotation
rather than one that invalidates all sessions.
Updates tailscale/corp#34037
Signed-off-by: James Tucker <james@tailscale.com>
As part of the conn25 work we will want to be able to keep track of a
pool of IP Addresses and know which have been used and which have not.
Fixestailscale/corp#34247
Signed-off-by: Fran Bull <fran@tailscale.com>
We use `tka.AUMHash` in `netmap.NetworkMap`, and we serialise it as JSON
in the `/debug/netmap` C2N endpoint. If the binary omits Tailnet Lock support,
the debug endpoint returns an error because it's unable to marshal the
AUMHash.
This patch adds a sentinel value so this marshalling works, and we can
use the debug endpoint.
Updates https://github.com/tailscale/tailscale/issues/17115
Signed-off-by: Alex Chan <alexc@tailscale.com>
Change-Id: I51ec1491a74e9b9f49d1766abd89681049e09ce4
Existing compaction logic seems to have had an assumption that
markActiveChain would cover a longer part of the chain than
markYoungAUMs. This prevented long, but fresh, chains, from being
compacted correctly.
Updates tailscale/corp#33537
Signed-off-by: Anton Tolchanov <anton@tailscale.com>
6a73c0bdf55 added a feature tag but didn't re-run go generate on ./feature/buildfeatures.
Updates #9192
Change-Id: I7819450453e6b34c60cad29d2273e3e118291643
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
I added a RemoveAll() method on tka.Chonk in #17946, but it's only used
in the node to purge local AUMs. We don't need it in the SQLite storage,
which currently implements tka.Chonk, so move it to CompactableChonk
instead.
Also add some automated tests, as a safety net.
Updates tailscale/corp#33599
Change-Id: I54de9ccf1d6a3d29b36a94eccb0ebd235acd4ebc
Signed-off-by: Alex Chan <alexc@tailscale.com>
The REST API does not return a node name
with a trailing dot, while the internal node name
reported in the netmap does have one.
In order to be consistent with the API,
strip the dot when recording node information.
Updates tailscale/corp#33352
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
Perform a path check first before attempting exec of `true`.
Try /usr/bin/true first, as that is now and increasingly so, the more
common and more portable path.
Fixes tests on macOS arm64 where exec was returning a different kind of
path error than previously checked.
Updates #16569
Signed-off-by: James Tucker <james@tailscale.com>
DA protection is not super helpful because we don't set an authorization
password on the key. But if authorization fails for other reasons (like
TPM being reset), we will eventually cause DA lockout with tailscaled
trying to load the key. DA lockout then leads to (1) issues for other
processes using the TPM and (2) the underlying authorization error being
masked in logs.
Updates #17654
Signed-off-by: Andrew Lytvynov <awly@tailscale.com>
For manual (human) testing, this lets the user disable control plane
map polls with "tailscale set --sync=false" (which survives restarts)
and "tailscale set --sync" to restore.
A high severity health warning is shown while this is active.
Updates #12639
Updates #17945
Change-Id: I83668fa5de3b5e5e25444df0815ec2a859153a6d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Let's fix all the typos, which lets the code be more readable, lest we
confuse our readers.
Updates #cleanup
Change-Id: I4954601b0592b1fda40269009647bb517a4457be
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This requires making the internals of LocalBackend a bit more generic,
and implementing the `tka.CompactableChonk` interface for `tka.Mem`.
Signed-off-by: Alex Chan <alexc@tailscale.com>
Updates https://github.com/tailscale/corp/issues/33599
Pick up a fix for https://pkg.go.dev/vuln/GO-2025-4116 (even though
we're not affected).
Updates #cleanup
Change-Id: I9f2571b17c1f14db58ece8a5a34785805217d9dd
Signed-off-by: Andrew Lytvynov <awly@tailscale.com>