Consolidate the duplicated WebSocket frame-parsing logic from Read
and Write into a shared processFrames loop, fixing several bugs in
the process:
- Mixed control and data frames in a single Read/Write call buffer
were not handled: a control frame would cause merged data frames
to be skipped.
- Multiple data frames into one Write call weren't being correctly
parsed: only the first frame was processed, ignoring the rest in
the buffer.
- msg.isFinalized was being set before confirming the fragment was
complete, so an incomplete msg fragment, could've been sometimes
marked as finalized.
- Continuation frames without any payload were being treated as if
they didn't have stream ID, even thought the id is already known
from the initial fragment.
Fixestailscale/corp#39583
Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
Co-authored-by: chaosinthecrd <tom@tmlabs.co.uk>
Move the ipn/desktop blank import from cmd/tailscaled/tailscaled_windows.go
into feature/condregister/maybe_desktop_sessions.go, consistent with how
all other modular features are registered. tailscaled already imports
feature/condregister, so it still gets ipn/desktop on Windows.
Updates #12614
Change-Id: I92418c4bf0e67f0ab40542e47584762ac0ffa2b2
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
GetMessage can call back into Go, triggering stack growth and causing the stack
to be copied to a new memory region, which invalidates the original stack pointer
passed to the syscall. Since GetMessage uses that pointer to write the message
before returning, this leads to memory corruption.
In this PR, we fix this by using runtime.Pinner, which requires the pointer to refer
to heap-allocated memory.
Fixes#19263Fixes#17832
Signed-off-by: Nick Khyl <nickk@tailscale.com>
Add a new "ipnbus" build feature tag so the watch-ipn-bus LocalAPI
endpoint can be independently controlled, rather than being gated
behind HasDebug || HasServe. Minimal/embedded builds that omit both
debug and serve were getting 404s on watch-ipn-bus, breaking
"tailscale up --authkey=..." and other CLI flows that depend on
WatchIPNBus.
In the CLI, check buildfeatures.HasIPNBus before attempting to watch
the IPN bus in "tailscale up"/"tailscale login", and exit early with
an informational message when the feature is omitted.
Also add the missing NewCounterFunc stub to clientmetric/omit.go,
which caused compilation errors when building with
ts_omit_clientmetrics and netstack enabled.
Fixes#19240
Change-Id: I2e3c69a72fc50fa02542b91b8a54859618a463d1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
If an entry in the tsmpLearnedDisco does not match the disco key of the
key currently being processed, overwrite the key, and leave the entry in
the map for later processing.
In reality, this should not happen, but is put in as a safety measure
with logging of the situation so we can replicate the behaviour and
correct it should it happen.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
After moving around locks in 4334dfa7d5ccbee1daf5acf30b33557bbca66525,
a data race were made possible.
Introduce an RWlock to the mapSession itself for fetching peers.
Fixes#19260
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
When a recording upload fails mid-session, the recording goroutine
cancels the session context. This triggers two concurrent paths:
exec.CommandContext kills the process (causing cmd.Wait to return),
and killProcessOnContextDone tries to write the termination message
via exitOnce.Do. If cmd.Wait returns first, the main goroutine's
exitOnce.Do(func(){}) steals the once, and the termination message
is never written to the client.
Fix by waiting for killProcessOnContextDone to finish writing the
termination message (via <-ss.exitHandled) before claiming exitOnce,
when the context is already done.
Also fix the fallback path when launchProcess itself fails due to
context cancellation: use SSHTerminationMessage() with the correct
"\r\n\r\n" framing instead of fmt.Fprintf with the internal error
string.
Deflakes TestSSHRecordingCancelsSessionsOnUploadFailure, which was
failing consistently at a low rate due to the exitOnce race. After
this fix, flakestress passes with 8,668 runs, 0 failures.
Fixes#7707 (again. hopefully for good.)
Change-Id: I5ab911c71574db8d3f9d979fb839f273be51ecf9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Brings in a newer version of Gliderlabs SSH with added socket forwarding support.
Fixes#12409Fixes#5295
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Investigating battery costs on a busy tailnet I noticed a large number
of nodes regularly reconnecting to control and DERP. In one case I was
able to analyze closely `pmset` reported the every-minute wake-ups being
triggered by bluetooth. The node was by side effect reconnecting to
control constantly, and this was at times visible to peers as well.
Three changes here improve the situation:
- Short time jumps (less than 10 minutes) no longer produce "major
network change" events, and so do not trigger full rebind/reconnect.
- Many "incidental" fields on interfaces are ignored, like MTU, flags
and so on - if the route is still good, the rest should be manageable.
- Additional log output will provide more detail about the cause of
major network change events.
Updates #3363
Signed-off-by: James Tucker <james@tailscale.com>
Set csrf.Path("/") so the CSRF cookie is available across all routes,
not just the path where it was set.
Add helpers to expose the gorilla/csrf token for use.
Updates #19264
Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Commit f905871fb moved host key generation from the ipnLocalBackend
interface (GetSSH_HostKeys) to the standalone getHostKeys function,
which requires either system host keys in /etc/ssh/ or a valid
TailscaleVarRoot to generate keys into. The testBackend returned ""
for TailscaleVarRoot, and the Docker test containers only install
openssh-client (no server host keys), so getHostKeys always failed.
When getHostKeys fails, HandleSSHConn returns the error but never
closes the TCP connection, so SSH clients hang forever waiting for
the server hello.
Fix by creating a temp directory in TestMain and returning it from
testBackend.TailscaleVarRoot().
Regression from f905871fb #18949 ("ipn/ipnlocal, feature/ssh: move SSH code
out of LocalBackend to feature").
I was apparently too impatient to wait for the test to complete
and didn't connect the dots: https://github.com/tailscale/tailscale/actions/runs/22930275950
We should make that test faster (#19244) for the patience issue, but
also fail more nicely if this happens in the future.
Updates #19244
Change-Id: If82393b8f35413b04174e6f7d09a1ee3a2125a6b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The cloner and viewer code generators didn't handle named types
with basic underlying types (map/slice) that have their own Clone
or View methods. For example, a type like:
type Map map[string]any
func (m Map) Clone() Map { ... }
func (m Map) View() MapView { ... }
When used as a struct field, the cloner would descend into the
underlying map[string]any and fail because it can't clone the any
(interface{}) value type. Similarly, the viewer would try to create
a MapFnOf view and fail.
Fix the cloner to check for a Clone method on the named type
before falling through to the underlying type handling.
Fix the viewer to check for a View method on named map/slice types,
so the type author can provide a purpose-built safe view that
doesn't leak raw any values. Named map/slice types without a View
method fall through to normal handling, which correctly rejects
types like map[string]any as unsupported.
Updates tailscale/corp#39502 (needed by tailscale/corp#39594)
Change-Id: Iaef0192a221e02b4b8e409c99ef8398090327744
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
To denoise log output, to make it easier to find real failures.
Updates #19252
Change-Id: Iae64a9278c70de24a236c39e3d181a509a512a0b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The -run "^$" flag was being mangled by cmd.exe's argument processing.
The ^ character is cmd.exe's escape character, so go.cmd's cmd.exe layer
eats it, turning -run "^$" into -run "$" which matches all test names.
This caused the benchmark job to run every test, leading to timeouts
and Go runtime crashes.
Use -run XXXXNothingXXXX instead, which avoids special characters
entirely.
Updates #19252
Change-Id: I888c124254dd2767a40b61bcd68dbc9b22ad35a1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The upload-client-metrics handler called metricCapture without
checking if it was nil or if the metrics slice was empty. Most
tests pass nil for metricCapture, so if a metrics upload races
in during the test, it panics.
Fixes#19252
Change-Id: Ib904d1fe6779067dc2a153d1680b8f50cba9c773
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a new vet analyzer that checks t.Run subtest names don't contain
characters requiring quoting when re-running via "go test -run". This
enforces the style guide rule: don't use spaces or punctuation in
subtest names.
The analyzer flags:
- Direct t.Run calls with string literal names containing spaces,
regex metacharacters, quotes, or other problematic characters
- Table-driven t.Run(tt.name, ...) calls where tt ranges over a
slice/map literal with bad name field values
Also fix all 978 existing violations across 81 test files, replacing
spaces with hyphens and shortening long sentence-like names to concise
hyphenated forms.
Updates #19242
Change-Id: Ib0ad96a111bd8e764582d1d4902fe2599454ab65
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
TestGocrossWrapper will fail when run inside a git linked worktree
because Go 1.26 and earlier cannot get the current revision hash.
Since this will be fixed in Go 1.27, see golang/go#58218, this patch
skips this test until that release.
Fixes#19217
Signed-off-by: Simon Law <sfllaw@tailscale.com>
The test sets up an HTTP-over-Unix server and a reverse proxy pointed at
this server, but prior to this change did not round-trip anything to the
backing server. This change ensures that we test code paths which proxy
Unix sockets for serve.
Fixes#19232
Signed-off-by: Harry Harpham <harry@tailscale.com>
This is a follow-up to #19117, adding a debug CLI command allowing the operator
to explicitly discard cached netmap data, as a safety and recovery measure.
Updates #12639
Change-Id: I5c3c47c0204754b9c8e526a4ff8f69d6974db6d0
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
When getting a full map from control, disco keys for the nodes will also
be delivered. When communicating with a peer that is running without
being connected to control, and having that connection running based on
a TSMP learned disco key, we need to avoid overwriting the disco key for
that peer with the stale one control knows about.
Add a filter that filteres out keys from control, and replace them with
the TSMP learned disco keys.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
* cmd/k8s-operator/e2e: add L7 HA ingress test
Change-Id: Ic017e4a7e3affbc3e2a87b9b6b9c38afd65f32ed
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
* cmd/k8s-operator: add further E2E tests for Ingress (#34833)
This change adds E2E tests for L3 HA Ingress and L7 Ingress (Standalone and
HA). Updates the existing L3 Ingress test to use the Service's Magic DNS
name to test connectivity.
Also refactors test setup to set TS_DEBUG_ACME_DIRECTORY_URL only for tests
running against devcontrol, and updates the Kind node image from v1.30.0 to
v1.35.0.
Fixestailscale/corp#34833
Signed-off-by: Becky Pauley <becky@tailscale.com>
---------
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
Signed-off-by: Becky Pauley <becky@tailscale.com>
Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>
We have ~2.5k nodes running Void Linux, which report a version string
like `1.96.2_1 (Void Linux)`. Previously these versions would fail to
parse, because we only expect a hyphen and `extraCommits` after the
major/minor/patch numbers.
Fix the version parsing logic to handle this case.
Updates #19148
Change-Id: Ica4f172d080af266af7f0d69bb31483a095cd199
Signed-off-by: Alex Chan <alexc@tailscale.com>
Add a new tailcfg.NodeCapability (NodeAttrCacheNetworkMaps) to control whether
a node with support for caching network maps will attempt to do so. Update the
capability version to reflect this change (mainly as a safety measure, as the
control plane does not currently need to know about it).
Use the presence (or absence) of the node attribute to decide whether to create
and update a netmap cache for each profile. If caching is disabled, discard the
cached data; this allows us to use the presence of a cached netmap as an
indicator it should be used (unless explicitly overridden). Add a test that
verifies the attribute is respected. Reverse the sense of the environment knob
to be true by default, with an override to disable caching at the client
regardless what the node attribute says.
Move the creation/update of the netmap cache (when enabled) until after
successfully applying the network map, to reduce the possibility that we will
cache (and thus reuse after a restart) a network map that fails to correctly
configure the client.
Updates #12639
Change-Id: I1df4dd791fdb485c6472a9f741037db6ed20c47e
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Instead of sending out disco keys via TSMP once, send them out in
intervals of 60+ seconds. The trigger is still callmemaaybe and the keys
will not be send if no direct connection needs to be established.
This fixes a case where a node can have stale keys but have communicated
with the other peer before, leading to an infinite DERP state.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
In #10057, @seigel pointed out an inconsistency in the help text for
`exit-node list` and `set --exit-node`:
1. Use `tailscale exit-node list`, which has a column titled "hostname"
and tells you that you can use a hostname with `set --exit-node`:
```console
$ tailscale exit-node list
IP HOSTNAME COUNTRY CITY STATUS
100.98.193.6 linode-vps.tailfa84dd.ts.net - - -
[…]
100.93.242.75 ua-iev-wg-001.mullvad.ts.net Ukraine Kyiv -
# To view the complete list of exit nodes for a country, use `tailscale exit-node list --filter=` followed by the country name.
# To use an exit node, use `tailscale set --exit-node=` followed by the hostname or IP.
# To have Tailscale suggest an exit node, use `tailscale exit-node suggest`.
```
(This is the same format hostnames are presented in the admin
console.)
2. Try copy/pasting a hostname into `set --exit-node`:
```console
$ tailscale set --exit-node=linode-vps.tailfa84dd.ts.net
invalid value "linode-vps.tailfa84dd.ts.net" for --exit-node; must be IP or unique node name
```
3. Note that the command allows some hostnames, if they're from nodes
in a different tailnet:
```console
$ tailscale set --exit-node= ua-iev-wg-001.mullvad.ts.net
$ echo $?
0
```
This patch addresses the inconsistency in two ways:
1. Allow using `tailscale set --exit-node=` with an FQDN that's missing
the trailing dot, matching the formatting used in `exit-node list`
and the admin console.
2. Make the description of valid exit nodes consistent across commands
("hostname or IP").
Updates #10057
Change-Id: If5d74f950cc1a9cc4b0ebc0c2f2d70689ffe4d73
Signed-off-by: Alex Chan <alexc@tailscale.com>
This avoids putting "DisablementSecrets" in the JSON output from
`tailscale lock log`, which is potentially scary to somebody who doesn't
understand the distinction.
AUMs are stored and transmitted in CBOR-encoded format, which uses an
integer rather than a string key, so this doesn't break already-created
TKAs.
Fixes#19189
Change-Id: I15b4e81a7cef724a450bafcfa0b938da223c78c9
Signed-off-by: Alex Chan <alexc@tailscale.com>
Reports whether the current binary was built with Tailscale's
custom Go toolchain (the "tailscale_go" build tag).
For https://github.com/tailscale/go/pull/165
Updates tailscale/corp#39430
Change-Id: Ica437582ddf55d7df48b1453bad03ce14b1c0949
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
* Refer to "tailnet-lock" instead of "network-lock" in log messages
* Log keys as `tlpub:<hex>` rather than as Go structs
Updates tailscale/corp#39455
Updates tailscale/corp#37904
Change-Id: I644407d1eda029ee11027bcc949897aa4ba52787
Signed-off-by: Alex Chan <alexc@tailscale.com>
Prior to this change, closing multiple ServiceListeners concurrently
could result in failures as the independent close operations vie for the
attention of the Server's LocalBackend. The close operations would each
obtain the current ETag of the serve config and try to write new serve
config using this ETag. When one write invalidated the ETag of another,
the latter would fail. Exacerbating the issue, ServiceListener.Close
cannot be retried.
This change resolves the bug by using Server.mu to synchronize across
all ServiceListener.Close operations, ensuring they happen serially.
Fixes#19169
Signed-off-by: Harry Harpham <harry@tailscale.com>
This is a regression test for #19166, in which it was discovered that
after calling Server.ListenService for multiple Services, only the
Service from the most recent call would be advertised.
The bug was fixed in 99f8039101036857f088c8b72cac365f80219a27
Updates #19166
Signed-off-by: Harry Harpham <harry@tailscale.com>
This makes the limits easier to find and change, rather than scattering
them across the TKA code.
Updates #cleanup
Change-Id: I2f9b3b83d293eebb2572fa7bb6de2ca1f3d9a192
Signed-off-by: Alex Chan <alexc@tailscale.com>
The disco key subscriber could deadlock in a scenario where a self node
update came through the control path into the mapSession after the disco
key subscriber had taken the lock, but before it had pushed the netmap
change, as both the subscriber and onSelfNodeChanged needs the
controlclient lock.
The subscriber can safely take the mapsession as the changequeue has its
own lock for inserting records, and also checks if the queue has been
closed before inserting.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Without this, any test relying on underlying use of magicsock will fail
without network connectivity, even when the test logic has no need for a
network connection. Tests currently in this bucket include many in
tstest/integration and in tsnet.
Further explanation:
ipn only becomes Running when it sees at least one live peer or DERP
connection:
0cc1b2ff76/ipn/ipnlocal/local.go (L5861-L5866)
When tests only use a single node, they will never see a peer, so the
node has to wait to see a DERP server.
magicsock sets the preferred DERP server in updateNetInfo(), but this
function returns early if the network is down.
0cc1b2ff76/wgengine/magicsock/magicsock.go (L1053-L1106)
Because we're checking the real network, this prevents ipn from entering
"Running" and causes the test to fail or hang.
In tests, we can assume the network is up unless we're explicitly testing
the behaviour of tailscaled when the network is down. We do something similar
in magicsock/derp.go, where we assume we're connected to control unless
explicitly testing otherwise:
7d2101f352/wgengine/magicsock/derp.go (L166-L177)
This is the template for the changes to `networkDown()`.
Fixes#17122
Co-authored-by: Alex Chan <alexc@tailscale.com>
Signed-off-by: Harry Harpham <harry@tailscale.com>
When disco keys are learned on a node that is connected to control and
has a mapSession, wgengine will see the key as having changed, and
assume that any existing connections will need to be reset.
For keys learned via TSMP, the connection should not be reset as that
key is learned via an active wireguard connection. If wgengine resets
that connetion, a 15s timeout will occur.
This change adds a map to track new keys coming in via TSMP, and removes
them from the list of keys that needs to trigger wireguard resets. This
is done with an interface chain from controlclient down via localBackend
to userspaceEngine via the watchdog.
Once a key has been actively used for preventing a wireguard reset, the
key is removed from the map.
If mapSession becomes a long lived process instead of being dependent on
having a connection to control. This interface chain can be removed, and
the event sequence from wrap->controlClient->userspaceEngine, can be
changed to wrap->userspaceEngine->controlClient as we know the map will
not be gunked up with stale TSMP entries.
Updates #12639
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
AppendTo returns the new slice but the result was discarded,
so only the newly added service was advertised.
Signed-off-by: Evan Champion <110177090+evan314159@users.noreply.github.com>
Add riscv64 to the GOARCH list passed to mkctr for all Docker image
builds. Go already cross-compiles for riscv64, so this just adds the
architecture to the container manifest.
Updates #17812
Signed-off-by: Bruno Verachten <gounthar@gmail.com>
Previously, running `add/remove/revoke-keys` without passing any keys
would fail with an unhelpful error:
```console
$ tailscale lock revoke-keys
generation of recovery AUM failed: sending generate-recovery-aum: 500 Internal Server Error: no provided key is currently trusted
```
or
```console
$ tailscale lock revoke-keys
generation of recovery AUM failed: sending generate-recovery-aum: 500 Internal Server Error: network-lock is not active
```
Now they fail with a more useful error:
```console
$ tailscale lock revoke-keys
missing argument, expected one or more tailnet lock keys
```
Fixes#19130
Change-Id: I9d81fe2f5b92a335854e71cbc6928e7e77e537e3
Signed-off-by: Alex Chan <alexc@tailscale.com>
Install the previously uninstalled hooks for the filter and tstun
intercepts. Move the DNS manager hook installation into Init() with all
the others. Protect all implementations with a short-circuit if the node
is not configured to use Connectors 2025. The short-circuit pattern
replaces the previous pattern used in managing the DNS manager hook, of
setting it to nil in response to CapMap changes.
Fixestailscale/corp#38716
Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
The tailscale-online.target and tailscale-wait-online.service systemd
units were added in 30e12310f1 but never included in the release
packaging (tarballs, debs, rpms).
Updates #11504
Change-Id: I93e03e1330a7ff8facf845c7ca062ed2f0d35eaa
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>