Various settings can be configured related to QUIC congestion controler.
This patch duplicates them to be able to set independent values on
frontend and backend sides.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUIC
settings. Older options are deprecated and will be removed on 3.5
release.
Congestion window is limit by a minimal and maximum values which can
never be exceeded. Min value is hardcoded to 2 datagrams as recommended
by the specification. Max value is specified via haproxy configuration.
These values must be respected each time the congestion window size is
adjusted. However, in some rare occasions, limit were not always
enforced. Fix this by implementing wrappers to set or increment the
congestion window. These functions ensure limits are always applied
after the operation.
Additionnally, wrappers also ensure that if window reached a new maximum
value, it is saved in <cwnd_last_max> field.
This should be backported up to 2.6, after a brief period of
observation.
There was some possible confusion between fields related to congestion
window size min and max limit which cannot be exceeded, and the maximum
value previously reached by the window.
Fix this by adopting a new naming scheme. Enforced limit are now renamed
<limit_max>/<limit_min>, while the previously reached max value is
renamed <cwnd_last_max>.
This should be backported up to 3.1.
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.
This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
Pacing support was previously activated on each bind line individually,
via an optional argument of quic-cc-algo keyword. Remove this optional
argument and introduce a global setting to enable/disable pacing. Pacing
activation is still flagged as experimental.
One important change is that previously BBR usage automatically
activated pacing support. This is not the case anymore, so users should
now always explicitely activate pacing if BBR is selected. A new warning
message will be displayed if this is not the case.
Another consequence of this change is that now pacing_inter callback is
always defined for every quic_cc_algo types. As such, QUIC MUX uses
global.tune.options to determine if pacing is required.
This should be backported up to 3.1, after a period of observation.
Previously, congestion window was increased any time each time a new
acknowledge was received. However, it did not take into account the
window filling level. In a network condition with negligible loss, this
will cause the window to be incremented until the maximum value (by
default 480k), even though the application does not have enough data to
fill it.
In most cases, this issue is not noticeable. However, it may lead to
excessive memory consumption when a QUIC connection is suddendly
interrupted, as in this case haproxy will fill the window with
retransmission. It even has caused OOM crash when thousands of clients
were interrupted at once on a local network benchmark.
Fix this by first checking window level prior to every incrementation
via a new helper function quic_cwnd_may_increase(). It was arbitrarily
decided that the window must be at least 50% full when the ACK is
handled prior to increment it. This value is a good compromise to keep
window in check while still allowing fast increment when needed.
Note that this patch only concerns cubic and newreno algorithm. BBR has
already its notion of application limited which ensures the window is
only incremented when necessary.
This should be backported up to 2.6.
Extend extra pacing support for newreno and nocc congestion algorithms,
as with cubic.
For better extensibility of cc algo definition, define a new flags field
in quic_cc_algo structure. For now, the only value is
QUIC_CC_ALGO_FL_OPT_PACING which is set if pacing support can be
optionally activated. Both cubic, newreno and nocc now supports this.
This new flag is then reused by QUIC config parser. If set, extra
quic-cc-algo burst parameter is taken into account. If positive, this
will activate pacing support on top of the congestion algorithm. As with
cubic previously, pacing is only supported if running under experimental
mode.
Only BBR is not flagged with this new value as pacing is directly
builtin in the algorithm and cannot be turn off. Furthermore, BBR
calculates automatically its value for maximum burst. As such, any
quic-cc-algo burst argument used with BBR is still ignored with a
warning.
Since these commits the 32bits build is broken due to several errors as follow:
CC src/quic_cli.o
src/quic_cli.c: In function ‘dump_quic_full’:
src/quic_cli.c:285:94: error: format ‘%ld’ expects argument of type ‘long int’,
but argument 5 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
285 | chunk_appendf(&trash, " [initl] rx.ackrng=%-6zu tx.inflight=%-6zu(%ld%%)\n",
| ~~^
| |
| long int
| %lld
286 | pktns->rx.arngs.sz, pktns->tx.in_flight,
287 | pktns->tx.in_flight * 100 / qc->path->cwnd);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| uint64_t {aka long long unsigned int}
Replace several %ld by %llu with ull as printf conversion in quic_clic.c and a
%ld by %lld with (long long) as printf conversion in quic_cc_cubic.c.
Thank you to Ilya (@chipitsine) for having reported this issue in GH #2689.
Must be backported to 3.0.
Tests performed between a 1 Gbps connected server and a 100 mbps client,
distant by 95ms showed that:
- we need 1.1 MB in flight to fill the link
- rare but inevitable losses are sufficient to make cubic's window
collapse fast and long to recover
- a 100 MB object takes 69s to download
- tolerance for 1 loss between two ACKs suffices to shrink the download
time to 20-22s
- 2 losses go to 17-20s
- 4 losses reach 14-17s
At 100 concurrent connections that fill the server's link:
- 0 loss tolerance shows 2-3% losses
- 1 loss tolerance shows 3-5% losses
- 2 loss tolerance shows 10-13% losses
- 4 loss tolerance shows 23-29% losses
As such while there can be a significant gain sometimes in setting this
tolerance above zero, it can also significantly waste bandwidth by sending
far more than can be received. While it's probably not a solution to real
world problems, it repeatedly proved to be a very effective troubleshooting
tool helping to figure different root causes of low transfer speeds. In
spirit it is comparable to the no-cc congestion algorithm, i.e. it must
not be used except for experimentation.
Add ->state_cli() new callback to quic_cc_algo struct to define a
function called by the "show quic (cc|full)" commands to dump some information
about the congestion algorithm internal state currently in use by the QUIC
connections.
Implement this callback for CUBIC algorithm to dump its internal variables:
- K: (the time to reach the cubic curve inflexion point),
- last_w_max: the last maximum window value reached before intering
the last recovery period. This is also the window value at the
inflexion point of the cubic curve,
- wdiff: the difference between the current window value and last_w_max.
So negative before the inflexion point, and positive after.
K cubic variable is stored in ms. But it was a formula with the second as unit
for the window difference parameter which was used to compute K without
considering the loss of information. Then the result was converted in ms (K *= 1000).
This leaded to a lack of precision and multiples of 1000 as values.
To fix this, use the same formula but with the window difference in ms as parameter
passed to the cubic function and remove the conversion.
Must be backported as far as 2.6.
Add two initcall callback with BUG_ON_HOT() to newro and cubic modules to
ensure there is no buffer overflow when accessing the private data of
these congestion control algorithm state structures. This is to ensure
that further modifications about these data structures will not
lead to surprises. At this time there is no possible buffer overflow.
This issue arrived with this commit:
"MINOR: quic: HyStart++ implementation (RFC 9406)"
Thanks to @chipitsine for having reported this issue in GH #2513.
Should be backported where the previous commit will be backported.
This is a simple algorithm to replace the classic slow start phase of the
congestion control algorithms. It should reduce the high packet loss during
this step.
Implemented only for Cubic.
This is 39th iteration of typo fixes
The naming issue on the argument called "unsued" instead of "unused"
in two functions from resolvers and stick-tables was put into a second
patch so that it can be omitted if it were to cause backport issues.
The new formula for K CUBIC which arrives with RFC 9438 is as follows:
K = cubic_root((W_max - cwnd_epoch) / C)
Note that W_max is c->last_w_max, and cwnd_epoch is c->cwnd when entering
quic_cubic_update() just after a congestion event.
Must be backported as far as 2.6.
The formula for K CUBIC calculation is as follows:
K = cubic_root(W_max * (1 - beta_quic) / C).
Note that this does not match the comment. But the aim of this patch is to not
hide a bug inside another patch to update this K CUBIC calculation.
The unit of C is bytes/s^3 (or segments/s^3). And we want to store K as
milliseconds. So, the conversion inside the cubic_root() to convert seconds in
milliseconds is wrong. The unit used here is bytes/(ms/1000)^3 or
bytes*1000^3/ms^3. That said, it is preferable to compute K as seconds, then
convert to milliseconds as done by this patch.
Must be backported as far as 2.6.
Avoid loss of precision when computing K cubic value.
Same issue when computing the congestion window value from cubic increase function
formula with possible integer varaiable wrap around.
Depends on this commit:
MINOR: quic: Code clarifications for QUIC CUBIC (RFC 9438)
Must be backported as far as 2.6.
The first version of our QUIC CUBIC implementation is confusing because relying on
TCP CUBIC linux kernel implementation and with references to RFC 8312 which is
obsoleted by RFC 9438 (August 2023) after our implementation. RFC 8312 is a little
bit hard to understand. RFC 9438 arrived with much more clarifications.
So, RFC 9438 is about "CUBIC for Fast Long-Distance Networks". Our implementation
for QUIC is not very well documented. As it was difficult to reread this
code, this patch adds only some comments at complicated locations and rename
some macros, variables without logic modifications at all.
So, the aim of this patch is to add first some comments and variables/macros renaming
to avoid embedding too much code modifications in the same big patch.
Some code modifications will come to adapt this CUBIC implementation to this new
RFC 9438.
Rename some macros:
CUBIC_BETA -> CUBIC_BETA_SCALED
CUBIC_C -> CUBIC_C_SCALED
CUBIC_BETA_SCALE_SHIFT -> CUBIC_SCALE_FACTOR_SHIFT (this is the scaling factor
which is used only for CUBIC_BETA_SCALED)
CUBIC_DIFF_TIME_LIMIT -> CUBIC_TIME_LIMIT
CUBIC_ONE_SCALED was added (scaled value of 1).
These cubic struct members were renamed:
->tcp_wnd -> ->W_est
->origin_point -> ->W_target
->epoch_start -> ->t_epoch
->remaining_tcp_inc -> remaining_W_est_inc
Local variables to quic_cubic_update() were renamed:
t -> elapsed_time
diff ->t
delta -> W_cubic_t
Add a grahpic curve about the CUBIC Increase function.
Add big copied & pasted RFC 9438 extracts in relation with the 3 different increase
function regions.
Same thing for the fast convergence.
Fix a typo about the reference to QUIC RFC 9002.
Must be backported as far as 2.6 to ease any further modifications to come.
Very minor modification to replace a statement with an hardcoded value by a macro.
Should be backported as far as 2.6 to ease any further modification to come.
Move quic_path struct from quic_conn-t.h to quic_cc-t.h and rename it to quic_cc_path.
Update the code consequently.
Also some inlined functions in relation with QUIC path to quic_cc.h
Make all the congestion support the maximum congestion control window
set by configuration. There is nothing special to explain. For each
each algo, each time the window is incremented it is also bounded.
Move all QUIC trace definitions from quic_conn.h to quic_trace-t.h. Also
remove multiple definition trace_quic macro definition into
quic_trace.h. This forces all QUIC source files who relies on trace to
include it while reducing the size of quic_conn.h.
As reported by @Tristan971 in GH #2116, the congestion control window could be zero
due to an inversion in the code about the reduction factor to be applied.
On a new loss event, it must be applied to the slow start threshold and the window
should never be below ->min_cwnd (2*max_udp_payload_sz).
Same issue in both newReno and cubic algorithm. Furthermore in newReno, only the
threshold was decremented.
Must be backported to 2.6 and 2.7.
Add the number of packet losts and the maximum congestion control window computed
by the algorithms to "show quic".
Same thing for the traces of existent congestion control algorithms.
Must be backported to 2.7 and 2.6.
As now_ms may be zero, these BUG_ON() could be triggered when its value has wrapped.
These call to BUG_ON() may be removed because the values they was supposed to
check are safely used by the ticks API.
Must be backported to 2.6 and 2.7.
This very old bug is there since the first implementation of newreno congestion
algorithm implementation. This was a very bad idea to put a state variable
into quic_cc_algo struct which only defines the congestion control algorithm used
by a QUIC listener, typically its type and its callbacks.
This bug could lead to crashes since BUG_ON() calls have been added to each algorithm
implementation. This was revealed by interop test, but not very often as there was
not very often several connections run at the time during these tests.
Hopefully this was also reported by Tristan in GH #2095.
Move the congestion algorithm state to the correct structures which are private
to a connection (see cubic and nr structs).
Must be backported to 2.7 and 2.6.
When entering a recovery period, the algo state is set by quic_enter_recovery().
And that's it!. These two lines should have been removed with this commit:
BUG/MINOR: quic: Wrong use of now_ms timestamps (cubic algo)
Take the opportunity of this patch to add a missing TRACE_LEAVE() call in
quic_cc_cubic_ca_cb().
Must be backported to 2.7 and 2.6.
Dump variables displayed by TRACE_ENTER() or TRACE_LEAVE() by calls to TRACE_PROTO().
No more variables are displayed by the two former macros. For now on, these information
are accessible from proto level.
Add new calls to TRACE_PROTO() at important locations in relation whith QUIC transport
protocol.
When relevant, try to prefix such traces with TX or RX keyword to identify the
concerned subpart (transmission or reception) of the protocol.
Must be backported to 2.7.
This callback was left as not implemented. It should at least display
the algorithm state, the control congestion window the slow start threshold
and the time of the current recovery period. Should be helpful to debug.
Must be backported to 2.7.
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore to make the cubic congestion control algorithm more readable and easy
to maintain, adding a new state ("in recovery period" QUIC_CC_ST_RP new enum) helps
in reaching this goal. Implement quic_cc_cubic_rp_cb() which is the callback for
this new state.
Must be backported to 2.7 and 2.6.
It is illegal to call my_flsl() with 0 as parameter value. It is a UB.
This leaded cubic_root() to divide values by 0 at this line:
x = 2 * x + (uint32_t)(val / ((uint64_t)x * (uint64_t)(x - 1)));
Thank you to Tristan971 for having reported this issue in GH #1808
and Willy for having spotted the root cause of this bug.
Must follow any cubic for QUIC backport (2.6).
Cubic is the congestion control algorithm used by default by the Linux kernel
since 2.6.15 version. This algorithm is supposed to achieve good scalability and
fairness between flows using the same network path, it should also be used by QUIC
by default. This patch implements this algorithm and select it as default algorithm
for the congestion control.
Must be backported to 2.6.