In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.
On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.
This patch should fix the issue #2930.
This commit extends MUX H2 connection reversal step to properly take
into account the new idle-ping feature. It first ensures that h2c task
is properly instantiated/freed depending now on both timers and
idle-ping configuration. Also, h2c_update_timeout() is now called
instead of manually requeuing the task, which ensures the proper timer
value is selected depending on the new connection side.
If idle-ping is activated and h2c task is expired due to missing PING
ACK, consider that the peer is away and the connection can be closed
immediately. GOAWAY emission is thus skipped.
A new test is necessary in h2c_update_timeout() when PING ACK is
currently expected, but the next timer expiration selected is not
idle-ping. This may happen if http-keep-alive/http-request timers are
selected first. In this case, H2_CF_IDL_PING_SENT flag is resetted. This
is necessary to not prevent GOAWAY emission on expiration.
This commit is the counterpart of the previous one, adapted on the
frontend side. "idle-ping" is added as keyword to bind lines, to be able
to refresh client timeout of idle frontend connections.
H2 MUX behavior remains similar as the previous patch. The only
significant change is in h2c_update_timeout(), as idle-ping is now taken
into account also for frontend connection. The calculated value is
compared with http-request/http-keep-alive timeout value. The shorter
delay is then used as expired date. As hr/ka timeout are based on
idle_start, this allows to run them in parallel with an idle-ping timer.
This commit implements support for idle-ping on the backend side. First,
a new server keyword "idle-ping" is defined in configuration parsing. It
is used to set the corresponding new server member.
The second part of this commit implements idle-ping support on H2 MUX. A
new inlined function conn_idle_ping() is defined to access connection
idle-ping value. Two new connection flags are defined H2_CF_IDL_PING and
H2_CF_IDL_PING_SENT. The first one is set for idle connections via
h2c_update_timeout().
On h2_timeout_task() handler, if first flag is set, instead of releasing
the connection as before, the second flag is set and tasklet is
scheduled. As both flags are now set, h2_process_mux() will proceed to
PING emission. The timer has also been rearmed to the idle-ping value.
If a PING ACK is received before next timeout, connection timer is
refreshed. Else, the connection is released, as with timer expiration.
Also of importance, special care is needed when a backend connection is
going to idle. In this case, idle-ping timer must be rearmed. Thus a new
invokation of h2c_update_timeout() is performed on h2_detach().
Adapt the already existing function h2c_ack_ping(). The objective is to
be able to emit a PING request. First, it is renamed as h2c_send_ping().
A new boolean argument <ack> is used to emit either a PING request or
ack.
Reorganize code for timeout calculation in case the connection is idle.
The objective is to better reflect the relations between each timeouts
as follow :
* if GOAWAY already emitted, use shut-timeout, or if unset fallback to
client/server one. However, an already set timeout is never erased.
* else, for frontend connection, http-request or keep-alive timeout is
applied depending on the current demux state. If the selected value is
unset, fallback to client timeout
* for backend connection, no timeout is set to perform http-reuse
This commit is pure refactoring, so no functional change should occur.
Since the following commit, MUX H2 timeout function has been slightly
exetended.
d38d8c6ccb189e7bc813b3693fec3093c9be55f1
BUG/MEDIUM: mux-h2: make sure control frames do not refresh the idle timeout
A side-effect of this patch is that now backend idle connection expire
timer is not reset if already defined. This means that if a timer was
registered prior to the connection transition to idle, the connection
would be destroyed on its timeout. If this happens for enough
connection, this may have an impact on the reuse rate.
In practice, this case should be rare, as h2c timer is set to
TICK_ETERNITY while there is active streams. The timer is not refreshed
most of the time before going the transition to idle, so the connection
won't be deleted on expiration.
The only case where it could occur is if there is still pending data
blocked on emission on stream detach. Here, timeout server is applied on
the connection. When the emission completes, the connection goes to
idle, but the timer will still armed, and thus will be triggered on the
idle connection.
To prevent this, explicitely reset h2c timer to TICK_ETERNITY for idle
backend connection via h2c_update_timeout().
This patch is explicitely not scheduled for backport for now, as it is
difficult to estimate the real impact of the previous code state.
GOAWAY emission should not be emitted before preface. Thus, max_id field
from h2c acting as a server is initialized to -1, which prevents its
emission until preface is received from the peer. If acting as a client,
max_id is initialized to a valid value on the first h2s emission.
This causes an issue with reverse HTTP on the active side. First, it
starts as a client, so the peer does not emit a preface but instead a
simple SETTINGS frame. As role are switched, max_id is initialized much
later when the first h2s response is emitted. Thus, if the connection
must be terminated before any stream transfer, GOAWAY cannot be emitted.
To fix this, ensure max_id is initialized to 0 on h2_conn_reverse() for
active connect side. Thus, a GOAWAY indicating that no stream has been
handled can be generated.
Note that passive connect side is not impacted, as it max_id is
initialized thanks to preface reception.
This should be backported up to 2.9.
While reviewing HTTP/2 MUX timeout, it seems there is a possibility that
MUX task is requeued via h2c_update_timeout() with an already expired
date. This can happens with idle connections on two cases :
* first with shut timeout, as timer is not refreshed if already set
* second with http-request and keep-alive timers, which are based on
idle_start
Queuing an already expired task is an undefined behavior. Fix this by
using task_wakeup() instead of task_queue() at the end of
h2c_update_timeout() if such case occurs.
This should be backported up to 2.6.
On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.
This patch should solve the issue #1219. It must be backported to all stable
versions.
Annika Wickert reported some occasional disconnections between haproxy
and varnish when communicating over HTTP/2, with varnish complaining
about protocol errors while captures looked apparently normal. Nils
Goroll managed to reproduce this on varnish by injecting the capture of
the outgoing haproxy traffic and noticed that haproxy was forwarding a
header value containing a trailing space, which is now explicitly
forbidden since RFC9113.
It turns out that the only way for such a header to pass through haproxy
is to arrive in h2 and not be edited, in which case it will arrive in
HTX with its undesired spaces. Since the code dealing with HTX headers
always trims spaces around them, these are not observable in dumps, but
only when started in debug mode (-d). Conversions to/from h1 also drop
the spaces.
With this patch we trim LWS both on input and on output. This way we
always present clean headers in the whole stack, and even if some are
manually crafted by the configuration or Lua, they will be trimmed on
the output.
This must be backported to all stable versions.
Thanks to Annika for the helpful capture and Nils for the help with
the analysis on the varnish side!
On shut, truncated HTX messages were not properly handled by the H2
multiplexer. Depending on how data were emitted, a chunked HTX message
without the 0-CRLF could be considered as full and an empty data with ES
flag set could be emitted instead of a RST_STREAM(CANCEL) frame.
In the H2 multiplexer, when a shut is performed, an HTX message is
considered as truncated if more HTX data are still expected. It is based on
the presence or not of the H2_SF_MORE_HTX_DATA flag on the H2 stream.
However, this flag is set or unset depending on the HTX extra field
value. This field is used to state how much data that must still be
transferred, based on the announced data length. For a message with a
content-length, this assumption is valid. But for a chunked message, it is
not true. Only the length of the current chunk is announced. So we cannot
rely on this field in that case to know if a message is full or not.
Instead, we must rely on the HTX start-line flags to know if more HTX data
are expected or not. If the xfer length is known (the HTX_SL_F_XFER_LEN flag
is set on the HTX start-line), it means that more data are always expected,
until the end of message is reached (the HTX_FL_EOM flag is set on the HTX
message). This is true for bodyless message because the end of message is
reported with the end of headers. This is also true for tunneled messages
because the end of message is received before switching the H2 stream in
tunnel mode.
This patch must be backported as far as 2.8.
When receive or EOS termination events are reported at the SE level, a
truncation was erroneously reported when no error was detected. Of course, it
must be the opposite.
No backport needed.
Termination events dedicated to mux connection and stream-endpoint
descriptors are added in this patch. Specific events to these locations are
thus added. Changes for the H1 and H2 multiplexers are reviewed to be more
accurate.
MUX_CTL_TEVTS command is added to get the termination event logs of a mux
connection and MUX_SCTL_TEVTS command to get the termination event logs of a
mux stream.
The termiantion events logs of the multiplexer connection and stream are now
dumped when corresponding mux info are dumped. The termination event logs of
the underlying connection is also dumped in the debug string.
shutdown for reads (read0), receive errors, shutdown for writes and timeouts
are reported, but only for the H2 connection for now.
As for the H1 multiplexer, more events must be added to report protocol
errors, goaways and rst-streams. And of course, all events for the H2
streams must be reported too.
When data was copied from RX buffers to the channel buffer, more data than
expected could be moved because amount of data copied was never decremented
from the limit. This could lead to a stream dead lock when the compression
filter was inuse.
The issue was introduced by commit 4eb3ff1 ("MAJOR: mux-h2: make streams use
the connection's buffers") but revealed by 3816c38 ("MAJOR: mux-h2: permit a
stream to allocate as many buffers as desired").
Because a h2 stream can now have several RX buffers, in h2_rcv_buf(), we
loop on these buffers to fill the channel buffer. However, we must still
take care to respect the limit to not copy to much data. However, the
"count" variable was never decremented to reflect amount of data already
copied. So, it was possible to exceed the limit.
It was an issue when the compression filter was inuse because the channel
buffer could be fully filled, preventing the compression to be
performed. When this happened, the stream was infinitly blocked because the
compression filter was asking for some space but nothing was scheduled to be
forwarded.
This patch should fix the issue #2826. It must be backported to 3.1.
Dedicated HTTP/2 stats proxy counters are available for current and
total number of HTTP/2 connection on both frontend and backend sides.
Both counters are simply incremented into h2_init().
This causes issues when using reverse HTTP. First, increment is not
performed on the expected side, as it is triggered before
h2_conn_reverse() which switches a connection from frontend to backend
or vice versa. For example on active revers side, h2_total_connections
is incremented on the backend only even after connection is reversed and
attached to a listener for the remainder of its lifetime.
h2_open_connections suffers from a similar but arguably worst behavior
as it is also decremented. If increment and decrement operations are not
performed on the same proxy side, which happens for every connection
which has been successfully reversed, it causes an invalid counter
value, possibly with an integer overflow.
To fix this, delay increment operations on reverse HTTP from h2_init()
to h2_conn_reverse(). Both counters are updated only after reverse has
completed, thus using the expected frontend or backend side.
To prevent overflow on h2_open_connections, ensure h2_release()
decrement is not performed if a connection is freed before achieving its
reversal, as in this case it would not have been accounted by H2
counters.
This should be backported up to 2.9.
This should fix github issue #2821.
Latest commit f0eca8fe7 ("MINOR: mux-h2/glitches: add a description to
the H2 glitches") misplaced the optional glitch description field, with
it appearing at the end of the if () condition and always reporting
an excess of CONTINUATION frames from the first exceeding one.
This needs to be backported along with that commit once it gets backported.
Since we can now list them using "debug counters" and now support a
description, better add the description to all glitches. This patch may
be backported to 3.1, but before this the following patches must also
be picked:
86823c828 MINOR: mux-h2/traces: add a missing trace on negative initial window size
7c8e9420a CLEANUP: mux-h2/traces: reword certain ambiguous traces
When a negative initial windows size is reported, we're going to close
the connection, so it's important to report a trace to explain why!
This should be backported at least to 3.1 and possibly 3.0 (adapting the
context since there's no glitches there).
This commit is the counterpart of the previous one for H2 mux. It
replaces objt_server() by unsafe __objt_server(), as conn target is
guarantee to point to a valid server instance, which can then be used as
_srv_add_idle() argument.
Since commit 1cc851d9f2 ("MEDIUM: mux-h2: start to update stream when
sending WU") we started storing stream offsets in the h2s struct. These
offsets are updated at a few points, where it's safe to write to the
stream, and in h2c_send_strm_wu(), where the h2s->h2c was not performed.
Due to this, nothing protects the h2s from being updated when sending a
WU for a closed stream, which might only happen when acknowledging a
frame after resetting that stream, which is quite unlikely. In any case
if this happens, it will crash as in issue #2793 since the closed streams
are purposely read-only to catch such bugs.
The fix is trivial, just check h2s->h2c before deciding to update the
stream.
Thanks to @Wahnes for reporting this, and Christopher for spotting the
cause. This needs to be backported to 3.1 only.
There is no explicit test on the number of headers when a HEADERS frame is
received. It is implicitely limited by the size of the header list. But it
is twice the configured limit to be sure to decode the frame.
So now, a check is performed after the HTX message was created. This way, we
are sure to not exceed the configured limit after the decoding stage. If
there are too many headers, a parsing error is reported.
Note the same is performed on the trailers.
This patch should patially address the issue #2685. It should be backported
to all stable versions.
When a HEADERS frame is encoded to be sent, the maximum number of headers
allowed in the frame is lower than on receiving path. This can lead to
report a sending error while the message was accepted. It could be
confusing.
In addition, the start-line is splitted into pseudo-headers and consummes
this way some header slots, increasing the difference between HEADERS frames
encoding and decoding. It is even more noticeable because when a HEADERS
frame is decoded, a margin is used to be able to handle splitted cookie
headers. Concretly, on decoding path, a limit of twice the maxumum number of
headers allowed in a message (tune.http.maxhdr * 2) is used. On encoding
path, the exact limit is used. It is not consistent.
Note that when a frame is decoded, we must use a larger limit because the
pseudo headers are reassembled in the start-line and must count for one. But
also because, most of time, the cookies are splitted into several headers
and are reassembled too.
To fix the issue, the same ratio is applied on sending path. A limit must be
defined because an dynamic allocation is not acceptable. Twice of the
configured limit should be good enough to support headers manipulation.
This patch should be backported to all stable versions.
On server side, the H2 stream is first created with an unassigned ID (ID ==
0). Its ID is assigned when the request is emitted, before formatting the
HEADERS frame. However, the session may be aborted during that stage. We
must take care to not emit RST_STREAM frame for this stream, because it does
not exist yet for the server.
It is especially important to do so because, depending on the timing, it may
also happens before the H2 PREFACE was sent.
This patch must be backported to all stable versions. It is related to issue
The h2c_report_glitch() function is now replaced with a macro to support
enumerating counters for each individual glitch line. For now this adds
43 such counters. The macro supports an optional description, though that
is not being used for now. It gives outputs like this (note that the last
one was purposely instrumented to pass a description):
> debug dev counters glt all
0 GLT mux_h2.c:5976 h2c_dec_hdrs()
0 GLT mux_h2.c:5960 h2c_dec_hdrs()
(...)
0 GLT mux_h2.c:2207 h2c_frt_recv_preface()
0 GLT mux_h2.c:1954 h2c_frt_stream_new(): new stream too early
As a reminder, this requires to build with -DDEBUG_GLITCHES=1.
When timeout http-keep-alive is very short (e.g. 10ms), it's possible
sometimes for a client to face truncated responses due to an early
close that happens while the system is still pushing the last data,
colliding with the client's WINDOW_UPDATEs that trigger RSTs.
Here we're trying to do better: first we send a GOAWAY on timeout, then
we wait up to clientfin/client timeout for the peer to react so that we
don't immediately close. This is sufficient to avoid truncation as soon
as the timeout is more than a few hundred ms.
It's not certain it should be backported, because it's a bit sensistive
and might possibly fall into certain edge cases.
During 11th and 12th iteration of the development cycle for the H2 auto
rx window, several approaches were attempted to figure if another buffer
could be allocated or not. One of them consisted in looping back to the
beginning of the function requesting a new buffer slot and getting one
if the buffer was either apparently or confirmed full. The latest one
consisted in directly allocating the next buffer from the two places
where it's found to be proven full, instead of checking with the now
defunct h2s_may_get_rxbuf() if we were allowed to get once an loop.
That approach was retained. In this case the "full" variabled is no
longer needed, so let's get rid of it because the construct looks bogus
and confuses coverity (and possibly code readers as the intent is unclear
compared to the code).
When data are sent via the zero-copy data forwarding, in h2_done_ff, we must
be sure to remove the H2 stream from the send list if something is send. It
was only performed if no blocking condition was encountered. But we must
also do it if something is sent. Otherwise the transfer may be blocked till
timeout.
This patch must be backported as far as 2.9.
Commit cf3fe1eed ("MINOR: mux-h2/traces: print the size of the DATA
frames") added the size of the DATA frame to the traces. Unfortunately
it uses ullong instead of ulong to cast a pointer, which breaks the
build on 32-bit platforms. Let's just switch it to ulong which works
on both.
The traces currently don't contain any info about the amount of data
present in buffers, making it difficult to figure if an empty buffer
is the cause for not demuxing or if a full buffer is the cause for
not reading more data. Let's add them, with the head/tail info as
well.
H2 traces are unusable to detect bugs most of the time because they miss
the h2c and h2s flags, as well as the proxy, which makes it very hard to
figure if the info comes from the client or the server as soon as two
layers are stacked. This commit adds these precious information as well
as the h2s's rx and tx windows.
This could be backported to a few recent branches, but the rx window
calculation will have to be replaced with the static value there.
This reduces the avg wakeup latency of sc_conn_io_cb() from 1900 to 51us.
The L2 cache misses from from 1.4 to 1.2 billion for 20k req. But the
perf is not better. Also there are situations where we must not perform
such wakeup, these may only be done from h2_io_cb, hence the test on the
next_tasklet pointer and its reset when leaving the function. In practice
all callers to h2s_close() or h2s_destroy() can reach that code, this
includes h2_detach, h2_snd_buf, h2_shut etc.
Another test with 40 concurrent connections, transferring 40k 1MB objects
at different concurrency levels from 1 to 80 also showed a 21% drop in L2
cache misses, and a 2% perf improvement:
Before:
329,510,887,528 instructions
50,907,966,181 branches
843,515,912 branch-misses
2,753,360,222 cache-misses
19,306,172,474 L1-icache-load-misses
17,321,132,742 L1-dcache-load-misses
951,787,350 LLC-load-misses
44.660469000 seconds user
62.459354000 seconds sys
=> avg perf: 373 MB/s
After:
331,310,219,157 instructions
51,343,396,257 branches
851,567,572 branch-misses
2,183,369,149 cache-misses
19,129,827,134 L1-icache-load-misses
17,441,877,512 L1-dcache-load-misses
906,923,115 LLC-load-misses
42.795458000 seconds user
62.277983000 seconds sys
=> avg perf: 380 MB/s
With small requests, it's the L1 and L3 cache misses which reduced by
3% and 7% respectively, and the performance went up by 3%.
When we stop demuxing in the middle of a frame, we know that there are
other data following. The demux buffer is small and unique, but now we
have rxbufs, so after h2_process_demux() is left, the dbuf is almost
empty and has room to be delivered into another rxbuf.
Let's implement a short loop with a counter and a few conditions around
the demux call. We limit the number of turns to the number of available
rxbufs and no more than 12, since it shows good performance, and the
wakeup is only called once. This has shown a nice 12-20% bandwidth gain
on backend-side H2 transferring 1MB-large objects, and does not affect
the rest (headers, control etc). The number of wakeup calls was divided
by 5 to 8, which is also a nice improvement. The counter is limited to
make sure we don't add processing latency. Tests were run to find the
optimal limit, and it turns out that 16 is just slightly better, but not
worth the +33% increase in peak processing latency.
The h2_process_demux() function just doens't call the wakeup function
anymore, and solely focuses on transferring from dbuf to rxbuf.
Practical measurement: test with h2load producing 4 concurrent connections
with 10 concurrent streams each, downloading 1MB objects (20k total) via
two layers of haproxy stacked, reaching httpterm over H1 (numbers are total
for the 2 h2 front and 1 h2 back). All on a single thread.
Before: 549-553 MB/s (on h2load)
function calls cpu_tot cpu_avg
h2_io_cb 2562340 8.157s 3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
h2_io_cb 30109 840.9ms 27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 16105 106.4ms 6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 11.75us 11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 2608555 9.104s 3.490us --total--
perf stat:
153,117,996,214 instructions (71.41%)
22,919,659,027 branches # 14.97% of inst (71.41%)
384,009,600 branch-misses # 1.68% of all branches (71.42%)
44,052,220 cache-misses # 1 inst / 3476 (71.44%)
9,819,232,047 L1-icache-load-misses # 6.4% of inst (71.45%)
8,426,410,306 L1-dcache-load-misses # 5.5% of inst (57.15%)
10,951,949 LLC-load-misses # 1 inst / 13982 (57.13%)
12.372600000 seconds user
23.629506000 seconds sys
After: 660 MB/s (+20%)
function calls cpu_tot cpu_avg
h2_io_cb 244502 4.410s 18.04us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
h2_io_cb 42107 1.062s 25.22us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 13703 106.3ms 7.758us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 13.74us 13.74us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 300313 5.578s 18.57us --total--
perf stat:
126,840,441,876 instructions (71.40%)
17,576,059,236 branches # 13.86% of inst (71.40%)
274,136,753 branch-misses # 1.56% of all branches (71.42%)
30,413,562 cache-misses # 1 inst / 4170 (71.45%)
6,665,036,203 L1-icache-load-misses # 5.25% of inst (71.46%)
7,519,037,097 L1-dcache-load-misses # 5.9% of inst (57.15%)
6,702,411 LLC-load-misses # 1 inst / 18925 (57.12%)
10.490097000 seconds user
19.212515000 seconds sys
It's also interesting to see that less total time is spent in these
functions, clearly indicating that the cost of interrupted processing,
and the extraneous cache misses come into play at some point. Indeed,
after the change, the number of instructions went down by 17.2%, while
the L2 cache misses dropped by 31% and the L3 cache misses by 39%!
h2_send() used to report non-zero every time any data were sent, and
this was used from h2_snd_buf() or h2_done_ff() to trigger a wakeup,
which possibly can do nothing. Restricting this wakeup to either a
successful send() combined with the ability to demux, or an error.
Doing this makes the number of h2_io_cb() wakeups drop from 422k to
245k for 1000 1MB objects delivered over 100 streams between two H2
proxies, without any behavior change nor performance change. In
practice, most send() calls do not result in a wakeup anymore but
synchronous errors still do.
A local test downloading 10k 1MB objects from an H1 server with a single
connection shows this change:
before after caller
1547 1467 h2_process_demux()
2138 0 h2_done_ff() <---
38 1453 ssl_sock_io_cb() <---
18 0 h2_snd_buf()
1 1 h2_init()
3742 2921 -- total --
In practice the ssl_sock_io_cb() wakeups are those notifying about
SUB_RETRY_RECV, which are not accounted for when h2_done_ff() performs
the wakeup because the tasklet is already queued (a counter placed
there shows that it's nonetheless called). So there's no transfer and
h2_done_ff() was only hiding the other one.
Another test involving 4 connections with 10 concurrent streams each
and 20000 1MB objects total shows a total disparition of the wakeups
from h2_snd_buf and h2_done_ff, which used to account together for
50% of the wakeups, resulting in effectively halving the number of
wakeups which, based on their avg process time, were not doing
anything:
Before:
function calls cpu_tot cpu_avg
h2_io_cb 2571208 7.406s 2.880us <- h2c_restart_reading@src/mux_h2.c:940 tasklet_wakeup
h2_io_cb 2536949 251.4ms 99.00ns <- h2_snd_buf@src/mux_h2.c:7573 tasklet_wakeup ###
h2_io_cb 41100 5.622ms 136.0ns <- h2_done_ff@src/mux_h2.c:7779 tasklet_wakeup ###
h2_io_cb 38979 852.8ms 21.88us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 12519 90.28ms 7.211us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 13.81us 13.81us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 5200756 8.606s 1.654us --total--
After:
h2_io_cb 2562340 8.157s 3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
h2_io_cb 30109 840.9ms 27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 16105 106.4ms 6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 11.75us 11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 2608555 9.104s 3.490us --total--
From the beginning, h2_restart_reading() has always been confusing because
it decides whether or not to wake the tasklet handler up or not. This
tasklet handler does two things, one is receiving from the socket to the
demux buf, and one is demuxing from the demux buf to the streams' rxbufs.
The conditions are governed by h2_recv_allowed(), which is also called at
a few places to decide whether or not to actually receive from the socket.
It starts to be visible that this leaves some difficulties regarding what
to do with possibly pending data.
In 2.0 with commit 3ca18bf0b ("BUG/MEDIUM: h2: Don't attempt to recv from
h2_process_demux if we subscribed."), we even had to address a special
case where it was possibly to endlessly wake up because the conditions
would rely on the demux buffer's contents, though the solution consisted
in passing a flag to decide whether or not to consider the buffer's
contents.
In 2.5 commit b5f7b5296 ("BUG/MEDIUM: mux-h2: Handle remaining read0 cases
on partial frames") introduced a new flag H2_CF_DEM_SHORT_READ which
indicates that the demux had to stop in the middle of a frame and cannot
make progress without more data. More adaptations later came in based on
this but this actually reflected exactly what was needed to solve this
painful situation: a state indicating whether to receive or parse.
Now's about time to definitely address this by reworking h2_restart_reading()
to check two completely independent things:
- the ability to receive more data into the demux buffer, which is
based on its allocation/fill state and the socket's errors
- the ability to demux such data, which is based on the presence of
enough data (i.e. no stuck short read), and ability to find an rx
buf to continue the processing.
Now the conditions are much more understandable, and it's also visible
that the consider_buffer argument, whose value was not trivial for
callers, is not used anymore.
Tests stacking two layers of H2 show strictly no change to the wakeup
cause distributions nor counts.
Now that we're using all available rx buffers for transfers, there's
no point anymore in advertising more than the minimum value we can
safely buffer. Let's be conservative and only rely on the dynamic
buffers to improve speed beyond the configured value, and make sure
than many streams will no longer cause unfairness.
Interestingly, the total number of wakeups has further shrunk down, but
with a different distribution. From 128k for 1000 1M transfers, it went
down to 119k, with 96k from restart_reading, 10k from done_ff and 2.6k
from snd_buf. done_ff went up by 30% and restart_reading went down by
30%.
These settings allow to change the total buffer size allocated to the
backend and frontend respectively. This way it's no longer necessary to
play with tune.bufsize nor increase the number of streams to benefit from
more buffers.
Setting tune.h2.fe.rxbuf to 4m to match a sender's max tcp_wmem resulted
in 257 Mbps for a single stream at 103ms vs 121 Mbps default (or 5.1 Mbps
with a single buffer and 64kB window).
Without using bandwidth estimates, we can already use up to the number
of allocatable rxbufs and share them evenly between receiving streams.
In practice we reserve one buffer for any non-receiving stream, plus
1 per 8 possible new streams, and divide the rest between the number
of receiving streams.
Finally, for front streams, this is rounded up to the buffer size while
for back streams we round it down. The rationale here is that front to
back is very fast to flush and slow to refill so we want to optimise
upload bandwidth regardless of the number of streams, while it's the
opposite in the other way so we try to minimize HoL.
That shows good results with a single stream being able to send at 121
Mbps at 103ms using 1.4 MB buffer with default settings, or 8 streams
sharing the bandwidth at 180kB each. Previously the limit was approx
5.1 Mbps per stream.
It also enables better sharing of backend connections: a slow (100 Mbps)
and a fast (1 Gbps) clients were both downloading 2 100MB files each over
a shared H2 connection. The fast one used to show 6.86 to 20.74s with an
avg of 11.45s and an stddev of 5.81s before the patch, and went to a
much more respectable 6.82 to 7.73s with 7.08s avg and 0.336s stddev.
We don't try to increase the window past the remaining content length.
First, this is pointless (though harmless), but in addition it causes
needless emission of WINDOW_UPDATE frames on small uploads that are
smaller than a window, and beyond being useless, it upsets vtest which
expects an RST on some tests. The scheduling is not reliable enough to
insert an expect for a window update first, so in the end wich that
extra check we save a few useless frames on small uploads and please
vtest.
A new setting should be added to allow to increase the number of buffers
without having to change the number of streams. At this point it's not
done.
Now we don't enforce allocation limits in h2s_get_rxbuf(), since there
is no benefit in not processing pending data, it would still cause HoL
for no saving. The only reason for not allocating is if there are no
buffers available for the connection.
In theory this should not change anything except that it excerts code
paths that support reallocating multiple buffers, which could possibly
uncover a sleeping bug. This is why it's placed in a separate commit.
And one observation worth noting is that it almost cut in half the number
of iocb wakeups: for 1000 1MB transfers over 100 concurrent streams of a
single connection, we used to observe 208k wakeups (110 from restart_reading,
80 from snd_buf, 11 from done_ff), and now we're observing 128k (113 from
restart_reading, 2.4 from snd_buf, 6.9k from done_ff), which seems to
indicate that pretty often the demuxing was blocked on a buffer full due
to the default advertised window of 64k.
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.
A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
- none: everything's OK
- SFULL: the unique rx buffer is full
- RXBUF || (RXBUF|SFULL): cannot allocate more entries
Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
Since commit 485da0b05 ("BUG/MEDIUM: mux_h2: Handle others remaining
read0 cases on partial frames"), H2_CF_DEM_SHORT_READ is set when there
is no blocking flags. However, it checks H2_CF_DEM_BLOCK_ANY which does
not include H2_CF_DEM_DFULL. This results in many cases where both
H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ are set together, which makes
no sense, since one says the demux buffer is full while the other one
says an incomplete read was done. This doesn't permit to properly
decide whether to restart reading or processing.
Let's make sure to clear DFULL in h2_process_demux() whenever we
consume incoming data from the dbuf, and check for DFULL before
setting SHORT_READ.
This could probably be considered as a bug fix but it's hard to say if
it has any impact on the current code, probably at worst it might cause
a few useless wakeups, so until there's any proof that it needs to be
backported, better not do it.
The code used to decide when to restart reading is far from being trivial
and will cause trouble after the forthcoming changes: it checks if the
current stream is the same that is being demuxed, and only if so, wakes
the demux to restart reading. Once streams will start to use multiple
buffers, this condition will make no sense anymore. Actually the real
reason is split into two steps:
- detect if the demux is currently blocked on the current stream, and
if so remove SFULL
- detect if any demux blocking flags were removed during the operations,
and if so, wake demuxing.
For now this doesn't change anything.