When a negative initial windows size is reported, we're going to close
the connection, so it's important to report a trace to explain why!
This should be backported at least to 3.1 and possibly 3.0 (adapting the
context since there's no glitches there).
This commit is the counterpart of the previous one for H2 mux. It
replaces objt_server() by unsafe __objt_server(), as conn target is
guarantee to point to a valid server instance, which can then be used as
_srv_add_idle() argument.
Since commit 1cc851d9f2 ("MEDIUM: mux-h2: start to update stream when
sending WU") we started storing stream offsets in the h2s struct. These
offsets are updated at a few points, where it's safe to write to the
stream, and in h2c_send_strm_wu(), where the h2s->h2c was not performed.
Due to this, nothing protects the h2s from being updated when sending a
WU for a closed stream, which might only happen when acknowledging a
frame after resetting that stream, which is quite unlikely. In any case
if this happens, it will crash as in issue #2793 since the closed streams
are purposely read-only to catch such bugs.
The fix is trivial, just check h2s->h2c before deciding to update the
stream.
Thanks to @Wahnes for reporting this, and Christopher for spotting the
cause. This needs to be backported to 3.1 only.
There is no explicit test on the number of headers when a HEADERS frame is
received. It is implicitely limited by the size of the header list. But it
is twice the configured limit to be sure to decode the frame.
So now, a check is performed after the HTX message was created. This way, we
are sure to not exceed the configured limit after the decoding stage. If
there are too many headers, a parsing error is reported.
Note the same is performed on the trailers.
This patch should patially address the issue #2685. It should be backported
to all stable versions.
When a HEADERS frame is encoded to be sent, the maximum number of headers
allowed in the frame is lower than on receiving path. This can lead to
report a sending error while the message was accepted. It could be
confusing.
In addition, the start-line is splitted into pseudo-headers and consummes
this way some header slots, increasing the difference between HEADERS frames
encoding and decoding. It is even more noticeable because when a HEADERS
frame is decoded, a margin is used to be able to handle splitted cookie
headers. Concretly, on decoding path, a limit of twice the maxumum number of
headers allowed in a message (tune.http.maxhdr * 2) is used. On encoding
path, the exact limit is used. It is not consistent.
Note that when a frame is decoded, we must use a larger limit because the
pseudo headers are reassembled in the start-line and must count for one. But
also because, most of time, the cookies are splitted into several headers
and are reassembled too.
To fix the issue, the same ratio is applied on sending path. A limit must be
defined because an dynamic allocation is not acceptable. Twice of the
configured limit should be good enough to support headers manipulation.
This patch should be backported to all stable versions.
On server side, the H2 stream is first created with an unassigned ID (ID ==
0). Its ID is assigned when the request is emitted, before formatting the
HEADERS frame. However, the session may be aborted during that stage. We
must take care to not emit RST_STREAM frame for this stream, because it does
not exist yet for the server.
It is especially important to do so because, depending on the timing, it may
also happens before the H2 PREFACE was sent.
This patch must be backported to all stable versions. It is related to issue
The h2c_report_glitch() function is now replaced with a macro to support
enumerating counters for each individual glitch line. For now this adds
43 such counters. The macro supports an optional description, though that
is not being used for now. It gives outputs like this (note that the last
one was purposely instrumented to pass a description):
> debug dev counters glt all
0 GLT mux_h2.c:5976 h2c_dec_hdrs()
0 GLT mux_h2.c:5960 h2c_dec_hdrs()
(...)
0 GLT mux_h2.c:2207 h2c_frt_recv_preface()
0 GLT mux_h2.c:1954 h2c_frt_stream_new(): new stream too early
As a reminder, this requires to build with -DDEBUG_GLITCHES=1.
When timeout http-keep-alive is very short (e.g. 10ms), it's possible
sometimes for a client to face truncated responses due to an early
close that happens while the system is still pushing the last data,
colliding with the client's WINDOW_UPDATEs that trigger RSTs.
Here we're trying to do better: first we send a GOAWAY on timeout, then
we wait up to clientfin/client timeout for the peer to react so that we
don't immediately close. This is sufficient to avoid truncation as soon
as the timeout is more than a few hundred ms.
It's not certain it should be backported, because it's a bit sensistive
and might possibly fall into certain edge cases.
During 11th and 12th iteration of the development cycle for the H2 auto
rx window, several approaches were attempted to figure if another buffer
could be allocated or not. One of them consisted in looping back to the
beginning of the function requesting a new buffer slot and getting one
if the buffer was either apparently or confirmed full. The latest one
consisted in directly allocating the next buffer from the two places
where it's found to be proven full, instead of checking with the now
defunct h2s_may_get_rxbuf() if we were allowed to get once an loop.
That approach was retained. In this case the "full" variabled is no
longer needed, so let's get rid of it because the construct looks bogus
and confuses coverity (and possibly code readers as the intent is unclear
compared to the code).
When data are sent via the zero-copy data forwarding, in h2_done_ff, we must
be sure to remove the H2 stream from the send list if something is send. It
was only performed if no blocking condition was encountered. But we must
also do it if something is sent. Otherwise the transfer may be blocked till
timeout.
This patch must be backported as far as 2.9.
Commit cf3fe1eed ("MINOR: mux-h2/traces: print the size of the DATA
frames") added the size of the DATA frame to the traces. Unfortunately
it uses ullong instead of ulong to cast a pointer, which breaks the
build on 32-bit platforms. Let's just switch it to ulong which works
on both.
The traces currently don't contain any info about the amount of data
present in buffers, making it difficult to figure if an empty buffer
is the cause for not demuxing or if a full buffer is the cause for
not reading more data. Let's add them, with the head/tail info as
well.
H2 traces are unusable to detect bugs most of the time because they miss
the h2c and h2s flags, as well as the proxy, which makes it very hard to
figure if the info comes from the client or the server as soon as two
layers are stacked. This commit adds these precious information as well
as the h2s's rx and tx windows.
This could be backported to a few recent branches, but the rx window
calculation will have to be replaced with the static value there.
This reduces the avg wakeup latency of sc_conn_io_cb() from 1900 to 51us.
The L2 cache misses from from 1.4 to 1.2 billion for 20k req. But the
perf is not better. Also there are situations where we must not perform
such wakeup, these may only be done from h2_io_cb, hence the test on the
next_tasklet pointer and its reset when leaving the function. In practice
all callers to h2s_close() or h2s_destroy() can reach that code, this
includes h2_detach, h2_snd_buf, h2_shut etc.
Another test with 40 concurrent connections, transferring 40k 1MB objects
at different concurrency levels from 1 to 80 also showed a 21% drop in L2
cache misses, and a 2% perf improvement:
Before:
329,510,887,528 instructions
50,907,966,181 branches
843,515,912 branch-misses
2,753,360,222 cache-misses
19,306,172,474 L1-icache-load-misses
17,321,132,742 L1-dcache-load-misses
951,787,350 LLC-load-misses
44.660469000 seconds user
62.459354000 seconds sys
=> avg perf: 373 MB/s
After:
331,310,219,157 instructions
51,343,396,257 branches
851,567,572 branch-misses
2,183,369,149 cache-misses
19,129,827,134 L1-icache-load-misses
17,441,877,512 L1-dcache-load-misses
906,923,115 LLC-load-misses
42.795458000 seconds user
62.277983000 seconds sys
=> avg perf: 380 MB/s
With small requests, it's the L1 and L3 cache misses which reduced by
3% and 7% respectively, and the performance went up by 3%.
When we stop demuxing in the middle of a frame, we know that there are
other data following. The demux buffer is small and unique, but now we
have rxbufs, so after h2_process_demux() is left, the dbuf is almost
empty and has room to be delivered into another rxbuf.
Let's implement a short loop with a counter and a few conditions around
the demux call. We limit the number of turns to the number of available
rxbufs and no more than 12, since it shows good performance, and the
wakeup is only called once. This has shown a nice 12-20% bandwidth gain
on backend-side H2 transferring 1MB-large objects, and does not affect
the rest (headers, control etc). The number of wakeup calls was divided
by 5 to 8, which is also a nice improvement. The counter is limited to
make sure we don't add processing latency. Tests were run to find the
optimal limit, and it turns out that 16 is just slightly better, but not
worth the +33% increase in peak processing latency.
The h2_process_demux() function just doens't call the wakeup function
anymore, and solely focuses on transferring from dbuf to rxbuf.
Practical measurement: test with h2load producing 4 concurrent connections
with 10 concurrent streams each, downloading 1MB objects (20k total) via
two layers of haproxy stacked, reaching httpterm over H1 (numbers are total
for the 2 h2 front and 1 h2 back). All on a single thread.
Before: 549-553 MB/s (on h2load)
function calls cpu_tot cpu_avg
h2_io_cb 2562340 8.157s 3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
h2_io_cb 30109 840.9ms 27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 16105 106.4ms 6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 11.75us 11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 2608555 9.104s 3.490us --total--
perf stat:
153,117,996,214 instructions (71.41%)
22,919,659,027 branches # 14.97% of inst (71.41%)
384,009,600 branch-misses # 1.68% of all branches (71.42%)
44,052,220 cache-misses # 1 inst / 3476 (71.44%)
9,819,232,047 L1-icache-load-misses # 6.4% of inst (71.45%)
8,426,410,306 L1-dcache-load-misses # 5.5% of inst (57.15%)
10,951,949 LLC-load-misses # 1 inst / 13982 (57.13%)
12.372600000 seconds user
23.629506000 seconds sys
After: 660 MB/s (+20%)
function calls cpu_tot cpu_avg
h2_io_cb 244502 4.410s 18.04us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
h2_io_cb 42107 1.062s 25.22us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 13703 106.3ms 7.758us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 13.74us 13.74us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 300313 5.578s 18.57us --total--
perf stat:
126,840,441,876 instructions (71.40%)
17,576,059,236 branches # 13.86% of inst (71.40%)
274,136,753 branch-misses # 1.56% of all branches (71.42%)
30,413,562 cache-misses # 1 inst / 4170 (71.45%)
6,665,036,203 L1-icache-load-misses # 5.25% of inst (71.46%)
7,519,037,097 L1-dcache-load-misses # 5.9% of inst (57.15%)
6,702,411 LLC-load-misses # 1 inst / 18925 (57.12%)
10.490097000 seconds user
19.212515000 seconds sys
It's also interesting to see that less total time is spent in these
functions, clearly indicating that the cost of interrupted processing,
and the extraneous cache misses come into play at some point. Indeed,
after the change, the number of instructions went down by 17.2%, while
the L2 cache misses dropped by 31% and the L3 cache misses by 39%!
h2_send() used to report non-zero every time any data were sent, and
this was used from h2_snd_buf() or h2_done_ff() to trigger a wakeup,
which possibly can do nothing. Restricting this wakeup to either a
successful send() combined with the ability to demux, or an error.
Doing this makes the number of h2_io_cb() wakeups drop from 422k to
245k for 1000 1MB objects delivered over 100 streams between two H2
proxies, without any behavior change nor performance change. In
practice, most send() calls do not result in a wakeup anymore but
synchronous errors still do.
A local test downloading 10k 1MB objects from an H1 server with a single
connection shows this change:
before after caller
1547 1467 h2_process_demux()
2138 0 h2_done_ff() <---
38 1453 ssl_sock_io_cb() <---
18 0 h2_snd_buf()
1 1 h2_init()
3742 2921 -- total --
In practice the ssl_sock_io_cb() wakeups are those notifying about
SUB_RETRY_RECV, which are not accounted for when h2_done_ff() performs
the wakeup because the tasklet is already queued (a counter placed
there shows that it's nonetheless called). So there's no transfer and
h2_done_ff() was only hiding the other one.
Another test involving 4 connections with 10 concurrent streams each
and 20000 1MB objects total shows a total disparition of the wakeups
from h2_snd_buf and h2_done_ff, which used to account together for
50% of the wakeups, resulting in effectively halving the number of
wakeups which, based on their avg process time, were not doing
anything:
Before:
function calls cpu_tot cpu_avg
h2_io_cb 2571208 7.406s 2.880us <- h2c_restart_reading@src/mux_h2.c:940 tasklet_wakeup
h2_io_cb 2536949 251.4ms 99.00ns <- h2_snd_buf@src/mux_h2.c:7573 tasklet_wakeup ###
h2_io_cb 41100 5.622ms 136.0ns <- h2_done_ff@src/mux_h2.c:7779 tasklet_wakeup ###
h2_io_cb 38979 852.8ms 21.88us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 12519 90.28ms 7.211us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 13.81us 13.81us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 5200756 8.606s 1.654us --total--
After:
h2_io_cb 2562340 8.157s 3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
h2_io_cb 30109 840.9ms 27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
h2_io_cb 16105 106.4ms 6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
h2_io_cb 1 11.75us 11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
h2_io_cb 2608555 9.104s 3.490us --total--
From the beginning, h2_restart_reading() has always been confusing because
it decides whether or not to wake the tasklet handler up or not. This
tasklet handler does two things, one is receiving from the socket to the
demux buf, and one is demuxing from the demux buf to the streams' rxbufs.
The conditions are governed by h2_recv_allowed(), which is also called at
a few places to decide whether or not to actually receive from the socket.
It starts to be visible that this leaves some difficulties regarding what
to do with possibly pending data.
In 2.0 with commit 3ca18bf0b ("BUG/MEDIUM: h2: Don't attempt to recv from
h2_process_demux if we subscribed."), we even had to address a special
case where it was possibly to endlessly wake up because the conditions
would rely on the demux buffer's contents, though the solution consisted
in passing a flag to decide whether or not to consider the buffer's
contents.
In 2.5 commit b5f7b5296 ("BUG/MEDIUM: mux-h2: Handle remaining read0 cases
on partial frames") introduced a new flag H2_CF_DEM_SHORT_READ which
indicates that the demux had to stop in the middle of a frame and cannot
make progress without more data. More adaptations later came in based on
this but this actually reflected exactly what was needed to solve this
painful situation: a state indicating whether to receive or parse.
Now's about time to definitely address this by reworking h2_restart_reading()
to check two completely independent things:
- the ability to receive more data into the demux buffer, which is
based on its allocation/fill state and the socket's errors
- the ability to demux such data, which is based on the presence of
enough data (i.e. no stuck short read), and ability to find an rx
buf to continue the processing.
Now the conditions are much more understandable, and it's also visible
that the consider_buffer argument, whose value was not trivial for
callers, is not used anymore.
Tests stacking two layers of H2 show strictly no change to the wakeup
cause distributions nor counts.
Now that we're using all available rx buffers for transfers, there's
no point anymore in advertising more than the minimum value we can
safely buffer. Let's be conservative and only rely on the dynamic
buffers to improve speed beyond the configured value, and make sure
than many streams will no longer cause unfairness.
Interestingly, the total number of wakeups has further shrunk down, but
with a different distribution. From 128k for 1000 1M transfers, it went
down to 119k, with 96k from restart_reading, 10k from done_ff and 2.6k
from snd_buf. done_ff went up by 30% and restart_reading went down by
30%.
These settings allow to change the total buffer size allocated to the
backend and frontend respectively. This way it's no longer necessary to
play with tune.bufsize nor increase the number of streams to benefit from
more buffers.
Setting tune.h2.fe.rxbuf to 4m to match a sender's max tcp_wmem resulted
in 257 Mbps for a single stream at 103ms vs 121 Mbps default (or 5.1 Mbps
with a single buffer and 64kB window).
Without using bandwidth estimates, we can already use up to the number
of allocatable rxbufs and share them evenly between receiving streams.
In practice we reserve one buffer for any non-receiving stream, plus
1 per 8 possible new streams, and divide the rest between the number
of receiving streams.
Finally, for front streams, this is rounded up to the buffer size while
for back streams we round it down. The rationale here is that front to
back is very fast to flush and slow to refill so we want to optimise
upload bandwidth regardless of the number of streams, while it's the
opposite in the other way so we try to minimize HoL.
That shows good results with a single stream being able to send at 121
Mbps at 103ms using 1.4 MB buffer with default settings, or 8 streams
sharing the bandwidth at 180kB each. Previously the limit was approx
5.1 Mbps per stream.
It also enables better sharing of backend connections: a slow (100 Mbps)
and a fast (1 Gbps) clients were both downloading 2 100MB files each over
a shared H2 connection. The fast one used to show 6.86 to 20.74s with an
avg of 11.45s and an stddev of 5.81s before the patch, and went to a
much more respectable 6.82 to 7.73s with 7.08s avg and 0.336s stddev.
We don't try to increase the window past the remaining content length.
First, this is pointless (though harmless), but in addition it causes
needless emission of WINDOW_UPDATE frames on small uploads that are
smaller than a window, and beyond being useless, it upsets vtest which
expects an RST on some tests. The scheduling is not reliable enough to
insert an expect for a window update first, so in the end wich that
extra check we save a few useless frames on small uploads and please
vtest.
A new setting should be added to allow to increase the number of buffers
without having to change the number of streams. At this point it's not
done.
Now we don't enforce allocation limits in h2s_get_rxbuf(), since there
is no benefit in not processing pending data, it would still cause HoL
for no saving. The only reason for not allocating is if there are no
buffers available for the connection.
In theory this should not change anything except that it excerts code
paths that support reallocating multiple buffers, which could possibly
uncover a sleeping bug. This is why it's placed in a separate commit.
And one observation worth noting is that it almost cut in half the number
of iocb wakeups: for 1000 1MB transfers over 100 concurrent streams of a
single connection, we used to observe 208k wakeups (110 from restart_reading,
80 from snd_buf, 11 from done_ff), and now we're observing 128k (113 from
restart_reading, 2.4 from snd_buf, 6.9k from done_ff), which seems to
indicate that pretty often the demuxing was blocked on a buffer full due
to the default advertised window of 64k.
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.
A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
- none: everything's OK
- SFULL: the unique rx buffer is full
- RXBUF || (RXBUF|SFULL): cannot allocate more entries
Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
Since commit 485da0b05 ("BUG/MEDIUM: mux_h2: Handle others remaining
read0 cases on partial frames"), H2_CF_DEM_SHORT_READ is set when there
is no blocking flags. However, it checks H2_CF_DEM_BLOCK_ANY which does
not include H2_CF_DEM_DFULL. This results in many cases where both
H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ are set together, which makes
no sense, since one says the demux buffer is full while the other one
says an incomplete read was done. This doesn't permit to properly
decide whether to restart reading or processing.
Let's make sure to clear DFULL in h2_process_demux() whenever we
consume incoming data from the dbuf, and check for DFULL before
setting SHORT_READ.
This could probably be considered as a bug fix but it's hard to say if
it has any impact on the current code, probably at worst it might cause
a few useless wakeups, so until there's any proof that it needs to be
backported, better not do it.
The code used to decide when to restart reading is far from being trivial
and will cause trouble after the forthcoming changes: it checks if the
current stream is the same that is being demuxed, and only if so, wakes
the demux to restart reading. Once streams will start to use multiple
buffers, this condition will make no sense anymore. Actually the real
reason is split into two steps:
- detect if the demux is currently blocked on the current stream, and
if so remove SFULL
- detect if any demux blocking flags were removed during the operations,
and if so, wake demuxing.
For now this doesn't change anything.
The code used to decide what to tell to the upper layer and when to free
the rxbuf is a bit convoluted and difficult to adapt to dynamic rxbufs.
We first need to deal with memory management (b_free) and only then to
decide what to report upwards. Right now it does it the other way around.
This should not change anything.
Now the h2s get their rx_head, rx_tail and rx_count associated with the
shared rxbufs. A few functions are provided to manipulate all this,
essentially allocate/release a buffer for the stream, return a buffer
pointer to the head/tail, counting allocated buffers for the stream
and reporting if a stream may still allocate.
For now this code is not used.
In preparation for having a shared list of rx bufs, we're now allocating
the array of shared rx bufs in the h2c. The pool is created at the max
size between the front and back max streams for now, and the array is not
used yet.
A stream is receiving data from after the HEADERS frame missing END_STREAM,
to the end of the stream or HREM (the presence of END_STREAM). We're now
adding a flag to the stream that indicates this state, as well as a counter
in the connection of streams currently receiving data. The purpose will be
to gauge at any instant the number of streams that might have to share the
available bandwidth and buffers count in order not to allocate too much flow
control to any single stream. For now the counter is kept up to date, and is
reported in "show fd".
Instead of incrementing the last_max_ofs by the amount of received bytes,
we now start from the new current offset to which we add the static window
size. The result is exactly the same but it prepares the code to use a
window size combined with an offset instead of just refilling the budget
from what was received.
It was even verified that changing h2_fe_settings_initial_window_size in
the middle of a transfer using gdb does indeed allow the transfer speed
to adapt accordingly.
The rationale here is that we don't absolutely need to update the
stream offset live, there's already the rcvd_s counter to remind
us we've received data. So we can continue to exploit the current
check points for this.
Now we know that rcvd_s indicates the amount of newly received bytes
for the stream since last call to h2c_send_strm_wu() so we can update
our stream offsets within that function. The wu_s counter is set to
the difference between next_adv_ofs and last_adv_ofs, which are
resynchronized once the frame is sent.
If the stream suddenly disappears with unacked data (aborted upload),
the presence of the last update in h2c->wu_s is sufficient to let the
connection ack the data alone, and upon subsequent calls with new
rcvd_s, the received counter will be used to ack, like before. We
don't need to do more anyway since the goal is to let the client
abort ASAP when it gets an RST.
At this point, the stream knows its current rx offset, the computed
max offset and the last advertised one.
In H2, everything is accounted as budget. But if we want to moderate
the rcv window that's not very convenient, and we'd rather have offsets
instead so that we know where we are in the stream. Let's first add
the fields to the struct and initialize them. The curr_rx_ofs indicates
the position in the stream where next incoming bytes will be stored.
last_adv_ofs tells what's the offset that was last advertised as the
window limit, and next_max_ofs is the one that will need to be
advertised, which is curr_rx_ofs plus the current window. next_max_ofs
will have to cause a WINDOW_UPDATE to be emitted when it's higher than
last_adv_ofs, and once the WU is sent, its value will have to be copied
over last_adv_ofs.
The problem is, for now wherever we emit a stream WU, we have no notion
of stream (the stream might even not exist anymore, e.g. after aborting
an upload), because we currently keep a counter of stream window to be
acked for the current stream ID (h2c->dsi) in the connection (rcvd_s).
Similarly there are a few places early in the frame header processing
where rcvd_s is incremented without knowing the stream yet. Thus, lookups
will be needed for that, unless such a connection-level counter remains
used and poured into the stream's count once known (delicate).
Thus for now this commit only creates the fields and initializes them.
We'll need to keep track of the total amount of data received for the
current stream, and the amount of data to ack for the current stream,
which might soon diverge as soon as we'll have to update the stream's
offset with received data, which are different from those to be ACKed.
One reason is that in case a stream doesn't exist anymore (e.g. aborted
an upload), the rcvd_s info might get lost after updating the stream,
so we do need to have an in-connection counter for that.
What's done here is that the rcvd_s count is transferred to wu_s in
h2c_send_strm_wu(), to be used as the counter to send, and both are
considered as sufficient when non-null to call the function.
Since 2.7 with commit 8522348482 ("BUG/MAJOR: conn-idle: fix hash indexing
issues on idle conns"), we've been using eb64 trees and not ebmb trees
anymore, and later we dropped all that to centralize the operations in
the server. Let's remove the ebmbtree.h includes from the muxes that do
not use them.
The local "rxbuf" buffer was passed to the trace instead of h2s->rxbuf
that is used when decoding trailers. The impact is essentially the
impossibility to present some buffer contents in some rare cases. It
may be backported but it's unlikely that anyone will ever notice the
difference.
Since 1d2d77b27 ("MEDIUM: mux-h1: Return a 501-not-implemented for upgrade
requests with a body"), it is no longer possible to perform a protocol
upgrade for requests with a payload. The main reason was to be able to
support protocol upgrade for H1 client requesting a H2 server. In that case,
the upgrade request is converted to a CONNECT request. So, it is not
possible to convey a payload in that case.
But, it is a problem for anyone wanting to perform upgrades on H1 server
using requests with a payload. It is uncommon but valid. So, now, it is the
H2 multiplexer responsibility to reject upgrade requests, on server side, if
there is a payload. An INTERNAL_ERROR is returned for the H2S in that
case. On H1 side, the upgrade is now allowed, but only if the server waits
for the end of the request to return the 101-Switching-protocol
response. Indeed, it is quite hard to synchronise the frontend side and the
backend side in that case. Asking to servers to fully consume the request
payload before returned the response seems reasonable.
This patch should fix the issue #2684. It could be backported after a period
of observation, as far as 2.4 if possible. But only if it is not too
hard. It depends on "MINOR: mux-h1: Set EOI on SE during demux when both
side are in DONE state".
There exists an extremely tricky code path that was revealed in 3.0 by
the glitches feature, though it might theoretically have existed before.
TL;DR: a mux mbuf may be full after successfully sending GOAWAY, and
discard its remaining contents without clearing H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM, then endlessly loop in h2_send(), until the watchdog
takes care of it.
What can happen is the following: Some data are received, h2_io_cb() is
called. h2_recv() is called to receive the incoming data. Then
h2_process() is called and in turn calls h2_process_demux() to process
input data. At some point, a glitch limit is reached and h2c_error() is
called to close the connection. The input frame was incomplete, so some
data are left in the demux buffer. Then h2_send() is called, which in
turn calls h2_process_mux(), which manages to queue the GOAWAY frame,
turning the state to H2_CS_ERROR2. The frame is sent, and h2_process()
calls h2_send() a last time (doing nothing) and leaves. The streams
are all woken up to notify about the error.
Multiple backend streams were waiting to be scheduled and are woken up
in turn, before their parents being notified, and communicate with the
h2 mux in zero-copy-forward mode, request a buffer via h2_nego_ff(),
fill it, and commit it with h2_done_ff(). At some point the mux's output
buffer is full, and gets flags H2_CF_MUX_MFULL.
The io_cb is called again to process more incoming data. h2_send() isn't
called (polled) or does nothing (e.g. TCP socket buffers full). h2_recv()
may or may not do anything (doesn't matter). h2_process() is called since
some data remain in the demux buf. It goes till the end, where it finds
st0 == H2_CS_ERROR2 and clears the mbuf. We're now in a situation where
the mbuf is empty and MFULL is still present.
Then it calls h2_send(), which doesn't call h2_process_mux() due to
MFULL, doesn't enter the for() loop since all buffers are empty, then
keeps sent=0, which doesn't allow to clear the MFULL flag, and since
"done" was not reset, it loops forever there.
Note that the glitches make the issue more reproducible but theoretically
it could happen with any other GOAWAY (e.g. PROTOCOL_ERROR). What makes
it not happen with the data produced on the parsing side is that we
process a single buffer of input at once, and there's no way to amplify
this to 30 buffers of responses (RST_STREAM, GOAWAY, SETTINGS ACK,
WINDOW_UPDATE, PING ACK etc are all quite small), and since the mbuf is
cleared upon every exit from h2_process() once the error was sent, it is
not possible to accumulate response data across multiple calls. And the
regular h2_snd_buf() path checks for st0 >= H2_CS_ERROR so it will not
produce any data there either.
Probably that h2_nego_ff() should check for H2_CS_ERROR before accepting
to deliver a buffer, but this needs to be carefully studied. In the mean
time the real problem is that the MFULL flag was kept when clearing the
buffer, making the two inconsistent.
Since it doesn't seem possible to trigger this sequence without the
zero-copy-forward mechanism, this fix needs to be backported as far as
2.9, along with previous commit "MINOR: mux-h2: try to clear DEM_MROOM
and MUX_MFULL at more places" which will strengthen the consistency
between these checks.
Many thanks to Annika Wickert for her detailed report that allowed to
diagnose this problem. CVE-2024-45506 was assigned to this problem.
The code leading to H2_CF_MUX_MFULL and H2_CF_DEM_MROOM being cleared
is quite complex and assumptions about its state are extremely difficult
when reading the code. There are indeed long sequences where the mux might
possibly be empty, still having the flag set until it reaches h2_send()
which will clear it after the last send. Even then it's not obviour whether
it's always guaranteed to release the flag when invoked in multiple passes.
Let's just simplify the conditionnn so that h2_send() does not depend on
"sent" anymore and that h2_timeout_task() doesn't leave the flags set on
the buffer on emptiness. While it doesn't seem to fix anything, it will
make the code more robust against future changes.
When DATA frames are sent via the 0-copy data forwarding, we must take care
to set the ES flag on the last DATA frame. It should be performed in
h2_done_ff() when IOBUF_FL_EOI flag was set by the producer. This flag is
here to know when the producer has reached the end of input. When this
happens, the h2s state is also updated. It is switched to "half-closed
local" or "closed" state depending on its previous state.
It is mainly an issue on uploads because the server may be blocked waiting
for the end of the request. A workaround is to disable the 0-copy forwarding
support the the H2 by setting "tune.h2.zero-copy-fwd-send" directive to off
in your global section.
This patch should fix the issue #2665. It must be backported as far as 2.9.
Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces for these
important steps.
This helper is able to find a connection, a session, a stream, a
frontend or a backend from its args.
Note that this required to always make sure that h2s->sess is reset on
allocation because it's normally initialized later for backend streams,
and producing traces between the two could pre-fill a bad pointer in
the trace_ctx.
When a stream is explicitly woken up by the H2 conneciton, if an error
condition is detected, the corresponding error flag is set on the SE. So
SE_FL_ERROR or SE_FL_ERR_PENDING, depending if the end of stream was
reported or not.
However, there is no attempt to propagate other termination flags. We must
be sure to properly set SE_FL_EOI and SE_FL_EOS when appropriate to be able
to switch a pending error to a fatal error.
Because of this bug, the SE remains with a pending error and no end of
stream, preventing the applicative stream to trully abort it. It means on
some abort scenario, it is possible to block a stream infinitely.
This patch must be backported at least as far as 2.8. No bug was observed on
older versions while the same code is inuse.
A risk of truncated packet was addressed in 2.9 by commit 19fb19976f
("BUG/MEDIUM: mux-h2: Only Report H2C error on read error if demux
buffer is empty") by ignoring CO_FL_ERROR after a recv() call as long
as some data remained present in the buffer. However it has a side
effect due to the fact that some frame processors only deal with full
frames, for example, HEADERS. The side effect is that an incomplete
frame will not be processed and will remain in the buffer, preventing
the error from being taken into account, so the I/O handler wakes up
the H2 parser to handle the error, and that one just subscribes for
more data, and this loops forever wasting CPU cycles.
Note that this only happens with errors at the SSL layer exclusively,
otherwise we'd have a read0 pending that would properly be detected:
conn->flags = CO_FL_XPRT_TRACKED | CO_FL_ERROR | CO_FL_XPRT_READY | CO_FL_CTRL_READY
conn->err_code = CO_ERR_SSL_FATAL
h2c->flags = H2_CF_ERR_PENDING | H2_CF_WINDOW_OPENED | H2_CF_MBUF_HAS_DATA | H2_CF_DEM_IN_PROGRESS | H2_CF_DEM_SHORT_READ
The condition to report the error in h2_recv() needs to be refined, so
that connection errors are taken into account either when the buffer is
empty, or when there's an incomplete frame, since we're certain it will
never be completed. We're certain to enter that function because
H2_CF_DEM_SHORT_READ implies too short a frame, and earlier there's a
protocol check to validate that no frame size is larger than bufsize,
hence a H2_CF_DEM_SHORT_READ implies there's some room left in the
buffer and we're allowed to try to receive.
The condition to reproduce the bug seems super hard to meet but was
observed once by Patrick Hemmer who had the reflex to capture lots of
information that allowed to explain the problem. In order to reproduce
it, the SSL code had to be significantly modified to alter received
contents at very empiric places, but that was sufficient to reproduce
it and confirm that the current patch works as expected.
The bug was tagged MAJOR because when it triggers there's no other
solution to get rid of it but to restart the process. However given how
hard it is to trigger on a lab, it does not seem very likely to occur
in field.
This needs to be backported to 2.9.
It is a small change, but it is cleaner to no include stconn-t.h header in
connection-t.h, mainly to avoid circular definitions.
The related issue is #2502.
The ->takeover() is quite tricky. It didn't take care of the possibility
that the original thread's connection handler had been woken up to handle
an event (e.g. read0), failed to get a buffer, registered against its own
thread's buffer_wait queue and left the connection in an idle state.
A new thread could then come by, perform a takeover(), and when a buffer
was available, the new thread's tasklet would be woken up by the old one
via *_buf_available(), causing all sort of problems. These problems are
easy to reproduce, by running with shared backend connections and few
buffers (tune.buffers.limit=20, 8 threads, 500 connections, transfer
64kB objects and wait 2-5s for a crash to appear).
A first estimated solution consisted in removing the connection from the
idle list but it turns out that it would be worse for the delete stuff
(the connection no longer appearing as idle, making it impossible to find
it in order to close it). Also, idle counts wouldn't match anymore the
list's state, and the special case of private connections could be
difficult to handle as the connection could be forcefully re-added to the
idle list after allocation despite being private.
After multiple attempts to address the problem in various ways, it appears
that the only reliable solution for now (without starting to turn many
lists to mt_lists) is to have the takeover() function handle the buf_wait
detection or unregistration itself:
- when doing a regular takeover aiming at finding an idle connection
for a new request, connections that are blocked in a buffer_wait
queue are quite rare and not interesting at all (since not immediately
usable), so skipping them is sufficient. For this we detect that the
desired connection belongs to a buffer_wait list by checking its
buf_wait.list element. Note that this check is *not* thread-safe! The
LIST_DEL_INIT() is performed by __offer_buffers() after the callback
was called. But this is sufficient as it is now because the only way
for the element to be seen as not in a list is after the element was
last touched by __offer_buffers(), so the situation for this connection
will not change in a different way later.
- when doing a server delete, we're running under thread isolation.
The connection might get taken over to be killed. The only trick is
that private connections not belonging to any idle list may also
experience this, and in this case even the idle_conns lock will not
offer any protection against anything. But since we're run under
thread isolation, we're certain not to compete with the other thread,
so it's safe to directly unregister the connection from its owner
thread. Normally this is already handled by conn_release() in
cli_parse_delete_server(), which calls mux->destroy(), but this would
actually update the current thread's queue instead of the origin
thread's, thus we do need to perform an explicit dequeue before
completing the takeover.
With this, the problem now looks solved for HTTP/1, HTTP/2 and FCGI,
though extensive tests were essentially run on HTTP/1 and HTTP/2.
While the problem has been there for a very long time, there should be
no reason to backport it since buffer_wait didn't practically work
before 3.0-dev and the process used to freeze hard very quickly before
we'd even have a chance to meet that race.
The code places that were used to manipulate the buffer_wq manually
now just call b_queue() or b_requeue(). This will simplify the multiple
list management later.