1005 Commits

Author SHA1 Message Date
Christopher Faulet
5200203677 MINOR: proxy: Add options to drop HTTP trailers during message forwarding
In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.

On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.

This patch should fix the issue #2930.
2025-04-22 16:14:46 +02:00
Amaury Denoyelle
bd1d02e2b3 MINOR: mux-h2: handle idle-ping on conn reverse
This commit extends MUX H2 connection reversal step to properly take
into account the new idle-ping feature. It first ensures that h2c task
is properly instantiated/freed depending now on both timers and
idle-ping configuration. Also, h2c_update_timeout() is now called
instead of manually requeuing the task, which ensures the proper timer
value is selected depending on the new connection side.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
cc5a7a760f MINOR: mux-h2: do not emit GOAWAY on idle ping expiration
If idle-ping is activated and h2c task is expired due to missing PING
ACK, consider that the peer is away and the connection can be closed
immediately. GOAWAY emission is thus skipped.

A new test is necessary in h2c_update_timeout() when PING ACK is
currently expected, but the next timer expiration selected is not
idle-ping. This may happen if http-keep-alive/http-request timers are
selected first. In this case, H2_CF_IDL_PING_SENT flag is resetted. This
is necessary to not prevent GOAWAY emission on expiration.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
52246249ab MEDIUM: listener/mux-h2: implement idle-ping on frontend side
This commit is the counterpart of the previous one, adapted on the
frontend side. "idle-ping" is added as keyword to bind lines, to be able
to refresh client timeout of idle frontend connections.

H2 MUX behavior remains similar as the previous patch. The only
significant change is in h2c_update_timeout(), as idle-ping is now taken
into account also for frontend connection. The calculated value is
compared with http-request/http-keep-alive timeout value. The shorter
delay is then used as expired date. As hr/ka timeout are based on
idle_start, this allows to run them in parallel with an idle-ping timer.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
a78a04cfae MEDIUM: server/mux-h2: implement idle-ping on backend side
This commit implements support for idle-ping on the backend side. First,
a new server keyword "idle-ping" is defined in configuration parsing. It
is used to set the corresponding new server member.

The second part of this commit implements idle-ping support on H2 MUX. A
new inlined function conn_idle_ping() is defined to access connection
idle-ping value. Two new connection flags are defined H2_CF_IDL_PING and
H2_CF_IDL_PING_SENT. The first one is set for idle connections via
h2c_update_timeout().

On h2_timeout_task() handler, if first flag is set, instead of releasing
the connection as before, the second flag is set and tasklet is
scheduled. As both flags are now set, h2_process_mux() will proceed to
PING emission. The timer has also been rearmed to the idle-ping value.
If a PING ACK is received before next timeout, connection timer is
refreshed. Else, the connection is released, as with timer expiration.

Also of importance, special care is needed when a backend connection is
going to idle. In this case, idle-ping timer must be rearmed. Thus a new
invokation of h2c_update_timeout() is performed on h2_detach().
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
4dcfe098a6 MINOR: mux-h2: prepare to support PING emission
Adapt the already existing function h2c_ack_ping(). The objective is to
be able to emit a PING request. First, it is renamed as h2c_send_ping().
A new boolean argument <ack> is used to emit either a PING request or
ack.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
99b2e52f89 MINOR: mux-h2: refactor idle timeout calculation
Reorganize code for timeout calculation in case the connection is idle.
The objective is to better reflect the relations between each timeouts
as follow :

* if GOAWAY already emitted, use shut-timeout, or if unset fallback to
  client/server one. However, an already set timeout is never erased.

* else, for frontend connection, http-request or keep-alive timeout is
  applied depending on the current demux state. If the selected value is
  unset, fallback to client timeout

* for backend connection, no timeout is set to perform http-reuse

This commit is pure refactoring, so no functional change should occur.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
243bc95de0 BUG/MINOR: mux-h2: do not apply timer on idle backend connection
Since the following commit, MUX H2 timeout function has been slightly
exetended.

  d38d8c6ccb189e7bc813b3693fec3093c9be55f1
  BUG/MEDIUM: mux-h2: make sure control frames do not refresh the idle timeout

A side-effect of this patch is that now backend idle connection expire
timer is not reset if already defined. This means that if a timer was
registered prior to the connection transition to idle, the connection
would be destroyed on its timeout. If this happens for enough
connection, this may have an impact on the reuse rate.

In practice, this case should be rare, as h2c timer is set to
TICK_ETERNITY while there is active streams. The timer is not refreshed
most of the time before going the transition to idle, so the connection
won't be deleted on expiration.

The only case where it could occur is if there is still pending data
blocked on emission on stream detach. Here, timeout server is applied on
the connection. When the emission completes, the connection goes to
idle, but the timer will still armed, and thus will be triggered on the
idle connection.

To prevent this, explicitely reset h2c timer to TICK_ETERNITY for idle
backend connection via h2c_update_timeout().

This patch is explicitely not scheduled for backport for now, as it is
difficult to estimate the real impact of the previous code state.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
9e6f8ce328 BUG/MINOR: rhttp: ensure GOAWAY can be emitted after reversal
GOAWAY emission should not be emitted before preface. Thus, max_id field
from h2c acting as a server is initialized to -1, which prevents its
emission until preface is received from the peer. If acting as a client,
max_id is initialized to a valid value on the first h2s emission.

This causes an issue with reverse HTTP on the active side. First, it
starts as a client, so the peer does not emit a preface but instead a
simple SETTINGS frame. As role are switched, max_id is initialized much
later when the first h2s response is emitted. Thus, if the connection
must be terminated before any stream transfer, GOAWAY cannot be emitted.

To fix this, ensure max_id is initialized to 0 on h2_conn_reverse() for
active connect side. Thus, a GOAWAY indicating that no stream has been
handled can be generated.

Note that passive connect side is not impacted, as it max_id is
initialized thanks to preface reception.

This should be backported up to 2.9.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
3ebdd3ae50 BUG/MINOR: mux-h2: prevent past scheduling with idle connections
While reviewing HTTP/2 MUX timeout, it seems there is a possibility that
MUX task is requeued via h2c_update_timeout() with an already expired
date. This can happens with idle connections on two cases :
* first with shut timeout, as timer is not refreshed if already set
* second with http-request and keep-alive timers, which are based on
  idle_start

Queuing an already expired task is an undefined behavior. Fix this by
using task_wakeup() instead of task_queue() at the end of
h2c_update_timeout() if such case occurs.

This should be backported up to 2.6.
2025-04-17 14:49:36 +02:00
Ilia Shipitsin
27a6353ceb CLEANUP: assorted typo fixes in the code, commits and doc 2025-04-03 11:37:25 +02:00
Ilia Shipitsin
78b849b839 CLEANUP: assorted typo fixes in the code and comments
code, comments and doc actually.
2025-04-02 11:12:20 +02:00
Christopher Faulet
e87397bc7d BUG/MINOR: mux-h2: Reset streams with NO_ERROR code if full response was already sent
On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.

This patch should solve the issue #1219. It must be backported to all stable
versions.
2025-03-20 08:36:06 +01:00
Willy Tarreau
bbf824933f BUG/MINOR: h2: always trim leading and trailing LWS in header values
Annika Wickert reported some occasional disconnections between haproxy
and varnish when communicating over HTTP/2, with varnish complaining
about protocol errors while captures looked apparently normal. Nils
Goroll managed to reproduce this on varnish by injecting the capture of
the outgoing haproxy traffic and noticed that haproxy was forwarding a
header value containing a trailing space, which is now explicitly
forbidden since RFC9113.

It turns out that the only way for such a header to pass through haproxy
is to arrive in h2 and not be edited, in which case it will arrive in
HTX with its undesired spaces. Since the code dealing with HTX headers
always trims spaces around them, these are not observable in dumps, but
only when started in debug mode (-d). Conversions to/from h1 also drop
the spaces.

With this patch we trim LWS both on input and on output. This way we
always present clean headers in the whole stack, and even if some are
manually crafted by the configuration or Lua, they will be trimmed on
the output.

This must be backported to all stable versions.

Thanks to Annika for the helpful capture and Nils for the help with
the analysis on the varnish side!
2025-02-24 09:39:57 +01:00
Christopher Faulet
b70921f2c1 BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut
On shut, truncated HTX messages were not properly handled by the H2
multiplexer. Depending on how data were emitted, a chunked HTX message
without the 0-CRLF could be considered as full and an empty data with ES
flag set could be emitted instead of a RST_STREAM(CANCEL) frame.

In the H2 multiplexer, when a shut is performed, an HTX message is
considered as truncated if more HTX data are still expected. It is based on
the presence or not of the H2_SF_MORE_HTX_DATA flag on the H2 stream.
However, this flag is set or unset depending on the HTX extra field
value. This field is used to state how much data that must still be
transferred, based on the announced data length. For a message with a
content-length, this assumption is valid. But for a chunked message, it is
not true. Only the length of the current chunk is announced. So we cannot
rely on this field in that case to know if a message is full or not.

Instead, we must rely on the HTX start-line flags to know if more HTX data
are expected or not. If the xfer length is known (the HTX_SL_F_XFER_LEN flag
is set on the HTX start-line), it means that more data are always expected,
until the end of message is reached (the HTX_FL_EOM flag is set on the HTX
message). This is true for bodyless message because the end of message is
reported with the end of headers. This is also true for tunneled messages
because the end of message is received before switching the H2 stream in
tunnel mode.

This patch must be backported as far as 2.8.
2025-02-18 17:34:59 +01:00
Christopher Faulet
54a09dfe0f BUG/MINOR: tevt/mux-h2: Set truncated receive/eos events at SE level on error
When receive or EOS termination events are reported at the SE level, a
truncation was erroneously reported when no error was detected. Of course, it
must be the opposite.

No backport needed.
2025-02-06 11:19:32 +01:00
Christopher Faulet
a58e650ad1 MEDIUM: tevt/muxes: Add dedicated termination events for muxc/se locations
Termination events dedicated to mux connection and stream-endpoint
descriptors are added in this patch. Specific events to these locations are
thus added. Changes for the H1 and H2 multiplexers are reviewed to be more
accurate.
2025-01-31 10:41:50 +01:00
Christopher Faulet
a4c281a190 MINOR: tevt/muxes: Add CTL and SCTL command to get the termination event logs
MUX_CTL_TEVTS command is added to get the termination event logs of a mux
connection and MUX_SCTL_TEVTS command to get the termination event logs of a
mux stream.
2025-01-31 10:41:50 +01:00
Christopher Faulet
95029305d3 MINOR: tevt/mux-h1/mux-h2: Add termination events log when dumping mux info
The termiantion events logs of the multiplexer connection and stream are now
dumped when corresponding mux info are dumped. The termination event logs of
the underlying connection is also dumped in the debug string.
2025-01-31 10:41:50 +01:00
Christopher Faulet
147b6d3d4d MINOR: tevt/mux-h2: Report termination events for the H2C
shutdown for reads (read0), receive errors, shutdown for writes and timeouts
are reported, but only for the H2 connection for now.

As for the H1 multiplexer, more events must be added to report protocol
errors, goaways and rst-streams. And of course, all events for the H2
streams must be reported too.
2025-01-31 10:41:50 +01:00
Christopher Faulet
22f8d2c99e BUG/MEDIUM: mux-h2: Count copied data when looping on RX bufs in h2_rcv_buf()
When data was copied from RX buffers to the channel buffer, more data than
expected could be moved because amount of data copied was never decremented
from the limit. This could lead to a stream dead lock when the compression
filter was inuse.

The issue was introduced by commit 4eb3ff1 ("MAJOR: mux-h2: make streams use
the connection's buffers") but revealed by 3816c38 ("MAJOR: mux-h2: permit a
stream to allocate as many buffers as desired").

Because a h2 stream can now have several RX buffers, in h2_rcv_buf(), we
loop on these buffers to fill the channel buffer. However, we must still
take care to respect the limit to not copy to much data. However, the
"count" variable was never decremented to reflect amount of data already
copied. So, it was possible to exceed the limit.

It was an issue when the compression filter was inuse because the channel
buffer could be fully filled, preventing the compression to be
performed. When this happened, the stream was infinitly blocked because the
compression filter was asking for some space but nothing was scheduled to be
forwarded.

This patch should fix the issue #2826. It must be backported to 3.1.
2025-01-02 09:58:23 +01:00
Amaury Denoyelle
8633446337 BUG/MINOR: h2/rhttp: fix HTTP2 conn counters on reverse
Dedicated HTTP/2 stats proxy counters are available for current and
total number of HTTP/2 connection on both frontend and backend sides.
Both counters are simply incremented into h2_init().

This causes issues when using reverse HTTP. First, increment is not
performed on the expected side, as it is triggered before
h2_conn_reverse() which switches a connection from frontend to backend
or vice versa. For example on active revers side, h2_total_connections
is incremented on the backend only even after connection is reversed and
attached to a listener for the remainder of its lifetime.

h2_open_connections suffers from a similar but arguably worst behavior
as it is also decremented. If increment and decrement operations are not
performed on the same proxy side, which happens for every connection
which has been successfully reversed, it causes an invalid counter
value, possibly with an integer overflow.

To fix this, delay increment operations on reverse HTTP from h2_init()
to h2_conn_reverse(). Both counters are updated only after reverse has
completed, thus using the expected frontend or backend side.

To prevent overflow on h2_open_connections, ensure h2_release()
decrement is not performed if a connection is freed before achieving its
reversal, as in this case it would not have been accounted by H2
counters.

This should be backported up to 2.9.

This should fix github issue #2821.
2024-12-19 17:32:01 +01:00
Willy Tarreau
cb21db04c7 BUG/MINOR: mux-h2: fix expression when detecting excess of CONTINUATION frames
Latest commit f0eca8fe7 ("MINOR: mux-h2/glitches: add a description to
the H2 glitches") misplaced the optional glitch description field, with
it appearing at the end of the if () condition and always reporting
an excess of CONTINUATION frames from the first exceeding one.

This needs to be backported along with that commit once it gets backported.
2024-12-06 18:53:19 +01:00
Willy Tarreau
f0eca8fe73 MINOR: mux-h2/glitches: add a description to the H2 glitches
Since we can now list them using "debug counters" and now support a
description, better add the description to all glitches. This patch may
be backported to 3.1, but before this the following patches must also
be picked:

    86823c828 MINOR: mux-h2/traces: add a missing trace on negative initial window size
    7c8e9420a CLEANUP: mux-h2/traces: reword certain ambiguous traces
2024-12-06 18:49:07 +01:00
Willy Tarreau
7c8e9420a2 CLEANUP: mux-h2/traces: reword certain ambiguous traces
Some h2 traces were not very clear, let's reword them a bit.
2024-12-06 18:45:46 +01:00
Willy Tarreau
86823c828f MINOR: mux-h2/traces: add a missing trace on negative initial window size
When a negative initial windows size is reported, we're going to close
the connection, so it's important to report a trace to explain why!
This should be backported at least to 3.1 and possibly 3.0 (adapting the
context since there's no glitches there).
2024-12-06 18:45:46 +01:00
Amaury Denoyelle
762d0764d7 MINOR: mux-h2: use explicit __objt_server on idle conn reinsert
This commit is the counterpart of the previous one for H2 mux. It
replaces objt_server() by unsafe __objt_server(), as conn target is
guarantee to point to a valid server instance, which can then be used as
_srv_add_idle() argument.
2024-12-06 18:02:55 +01:00
Willy Tarreau
d649278fce BUG/MEDIUM: mux-h2: make sure not to touch dummy streams when sending WU
Since commit 1cc851d9f2 ("MEDIUM: mux-h2: start to update stream when
sending WU") we started storing stream offsets in the h2s struct. These
offsets are updated at a few points, where it's safe to write to the
stream, and in h2c_send_strm_wu(), where the h2s->h2c was not performed.

Due to this, nothing protects the h2s from being updated when sending a
WU for a closed stream, which might only happen when acknowledging a
frame after resetting that stream, which is quite unlikely. In any case
if this happens, it will crash as in issue #2793 since the closed streams
are purposely read-only to catch such bugs.

The fix is trivial, just check h2s->h2c before deciding to update the
stream.

Thanks to @Wahnes for reporting this, and Christopher for spotting the
cause. This needs to be backported to 3.1 only.
2024-12-05 15:25:09 +01:00
Christopher Faulet
63d2760dfa BUG/MEDIUM: mux-h2: Check the number of headers in HEADERS frame after decoding
There is no explicit test on the number of headers when a HEADERS frame is
received. It is implicitely limited by the size of the header list. But it
is twice the configured limit to be sure to decode the frame.

So now, a check is performed after the HTX message was created. This way, we
are sure to not exceed the configured limit after the decoding stage. If
there are too many headers, a parsing error is reported.

Note the same is performed on the trailers.

This patch should patially address the issue #2685. It should be backported
to all stable versions.
2024-11-20 17:44:22 +01:00
Christopher Faulet
e415e3cb7a BUG/MEDIUM: mux-h2: Increase max number of headers when encoding HEADERS frames
When a HEADERS frame is encoded to be sent, the maximum number of headers
allowed in the frame is lower than on receiving path. This can lead to
report a sending error while the message was accepted. It could be
confusing.

In addition, the start-line is splitted into pseudo-headers and consummes
this way some header slots, increasing the difference between HEADERS frames
encoding and decoding. It is even more noticeable because when a HEADERS
frame is decoded, a margin is used to be able to handle splitted cookie
headers. Concretly, on decoding path, a limit of twice the maxumum number of
headers allowed in a message (tune.http.maxhdr * 2) is used. On encoding
path, the exact limit is used. It is not consistent.

Note that when a frame is decoded, we must use a larger limit because the
pseudo headers are reassembled in the start-line and must count for one. But
also because, most of time, the cookies are splitted into several headers
and are reassembled too.

To fix the issue, the same ratio is applied on sending path. A limit must be
defined because an dynamic allocation is not acceptable. Twice of the
configured limit should be good enough to support headers manipulation.

This patch should be backported to all stable versions.
2024-11-20 17:44:22 +01:00
Christopher Faulet
f065d00098 BUG/MEDIUM: mux-h2: Don't send RST_STREAM frame for streams with no ID
On server side, the H2 stream is first created with an unassigned ID (ID ==
0). Its ID is assigned when the request is emitted, before formatting the
HEADERS frame. However, the session may be aborted during that stage. We
must take care to not emit RST_STREAM frame for this stream, because it does
not exist yet for the server.

It is especially important to do so because, depending on the timing, it may
also happens before the H2 PREFACE was sent.

This patch must be backported to all stable versions. It is related to issue
2024-11-15 10:34:47 +01:00
Willy Tarreau
df93cf72b9 MINOR: mux-h2: count glitches when they're reported
The h2c_report_glitch() function is now replaced with a macro to support
enumerating counters for each individual glitch line. For now this adds
43 such counters. The macro supports an optional description, though that
is not being used for now. It gives outputs like this (note that the last
one was purposely instrumented to pass a description):

   > debug dev counters glt all
   0          GLT mux_h2.c:5976 h2c_dec_hdrs()
   0          GLT mux_h2.c:5960 h2c_dec_hdrs()
   (...)
   0          GLT mux_h2.c:2207 h2c_frt_recv_preface()
   0          GLT mux_h2.c:1954 h2c_frt_stream_new(): new stream too early

As a reminder, this requires to build with -DDEBUG_GLITCHES=1.
2024-11-14 09:01:57 +01:00
Willy Tarreau
3ed9361688 BUG/MEDIUM: mux-h2: try to wait for the peer to read the GOAWAY
When timeout http-keep-alive is very short (e.g. 10ms), it's possible
sometimes for a client to face truncated responses due to an early
close that happens while the system is still pushing the last data,
colliding with the client's WINDOW_UPDATEs that trigger RSTs.

Here we're trying to do better: first we send a GOAWAY on timeout, then
we wait up to clientfin/client timeout for the peer to react so that we
don't immediately close. This is sufficient to avoid truncation as soon
as the timeout is more than a few hundred ms.

It's not certain it should be backported, because it's a bit sensistive
and might possibly fall into certain edge cases.
2024-11-08 14:31:07 +01:00
Willy Tarreau
db76949cff CLEANUP: mux-h2: remove the unused "full" variable in h2_frt_transfer_data()
During 11th and 12th iteration of the development cycle for the H2 auto
rx window, several approaches were attempted to figure if another buffer
could be allocated or not. One of them consisted in looping back to the
beginning of the function requesting a new buffer slot and getting one
if the buffer was either apparently or confirmed full. The latest one
consisted in directly allocating the next buffer from the two places
where it's found to be proven full, instead of checking with the now
defunct h2s_may_get_rxbuf() if we were allowed to get once an loop.
That approach was retained. In this case the "full" variabled is no
longer needed, so let's get rid of it because the construct looks bogus
and confuses coverity (and possibly code readers as the intent is unclear
compared to the code).
2024-10-24 16:12:46 +02:00
Willy Tarreau
1eb31d30fe Revert "OPTIM: mux-h2: make h2_send() report more accurate wake up conditions"
This reverts commit 9fbc01710a313968c90e72537a5906432f438062.

In 3.1-dev10, commit 9fbc01710a ("OPTIM: mux-h2: make h2_send() report
more accurate wake up conditions") leveraged the more accurate distinction
between demux and recv to decide when to wake the tasklet up after a send.
But other cases are needed. When we just need to wake the processing task
up so that it itself wakes up other streams, for example because these ones
are blocked. Indeed, a temporarily blocked stream may block other ones,
which will never be woken up if the demux has nothing to do.

In an ideal world we would check all cases where blocking flags were
dropped. However it looks like this case after a send is probably the
only one that deserves waking up the connection again. It's likely that
in practice the MUX_MFULL flag was dropped and that it was that one that
was blocking the send.

In addition, dealing with these cases was not sufficient, as one case was
encountered where dbuf was empty, subs=0, short_read still present while
in FRH state... and the timeouts were still there (easily found with
halog -tcn cD at a rate of 1-2 every 2 minutes roughly).

Interestingly, in a dump, some MBUF_HAS_DATA were seen on an empty mbuf,
so it means that certain conditions must be taken very carefully in the
wakeup conditions.

So overall this indicates that there remain subtle inconsistencies that
this optimization is sensitive to. It may have to be revisited later but
for now better revert it.

No backport is needed.

Annex:
  - first dump showing a dependency on WAIT_INLIST after h2_send():

    0x6dc2800: [23/Oct/2024:18:07:22.861247] id=1696 proto=tcpv4
      flags=0x100c4a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=0x597a900, pend_pos=(nil) waiting=0 epoch=0
      frontend=public (id=2 mode=http), listener=SSL (id=5)
      backend=gitweb-haproxy (id=6 mode=http)
      task=0x6e1d090 (state=0x00 nice=0 calls=23 rate=0 exp=2s tid=0(1/0) age=57s)
      txn=0x6e3f7c0 flags=0x43000 meth=1 status=200 req.st=MSG_DONE rsp.st=MSG_DATA req.f=0x4c rsp.f=0x2e
      scf=0x6dc33a0 flags=0x00002482 ioto=1m state=EST endp=CONN,0x6dc6c20,0x40405001 sub=3 rex=<NEVER> wex=3s rto=3s wto=3s
        iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
          h2s=0x6dc6c20 h2s.id=59 .st=HCR .flg=0x7001 .rxwin=32712 .rxbuf.c=0 .t=0@(nil)+0/0 .h=0@(nil)+0/0
           .sc=0x6dc33a0(.flg=0x00002482 .app=0x6dc2800) .sd=0x6e83fd0(.flg=0x40405001)
           .subs=0x6dc33b8(ev=3 tl=0x6e22a20 tl.calls=10 tl.ctx=0x6dc33a0 tl.fct=sc_conn_io_cb)
           h2c=0x6e66570 h2c.st0=FRH .err=0 .maxid=77 .lastid=-1 .flg=0x2000e00 .nbst=2 .nbsc=2 .nbrcv=0 .glitches=0
           .fctl_cnt=0 .send_cnt=2 .tree_cnt=2 .orph_cnt=0 .sub=1 .dsi=77 .dbuf=0@(nil)+0/0
           .mbuf=[4..4|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=0x6dbdc60 .exp=<NEVER>
          co0=0x7f84881614b0 ctrl=tcpv4 xprt=SSL mux=H2 data=STRM target=LISTENER:0x2acb7c0
          flags=0x80000300 fd=19 fd.state=121 updt=0 fd.tmask=0x1
      scb=0x2a8da90 flags=0x00001211 ioto=1m state=EST endp=CONN,0x6e5a530,0x106c0001 sub=0 rex=<NEVER> wex=<NEVER> rto=3s wto=<NEVER>
        iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
          h1s=0x6e5a530 h1s.flg=0x14094 .sd.flg=0x106c0001 .req.state=MSG_DONE .res.state=MSG_DATA
           .meth=GET status=200 .sd.flg=0x106c0001 .sc.flg=0x00001211 .sc.app=0x6dc2800 .subs=(nil)
           h1c=0x7f84880f5f40 h1c.flg=0x80000020 .sub=0 .ibuf=32704@0x6ddef30+16262/32768 .obuf=0@(nil)+0/0 .task=0x6e131d0 .exp=<NEVER>
          co1=0x7f8488172b70 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x597a900
          flags=0x00000300 fd=31 fd.state=10122 updt=0 fd.tmask=0x1
      filters={0x6e49f30="cache store filter", 0x6e67ad0="compression filter"}
      req=0x6dc2828 (f=0x21840000 an=0x48000 tofwd=0 total=224)
          an_exp=<NEVER> buf=0x6dc2830 data=(nil) o=0 p=0 i=0 size=0
          htx=0x104d2c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
      res=0x6dc2870 (f=0xa0040000 an=0x24000000 tofwd=0 total=309982)
          an_exp=<NEVER> buf=0x6dc2878 data=0x6dceef0 o=16333 p=16333 i=16435 size=32768
          htx=0x6dceef0 flags=0x0 size=32720 data=16333 used=1 wrap=NO extra=0
      -----------------------------------
      strm.flg       0x100c4a  SF_SRV_REUSED SF_HTX SF_REDIRECTABLE SF_CURR_SESS SF_BE_ASSIGNED SF_ASSIGNED
      task.state            0  0
      txn.meth              1  GET
      txn.flg         0x43000  TX_NOT_FIRST TX_CACHE_COOK TX_CACHEABLE
      txn.req.flg        0x4c  HTTP_MSGF_BODYLESS HTTP_MSGF_VER_11 HTTP_MSGF_XFER_LEN
      txn.rsp.flg        0x2e  HTTP_MSGF_COMPRESSING HTTP_MSGF_VER_11 HTTP_MSGF_XFER_LEN HTTP_MSGF_TE_CHNK
      f.sc.flg         0x2482  SC_FL_SND_EXP_MORE SC_FL_RCV_ONCE SC_FL_WONT_READ SC_FL_EOI
      f.sc.sd.flg  0x40405001  SE_FL_HAVE_NO_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_EOI SE_FL_NOT_FIRST SE_FL_T_MUX
      f.h2s.flg        0x7001  H2_SF_HEADERS_RCVD H2_SF_OUTGOING_DATA H2_SF_HEADERS_SENT H2_SF_ES_RCVD
      f.h2s.sd.flg 0x40405001  SE_FL_HAVE_NO_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_EOI SE_FL_NOT_FIRST SE_FL_T_MUX
      f.h2c.flg     0x2000e00  H2_CF_MBUF_HAS_DATA H2_CF_DEM_IN_PROGRESS H2_CF_DEM_SHORT_READ H2_CF_WAIT_INLIST
      f.co.flg     0x80000300  CO_FL_XPRT_TRACKED CO_FL_XPRT_READY CO_FL_CTRL_READY
      f.co.fd.st        0x121  FD_POLL_IN FD_EV_READY_W FD_EV_ACTIVE_R
      b.sc.flg         0x1211  SC_FL_SND_NEVERWAIT SC_FL_NEED_ROOM SC_FL_NOHALF SC_FL_ISBACK
      b.sc.sd.flg  0x106c0001  SE_FL_WAIT_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_MAY_FASTFWD_PROD SE_FL_WANT_ROOM SE_FL_RCV_MORE SE_FL_T_MUX
      b.h1s.sd.flg 0x106c0001  SE_FL_WAIT_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_MAY_FASTFWD_PROD SE_FL_WANT_ROOM SE_FL_RCV_MORE SE_FL_T_MUX
      b.h1s.flg       0x14094  H1S_F_HAVE_CLEN H1S_F_HAVE_O_CONN H1S_F_NOT_FIRST H1S_F_WANT_KAL H1S_F_RX_CONGESTED
      b.h1c.flg    0x80000020  H1C_F_IS_BACK H1C_F_IN_FULL
      b.co.flg          0x300  CO_FL_XPRT_READY CO_FL_CTRL_READY
      b.co.fd.st       0x278a  FD_POLL_OUT FD_POLL_PRI FD_POLL_IN FD_EV_ERR_RW FD_EV_READY_R 0x2008
      req.flg      0x21840000  CF_FLT_ANALYZE CF_DONT_READ CF_AUTO_CONNECT CF_WROTE_DATA
      req.ana         0x48000  AN_REQ_FLT_END AN_REQ_HTTP_XFER_BODY
      req.htx.flg           0  0
      res.flg      0xa0040000  CF_ISRESP CF_FLT_ANALYZE CF_WROTE_DATA
      res.ana      0x24000000  AN_RES_FLT_END AN_RES_HTTP_XFER_BODY
      res.htx.flg           0  0
      -----------------------------------

  - second example of stuck connection after properly checking for WAIT_INLIST
    as well:

    0x73438d0: [23/Oct/2024:18:46:57.235709] id=3963 proto=tcpv4
      flags=0x100c4a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=0x5dd3f50, pend_pos=(nil) waiting=0 epoch=0x13
      p_stc=25 p_req=29 p_res=29 p_prp=29
      frontend=public (id=2 mode=http), listener=SSL (id=5)
      backend=gitweb-haproxy (id=6 mode=http)
      task=0x72a13e0 (state=0x00 nice=0 calls=24 rate=0 exp=7s tid=0(1/0) age=53s)
      txn=0x7287260 flags=0x43000 meth=1 status=200 req.st=MSG_DONE rsp.st=MSG_DATA req.f=0x4c rsp.f=0x2e
      scf=0x729e520 flags=0x00042082 ioto=1m state=EST endp=CONN,0x737ffd0,0x4040d001 sub=2 rex=<NEVER> wex=46s rto=46s wto=46s
        iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
          h2s=0x737ffd0 h2s.id=57 .st=HCR .flg=0x7001 .rxwin=32712 .rxbuf.c=0 .t=0@(nil)+0/0 .h=0@(nil)+0/0
           .sc=0x729e520(.flg=0x00042082 .app=0x73438d0) .sd=0x72afd50(.flg=0x4040d001)
           .subs=0x729e538(ev=2 tl=0x72af760 tl.calls=10 tl.ctx=0x729e520 tl.fct=sc_conn_io_cb)
           h2c=0x72555a0 h2c.st0=FRH .err=0 .maxid=77 .lastid=-1 .flg=0x60e00 .nbst=1 .nbsc=1 .nbrcv=0 .glitches=0
           .fctl_cnt=0 .send_cnt=1 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=77 .dbuf=0@(nil)+0/0
           .mbuf=[2..2|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=0x725e660 .exp=<NEVER>
          co0=0x7378e00 ctrl=tcpv4 xprt=SSL mux=H2 data=STRM target=LISTENER:0x2f24800
          flags=0x80040300 fd=23 fd.state=1122 updt=0 fd.tmask=0x1
      scb=0x2ee74c0 flags=0x00001211 ioto=1m state=EST endp=CONN,0x7287190,0x106c0001 sub=0 rex=<NEVER> wex=<NEVER> rto=46s wto=<NEVER>
        iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
          h1s=0x7287190 h1s.flg=0x14094 .sd.flg=0x106c0001 .req.state=MSG_DONE .res.state=MSG_DATA
           .meth=GET status=200 .sd.flg=0x106c0001 .sc.flg=0x00001211 .sc.app=0x73438d0 .subs=(nil)
           h1c=0x7373920 h1c.flg=0x80000020 .sub=0 .ibuf=32704@0x7272700+318/32768 .obuf=0@(nil)+0/0 .task=0x729e700 .exp=<NEVER>
          co1=0x72f5290 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x5dd3f50
          flags=0x00000300 fd=19 fd.state=10122 updt=0 fd.tmask=0x1
      filters={0x728f1f0="cache store filter" [3], 0x728fea0="compression filter" [28]}
      req=0x73438f8 (f=0x21840000 an=0x48000 tofwd=0 total=224)
          an_exp=<NEVER> buf=0x7343900 data=(nil) o=0 p=0 i=0 size=0
          htx=0x105f440 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
      res=0x7343940 (f=0xa0040000 an=0x24000000 tofwd=0 total=359574)
          an_exp=<NEVER> buf=0x7343948 data=0x72b1b30 o=16333 p=16333 i=16435 size=32768
          htx=0x72b1b30 flags=0x8 size=32720 data=16333 used=1 wrap=NO extra=0
      -----------------------------------
      strm.flg       0x100c4a  SF_SRV_REUSED SF_HTX SF_REDIRECTABLE SF_CURR_SESS SF_BE_ASSIGNED SF_ASSIGNED
      task.state            0  0
      txn.meth              1  GET
      txn.flg         0x43000  TX_NOT_FIRST TX_CACHE_COOK TX_CACHEABLE
      txn.req.flg        0x4c  HTTP_MSGF_BODYLESS HTTP_MSGF_VER_11 HTTP_MSGF_XFER_LEN
      txn.rsp.flg        0x2e  HTTP_MSGF_COMPRESSING HTTP_MSGF_VER_11 HTTP_MSGF_XFER_LEN HTTP_MSGF_TE_CHNK
      f.sc.flg        0x42082  SC_FL_EOS SC_FL_SND_EXP_MORE SC_FL_WONT_READ SC_FL_EOI
      f.sc.sd.flg  0x4040d001  SE_FL_HAVE_NO_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_EOS SE_FL_EOI SE_FL_NOT_FIRST SE_FL_T_MUX
      f.h2s.flg        0x7001  H2_SF_HEADERS_RCVD H2_SF_OUTGOING_DATA H2_SF_HEADERS_SENT H2_SF_ES_RCVD
      f.h2s.sd.flg 0x4040d001  SE_FL_HAVE_NO_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_EOS SE_FL_EOI SE_FL_NOT_FIRST SE_FL_T_MUX
      f.h2c.flg       0x60e00  H2_CF_END_REACHED H2_CF_RCVD_SHUT H2_CF_MBUF_HAS_DATA H2_CF_DEM_IN_PROGRESS H2_CF_DEM_SHORT_READ
      f.co.flg     0x80040300  CO_FL_XPRT_TRACKED CO_FL_SOCK_RD_SH CO_FL_XPRT_READY CO_FL_CTRL_READY
      f.co.fd.st       0x1122  FD_POLL_HUP FD_POLL_IN FD_EV_READY_W FD_EV_READY_R
      b.sc.flg         0x1211  SC_FL_SND_NEVERWAIT SC_FL_NEED_ROOM SC_FL_NOHALF SC_FL_ISBACK
      b.sc.sd.flg  0x106c0001  SE_FL_WAIT_DATA SE_FL_MAY_FASTFWD_CONS SE_FL_MAY_FASTFWD_PROD SE_FL_WANT_ROOM SE_FL_RCV_MORE SE_FL_T_MUX
2024-10-23 19:17:10 +02:00
Christopher Faulet
ded28f6e5c BUG/MEDIUM: mux-h2: Remove H2S from send list if data are sent via 0-copy FF
When data are sent via the zero-copy data forwarding, in h2_done_ff, we must
be sure to remove the H2 stream from the send list if something is send. It
was only performed if no blocking condition was encountered. But we must
also do it if something is sent. Otherwise the transfer may be blocked till
timeout.

This patch must be backported as far as 2.9.
2024-10-22 08:00:32 +02:00
Willy Tarreau
9aa86b9dbd BUILD: mux-h2/traces: fix build on 32-bit due to size of the DATA frame
Commit cf3fe1eed ("MINOR: mux-h2/traces: print the size of the DATA
frames") added the size of the DATA frame to the traces. Unfortunately
it uses ullong instead of ulong to cast a pointer, which breaks the
build on 32-bit platforms. Let's just switch it to ulong which works
on both.
2024-10-21 04:17:59 +02:00
Willy Tarreau
e4cb0ad632 MINOR: mux-h2/traces: add buffer-related info to h2s and h2c
The traces currently don't contain any info about the amount of data
present in buffers, making it difficult to figure if an empty buffer
is the cause for not demuxing or if a full buffer is the cause for
not reading more data. Let's add them, with the head/tail info as
well.
2024-10-12 18:07:21 +02:00
Willy Tarreau
a8f907a459 MINOR: mux-h2/traces: add missing flags and proxy ID in traces
H2 traces are unusable to detect bugs most of the time because they miss
the h2c and h2s flags, as well as the proxy, which makes it very hard to
figure if the info comes from the client or the server as soon as two
layers are stacked. This commit adds these precious information as well
as the h2s's rx and tx windows.

This could be backported to a few recent branches, but the rx window
calculation will have to be replaced with the static value there.
2024-10-12 17:45:51 +02:00
Willy Tarreau
fcab647613 OPTIM: mux-h2: use tasklet_wakeup_after() in h2s_notify_recv()
This reduces the avg wakeup latency of sc_conn_io_cb() from 1900 to 51us.
The L2 cache misses from from 1.4 to 1.2 billion for 20k req. But the
perf is not better. Also there are situations where we must not perform
such wakeup, these may only be done from h2_io_cb, hence the test on the
next_tasklet pointer and its reset when leaving the function. In practice
all callers to h2s_close() or h2s_destroy() can reach that code, this
includes h2_detach, h2_snd_buf, h2_shut etc.

Another test with 40 concurrent connections, transferring 40k 1MB objects
at different concurrency levels from 1 to 80 also showed a 21% drop in L2
cache misses, and a 2% perf improvement:

Before:
   329,510,887,528  instructions
    50,907,966,181  branches
       843,515,912  branch-misses
     2,753,360,222  cache-misses
    19,306,172,474  L1-icache-load-misses
    17,321,132,742  L1-dcache-load-misses
       951,787,350  LLC-load-misses

      44.660469000 seconds user
      62.459354000 seconds sys

   => avg perf: 373 MB/s

After:
   331,310,219,157  instructions
    51,343,396,257  branches
       851,567,572  branch-misses
     2,183,369,149  cache-misses
    19,129,827,134  L1-icache-load-misses
    17,441,877,512  L1-dcache-load-misses
       906,923,115  LLC-load-misses

      42.795458000 seconds user
      62.277983000 seconds sys

   => avg perf: 380 MB/s

With small requests, it's the L1 and L3 cache misses which reduced by
3% and 7% respectively, and the performance went up by 3%.
2024-10-12 17:17:51 +02:00
Willy Tarreau
04ce6536e1 OPTIM: mux-h2: try to continue reading after demuxing when useful
When we stop demuxing in the middle of a frame, we know that there are
other data following. The demux buffer is small and unique, but now we
have rxbufs, so after h2_process_demux() is left, the dbuf is almost
empty and has room to be delivered into another rxbuf.

Let's implement a short loop with a counter and a few conditions around
the demux call. We limit the number of turns to the number of available
rxbufs and no more than 12, since it shows good performance, and the
wakeup is only called once. This has shown a nice 12-20% bandwidth gain
on backend-side H2 transferring 1MB-large objects, and does not affect
the rest (headers, control etc). The number of wakeup calls was divided
by 5 to 8, which is also a nice improvement. The counter is limited to
make sure we don't add processing latency. Tests were run to find the
optimal limit, and it turns out that 16 is just slightly better, but not
worth the +33% increase in peak processing latency.

The h2_process_demux() function just doens't call the wakeup function
anymore, and solely focuses on transferring from dbuf to rxbuf.

Practical measurement: test with h2load producing 4 concurrent connections
with 10 concurrent streams each, downloading 1MB objects (20k total) via
two layers of haproxy stacked, reaching httpterm over H1 (numbers are total
for the 2 h2 front and 1 h2 back). All on a single thread.

Before: 549-553 MB/s (on h2load)
  function    calls  cpu_tot  cpu_avg
  h2_io_cb  2562340  8.157s   3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
  h2_io_cb    30109  840.9ms  27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb    16105  106.4ms  6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb        1  11.75us  11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb  2608555  9.104s   3.490us --total--

  perf stat:
   153,117,996,214 instructions                             (71.41%)
    22,919,659,027 branches       # 14.97% of inst          (71.41%)
       384,009,600 branch-misses  #  1.68% of all branches  (71.42%)
        44,052,220 cache-misses          # 1 inst / 3476    (71.44%)
     9,819,232,047 L1-icache-load-misses # 6.4% of inst     (71.45%)
     8,426,410,306 L1-dcache-load-misses # 5.5% of inst     (57.15%)
        10,951,949 LLC-load-misses       # 1 inst / 13982   (57.13%)

      12.372600000 seconds user
      23.629506000 seconds sys

After: 660 MB/s (+20%)
  function    calls  cpu_tot  cpu_avg
  h2_io_cb   244502  4.410s   18.04us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
  h2_io_cb    42107  1.062s   25.22us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb    13703  106.3ms  7.758us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb        1  13.74us  13.74us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb   300313  5.578s   18.57us --total--

  perf stat:
   126,840,441,876 instructions                             (71.40%)
    17,576,059,236 branches       # 13.86% of inst          (71.40%)
       274,136,753 branch-misses  #  1.56% of all branches  (71.42%)
        30,413,562 cache-misses          # 1 inst / 4170    (71.45%)
     6,665,036,203 L1-icache-load-misses # 5.25% of inst    (71.46%)
     7,519,037,097 L1-dcache-load-misses # 5.9% of inst     (57.15%)
         6,702,411 LLC-load-misses       # 1 inst / 18925   (57.12%)

      10.490097000 seconds user
      19.212515000 seconds sys

It's also interesting to see that less total time is spent in these
functions, clearly indicating that the cost of interrupted processing,
and the extraneous cache misses come into play at some point. Indeed,
after the change, the number of instructions went down by 17.2%, while
the L2 cache misses dropped by 31% and the L3 cache misses by 39%!
2024-10-12 16:38:36 +02:00
Willy Tarreau
9fbc01710a OPTIM: mux-h2: make h2_send() report more accurate wake up conditions
h2_send() used to report non-zero every time any data were sent, and
this was used from h2_snd_buf() or h2_done_ff() to trigger a wakeup,
which possibly can do nothing. Restricting this wakeup to either a
successful send() combined with the ability to demux, or an error.

Doing this makes the number of h2_io_cb() wakeups drop from 422k to
245k for 1000 1MB objects delivered over 100 streams between two H2
proxies, without any behavior change nor performance change. In
practice, most send() calls do not result in a wakeup anymore but
synchronous errors still do.

A local test downloading 10k 1MB objects from an H1 server with a single
connection shows this change:

   before      after    caller
   1547        1467     h2_process_demux()
   2138           0     h2_done_ff()       <---
     38        1453     ssl_sock_io_cb()   <---
     18           0     h2_snd_buf()
      1           1     h2_init()
   3742        2921     -- total --

In practice the ssl_sock_io_cb() wakeups are those notifying about
SUB_RETRY_RECV, which are not accounted for when h2_done_ff() performs
the wakeup because the tasklet is already queued (a counter placed
there shows that it's nonetheless called). So there's no transfer and
h2_done_ff() was only hiding the other one.

Another test involving 4 connections with 10 concurrent streams each
and 20000 1MB objects total shows a total disparition of the wakeups
from h2_snd_buf and h2_done_ff, which used to account together for
50% of the wakeups, resulting in effectively halving the number of
wakeups which, based on their avg process time, were not doing
anything:

Before:
  function   calls     cpu_tot   cpu_avg
  h2_io_cb   2571208   7.406s    2.880us <- h2c_restart_reading@src/mux_h2.c:940 tasklet_wakeup
  h2_io_cb   2536949   251.4ms   99.00ns <- h2_snd_buf@src/mux_h2.c:7573 tasklet_wakeup ###
  h2_io_cb     41100   5.622ms   136.0ns <- h2_done_ff@src/mux_h2.c:7779 tasklet_wakeup ###
  h2_io_cb     38979   852.8ms   21.88us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb     12519   90.28ms   7.211us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb         1   13.81us   13.81us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb   5200756   8.606s    1.654us --total--

After:
  h2_io_cb   2562340   8.157s    3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
  h2_io_cb     30109   840.9ms   27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb     16105   106.4ms   6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb         1   11.75us   11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb   2608555   9.104s    3.490us --total--
2024-10-12 16:38:36 +02:00
Willy Tarreau
633c41c621 MEDIUM: mux-h2: rework h2_restart_reading() to differentiate recv and demux
From the beginning, h2_restart_reading() has always been confusing because
it decides whether or not to wake the tasklet handler up or not. This
tasklet handler does two things, one is receiving from the socket to the
demux buf, and one is demuxing from the demux buf to the streams' rxbufs.

The conditions are governed by h2_recv_allowed(), which is also called at
a few places to decide whether or not to actually receive from the socket.
It starts to be visible that this leaves some difficulties regarding what
to do with possibly pending data.

In 2.0 with commit 3ca18bf0b ("BUG/MEDIUM: h2: Don't attempt to recv from
h2_process_demux if we subscribed."), we even had to address a special
case where it was possibly to endlessly wake up because the conditions
would rely on the demux buffer's contents, though the solution consisted
in passing a flag to decide whether or not to consider the buffer's
contents.

In 2.5 commit b5f7b5296 ("BUG/MEDIUM: mux-h2: Handle remaining read0 cases
on partial frames") introduced a new flag H2_CF_DEM_SHORT_READ which
indicates that the demux had to stop in the middle of a frame and cannot
make progress without more data. More adaptations later came in based on
this but this actually reflected exactly what was needed to solve this
painful situation: a state indicating whether to receive or parse.

Now's about time to definitely address this by reworking h2_restart_reading()
to check two completely independent things:
  - the ability to receive more data into the demux buffer, which is
    based on its allocation/fill state and the socket's errors
  - the ability to demux such data, which is based on the presence of
    enough data (i.e. no stuck short read), and ability to find an rx
    buf to continue the processing.

Now the conditions are much more understandable, and it's also visible
that the consider_buffer argument, whose value was not trivial for
callers, is not used anymore.

Tests stacking two layers of H2 show strictly no change to the wakeup
cause distributions nor counts.
2024-10-12 16:38:36 +02:00
Willy Tarreau
0fd66703c2 MEDIUM: mux-h2: change the default initial window to 16kB
Now that we're using all available rx buffers for transfers, there's
no point anymore in advertising more than the minimum value we can
safely buffer. Let's be conservative and only rely on the dynamic
buffers to improve speed beyond the configured value, and make sure
than many streams will no longer cause unfairness.

Interestingly, the total number of wakeups has further shrunk down, but
with a different distribution. From 128k for 1000 1M transfers, it went
down to 119k, with 96k from restart_reading, 10k from done_ff and 2.6k
from snd_buf. done_ff went up by 30% and restart_reading went down by
30%.
2024-10-12 16:38:26 +02:00
Willy Tarreau
1ed9d37c88 MINOR: mux-h2: add tune.h2.be.rxbuf and tune.h2.fe.rxbuf global settings
These settings allow to change the total buffer size allocated to the
backend and frontend respectively. This way it's no longer necessary to
play with tune.bufsize nor increase the number of streams to benefit from
more buffers.

Setting tune.h2.fe.rxbuf to 4m to match a sender's max tcp_wmem resulted
in 257 Mbps for a single stream at 103ms vs 121 Mbps default (or 5.1 Mbps
with a single buffer and 64kB window).
2024-10-12 16:29:16 +02:00
Willy Tarreau
e018d9a0cf MAJOR: mux-h2: make the rxbuf allocation algorithm a bit smarter
Without using bandwidth estimates, we can already use up to the number
of allocatable rxbufs and share them evenly between receiving streams.
In practice we reserve one buffer for any non-receiving stream, plus
1 per 8 possible new streams, and divide the rest between the number
of receiving streams.

Finally, for front streams, this is rounded up to the buffer size while
for back streams we round it down. The rationale here is that front to
back is very fast to flush and slow to refill so we want to optimise
upload bandwidth regardless of the number of streams, while it's the
opposite in the other way so we try to minimize HoL.

That shows good results with a single stream being able to send at 121
Mbps at 103ms using 1.4 MB buffer with default settings, or 8 streams
sharing the bandwidth at 180kB each. Previously the limit was approx
5.1 Mbps per stream.

It also enables better sharing of backend connections: a slow (100 Mbps)
and a fast (1 Gbps) clients were both downloading 2 100MB files each over
a shared H2 connection. The fast one used to show 6.86 to 20.74s with an
avg of 11.45s and an stddev of 5.81s before the patch, and went to a
much more respectable 6.82 to 7.73s with 7.08s avg and 0.336s stddev.

We don't try to increase the window past the remaining content length.
First, this is pointless (though harmless), but in addition it causes
needless emission of WINDOW_UPDATE frames on small uploads that are
smaller than a window, and beyond being useless, it upsets vtest which
expects an RST on some tests. The scheduling is not reliable enough to
insert an expect for a window update first, so in the end wich that
extra check we save a few useless frames on small uploads and please
vtest.

A new setting should be added to allow to increase the number of buffers
without having to change the number of streams. At this point it's not
done.
2024-10-12 16:29:16 +02:00
Willy Tarreau
3816c38601 MAJOR: mux-h2: permit a stream to allocate as many buffers as desired
Now we don't enforce allocation limits in h2s_get_rxbuf(), since there
is no benefit in not processing pending data, it would still cause HoL
for no saving. The only reason for not allocating is if there are no
buffers available for the connection.

In theory this should not change anything except that it excerts code
paths that support reallocating multiple buffers, which could possibly
uncover a sleeping bug. This is why it's placed in a separate commit.

And one observation worth noting is that it almost cut in half the number
of iocb wakeups: for 1000 1MB transfers over 100 concurrent streams of a
single connection, we used to observe 208k wakeups (110 from restart_reading,
80 from snd_buf, 11 from done_ff), and now we're observing 128k (113 from
restart_reading, 2.4 from snd_buf, 6.9k from done_ff), which seems to
indicate that pretty often the demuxing was blocked on a buffer full due
to the default advertised window of 64k.
2024-10-12 16:29:16 +02:00
Willy Tarreau
4eb3ff1d3b MAJOR: mux-h2: make streams use the connection's buffers
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.

A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
  - none: everything's OK
  - SFULL: the unique rx buffer is full
  - RXBUF || (RXBUF|SFULL): cannot allocate more entries

Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
2024-10-12 16:29:16 +02:00
Willy Tarreau
6279cbc9e9 MINOR: mux-h2: clear up H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ ambiguity
Since commit 485da0b05 ("BUG/MEDIUM: mux_h2: Handle others remaining
read0 cases on partial frames"), H2_CF_DEM_SHORT_READ is set when there
is no blocking flags. However, it checks H2_CF_DEM_BLOCK_ANY which does
not include H2_CF_DEM_DFULL. This results in many cases where both
H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ are set together, which makes
no sense, since one says the demux buffer is full while the other one
says an incomplete read was done. This doesn't permit to properly
decide whether to restart reading or processing.

Let's make sure to clear DFULL in h2_process_demux() whenever we
consume incoming data from the dbuf, and check for DFULL before
setting SHORT_READ.

This could probably be considered as a bug fix but it's hard to say if
it has any impact on the current code, probably at worst it might cause
a few useless wakeups, so until there's any proof that it needs to be
backported, better not do it.
2024-10-12 16:29:16 +02:00
Willy Tarreau
b74bedf157 MINOR: mux-h2: simplify the wake up code in h2_rcv_buf()
The code used to decide when to restart reading is far from being trivial
and will cause trouble after the forthcoming changes: it checks if the
current stream is the same that is being demuxed, and only if so, wakes
the demux to restart reading. Once streams will start to use multiple
buffers, this condition will make no sense anymore. Actually the real
reason is split into two steps:
  - detect if the demux is currently blocked on the current stream, and
    if so remove SFULL
  - detect if any demux blocking flags were removed during the operations,
    and if so, wake demuxing.

For now this doesn't change anything.
2024-10-12 16:29:16 +02:00