Let's just see on a diagram how the receiver can detect that the
window is large enough for the remote sender to fill the link. Here
it seems that a first criterion is that data are accumulating in
the rxbuf, indicating that the next hop doesn't consume them fast
enough. On the diagram it's visible when blue arrows (incoming data)
are more frequent than the magenta ones on average (outgoing data),
which happens when silence moments are less frequent and don't allow
the reader to catch up. It's also visible that there are two phases
alternating in the transfer:
- measure round trip time (i.e. how long it takes to restart
sending after a WU was sent after a long silence)
- measure the lowest rxbuf size during the previous round trip
It's worth noting that a window size change only has *observable* effect
after two RTT: the first RTT is to restart sending (opening or enlarging
the window), the second RTT to measure the lowest rxbuf size over the
period.
By turning the advertised window into an offset and comparing it to
the received quantity, it's possible to measure the RTT of the whole
chain (including the client possibly producing the data). Note that
when multiple streams compete for BW this can become tricky. Limiting
the window to available buffers and counting the number of sending
streams on a connection could work (i.e. split total buffers into
1+#senders, first one being used for tx).
Now that we're using all available rx buffers for transfers, there's
no point anymore in advertising more than the minimum value we can
safely buffer. Let's be conservative and only rely on the dynamic
buffers to improve speed beyond the configured value, and make sure
than many streams will no longer cause unfairness.
Interestingly, the total number of wakeups has further shrunk down, but
with a different distribution. From 128k for 1000 1M transfers, it went
down to 119k, with 96k from restart_reading, 10k from done_ff and 2.6k
from snd_buf. done_ff went up by 30% and restart_reading went down by
30%.
These settings allow to change the total buffer size allocated to the
backend and frontend respectively. This way it's no longer necessary to
play with tune.bufsize nor increase the number of streams to benefit from
more buffers.
Setting tune.h2.fe.rxbuf to 4m to match a sender's max tcp_wmem resulted
in 257 Mbps for a single stream at 103ms vs 121 Mbps default (or 5.1 Mbps
with a single buffer and 64kB window).
Without using bandwidth estimates, we can already use up to the number
of allocatable rxbufs and share them evenly between receiving streams.
In practice we reserve one buffer for any non-receiving stream, plus
1 per 8 possible new streams, and divide the rest between the number
of receiving streams.
Finally, for front streams, this is rounded up to the buffer size while
for back streams we round it down. The rationale here is that front to
back is very fast to flush and slow to refill so we want to optimise
upload bandwidth regardless of the number of streams, while it's the
opposite in the other way so we try to minimize HoL.
That shows good results with a single stream being able to send at 121
Mbps at 103ms using 1.4 MB buffer with default settings, or 8 streams
sharing the bandwidth at 180kB each. Previously the limit was approx
5.1 Mbps per stream.
It also enables better sharing of backend connections: a slow (100 Mbps)
and a fast (1 Gbps) clients were both downloading 2 100MB files each over
a shared H2 connection. The fast one used to show 6.86 to 20.74s with an
avg of 11.45s and an stddev of 5.81s before the patch, and went to a
much more respectable 6.82 to 7.73s with 7.08s avg and 0.336s stddev.
We don't try to increase the window past the remaining content length.
First, this is pointless (though harmless), but in addition it causes
needless emission of WINDOW_UPDATE frames on small uploads that are
smaller than a window, and beyond being useless, it upsets vtest which
expects an RST on some tests. The scheduling is not reliable enough to
insert an expect for a window update first, so in the end wich that
extra check we save a few useless frames on small uploads and please
vtest.
A new setting should be added to allow to increase the number of buffers
without having to change the number of streams. At this point it's not
done.
Now we don't enforce allocation limits in h2s_get_rxbuf(), since there
is no benefit in not processing pending data, it would still cause HoL
for no saving. The only reason for not allocating is if there are no
buffers available for the connection.
In theory this should not change anything except that it excerts code
paths that support reallocating multiple buffers, which could possibly
uncover a sleeping bug. This is why it's placed in a separate commit.
And one observation worth noting is that it almost cut in half the number
of iocb wakeups: for 1000 1MB transfers over 100 concurrent streams of a
single connection, we used to observe 208k wakeups (110 from restart_reading,
80 from snd_buf, 11 from done_ff), and now we're observing 128k (113 from
restart_reading, 2.4 from snd_buf, 6.9k from done_ff), which seems to
indicate that pretty often the demuxing was blocked on a buffer full due
to the default advertised window of 64k.
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.
A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
- none: everything's OK
- SFULL: the unique rx buffer is full
- RXBUF || (RXBUF|SFULL): cannot allocate more entries
Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
Since commit 485da0b05 ("BUG/MEDIUM: mux_h2: Handle others remaining
read0 cases on partial frames"), H2_CF_DEM_SHORT_READ is set when there
is no blocking flags. However, it checks H2_CF_DEM_BLOCK_ANY which does
not include H2_CF_DEM_DFULL. This results in many cases where both
H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ are set together, which makes
no sense, since one says the demux buffer is full while the other one
says an incomplete read was done. This doesn't permit to properly
decide whether to restart reading or processing.
Let's make sure to clear DFULL in h2_process_demux() whenever we
consume incoming data from the dbuf, and check for DFULL before
setting SHORT_READ.
This could probably be considered as a bug fix but it's hard to say if
it has any impact on the current code, probably at worst it might cause
a few useless wakeups, so until there's any proof that it needs to be
backported, better not do it.
The code used to decide when to restart reading is far from being trivial
and will cause trouble after the forthcoming changes: it checks if the
current stream is the same that is being demuxed, and only if so, wakes
the demux to restart reading. Once streams will start to use multiple
buffers, this condition will make no sense anymore. Actually the real
reason is split into two steps:
- detect if the demux is currently blocked on the current stream, and
if so remove SFULL
- detect if any demux blocking flags were removed during the operations,
and if so, wake demuxing.
For now this doesn't change anything.
The code used to decide what to tell to the upper layer and when to free
the rxbuf is a bit convoluted and difficult to adapt to dynamic rxbufs.
We first need to deal with memory management (b_free) and only then to
decide what to report upwards. Right now it does it the other way around.
This should not change anything.
It's not convenient to have this flag in the middle of the demux flags,
it easily hides other ones that need to be added. Let's move it after
the other ones.
Now the h2s get their rx_head, rx_tail and rx_count associated with the
shared rxbufs. A few functions are provided to manipulate all this,
essentially allocate/release a buffer for the stream, return a buffer
pointer to the head/tail, counting allocated buffers for the stream
and reporting if a stream may still allocate.
For now this code is not used.
In preparation for having a shared list of rx bufs, we're now allocating
the array of shared rx bufs in the h2c. The pool is created at the max
size between the front and back max streams for now, and the array is not
used yet.
A stream is receiving data from after the HEADERS frame missing END_STREAM,
to the end of the stream or HREM (the presence of END_STREAM). We're now
adding a flag to the stream that indicates this state, as well as a counter
in the connection of streams currently receiving data. The purpose will be
to gauge at any instant the number of streams that might have to share the
available bandwidth and buffers count in order not to allocate too much flow
control to any single stream. For now the counter is kept up to date, and is
reported in "show fd".
Instead of incrementing the last_max_ofs by the amount of received bytes,
we now start from the new current offset to which we add the static window
size. The result is exactly the same but it prepares the code to use a
window size combined with an offset instead of just refilling the budget
from what was received.
It was even verified that changing h2_fe_settings_initial_window_size in
the middle of a transfer using gdb does indeed allow the transfer speed
to adapt accordingly.
The rationale here is that we don't absolutely need to update the
stream offset live, there's already the rcvd_s counter to remind
us we've received data. So we can continue to exploit the current
check points for this.
Now we know that rcvd_s indicates the amount of newly received bytes
for the stream since last call to h2c_send_strm_wu() so we can update
our stream offsets within that function. The wu_s counter is set to
the difference between next_adv_ofs and last_adv_ofs, which are
resynchronized once the frame is sent.
If the stream suddenly disappears with unacked data (aborted upload),
the presence of the last update in h2c->wu_s is sufficient to let the
connection ack the data alone, and upon subsequent calls with new
rcvd_s, the received counter will be used to ack, like before. We
don't need to do more anyway since the goal is to let the client
abort ASAP when it gets an RST.
At this point, the stream knows its current rx offset, the computed
max offset and the last advertised one.
In H2, everything is accounted as budget. But if we want to moderate
the rcv window that's not very convenient, and we'd rather have offsets
instead so that we know where we are in the stream. Let's first add
the fields to the struct and initialize them. The curr_rx_ofs indicates
the position in the stream where next incoming bytes will be stored.
last_adv_ofs tells what's the offset that was last advertised as the
window limit, and next_max_ofs is the one that will need to be
advertised, which is curr_rx_ofs plus the current window. next_max_ofs
will have to cause a WINDOW_UPDATE to be emitted when it's higher than
last_adv_ofs, and once the WU is sent, its value will have to be copied
over last_adv_ofs.
The problem is, for now wherever we emit a stream WU, we have no notion
of stream (the stream might even not exist anymore, e.g. after aborting
an upload), because we currently keep a counter of stream window to be
acked for the current stream ID (h2c->dsi) in the connection (rcvd_s).
Similarly there are a few places early in the frame header processing
where rcvd_s is incremented without knowing the stream yet. Thus, lookups
will be needed for that, unless such a connection-level counter remains
used and poured into the stream's count once known (delicate).
Thus for now this commit only creates the fields and initializes them.
We'll need to keep track of the total amount of data received for the
current stream, and the amount of data to ack for the current stream,
which might soon diverge as soon as we'll have to update the stream's
offset with received data, which are different from those to be ACKed.
One reason is that in case a stream doesn't exist anymore (e.g. aborted
an upload), the rcvd_s info might get lost after updating the stream,
so we do need to have an in-connection counter for that.
What's done here is that the rcvd_s count is transferred to wu_s in
h2c_send_strm_wu(), to be used as the counter to send, and both are
considered as sufficient when non-null to call the function.
The buffer ring is problematic in multiple aspects, one of which being
that it is only usable by one entity. With multiplexed protocols, we need
to have shared buffers used by many entities (streams and connection),
and the only way to use the buffer ring model in this case is to have
each entity store its own array, and keep a shared counter on allocated
entries. But even with the default 32 buf and 100 streams per HTTP/2
connection, we're speaking about 32*101*32 bytes = 103424 bytes per H2
connection, just to store up to 32 shared buffers, spread randomly in
these tables. Some users might want to achieve much higher than default
rates over high speed links (e.g. 30-50 MB/s at 100ms), which is 3 to 5
MB storage per connection, hence 180 to 300 buffers. There it starts to
cost a lot, up to 1 MB per connection, just to store buffer indexes.
Instead this patch introduces a variant which we call a buffer list.
That's basically just a free list encoded in an array. Each cell
contains a buffer structure, a next index, and a few flags. The index
could be reduced to 16 bits if needed, in order to make room for a new
struct member. The design permits initializing a whole freelist at once
using memset(0).
The list pointer is stored at a single location (e.g. the connection)
and all users (the streams) will just have indexes referencing their
first and last assigned entries (head and tail). This means that with
a single table we can now have all our buffers shared between multiple
streams, irrelevant to the number of potential streams which would want
to use them. Now the 180 to 300 entries array only costs 7.2 to 12 kB,
or 80 times less.
Two large functions (bl_deinit() & bl_get()) were implemented in buf.c.
A basic doc was added to explain how it works.
Over time, some of the buffer management functions grew quite a bit,
and were still forced to remain inlined since all defined in buf.h.
Let's create buf.c and move the heaviest ones there. All those moved
here were above 200 bytes.
Since 2.7 with commit 8522348482 ("BUG/MAJOR: conn-idle: fix hash indexing
issues on idle conns"), we've been using eb64 trees and not ebmb trees
anymore, and later we dropped all that to centralize the operations in
the server. Let's remove the ebmbtree.h includes from the muxes that do
not use them.
The local "rxbuf" buffer was passed to the trace instead of h2s->rxbuf
that is used when decoding trailers. The impact is essentially the
impossibility to present some buffer contents in some rare cases. It
may be backported but it's unlikely that anyone will ever notice the
difference.
Building with gcc-12.2 -Og yields this incorrect warning in cache.c:
In function 'release_entry_unlocked',
inlined from 'http_action_store_cache' at src/cache.c:1449:4:
src/cache.c:330:9: warning: 'object' may be used uninitialized [-Wmaybe-uninitialized]
330 | release_entry(cache, entry, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/cache.c: In function 'http_action_store_cache':
src/cache.c:1200:29: note: 'object' was declared here
1200 | struct cache_entry *object, *old;
| ^~~~~~
This is wrong, the only way to reach the function is with first!=NULL
and the gotos that reach there are all those made with first==NULL.
Let's just preset object to NULL to silence it.
This command is pausing the configuration parser for <timeout>
milliseconds. This is useful for development or for testing timeouts of
init scripts, particularly to simulate a very long reload. It requires
the expose-experimental-directives to be set.
If a small request is received on QUIC MUX frontend, it can be
transmitted directly with the FIN on attach operation. rcv_buf is
skipped by the stream layer. Thus, it is necessary to ensure that there
is similar behavior when FIN is reported either on attach or rcv_buf.
One difference was that se_expect_data() was called only for rcv_buf but
not on attach. This most obvious effect is that stream timeout was
deactivated for this request : client timeout was disabled on EOI but
server one not armed due to previous se_expect_no_data(). This prevents
the early closure of too long requests.
To fix this, add an invokation of se_expect_data() on attach operation.
This bug can simply be detected using httpterm with delay request (for
example /?t=10000) and using smaller client/server timeouts. The bug is
present if the request is not aborted on timeout but instead continue
until its proper HTTP 200 termination.
This has been introduced by the following commit :
85eabfbf67
MEDIUM: mux-quic: Don't expect data from server as long as request is unfinished
This must be backported up to 2.8.
debug() converter used to resolve sink names during parsing time. Because
of this, we were unable to specify sink names that were defined after
the debug() converter was placed.
Like in the previous commit, let's implement proper postparsing for the
debug() converter, in order to be able to use sink names that are about
to be defined later in the config file.
A previous known limitation about traces was that parsing was performed on
the fly, meaning that when using "sink" keyword, only sinks that were
either internal or previously defined in the config could be used. Indeed,
it was not possible to use a ring section defined AFTER the traces section
when using the 'sink' keyword from traces.
This limitation was also mentioned in the config file.
Let's get rid of that limitation by implementing proper postparsing for
the sink parameter in traces section. To do this, make use of the new
sink_find_early() helper to start referencing sink by their names even
if they don't exist yet (if they are about to be defined later in the
config)
Traces commands on the cli are not concerned by this change.
sink_find_early() is a convenient function that can be used instead of
sink_find() during parsing time in order to try to find a matching
sink even if the sink is not defined yet.
Indeed, if the sink is not defined, sink_find_early() will try to create
it and mark it as forward-declared. It will also save informations from
the caller to better identify it in case of errors.
If the sink happens to be found in the config, it will transition from
forward-declared type to its final type. Else, it means that the sink
was not found in the config, in this case, during postresolve, we raise
an error to indicate that the sink was not found in the configuration.
It should help solve postresolving issue with rings, because for now only
log targets implement proper ring postresolving.. but rings may be used
at different places in the code, such as debug() converter or in "traces"
section.
Patch 64a77e3ea5 disabled CRL check when no CRL file was provided, but
it only did it on bind side. Add the same fix in server context
initialization side.
This allows to enable peer verification (verify required) on a server
using TLS, without having to provide a CRL file.
Out-of-order STREAM ACK are buffered in its related streambuf tree. On
insertion, overlapping or contiguous ranges are merged together. The
total size of buffered ack range is stored in <room> streambuf member
and reported to QUIC MUX layer on streambuf release. The objective is to
ensure QUIC MUX layer can allocate Tx buffers conveniently to preserve a
good transfer throughput.
Streamdesc is the overall container of many streambufs. It may also been
released when its upper QCS instance is freed, after all stream data
have been emitted. In this case, the active streambuf is also released
via custom code. However, in this code path, <room> was not reported to
the QUIC MUX layer.
This bug caused wrong estimation for the QUIC MUX txbuf window, with
bytes reamining even after all ACK reception. This may cause transfer
freeze on other connection streams, with RESET_STREAM emission on
timeout client.
To fix this, reuse the existing qc_stream_buf_release() function on
streamdesc release. This ensures that notify_room is correctly used.
No need to backport.
To properly decount out-of-order acked data range, contiguous or
overlapping ranges are first merged before their insertion in a tree.
The first step ensure that a newly reported range is not completely
covered by the existing tree ranges. However, one of the condition was
incorrect. Fix this to ensure that the final range tree does not contain
duplicated entry.
The impact of this bug is unknown. However, it may have allowed the
insertion of overlapping ranges, which could in turn cause an error in
QUIC MUX txbuf window, with a possible transfer freeze.
No need to backport.
To execute sample fetches and converters from lua. hlua API leverages the
sample API. Prior to executing the sample func, the arg checker is called
from hlua_run_sample_{fetch,conv}() to detect potential errors.
However, hlua_run_sample_{fetch,conv}() both pass NULL as <err> argument,
but it is wrong for two reasons. First we miss an opportunity to report
precise error messages to help the user know what went wrong during the
check.. and more importantly, some val check functions consider that the
<err> pointer is never NULL. This is the case for example with
check_crypto_hmac(). Because of this, when such val check functions
encounter an error, they will crash the process because they will try
to de-reference NULL.
This bug was discovered and reported by GH user @JB0925 on #2745.
Perhaps val check functions should make sure that the provided <err>
pointer is != NULL prior to de-referencing it. But since there are
multiple occurences found in the code and the API isn't clear about that,
it is easier to fix the hlua part (caller) for now.
To fix the issue, let's always provide a valid <err> pointer when
leveraging val_arg() check function pointer, and make use of it in case
or error to report relevant message to the user before freeing it.
It should be backported to all stable versions.
hlua_ctx_renew() is called from unsafe places where the caller doesn't
expect it to LJMP.. however hlua_ctx_renew() makes use of Lua library
function that could potentially raise errors, such as lua_newthread(),
and it does nothing to catch errors. Because of this, haproxy could
unexpectedly crash. This was discovered and reported by GH user
@JB0925 on #2745.
To fix the issue, let's simply make hlua_ctx_renew() safe by applying
the same logic implemented for hlua_ctx_init() or hlua_ctx_destroy(),
which is catching Lua errors by leveraging SET_SAFE_LJMP_PARENT() helper.
It should be backported to all stable versions.
Now that 'do-log' action may be used for all existing action contexts,
let's add some tests in reg-tests/log/log_profile.vtc to ensure it works
as expected. quic-ini is not tested as it may not be builtin depending on
build options..
Thanks to the two previous commits, we can now expose the do-log action
on all available action contexts, including the new quic-init context.
Each context is responsible for exposing the do-log action by registering
the relevant log steps, saving the idendifier, and then store it in the
rule's context so that do_log_action() automatically uses it to produce
the log during runtime.
To use the feature, it is simply needed to use "do-log" (without argument)
on an action directive, example:
tcp-request connection do-log
As mentioned before, each context where the action is exposed has its own
log step identifier. Currently known identifiers are:
quic-initial: quic-init
tcp-request connection: tcp-req-conn
tcp-request session: tcp-req-sess
tcp-request content: tcp-req-cont
tcp-response content: tcp-res-cont
http-request: http-req
http-response: http-res
http-after-response: http-after-res
Thus, these "additional" logging steps can be used as-is under log-profile
section (after "on" keyword). However, although the parser will accept
them, it makes no sense to use them with the "log-steps" proxy keyword,
since the only path for these origins to trigger a log generation is
through the explicit use of "do-log" action.
This need was described in GH #401, it should help to conditionally
trigger logs using ACL at specific key points.. and may either be used
alone or combined with "log-steps" to add additional log "trackers" during
transaction handling.
Documentation was updated and some examples were added.
Function may be used from places where per-context actions are usually
registered (tcp_act.c, http_act.c, quic_rules.c.. to name a few) in
order to expose the do_log() action.
do_log() is quite similar to sess_log() or strm_log(), excepts that it
may be called at any time during session handling in an opportunistic
way as long as the session exists (the stream may or may not exist).
Also, it will try to emit the log as INFO by default, unless set-log-level
is used on the stream, or error origin flag is set.
This commit is the last one of a serie whose objective is to restore
QUIC transfer throughput performance to the state prior to the recent
QUIC MUX buffer allocator rework.
This gain is obtained by reporting received out-of-order ACK data range
to the QUIC MUX which can then decount room in its txbuf window. This is
implemented in QUIC streamdesc layer by adding a new invokation of
notify_room callback. This is done into qc_stream_buf_store_ack() which
handle out-of-order ACK data range.
Previous commit has introduced merging of overlapping ACK data range. As
such, it's easy to only report the newly acknowledged data range.
As with in-order ACKs, this new notification is only performed on
released streambuf. As such, when a streambuf instance is released,
notify_room notification now also reports the total length of
out-of-order ACK data range currently stored. This value is stored in a
new streambuf member <room> to avoid unnecessary tree lookup.
This <room> member also serves on in-order ACK notification to reduce
the notified room. This prevents to report invalid values when overlap
ranges are treated first out-of-order and then in-order, which would
cause an invalid QUIC MUX txbuf window value.
After this change has been implemented, performance has been
significantly improved, both with ngtcp2-client rate usage and on
interop goodput test. These values are now similar to the rate observed
on older haproxy version before QUIC MUX buffer allocator rework.
Transfer throughput was deteriorated since recent rework of QUIC MUX
txbuf allocator. This was partially restorated with the commit to
decount individual in-order ACK from the MUX buffer window.
To fully retrieve the old performance level, all ACKs must be decounted
when handled by QUIC streamdesc layer, event out-of-order ranges.
However, this is not easily implemented as several ranges may exist in
parallel with overlap on the underlying data. It would cause
miscalculation for QUIC MUX buffer window if such ranges were blindly
reported.
The proper solution is to first implement merge of contiguous or
overlapping ACK data ranges to reduce the number of stored ranges to the
minimal. This is the purpose of this patch. This is implemented in a new
static function named qc_stream_buf_store_ack() into streamdesc layer.
The merge algorithm is simple enough. First, it ensures the newly added
range is not already fully covered by a preexisting entry. Then, it
checks if there is contiguity/overlap with one or several ranges
starting at the same of a greater offset. If true, the newly added entry
is extended to cover them all, and all contiguous/overlapped ranges are
removed. Finally, if there is contiguity or overlap with an entry
starting at a smaller offset, no new range is instantiated and instead
the smaller offset is extended.
Now that contiguous or overlapped ranges cannot exits anymore, ACK data
ranges tree instiatiation can used EB_ROOT_UNIQUE.
Outside of the longer term objective which is to decount out-of-order
ACKs from MUX txbuf window, this commit could also improve some
performance and/or memory usage for connections where stream data
fragmentation and packet reording is high.
QUIC streamdesc layer is responsible to handle reception of ACK for
streams. It removes stream data from the underlying buffers on ACK
reception.
Streamdesc layer treats ACK in order at the stream level. Out of order
ACKs are buffered in a tree until they can be handled on older data
acknowledgement reception. Previously, qf_stream instance which comes
from the quic_tx_packet was used as tree node to buffer such ranges.
Introduce a new type dedicated to represent out of order stream ack data
range. This type is named qc_stream_ack. It contains minimal infos only
relative to the acknowledged stream data range.
This allows to reduce size of frequently used quic_frame with the
removal of tree node from qf_stream. Another side effect of this change
is that now quic_frame are always released immediately on ACK reception,
both in-order and out-of-order. This allows to also release the
quic_tx_packet instance which should reduce memory consumption.
The drawback of this change is that qc_stream_ack instance must be
allocated on out-of-order ACK reception. As such, qc_stream_desc_ack()
may fail if an error happens on allocation. For the moment, such error
is silenly recovered up to qc_treat_rx_pkts() with the dropping of the
received packet containing the ACK frame. In the future, it may be
useful to close the connection as this error may only happens on low
memory usage.
Recently, a new allocation mechanism was implemented for Tx buffers used
by QUIC MUX. Now, underlying congestion window size is used to determine
if it is still possible or not to allocate a new buffer when necessary.
This mechanism has render the QUIC stack more flexible. However, it also
has brought some performance degradation, with transfer time longer in
certain environment. It was first discovered on the measurement results
of the interop. It can also easily be reproduced using the following
ngtcp2-client example which forces a very small congestion window due to
frequent loss :
$ ngtcp2-client -q --no-quic-dump --no-http-dump --exit-on-all-streams-close -r 0.1 127.0.0.1 20443 "https://[::]:20443/?s=10m"
This performance decrease is caused by the allocator which is now too
strict. It may cause buffer underrun frequently at the MUX layer when
the congestion window is too small, as new buffers cannot be allocated
until the current one is fully acknowledged. This resuls in transfers
with very bad throughput utilisation. The objective of this new serie of
patches is to relax some restrictions to permit QUIC MUX to allocate new
buffers more quickly, while preserving the initial limitation based on
congestion window size.
An interesting method for this is to notify QUIC MUX about newly
available room on individual ACK reception, without waiting for the full
bffer acknowledgement. This is easily implemented by adding a new
notify_room invokation in QUIC streamdesc layer on ACK reception.
However, ACK reception are handled in-order at the stream level. Out of
order ACKs are buffered and are not decounted for now. This will be
implemented in a future commit.
Note that for a single buffer instance, data can in parallel be written
by QUIC MUX and removed on ACK reception. This could cause room
notification to QUIC MUX layer to report invalid values. As such, ACK
reception are only accounted for released buffers. This ensures that
such buffers won't received any new data. In the same time, buffer room
is notified on release operation as it does not need acknowledgement.
This commit has permit to improve performance for the ngtcp2-client
scenario above. However, it is not yet sufficient enough for interop
goodput test.
quic_frame is the type used to represent frames emitted in a QUIC Tx
packet. Each frame is attached to a packet, and can also be linked to
other frames from the the same packet, or duplicated frames for
retransmission. As such, quic_frame free operation is a tedious process.
qc_release_frm() has been implemented to ensure quic_frame is always
properly freed after detaching from all its list attach point. One
particular point is to ensure that when a frame is released, the frame
origin and all origin copies, including the current <frm> are flagged as
acked and detached from the reflist. Add a BUG_ON() to ensure this loop
is properly conducted when dealing with the current <frm> instance.
Most of the time STREAM frames emitted by QUIC MUX have some data in it.
However, it is possible to use an empty frame when a delayed FIN must be
transferred.
Recently, QUIC MUX send callback notification has been refactored. Now,
this callback is blindly called by quic_conn lower layer each time a
STREAM frame is built into a newly Tx packet. QUIC MUX is responsible to
ensure the notified frame corresponds to newly emitted data or
retransmission. Offsets are used for this comparison, but this requires
special care for empty FIN frames.
Sadly, the comparison written to determine if an empty FIN frame was
sent for the first time or retransmitted is not correct. This caused
such frame to always be dismissed as retransmission in QUIC MUX sent
callback. This prevented the related QCS instance to be removed from the
send_list, causing qcc_io_send() to retry a new emission. This was
finally interrupted by the BUG_ON() assertion to prevent an infinite
loop.
Fix this crash by updating the condition in QUIC MUX send callback. For
empty STREAM frame, it is sufficient to check if QC_SF_FIN_STREAM was
already removed or not to detect a retransmission. Indeed, empty STREAM
frames are never used outside of delayed FIN reporting.
No need to backport. This crash was introduced in the current dev branch
by the following commit.
d7f4e5abf0
MEDIUM: quic: strengthen MUX send notification
Released version 3.1-dev9 with the following main changes :
- MINOR: tools: add minimal file name management
- CLEANUP: stick-table: make the file location point to a global file name
- MINOR: proxy: use the global file names for conf->file
- CLEANUP: cfgparse: factor proxy vs log-forward collisions
- BUG/MINOR: cfgparse: detect another uncaught case of duplicate defaults
- MINOR: proxy: add a list of orphaned defaults sections
- MEDIUM: cfgparse: drop duplicate named defaults sections after use
- OPTIM: cfgparse: speed up duplicate server detection
- MEDIUM: cfgparse: warn about deprecated use of duplicate server names
- BUG/MINOR: server: shut down streams under thread isolation
- BUG/MINOR: proxy: also make the cli and resolvers use the global name
- REGTESTS: log: fix log-profile.vtc
- MEDIUM: mailers: warn about deprecated legacy mailers
- BUG/MEDIUM: cli: Be sure to catch immediate client abort
- DEV: flags/applet: decode appctx flags
- BUG/MEDIUM: cli: Deadlock when setting frontend maxconn
- MINOR: log: fix indent in strm_log()
- MINOR: log: introduce extra log profile steps
- MINOR: log: handle extra log origins in _process_send_log_override()
- MINOR: log: introduce log_orig flags
- MINOR: log: explicitly handle extra log origins as error when relevant
- MINOR: log: support extra log origins for '%OG' alias
- MINOR: proxy: add log_steps struct member
- MINOR: log: introduce "log-steps" proxy keyword
- MINOR: log: add log_orig_proxy() helper function
- MEDIUM: log: consider log-steps proxy setting for existing log origins
- DOC: config: document proxy "log-steps" keyword
- REGTESTS: add a test for proxy "log-steps"
- Revert "BUG/MINOR: server: shut down streams under thread isolation"
- MINOR: task: define two new one-shot events for use with WOKEN_OTHER or MSG
- BUG/MEDIUM: stream: make stream_shutdown() async-safe
- BUG/MINOR: server: make sure the HMAINT state is part of MAINT
- BUG/MINOR: queue: make sure that maintenance redispatches server queue
- MINOR: server: make srv_shutdown_sessions() call pendconn_redistribute()
- BUILD: tools: only include execinfo.h for the real backtrace() function
- MINOR: tools: do not attempt to use backtrace() on linux without glibc
- OPTIM: channel: speed up co_getline()'s search of the end of line
- OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR
- BUG/MINOR: mcli: Pretend the mux have more data to deliver between two commands
- MINOR: action: Export release_expr_int_action() release function
- MINOR: stream: Rely on a per-stream max connection retries value
- MINOR: stream: Support dynamic changes of the number of connection retries
- MINOR: stream/stats: Expose the current number of streams in stats
- MINOR: stream/stats: Expose the total number of streams ever created in stats
- BUG/MINOR: cfgparse-global: fix allowed args number for setenv
- MINOR: cfgparse-global: add dedicated parser for *env keywords
- MINOR: mux-quic: complete Tx infos for QCS dump
- MINOR: quic: ensure txbuf realloc is only performed on empty buffer
- MINOR: mux-quic: strengthen qcs_send_metadata() usage
- MINOR: quic: remove unneeded notification of txbuf room
- MINOR: quic: refactor MUX send notification
- MEDIUM: quic: strengthen MUX send notification
- MINOR: quic: refactor STREAM room notification
- MINOR: quic: do not remove qc_stream_desc automatically on ACK handling
- MINOR: quic: store streambuf in a streamdesc tree
- MINOR: quic: move buffered ACK to streambuf
- MEDIUM: quic: handle out-of-order ACK at streamdesc layer
- MEDIUM: quic: refactor buffered STREAM ACK consuming
- BUG/MEDIUM: queue: always dequeue the backend when redistributing the last server
- MINOR: config/trace: Add a 'traces' section to declare debug traces
- MINOR: trace: Be able to chain commands for a source in one line
- MINOR: tcpcheck: Add support for an option host header value for httpchk option
- BUG/MINOR: mux-h1: Fix condition to set EOI on SE during zero-copy forwarding
- MINOR: mux-h1: Use a dedicated function to conditionnaly set EOI flag on SE
- BUG/MINOR: http-ana: Disable fast-fwd for unfinished req waiting for upgrade
- BUG/MINOR: mux-quic: fix crash on qcc_init() early return
- BUG/MINOR: quic: fix trace on releasing STREAM frame after ack
Fix NULL argument pass to qc_release_frm(). This allows to give more
context on the traces inside it. Note that no crash occured as QUIC
traces always check validity on first arg before derefencing it.
No backport needed.
qcc_release() may be used in case qcc_init() cannot complete. In this
case, connection instance is NULL. As such, it cannot be dereferenced
without testing it first.
This should fix github coverity report #2739.
No backport needed.
If a request is waiting for a protocol upgrade but it is not finished, the
data fast-forwarding is disabled. Otherwise, the request analyzers will miss
the end of the message.
This case is possible since the commit 01fb1a54 ("BUG/MEDIUM: mux-h1/mux-h2:
Reject upgrades with payload on H2 side only"). Indeed, before, a protocol
upgrade was not allowed for request with payload. But it is now possible and
this comes with a side-effect. It is not really satisfying but for now there
is no other way to sync the muxes and the applicative stream. It seems to be
a reasonnable fix for now, waiting for a deeper refactoring.
This patch must be backported with the commit above.
The same conditions are evaluated in h1_process_demux() and h1_fastfwd() to
know if SE_FL_EOI flag must be set or not on the sedesc. So now, a dedicated
function is used.
During zero-copy data forwarding, the producer must set the EOI flag on the SE
when end of the message is reached. It is already done but there is a case where
this flag is set while it should not. When a request wants to perform a protocol
upgrade and it is waiting for the server response, the flag must not be set
because the HTTP message is finished but some data are possibly still expected,
depending on the server response. On a 101-switching-protocol, more data will be
sent because the producer is switch to TUNNEL state.
So, now, the right condition is used. In DONE state, SE_FL_EOI flag is set on the sedesc iff:
- it is the response
- it is the request and the response is also in DONNE state
- it is a request but no a protocol upgrade nor a CONNECT
This patch must be backported as far as 2.9.