This will make the pools size and alignment automatically inherit
the type declaration. It was done like this:
sed -i -e 's:DECLARE_POOL(\([^,]*,[^,]*,\s*\)sizeof(\([^)]*\))):DECLARE_TYPED_POOL(\1\2):g' $(git grep -lw DECLARE_POOL src addons)
sed -i -e 's:DECLARE_STATIC_POOL(\([^,]*,[^,]*,\s*\)sizeof(\([^)]*\))):DECLARE_STATIC_TYPED_POOL(\1\2):g' $(git grep -lw DECLARE_STATIC_POOL src addons)
81 replacements were made. The only remaining ones are those which set
their own size without depending on a structure. The few ones with an
extra size were manually handled.
It also means that the requested alignments are now checked against the
type's. Given that none is specified for now, no issue is reported.
It was verified with "show pools detailed" that the definitions are
exactly the same, and that the binaries are similar.
Previously quic_conn <target> member was used to determine if quic_conn
was used on the frontend (as server) or backend side (as client). A new
helper function can now be used to directly check flag
QUIC_FL_CONN_IS_BACK.
This reduces the dependency between quic_conn and their relative
listener/server instances.
QUIC emission can use GSO to emit multiple datagrams with a single
syscall invokation. However, this feature relies on several kernel
parameters which are checked on haproxy process startup.
Even if these checks report no issue, GSO may still be unable due to the
underlying network adapter underneath. Thus, if a EIO occured on
sendmsg() with GSO, listener is flagged to mark GSO as unsupported. This
allows every other QUIC connections to share the status and avoid using
GSO when using this listener.
Previously, listener flag was checked for every QUIC emission. This was
done using an atomic operation to prevent races. Improve this by
duplicating GSO unsupported status as the connection level. This is done
on qc_new_conn() and also on thread rebinding if a new listener instance
is used.
The main benefit from this patch is to reduce the dependency between
quic_conn and listener instances.
This patch impacts the QUIC frontends. It reverts this patch
MINOR: quic-be: add a "CC connection" backend TX buffer pool
which adds <pool_head_quic_be_cc_buf> new pool to allocate CC (connection closed state)
TX buffers with bigger object size than the one for <pool_head_quic_cc_buf>.
Indeed the QUIC backends must be able to send at least 1200 bytes Initial packets.
For now on, both the QUIC frontends and backend use the same pool with
MAX(QUIC_INITIAL_IPV6_MTU, QUIC_INITIAL_IPV4_MTU)(1252 bytes) as object size.
Replace all calls to qc_is_listener() (resp. !qc_is_listener()) by calls to
objt_listener() (resp. objt_server()).
Remove qc_is_listener() implement and QUIC_FL_CONN_LISTENER the flag it
relied on.
This bug fix completes this patch which was not sufficient:
MINOR: quic-be: Allow sending 1200 bytes Initial datagrams
This patch could not allow the build of well formed Initial packets coalesced to
others (Handshake) packets. Indeed, the <padding> parameter passed to qc_build_pkt()
is deduced from a first value: <padding> value and must be set to 1 for
the last encryption level. As a client, the last encryption level is always
the Handshake encryption level. But <padding> was always set to 1 for a QUIC
client, leading the first Initial packet to be malformed because considered
as the second one into the same datagram.
So, this patch sets <padding> value passed to qc_build_pkt() to 1 only when there
is no last encryption level at all, to allow the build of Initial only packets
(not coalesced) or when it frames to send (coalesced packets).
No need to backport.
- Add ->retry_token and ->retry_token_len new quic_conn struct members to store
the retry tokens. These objects are allocated by quic_rx_packet_parse() and
released by quic_conn_release().
- Add <pool_head_quic_retry_token> new pool for these tokens.
- Implement quic_retry_packet_check() to check the integrity tag of these tokens
upon RETRY packets receipt. quic_tls_generate_retry_integrity_tag() is called
by this new function. It has been modified to pass the address where the tag
must be generated
- Add <resend> new parameter to quic_pktns_discard(). This function is called
to discard the packet number spaces where the already TX packets and frames are
attached to. <resend> allows the caller to prevent this function to release
the in flight TX packets/frames. The frames are requeued to be resent.
- Modify quic_rx_pkt_parse() to handle the RETRY packets. What must be done upon
such packets receipt is:
- store the retry token,
- store the new peer SCID as the DCID of the connection. Note that the peer will
modify again its SCID. This is why this SCID is also stored as the ODCID
which must be matched with the peer retry_source_connection_id transport parameter,
- discard the Initial packet number space without flagging it as discarded and
prevent retransmissions calling qc_set_timer(),
- modify the TLS cryptographic cipher contexts (RX/TX),
- wakeup the I/O handler to send new Initial packets asap.
- Modify quic_transport_param_decode() to handle the retry_source_connection_id
transport parameter as a QUIC client. Then its caller is modified to
check this transport parameter matches with the SCID sent by the peer with
the RETRY packet.
This easy to understand patch is not intrusive at all and cannot break the QUIC
listeners.
The QUIC client MUST always pad its datagrams with Initial packets. A "!l" (not
a listener) test OR'ed with the existing ones is added to satisfy the condition
to allow the build of such datagrams.
There is no need to limit the size of the TX buffer to QUIC_MIN_CC_PKTSIZE bytes
when the connection is in closing state. There is already a test which limits the
number of bytes to be used from this TX buffer after this useless test removed.
It limits this number of bytes to the size of the TX buffer itself:
if (end > (unsigned char *)b_wrap(buf))
end = (unsigned char *)b_wrap(buf);
This is exactly what is needed when the connection is in closing state. Indeed,
the size of the TX buffers are limited to reduce the memory usage. The connection
only needs to send short datagrams with at most 2 packets with a CONNECTION_CLOSE*
frames. They are built only one time and backed up into small TX buffer allocated
from a dedicated pool.
The size of this TX buffer is QUIC_MAX_CC_BUFSIZE which depends on QUIC_MIN_CC_PKTSIZE:
#define QUIC_MIN_CC_PKTSIZE 128
#define QUIC_MAX_CC_BUFSIZE (2 * (QUIC_MIN_CC_PKTSIZE + QUIC_DGRAM_HEADLEN))
This size is smaller than an MTU.
This patch should be backported as far as 2.9 to ease further backports to come.
A QUIC client must be able to close a connection sending Initial packets. But
QUIC client Initial packets must always be at least 1200 bytes long. To reduce
the memory use of TX buffers of a connection when in "closing" state, a pool
was dedicated for this purpose but with a too much reduced TX buffer size
(QUIC_MAX_CC_BUFSIZE).
This patch adds a "closing state connection" TX buffer pool with the same role
for QUIC backends.
This is an old bug which was there since this commit:
MINOR: quic: Avoid zeroing frame structures
It seems QUIC_FT_CONNECTION_CLOSE was confused with QUIC_FT_CONNECTION_CLOSE_APP
which does not include a "frame type" field. This field was not initialized
(so with a random value) which prevent the packet to be built because the
packet builder supposes the packet with such frames are very short.
Must be backported as far as 2.6.
Replace ->li quic_conn pointer to struct listener member by ->target which is
an object type enum and adapt the code.
Use __objt_(listener|server)() where the object type is known. Typically
this is were the code which is specific to one connection type (frontend/backend).
Remove <server> parameter passed to qc_new_conn(). It is redundant with the
<target> parameter.
GSO is not supported at this time for QUIC backend. qc_prep_pkts() is modified
to prevent it from building more than an MTU. This has as consequence to prevent
qc_send_ppkts() to use GSO.
ssl_clienthello.c code is run only by listeners. This is why __objt_listener()
is used in place of ->li.
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.
This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
This patch is the direct follow-up of the previous one which refactor
STREAM frame encoding. Reuse the newly defined quic_strm_frm_fillbuf()
and quic_strm_frm_split() functions for CRYPTO frame encoding.
The code for CRYPTO and STREAM frames encoding should now be clearer as
it is mostly identical.
CRYPTO and STREAM frames encoding is similar. If payload is too large,
frame will be splitted and only the first payload part will be written
in the output QUIC packet. This process is complexified by the presence
of a variable-length integer Length field prior to the payload.
This commit aims at refactor these operations. Define two functions to
simplify the code :
* quic_strm_frm_fillbuf() which is used to calculate the optimal frame
length of a STREAM/CRYPTO frame with its payload in a buffer
* quic_strm_frm_split() which is used to split the frame payload if
buffer is too small
With this patch, both functions are now implemented for STREAM encoding.
STREAM and CRYPTO frames have a similar encoding format. In particular,
both of them have a variable-length integer Length field just before the
frame payload.
It is complex to determine the optimal Length value before copying the
payload data in the remaining buffer space. As such, helper functions
were implemented to calculate this. However, CRYPTO and STREAM frames
encoding implementation were not completely aligned, which renders the
code harder to follow.
The purpose of this commit is to simplify CRYPTO and STREAM frames
encoding. First, a new helper quic_int_cap_length() is defined which is
useful to determine the optimal buffer room available if prefixed by a
variable-length integer as Length field. Then, processing of both CRYPTO
and STREAM frames is now nearly identical, based on this new helper
function. Functions max_available_room() and max_stream_data_size() are
now unused and are removed.
Function max_stream_data_size() is used to determine the payload length
of a CRYPTO frame. It takes into account that the CRYPTO length field is
a variable length integer.
Implemented calcul was incorrect as it reserved too much space as a
frame header. This error is mostly due because max_stream_data_size()
reuses max_available_room() which also reserve space for a variable
length integer. This results in CRYPTO frames shorter of 1 to 2 bytes
than the maximum achievable value, which produces in the end datagram
shorter than the MTU.
Fix max_stream_data_size() implementation. It is now merely a wrapper on
max_available_room(). This ensures that CRYPTO frame encoding is now
properly optimized to use the MTU available.
This should be backported up to 2.6.
Long header packets have a mandatory Length field, which contains the
size of Packet number and payload, encoded as a variable-length integer.
Its value can thus only be determined after the payload size is known,
which depends on the remaining buffer space after this variable-length
field.
Packet payload are encoded in two steps. First, a list of input frames
is processed until the packet buffer is full. CRYPTO and STREAM frames
payload can be splitted if need to fill the buffer. Real encoding is
then performed as a second stage operation, first with Length field,
then with the selected frames themselves.
Before this patch, no space was reserved in the buffer for Length field
when attaching the frames to the packet. This could result in a error as
the packet payload would be too large for the remaining space.
In practice, this issue was rarely encounted, mostly as a side-effect
from another issue linked to CRYPTO frame encoding. Indeed, a wrong
calculation is performed on CRYPTO splitting, which results in frame
payload shorter by a few bytes than expected. This however ensured there
would be always enough room for the Length field and payload during
encoding. As CRYPTO frames are the only big enough content emitted with
a Long header packet, this renders the current issue mostly non
reproducible.
Fix the original issue by reserving some space for Length field prior to
frame payload calculation, using a maximum value based on the remaining
room space. Packet length is then reduced if needed when encoding is
performed, which ensures there is always enough room for the selected
frames.
Note that the other issue impacting CRYPTO frame encoding is not yet
fixed. This could result in datagrams with Long header packets not
completely extended to the full MTU. The issue will be addressed in
another patch.
This should be backported up to 2.6.
Implement a new method for QUIC pacing emission based on credit. This
represents the number of packets which can be emitted in a single burst.
After emission, decrement from the credit the number of emitted packets.
Several emission can be conducted in the same sequence until the credit
is completely decremented.
When a new emission sequence is initiated (i.e. under a new QMUX tasklet
invokation), credit is refilled according to the delay which occured
between the last and current emission context.
This new mechanism main advantage is that it allows to conduct several
emission in the same task context without having to wait between each
invokation. Wait is only forced if pacing is expired, which is now
equivalent to having a null credit.
Furthermore, if delay between two emissions sequence would have been
smaller than expected, credit is only partially refilled. This allows to
restart emission without having to wait for the whole credit to be
available.
On the implementation side, a new field <credit> is avaiable in
quic_pacer structure. It is automatically decremented on
quic_pacing_sent_done() invokation. Also, a new function
quic_pacing_reload() must be used by QUIC MUX when a new emission
sequence is initiated to refill credit. <next> field from quic_pacer has
been removed.
For the moment, credit is based on the burst configured via quic-cc-algo
keyword, or directly reported by BBR.
This should be backported up to 3.1.
This patch fixes the loss of information when computing the delivery rate
(quic_cc_drs.c) on links with very low latency due to usage of 32bits
variables with the millisecond as precision.
Initialize the quic_conn task with TASK_F_WANTS_TIME flag ask it to ask
the scheduler to update the call date of this task. This allows this task to get
a nanosecond resolution on the call date calling task_mono_time(). This is enabled
only for congestion control algorithms with delivery rate estimation support
(BBR only at this time).
Store the send date with nanosecond precision of each TX packet into
->time_sent_ns new quic_tx_packet struct member to store the date a packet was
sent in nanoseconds thanks to task_mono_time().
Make use of this new timestamp by the delivery rate estimation algorithm (quic_cc_drs.c).
Rename current ->time_sent member from quic_tx_packet struct to ->time_sent_ms to
distinguish the unit used by this variable (millisecond) and update the code which
uses this variable. The logic found in quic_loss.c is not modified at all.
Must be backported to 3.1.
A UDP datagram cannot be greater than 65535 bytes, as UDP length header
field is encoded on 2 bytes. As such, sendmsg() will reject a bigger
input with error EMSGSIZE. By default, this does not cause any issue as
QUIC datagrams are limited to 1.252 bytes and sent individually.
However, with GSO support, value bigger than 1.252 bytes are specified
on sendmsg(). If using a bufsize equal to or greater than 65535, syscall
could reject the input buffer with EMSGSIZE. As this value is not
expected, the connection is immediately closed by haproxy and the
transfer is interrupted.
This bug can easily reproduced by requesting a large object on loopback
interface and using a bufsize of 65535 bytes. In fact, the limit is
slightly less than 65535, as extra room is also needed for IP + UDP
headers.
Fix this by reducing the count of datagrams encoded in a single GSO
invokation via qc_prep_pkts(). Previously, it was set to 64 as specified
by man 7 udp. However, with 1252 datagrams, this is still too many.
Reduce it to a value of 52. Input to sendmsg will thus be restricted to
at most 65.104 bytes if last datagram is full.
If there is still data available for encoding in qc_prep_pkts(), they
will be written in a separate batch of datagrams. qc_send_ppkts() will
then loop over the whole QUIC Tx buffer and call sendmsg() for each
series of at most 52 datagrams.
This does not need to be backported.
If a packet build was asked to probe the peer with frames which have just
been acked, the frames build run by qc_build_frms() could be cancelled by
qc_stream_frm_is_acked() whose aim is to check that current frames to
be built have not been already acknowledged. In this case the packet build run
by qc_do_build_pkt() is not interrupted, leading to the build of an empty packet
which should be ack-eliciting.
This is a bug detected by the BUG_ON() statement in qc_do_build_pk():
BUG_ON(qel->pktns->tx.pto_probe &&
!(pkt->flags & QUIC_FL_TX_PACKET_ACK_ELICITING));
Thank you to @Tristan971 for having reported this issue in GH #2709
This is an old bug which must be backported as far as 2.6.
qc_prep_pkts() is a QUIC transport level function which encodes one or
several datagrams in a buffer before sending them. It returns the number
of encoded datagram. This is especially important when pacing is used to
limit packet bursts.
This datagram accounting was not trivial as qc_prep_pkts() used several
code paths depending on the condition of the current encoded packet.
Thus, there were several places were the local variable dgram_cnt could
have been incremented. This was implemented by the following commit :
commit 5cb8f8a6224db96f4386277c41ddae4a29a4130d
MINOR: quic: support a max number of built packet per send iteration
However, there is a bug due to a missing increment when all frames from
the current QEL have been encoded. In this case, the encoding continue
in the same datagram to coalesce a futur packet. However, if this is the
last QEL, encoding loop will then break. As first_pkt is not NULL,
qc_txb_store() is called outside but dgram_cnt is yet not incremented.
In particular, this causes qc_prep_pkts() to return 0 when there is only
small STREAM frames to emit for application QEL. In qc_send(), this is
interpreted as a value which prevents further emission for the current
invokation. Thus, it may hurts performance, both without and with
pacing.
To fix this, removing multiple dgram_cnt increment. Now, it is modified
only in a single place which should cover every case, and render the
code easier to validate.
The most notable case where the bug is visible is when using cubic with
pacing without any burst, with quic-cc-algo cubic(,1). First, transfer
bandwidth in average was suboptimal, with significant variation. Worst,
it could sometimes fall dramatically for a particular stream without
recovering before returning to an expected level on the next one.
No need to backport.
The ->app_limited member of the delivery rate struct (quic_cc_drs) aim is to
store the index of the last transmitted byte marked as application-limited
so that to track the application-limited phases. During these phases,
BBR must ignore delivery rate samples to properly estimate the delivery rate.
Without such a patch, the Startup phase could be exited very quickly with
a very low estimated bottleneck bandwidth. This had a very bad impact
on little objects with download times smaller than the expected Startup phase
duration. For such objects, with enough bandwith, BBR should stay in the Startup
state.
No need to be backported, as BBR is implemented in the current developement version.
Very few modifications: call ->on_transmit() and ->drs_on_transmit() congestion
control algorithm (quic_cc) callbacks from qc_send_ppkts() just after having
sents some packets.
qc_send_mux() has been extended previously to support pacing emission.
This will ensure that no more than one datagram will be emitted during
each invokation. However, to achieve better performance, it may be
necessary to emit a batch of several datagrams one one turn.
A so-called burst value can be specified by the user in the
configuration. However, some congestion control algos may defined their
owned dynamic value. As such, a new CC callback pacing_burst is defined.
quic_cc_default_pacing_burst() can be used for algo without pacing
interaction, such as cubic. It will returns a static value based on user
selected configuration.
Pacing will be implemented for STREAM frames emission. As such,
qc_send_mux() API has been extended to add an argument to a quic_pacer
engine.
If non NULL, engine will be used to pace emission. In short, no more
than one datagram will be emitted for each qc_send_mux() invokation.
Pacer is then notified about the emission and a timer for a future
emission is calculated. qc_send_mux() will return PACING error value, to
inform QUIC MUX layer that it will be responsible to retry emission
after some delay.
This commit is part of a adjustment on QUIC transport send API to
support pacing. Here, qc_send_mux() return type has been changed to use
a new enum quic_tx_err.
This is useful to explain different failure causes of emission. For now,
only two values have been defined : NONE and FATAL. When pacing will be
implemented, a new value would be added to specify that emission was
interrupted on pacing. This won't be a fatal error as this allows to
retry emission but not immediately.
Extend QUIC transport emission function to support a maximum datagram
argument. The purpose is to ensure that qc_send() won't emit more than
the specified value, unless it is 0 which is considered as unlimited.
In qc_prep_pkts(), a counter of built datagram has been added to support
this. The packet building loop is interrupted if it reaches a specified
maximum value. Also, its return value has been changed to the number of
prepared datagrams. This is reused by qc_send() to interrupt its work if
a specified max datagram argument value is reached over one or several
iteration of prepared/sent datagrams.
This change is necessary to support pacing emission. Note that ideally,
the total length in bytes of emitted datagrams should be taken into
account instead of the raw number of datagrams. However, for a first
implementation, it was deemed easier to implement it with the latter.
To prepare pacing support, qc_prep_pkts() exit path have been rewritten
to be easily modified. This is purely refactoring which should not have
any functional change :
* a dedicated error path has been added
* ensure qc_txb_store() is always called to finalize datagram on normal
exit path if first_pkt is not NULL. Needed to support breaking from
packet building loop in a easier way.
This bug arrived with this commit:
cdfceb10a MINOR: quic: refactor qc_prep_pkts() loop
which prevents haproxy from sending PING only packets/datagrams (some
packets/datagrams with only PING frame as ack-eliciting frames inside).
Such packets/datagrams are useful in rare cases during retransmissions
when one wants to probe the peer without exceeding the anti-amplification
limit.
Modify the condition passed to qc_build_pkt() to add padding to the current
datagram. One does not want to do that when probing the peer without ack-eliciting
frames passed as <frms> parameter. Indeed qc_build_pkt() calls qc_do_build_pkt()
which supports this case: if <probe> is true (probing required), qc_do_build_pkt()
handles the case where some padding must be added to a PING only packet/datagram.
This is the case when probing with an empty <frms> frame list of ack-eliciting
frames without exceeding the anti-amplification limit from qc_dgrams_retransmit().
Add some comments to qc_build_pkt() and qc_do_build_pkt() to clarify this
as this code is easy to break!
Thank you for @Tristan971 for having reported this issue in GH #2709.
Must be backported to 3.0.
Add a BUG_ON() to detect some malformed packets which are supposed to probe the
peer without being ack-eliciting: the peer would not acknowledged such packets.
QUIC streamdesc layer is responsible to handle reception of ACK for
streams. It removes stream data from the underlying buffers on ACK
reception.
Streamdesc layer treats ACK in order at the stream level. Out of order
ACKs are buffered in a tree until they can be handled on older data
acknowledgement reception. Previously, qf_stream instance which comes
from the quic_tx_packet was used as tree node to buffer such ranges.
Introduce a new type dedicated to represent out of order stream ack data
range. This type is named qc_stream_ack. It contains minimal infos only
relative to the acknowledged stream data range.
This allows to reduce size of frequently used quic_frame with the
removal of tree node from qf_stream. Another side effect of this change
is that now quic_frame are always released immediately on ACK reception,
both in-order and out-of-order. This allows to also release the
quic_tx_packet instance which should reduce memory consumption.
The drawback of this change is that qc_stream_ack instance must be
allocated on out-of-order ACK reception. As such, qc_stream_desc_ack()
may fail if an error happens on allocation. For the moment, such error
is silenly recovered up to qc_treat_rx_pkts() with the dropping of the
received packet containing the ACK frame. In the future, it may be
useful to close the connection as this error may only happens on low
memory usage.
Previous commit implement a refactor of MUX send notification from
quic_conn layer. With this new architecture, a proper callback is
defined for each qc_stream_desc instance.
This architecture change allows to simplify notification from quic_conn
layer. First, ensure the MUX callback to properly ignore retransmission
of an already emitted frame. Luckily, this can be handled easily by
comparing offsets and FIN status. Also, each QCS instance can now be
unregistered from send notification just prior qc_stream_desc releasing.
This ensures a QCS is never manipulated from quic_conn after its
emission ending. Both these changes render the send notification more
robust. As a nice effect, flag QUIC_FL_CONN_TX_MUX_CONTEXT can be
removed as it is now unneeded.
For STREAM emission, MUX QUIC generates one or several frames and emit
them via qc_send_mux(). Lower layer may use them as-is, or split them to
lower chunk to fit in a QUIC packet. It is then responsible to notify
the MUX to report the amount of data sent.
Previously, this was done via a direct call from quic_conn to MUX using
qcc_streams_sent_done(). Modify this to have a better isolation accross
layers. Define a send callback handled by the qc_stream_desc instance.
This allows the MUX to register each QCS instance individually to the
renamved qmux_ctrl_send() which replaces qcc_streams_sent_done().
At quic_conn layer, qc_stream_desc_send() can be used now. This is a
wrapper to qc_stream_desc layer to invoke the send callback if
registered.
This mechanism of qc_stream_desc callback should be extended later to
implement other notifications accross the QUIC stack.
When a STREAM frame is retransmitted, a check is performed to remove
range of data already acked from it. This is useful when STREAM frames
are duplicated and splitted to cover different data ranges. The newly
retransmitted frame contains only unacked data.
This process is performed similarly in qc_dup_pkt_frms() and
qc_build_frms(). Refactor the code into a new function named
qc_stream_frm_is_acked(). It returns true if frame data are already
fully acked and retransmission can be avoided. If only a partial range
of data is acknowledged, frame content is updated to only cover the
unacked data.
This patch does not have any functional change. However, it simplifies
retransmission for STREAM frames. Also, it will be reused to fix
retransmission for empty STREAM frames with FIN set from the following
patch :
BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM
As such, it must be backported prior to it.
This issue was reported by Ilya (@Chipitsine) when building haproxy against
aws-lc in GH #2663 where handshakeloss and handshakecorruption interop tests could
lead haproxy to crash after having built too short datagrams:
FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163
call trace(13):
| 0x55f4ee4dcc02 [ba d9 00 00 00 48 8d 35]: main-0x195bf2
| 0x55f4ee4e3112 [83 3d 2f 16 35 00 00 0f]: qc_send+0x11f3/0x1b5d
| 0x55f4ee4e9ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c
| 0x55f4ee6efa82 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2
| 0x55f4ee6f05d3 [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6
| 0x55f4ee671bb7 [83 3d 86 72 44 00 01 0f]: run_poll_loop+0x6e/0x57b
| 0x55f4ee672367 [48 8b 1d 22 d4 1d 00 48]: main-0x48d
| 0x55f4ee6755e0 [b8 00 00 00 00 e8 08 61]: main+0x2dec/0x335d
This could happen after Handshake packet building failures which follow a successful
Initial packet into the same datagram. In this case, the datagram could be emitted
with a too short length (<1200 bytes).
To fix this, store the datagram only if the first packet is not an Initial packet
or if its length is big enough (>=1200 bytes).
Must be backported as far as 2.6.
In order to prepare the code for using Chacha20 with the EVP_AEAD API,
both quic_tls_hp_decrypt() and quic_tls_hp_encrypt() need an extra key
argument.
Indeed Chacha20 does not exists as an EVP_CIPHER in AWS-LC, so the key
won't be embedded into the EVP_CIPHER_CTX, so we need an extra parameter
to use it.
Some of the crypto functions used for headers protection in QUIC are
named with an "aes" name even thought they are not used for AES
encryption only.
This patch renames these "aes" to "hp" so it is clearer.
Add a sent bytes counter for each quic_conn instance. A secondary field
which only account bytes sent via GSO which is useful to ensure if this
is activated.
For the moment, these counters are reported on "show quic" but not
aggregated on proxy quic module stats.
UDP GSO on Linux is not implemented in every network devices. For
example, this is not available for veth devices frequently used in
container environment. In such case, EIO is reported on send()
invocation.
It is impossible to test at startup for proper GSO support in this case
as a listener may be bound on multiple network interfaces. Furthermore,
network interfaces may change during haproxy lifetime.
As such, the only option is to react on send syscall error when GSO is
used. The purpose of this patch is to implement a fallback when
encountering such conditions. Emission can be retried immediately by
trying to send each prepared datagrams individually.
To support this, qc_send_ppkts() is able to iterate over each datagram
in a so-called non-GSO fallback mode. Between each emission, a datagram
header is rewritten in front of the buffer which allows the sending loop
to proceed until last datagram is emitted.
To complement this, quic_conn listener is flagged on first GSO send
error with value LI_F_UDP_GSO_NOTSUPP. This completely disables GSO for
all future emission with QUIC connections using this listener.
For the moment, non-GSO fallback mode is activated when EIO is reported
after GSO has been set. This is the error reported for the veth usage
described above.
QUIC datagrams are encoded during emission via the function
qc_prep_pkts(). By default, if GSO is not used, each datagram is
prefixed by a metadata header which specify its length and address of
its first quic_tx_packet instance.
If GSO is activated, metadata header won't be inserted for datagrams
following the first one sent in a single syscall. Length field will
contain the total size of these datagrams. This allows to support both
GSO and non-GSO prepared datagram in the same Tx buffer.
qc_send_ppkts() is invoked just after datagrams encoding. It iterates
over each metadata header in Tx buffer to sent each datagram
individually. If length field is bigger than network MTU, GSO usage is
assumed and qc_snd_buf() GSO parameter will be set.
Another important point to note regarding GSO implementation is that
during datagram encoding, packets from the same datagram instance are
attached together. However, if using GSO, consecutive packets from
different datagrams are also linked, but without
QUIC_FL_TX_PACKET_COALESCED flag. This allows to properly update
quic_conn status with all sent packets in qc_send_ppkts(). Packets from
different datagrams are then unlinked to treat them separately when
receiving corresponding ACK frames.
Add <gso_size> parameter to qc_snd_buf(). When non-null, this specifies
the value for socket option SOL_UDP/UDP_SEGMENT. This allows to send
several datagrams in a single call by splitting data multiple times at
<gso_size> boundary.
For now, <gso_size> remains set to 0 by caller, as such there should not
be any functional change.
This issue was revealed by chacha20 interop test which very often fails with
ngtcp2 as client. This was due to the fact that 2 application level packets could
be coalesced into the same datagram as revealed by such a capture:
Frame 380: 255 bytes on wire (2040 bits), 255 bytes captured (2040 bits)
Point-to-Point Protocol
Internet Protocol Version 4, Src: 193.167.100.100, Dst: 193.167.0.100
User Datagram Protocol
QUIC IETF
QUIC Connection information
[Connection Number: 0]
[Packet Length: 187]
QUIC Short Header DCID=ec523fe99840f9c17c868a88d649147814 PKN=333
0... .... = Header Form: Short Header (0)
.1.. .... = Fixed Bit: True
..0. .... = Spin Bit: False
[...0 0... = Reserved: 0]
[.... .0.. = Key Phase Bit: False]
[.... ..00 = Packet Number Length: 1 bytes (0)]
Destination Connection ID: ec523fe99840f9c17c868a88d649147814
[Packet Number: 333]
Protected Payload […]: 43537d43a3c83e47db6891bd6a4fd7d7fa31941badcb87a540e843341d6a5e493ed4c3f6e6bbff094804ee0ab06830dc1a1bbf52ace4323d2e4f6e0bd4eea73df0721d2949d05a058d3afb974e814494ebf44d1375b0e7f1fd5bcf634cf32ef9a9b4018758a49d39a24c40
STREAM id=0 fin=0 off=294768 len=144 dir=Bidirectional origin=Client-initiated
Frame Type: STREAM (0x000000000000000e)
.... ...0 = Fin: False
.... ..1. = Len(gth): True
.... .1.. = Off(set): True
Stream ID: 0
.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ...0 = Stream initiator: Client-initiated (0)
.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ..0. = Stream direction: Bidirectional (0)
Offset: 294768
Length: 144
Stream Data […]: 63eef6ccee0d2ab602db3682d0e7cc09b72db6adc307d7699a211144b4b6c029cbed9beae1491c10a5fe0678d815a5303843d33c0593fedc9b64068fd0207e280d05aac2c0054fe9ab30857bc3669ee51d34756cfd2e098eb1ab31a03911f6a103f0a16f8f984d9861efdcf4433c
QUIC IETF
[Packet Length: 38]
QUIC Short Header DCID=ec523fe99840f9c17c868a88d649147814 PKN=334
0... .... = Header Form: Short Header (0)
.1.. .... = Fixed Bit: True
..0. .... = Spin Bit: False
[...0 0... = Reserved: 0]
[.... .0.. = Key Phase Bit: False]
[.... ..00 = Packet Number Length: 1 bytes (0)]
Destination Connection ID: ec523fe99840f9c17c868a88d649147814
[Packet Number: 334]
Protected Payload: b9c0e6dc3fc523574f8164c31b6cd156496212
PING
Frame Type: PING (0x0000000000000001)
PADDING Length: 2
Frame Type: PADDING (0x0000000000000000)
[Padding Length: 2]
On the peer side these two packet are considered as a unique one
because there may be only one packet by datagram at application encryption
level and reported as a STREAM frame encoding error:
I00000332 0xec523fe99840f9c17c868a88d649147814 con recv packet len=225
mask=b2c69c7827 sample=43a3c83e47db6891bd6a4fd7d7fa3194
I00000332 0xec523fe99840f9c17c868a88d649147814 pkt rx pkn=333 dcid=0xec523fe99840f9c17c868a88d649147814 type=1RTT k=0
I00000332 0xec523fe99840f9c17c868a88d649147814 frm rx 333 1RTT STREAM(0x0e) id=0x0 fin=0 offset=294768 len=144 uni=0
ngtcp2_conn_read_pkt: ERR_FRAME_ENCODING
I00000332 0xec523fe99840f9c17c868a88d649147814 pkt tx pkn=1531039643 dcid=0xae79dfc99d6c65d6 type=1RTT k=0
I00000332 0xec523fe99840f9c17c868a88d649147814 frm tx 1531039643 1RTT CONNECTION_CLOSE(0x1c) error_code=FRAME_ENCODING_ERROR(0x7) frame_type=0 reason_len=0 reason=[]
I00000332 0xec523fe99840f9c17c868a88d649147814 frm tx 1531039643 1RTT PADDING(0x00) len=9
Note here that the sum of the two packet sizes (from capture) is the same as the
packet length reporte by ngtcp2: 187+38 = 225. It also seems that wireshark tries
to parse as much as packet into the same datagram, regardless of the QUIC protocol
rules.
Haproxy traces revealed that this could happen at least when probing the peer.
The recent low level packet building modifications aim was to build
as much as datagrams into the same buffer. But it seems that the
probing packet case treatment has been broken. That said, I have not
identified impacted commit. This issue could be reproduced inside
interop test environment (no possible git bisection).
To fix this, rely on the <probe> variable value to identify if the last
packet built by qc_prep_pkts() was a probing one, then try to
coalesce some others packet into the same datagram if this was not the case.
Of course the test on <probe> value has to be done before setting it
for the next packet.
Must be backported to 3.0.
On quic_tx_packet allocation failure, it is possible to trigger BUG_ON()
crash on INITIAL packet building. This statement is responsible to
ensure INITIAL packets are padded to 1.200 bytes as required. If a
packet on higher encryption level allocation fails, PADDING frame cannot
properly encoded, despite the INITIAL packet properly built.
This crash happens due to qc_txb_store() invokation after quic_tx_packet
allocation failure to validate already built packets. However, this
statement is unneeded as qc_purge_tx_buf() is called just after. Simply
remove qc_txb_store() to fix this issue.
This was detected using -dMfail.
This should be backported up to 2.6.
To emit CONNECTION_CLOSE frame, a special buffer is allocated via
qc_txb_store(). This is due to QUIC_FL_CONN_IMMEDIATE_CLOSE flag.
However this flag is reset after qc_send_ppkts() invocation to prevent
reemission of CONNECTION_CLOSE frame.
qc_send() can invoke multiple times a series of qc_prep_pkts() +
qc_send_ppkts() to emit several datagrams. However, this may cause a
crash if on first loop a CONNECTION_CLOSE is emitted. On the next loop
iteration, QUIC_FL_CONN_IMMEDIATE_CLOSE is resetted, thus qc_prep_pkts()
will use the wrong buffer size as end delimiter. In some cases, this may
cause a BUG_ON() crash due to b_add() outside of buffer.
This bug can be reproduced by using a while loop of ngtcp2-client and
interrupting them randomly via Ctrl+C.
Here is the patch which introduce this regression :
cdfceb10ae136b02e51f9bb346321cf0045d58e0
MINOR: quic: refactor qc_prep_pkts() loop