Extend QUIC server configuration so that congestion algorithm and
maximum window size can be set on the server line. This can be achieved
using quic-cc-algo keyword with a syntax similar to a bind line.
This should be backported up to 3.3 as this feature is considered as
necessary for full QUIC backend support. Note that this relies on the
serie of previous commits which should be picked first.
Each QUIC congestion algorithm is defined as a structure with callbacks
in it. Every quic_conn has a member pointing to the configured
algorithm, inherited from the bind-conf keyword or to the default CUBIC
value.
Convert all these definitions to const. This ensures that there never
will be an accidental modification of a globally shared structure. This
also requires to mark quic_cc_algo field in bind_conf and quic_cc as
const.
Each quic_conn instance is stored in a global list. Its purpose is to be
able to loop over all known connections during "show quic".
Split this into two separate lists for frontend and backend usage.
Another change is that closing backend connections do not move into
quic_conns_clo list. They remain instead in their original list. The
objective of this patch is to reduce the contention between the two
sides.
Note that this prevents backend connections to be listed in "show quic"
now. This will be adjusted in a future patch.
QUIC CIDs are stored in a global tree. Prior to this patch, CIDs used on
both frontend and backend sides were mixed together.
This patch implement CID storage separation between FE and BE sides. The
original tre quic_cid_trees is splitted as
quic_fe_cid_trees/quic_be_cid_trees.
This patch should reduce contention between frontend and backend usages.
Also, it should reduce the risk of random CID collision.
All of the other bandwidth-limiting code stores limits and intermediate
(byte) counters as unsigned integers. The exception here is
freq_ctr_overshoot_period which takes in unsigned values but returns a
signed value. While this has the benefit of letting the caller know how
far away from overshooting they are, this is not currently leveraged
anywhere in the codebase, and it has the downside of halving the positive
range of the result.
More concretely though, returning a signed integer when all intermediate
values are unsigned (and boundaries are not checked) could result in an
overflow, producing values that are at best unexpected. In the case of
flt_bwlim (the only usage of freq_ctr_overshoot_period in the codebase at
the time of writing), an overflow could cause the filter to wait for a
large number of milliseconds when in fact it shouldn't wait at all.
This is a niche possibility, because it requires that a bandwidth limit is
defined in the range [2^31, 2^32). In this case, the raw limit value would
not fit into a signed integer, and close to the end of the period, the
`(elapsed * freq)/period` calculation could produce a value which also
doesn't fit into a signed integer.
If at the same time `curr` (the number of events counted so far in the
current period) is small, then we could get a very large negative value
which overflows. This is undefined behaviour and could produce surprising
results. The most obvious outcome is flt_bwlim sometimes waiting for a
large amount of time in a case where it shouldn't wait at all, thereby
incorrectly slowing down the flow of data.
Converting just the return type from signed to unsigned (and checking for
the overflow) prevents this undefined behaviour. It also makes the range
of valid values consistent between the input and output of
freq_ctr_overshoot_period and with the input and output of other freq_ctr
functions, thereby reducing the potential for surprise in intermediate
calculations: now everything supports the full 0 - 2^32 range.
When a multiplexer protocol is defined, it is now possible to specify the
ALPN it supports, in binary format. This info is optionnal. For now only the
h2 and the h1 multiplexers define an ALPN because this will be mandatory for
a fix. But this could be used in future for different purpose.
This patch will be mandatory for the next fix.
It's always a pain to guess the number of FDs that can be needed by
listeners, checks, threads, pollers etc. We have this estimate in
global.maxsock before calling set_global_maxconn(), but we lose it
the line after. Let's copy it into global.est_fd_usage and keep it.
This will be helpful to try to provide more accurate suggestions for
maxconn.
On the frontend side, QUIC transfer can be performed either via a
connection owned FD or multiplex on the listener one. When a quic_conn
is freed and converted to quic_conn_closed instance, its FD if open is
closed and all exchanges are now multiplex via the listener FD.
This is different for the backend as connections only has the choice to
use their owned FD. Thus, special care care must be taken when freeing a
connection and converting it to a quic_conn_closed instance. In this
case, qc_release_fd() is delayed to the quic_conn_closed release.
Furthermore, when the FD is transferred, its iocb and owner fields are
updated to the new quic_conn_closed instance. Without it, a crash will
occur when accessing the freed quic_conn tasklet. A newly dedicated
handler quic_conn_closed_sock_fd_iocb is used to ensure access to
quic_conn_closed members only.
Properly implement support for max-reuse server keyword. This is done by
adding a total count of streams seen for the whole connection. This
value is used in avail_streams callback.
Remove <ipv4> argument from qc_new_conn(). This parameter is unnecessary
as it can be derived from the family type of the addresses also passed
as argument.
The objective of this patch is to streamline qc_new_conn() usage so that
it is similar for frontend and backend sides.
Previously, several parameters were set only for frontend connections.
These arguments are replaced by a single quic_rx_packet argument, which
represents the INITIAL packet triggering the connection allocation on
the server side. For a QUIC client endpoint, it remains NULL. This usage
is consider more explicit.
As a minor change, <target> is moved as the first argument of the
function. This is considered useful as this argument determines whether
the connection is a frontend or backend entry.
Along with these changes, qc_new_conn() documentation has been reworded
so that it is now up-to-date with the newest usage.
quic_conn has two fields named <dcid> and <scid>. It may cause confusion
as it is not obvious how these fields are related to the connection
direction. Try to improve this by extending the documentation of these
two fields.
quic_newcid_from_hash64 is an external callback. If defined, it serves
as a CID method generation, as an alternative to the default random
implementation.
This mechanism was not correctly implemented on the backend side.
Indeed, <hash64> quic_conn member is only setted for frontend
connections. The simplest solution would be to properly define it also
for backend ones. However, quic_newcid_from_hash64 derivation is really
only useful for the frontend side for now. Thus, this patch disables
using it on the backend side in favor of the default random generator.
To implement this, quic_cid_generate() is splitted in two functions, for
both methods of CIDs generation. This is the responsibility of the
caller to select the proper method. On backend side, only random
implementation is now used.
The shard keyword is already used by the peers and on the server lines. And
it is unrelated with the session keys distribution. So instead of talking
about shard for the session key hashing, we now use the term "bucket".
The purpose of this new BUG_ON is beyond BUG_ON_HOT(). While BUG_ON_HOT()
is meant to be light but placed on very hot code paths, BUG_ON_STRESS()
might be heavy and only used under stress-testing, to try to detect early
that something bad is starting to happen. This one is not even type-checked
when not defined because we don't want to risk the compiler emitting the
slightest piece of code there in production mode, so as to give enough
freedom to the developers.
DEBUG_STRESS is currently used only to expose "stress-level". With this
patch, we go a bit further, by automatically forcing DEBUG_STRICT and
DEBUG_STRICT_ACTION to their highest values in order to enable all
BUG_ON levels, and make all of them result in a crash. In addition,
care is taken to always only have 0 or 1 in the macro, so that it can be
tested using "#if DEBUG_STRESS > 0" as well as "if (DEBUG_STRESS) { }"
everywhere.
The goal will be to ease insertion of extra tests for builds dedicated
to stress-testing that enable possibly expensive extra checks on certain
code paths that cannot reasonably be compiled in for production code
right now.
The target patch fixes a rare race condition which happen when a MUX IO
handler is working on a connection already moved into the purge list. In
this case, the handler will incorrectly moved back the connection into
the idle list.
To fix this, conn_delete_from_tree() was extended to remove flags along
with the connection from the idle list. This was performed when the
connection is moved into the purge list. However, it introduces another
issue related to the idle server connection accounting. Thus it is
necessary to revert it prior to the incoming newer fix.
This patch must be backported to every version where the original commit
is.
AWS-LC features are not easily tested with just the openssl version
constant. AWS-LC uses its own API versioning stored in the
AWSLC_API_VERSION constant.
This patch add the two awslc_api_atleast and awslc_api_before predicates
that help to check the AWS-LC API.
check-reuse-pool can only perform as expected if reuse policy on the
backend is set to aggressive or higher. Update the documentation to
reflect this and implement a server diag warning.
The current ACME scheduler suffers from problems due to the way the
tasks are stored:
- MT_LIST are not scalables when having a lot of ACME tasks and having
to look for a specific one.
- the acme_task pointer was stored in the ckch_store in order to not
passing through the whole list. But a ckch_store can be updated and
the pointer lost in the previous one.
- when a task fails, the ptr in the ckch_store was not removed because
we only work with a copy of the original ckch_store, it would need to
lock the ckchs_tree and remove this pointer.
This patch fixes the issues by removing the MT_LIST-based architecture,
and replacing it by a simple ebmbtree + rwlock design.
The pointer to the task is not stored anymore in the ckch_store, but
instead it is stored in the acme_tasks tree. Finding a task is done by
doing a lookup on this tree with a RDLOCK.
Instead of checking if store->acme_task is not NULL, a lookup is also
done.
This allow to remove the stuck "acme_task" pointer in the store, which
was preventing to restart an acme task when the previous failed for this
specific certificate.
Must be backported in 3.2.
During 0-RTT sessions, some server transport parameters are reused after having
been save from previous sessions. These parameters must not be reduced
when it resends them. The client must check this is the case when some early data
are accepted by the server. This is what is implemented by this patch.
Implement qc_early_tranport_params_validate() which checks the new server parameters
are not reduced.
Also implement qc_ssl_eary_data_accepted() which was not implemented for TLS
stack without 0-RTT support (for instance wolfssl). That said this function
was no more used. This is why the compilation against wolfssl could not fail.
This patch allows the use of 0-RTT feature on QUIC server lines with "allow-0rtt"
option. In fact 0-RTT is really enabled only if ssl_sock_srv_try_reuse_sess()
successfully manages to reuse the SSL session and the chosen application protocol
from previous connections.
Note that, at this time, 0-RTT works only with quictls and aws-lc as TLS stack.
(0-RTT does not work at all (even for QUIC frontends) with libressl).
This patch is required to make 0-RTT work. It modifies the prototype of
quic_build_post_handshake_frames() to send post handshake frames from a
list of frames in place of the application encryption level (used
as <qc->ael> local variable).
This patch does not modify at all the current QUIC stack behavior (even for
QUIC frontends). It must be considered as a preparation for the code
to come about 0-RTT support for QUIC backends.
This function is called for both TCP and QUIC connections to reuse SSL sessions
saved by ssl_sess_new_srv_cb() callback called upon new SSL session creation.
In addition to this, a QUIC SSL session must reuse the ALPN and some specific QUIC
transport parameters. This is what is added by this patch for QUIC 0-RTT sessions.
Note that for now on, ssl_sock_srv_try_reuse_sess() may fail for QUIC connections
if it did not managed to reuse the ALPN. The caller must be informed of such an
issue. It must not enable 0-RTT for the current session in this case. This is
impossible without ALPN which is required to start a mux.
ssl_sock_srv_try_reuse_sess() is modified to always succeeds for TCP connections.
For both TCP and QUIC connections, this is ssl_sess_new_srv_cb() callback which
is called when a new SSL session is created. Its role is to save the session to
be reused for the next sessions.
This patch modifies this callback to save the QUIC parameters to be reused
for the next 0-RTT sessions (or during SSL session resumption).
The already existing path_params->nego_alpn member is used to store the ALPN as
this is done for TCP alongside path_params->tps new quic_early_transport_params
struct used to save the QUIC transport parameters to be reused for 0-RTT sessions.
Implement quic_reuse_srv_params() whose role is to reuse the ALPN negotiated
during a first connection to a QUIC backend alongside its transport parameters.
Define quic_early_transport_params new struct for QUIC transport parameters
in relation with 0-RTT. This parameters must be saved during a first session to
be reused for 0-RTT next sessions.
qc_early_transport_params_cpy() copies the 0-RTT transport parameters to be
saved during a first connection to a backend. The copy is made from
a quic_transport_params struct to a quic_ealy_transport_params struct.
On the contrary, qc_early_transport_params_reuse() copies the transport parameters
to be reused for a 0-RTT session from a previous one. The copy is made
from a quic_early_transport_params strcut to a quic_transport_params struct.
Also add QUIC_EV_EARLY_TRANSP_PARAMS trace event to dump such 0-RTT
transport parameters from traces.
Add a per thread ist struct to srv_per_thread struct to store the QUIC token to
be reused for subsequent sessions.
Parse at packet level (from qc_parse_ptk_frms()) these tokens and store
them calling qc_try_store_new_token() newly implemented function. This is
this new function which does its best (may fail) to update the tokens.
Modify qc_do_build_pkt() to resend these tokens calling quic_enc_token()
implemented by this patch.
Rename ->data qf_new_token struct field to ->w_data to distinguish it from
->r_data new field used to parse the NEW_TOKEN frame. Indeed to build the
NEW_TOKEN we need to write it to a static buffer into the frame struct. To
parse it we only need to store the address of the token field into the
RX buffer.
On Initial packet parsing, a new quic_conn instance is allocated via
qc_new_conn(). Then a CID is allocated with its value derivated from
client ODCID. On CID tree insert, a collision can occur if another
thread was already parsing an Initial packet from the same client. In
this case, the connection is released and the packet will be requeued to
the other thread.
Originally, CID collision check was performed prior to quic_conn
allocation. This was changed by the commit below, as this could cause
issue on quic_conn alloc failure.
commit 4ae29be18c5b212dd2a1a8e9fa0ee2fcb9dbb4b3
BUG/MINOR: quic: Possible endless loop in quic_lstnr_dghdlr()
However, this procedure is less optimal. Indeed, qc_new_conn() performs
many steps, thus it could be better to skip it on Initial CID collision,
which can happen frequently. This patch restores the older order of
operations, with CID collision check prior to quic_conn allocation.
To ensure this does not cause again the same bug, the CID is removed in
case of quic_conn alloc failure. This should prevent any loop as it
ensures that a CID found in the global tree does not point to a NULL
quic_conn, unless if CID is attach to a foreign thread. When this thread
will parse a re-enqueued packet, either the quic_conn is already
allocated or the CID has been removed, triggering a fresh CID and
quic_conn allocation procedure.
CIDs are provided by haproxy so that the peer can use them as DCID of
its packets. Their value is set via a random generator. It happens on
several occasions during connection lifetime:
* via ODCID derivation if haproxy is the server
* on quic_conn init if haproxy is the client
* during post-handshake if haproxy is the server
* on RETIRE_CONNECTION_ID frame parsing
CIDs are stored in a global tree. On ODCID derivation, a check is
performed to ensure the CID is not a duplicate value. This is mandatory
to properly handle multiple INITIAL packets from the same client on
different thread.
However, for the other cases, no check is performed for CID collision.
As _quic_cid_insert() is silent, the issue is not detected at all. This
results in a CID advertized to the peer but not stored in the global
one. In the end, this may cause two issues. The first one is that
packets from the client which use the new CID will be rejected by
haproxy, most probably with a STATELESS_RESET. The second issue is that
it can cause a crash during quic_conn release. Indeed, the CID is stored
in the quic_conn local tree and thus eb_delete() for the global tree
will be performed. As <leaf_p> member is uninit, this results in a
segfault.
Note that this issue is pretty rare. It can only be observed if running
with a high number of concurrent connections in parallel, so that the
random generator will provide duplicate values. Patch is still labelled
as MEDIUM as this modifies code paths used frequently.
To fix this, _quic_cid_insert() unsafe function is completely removed.
Instead, quic_cid_insert() can be used, which reports an error code if a
collision happens. CID are then stored in the quic_conn tree only after
global tree insert success. Here is the solution for each steps if a
collision occurs :
* on init as client: the connection is completely released
* post-handshake: the CID is immediately released. The connection is
kept, but it will miss an extra CID.
* on RETIRE_CONNECTION_ID parsing: a loop is implemented to retry random
generation. It it fails several times, the connection is closed in
error.
A small convenience change is made to quic_cid_insert(). Output
parameter <new_tid> can now be NULL, which is useful as most of the
times caller do not care about it.
This must be backported up to 2.6.
Split new_quic_cid() function into multiple ones. This patch should not
introduce any visible change. The objective is to render CID allocation
and generation more modular.
The first advantage of this patch is to bring code simplication. In
particular, conn CID sequence number increment and insertion into
connection tree is simpler than before. Another improvment is also that
errors could now be handled easier at each different steps of the CID
init.
This patch is a prerequisite for the fix on CID collision, thus it must
be backported prior to it to every affected version.
This stick-table field was atomically updated with the last update id pushed
and dumped on the CLI but never used otherwise. And all peer sessions share
the same id because it is a stick-table info. So the info in peers dump is
pretty limited.
So, let's remove it.
This is the same as the equivalent fix in ebtree:
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.
This is cebtree commit 2d08958858c2b8a1da880061aed941324e20e748.
The purpose here is to look in the environment for a variable whose
name looks like the provided one. This will be used to try to auto-
correct misspelled environment variables that would silently be turned
to an empty string.
The word fingerprinting functions are used to compare similar words to
suggest a correctly spelled one that looks like what the user proposed.
Currently the functions only support const char*, but there's no reason
for this, and it would be convenient to support substrings extracted
from random pieces of configurations. Here we're adding new variants
"_with_len" that take these ISTs and which are in fact a slight change
of the original ones that the old ones now rely on.
When a server is being disabled or deleted, in case it matches the
backend's ready_srv, this one is reset. However it's currently done in
a non-atomic way when the server goes down, and that could occasionally
reset the entry matching another server, but more importantly if in
parallel some requests are dequeued for that server, it may re-appear
there after having been removed, leading to a possible crash once it
is fully removed, as shown in issue #3177.
Let's make sure we reset the pointer when detaching the server from
the proxy, and use a CAS in both cases to only reset this server.
This fix needs to be backported to 3.2. There, srv_detach() is in
server.c instead of server.h. Thanks to Basha Mougamadou for the
detailed report and the useful backtraces.
The <total> field in the channel structure is now useless, so it can be
removed. The <bytes_in> field from the SC is used instead.
This patch is related to issue #1617.
The helper function applet_output_data() returns the amount of data in the
output buffer of an applet. For applets using the new API, it is based on
data present in the outbuf buffer. For legacy applets, it is based on input
data present in the input channel's buffer. The HTX version,
applet_htx_output_data(), is also available
This patch is related to issue #1617.
In previous patches, these counters were added per frontend, backend, server
and listener. With this patch, these counters are reported on stats,
including promex.
Note that the stats file minor version was incremented by one because the
shm_stats_file_object struct size has changed.
This patch is related to issue #1617.
bytes_in and bytes_out counters per frontend, backend, listener and server
were removed and we now rely on, respectively on, req_in and res_in
counters.
This patch is related to issue #1617.
per-stream bytes_in and bytes_out counters was removed and replaced by
req.in and res.in. Coorresponding samples still exists but replies on new
counters.
This patch is related to issue #1617.
Thanks to the previous patch, and based on info available on the stream, it
is now possible to have counters for frontends, backends, servers and
listeners to report number of bytes received and sent on both sides.
This patch is related to issue #1617.
req.in and req.out samples can now be used to get the number of bytes
received by a client and send to the server. And res.in and res.out samples
can be used to get the number of bytes received by a server and send to the
client. These info are stored in the logs structure inside a stream.
This patch is related to issue #1617.
<bytes_in> and <bytes_out> counters were added to SC to count, respectively,
the number of bytes received from an endpoint or sent to an endpoint. These
counters are updated for connections and applets.
This patch is related to issue #1617.
Almost all callers of _srv_add_idle() lock the list then call the
function. It's not the most efficient and it requires some care from
the caller to take care of that lock. Let's change this a little bit by
having srv_add_idle() that takes the lock and calls _srv_add_idle() that
is now inlined. This way callers don't have to handle the lock themselves
anymore, and the lock is only taken around the sensitive parts, not the
function call+return.
Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS
to 2.32-2.37M RPS on a 128-thread system.