The code in srv_add_to_idle_list() has its roots in 2.0 with commit
9ea5d361ae ("MEDIUM: servers: Reorganize the way idle connections are
cleaned."). At this era we didn't yet have the current set of atomic
load/store operations and we used to perform loads using volatile casts
after a barrier. It turns out that this function has kept this schema
over the years, resulting in a big mfence stalling all the pipeline
in the function:
| static __inline void
| __ha_barrier_full(void)
| {
| __asm __volatile("mfence" ::: "memory");
27.08 | mfence
| if ((volatile void *)srv->idle_node.node.leaf_p == NULL) {
0.84 | cmpq $0x0,0x158(%r15)
0.74 | je 35f
| return 1;
Switching these for a pair of atomic loads got rid of this and brought
0.5 to 3% extra performance depending on the tests due to variations
elsewhere, but it has never been below 0.5%. Note that the second load
doesn't need to be atomic since it's protected by the lock, but it's
cleaner from an API and code review perspective. That's also why it's
relaxed.
This was the last user of _ha_barrier_full(), let's try not to
reintroduce it now!
Function proxy_preset_defaults() purpose has evolved over time.
Originally, it was only used to initialize defaults proxies instances.
Until today, it was extended so that all proxies use it. Its objective
is to initialize settings to common default values.
To remove the confusion, this function is now removed. Its content is
integrated directly into init_new_proxy().
There remained some cases (on error paths) were a server could be freed
while still attached on the parent proxy server list. In 3.3 this can be
problematic because new_server() automatically adds the server to the
parent proxy list.
The bug is insignificant because it is on errors paths during init and
often haproxy exits right after. But let's fix that to ensure no UAF or
undefined behavior occurs because of that.
This patch depends on ("MINOR: cli: use srv_drop() when server was created using new_server()")
It must be backported in 3.3 with the above mentioned patch.
Now that new_server() is becoming more and more complex, we need to
take care that servers created using new_server() must be released
using the corresponding release function srv_drop() which takes care
of properly de-initing the server and its members.
Contrarily to what was previously believed, there are corner cases where
the counters may not be allocated, and we may want to make them optional
at a later date, so we have to check if those counters are there.
However, just checking that shared.tg is non-NULL is enough, we can then
assume that shared.tg[tgid - 1] has properly been allocated too.
Also modify the various COUNTER_SHARED_* macros to make sure they check
for that too.
Before updating counters, a few tests are made to check if the counters
exits. but those counters should always exist at this point, so just
remmove them.
This commit should have no impact, but can easily be reverted with no
functional impact if various crashes appear.
Instead of statically allocating the per-thread group counters,
based on the max number of thread groups available, allocate
them dynamically, based on the number of thread groups actually
used. That way we can increase the maximum number of thread
groups without using an unreasonable amount of memory.
Extend QUIC server configuration so that congestion algorithm and
maximum window size can be set on the server line. This can be achieved
using quic-cc-algo keyword with a syntax similar to a bind line.
This should be backported up to 3.3 as this feature is considered as
necessary for full QUIC backend support. Note that this relies on the
serie of previous commits which should be picked first.
A recent patch has introduced free operation for QUIC tokens stored in a
server. These values are located in <per_thr> server array.
However, a server instance may be released prior to its full
initialization in case of a failure during "add server" CLI command. The
mentionned patch would cause a srv_drop() crash due to an invalid usage
of NULL <per_thr> member.
Fix this by adding a check on <per_thr> prior to dereference it in
srv_drop().
No need to backport.
A recent patch has implemented caching of QUIC token received from a
NEW_TOKEN frame into the server cache. This value is stored per thread
into a <quic_retry_token> field.
This field is an ist, first set to an empty string. Via
qc_try_store_new_token(), it is reallocated to fit the size of the newly
stored token. Prior to this patch, the field was never freed so this
causes a memory leak.
Fix this by using istfree() on <quic_retry_token> field during
srv_drop().
No need to backport.
The latest idle conns fix 9481cef948 ("BUG/MEDIUM: connection: do not
reinsert a purgeable conn in idle list") addresses a very hard-to-hit
case which manifests itself with an attempt to reuse a connection fails
because conn->mux is NULL:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000655410b8642c in conn_backend_get (reuse_mode=4, srv=srv@entry=0x6554378a7140,
sess=sess@entry=0x7cfe140948a0, is_safe=is_safe@entry=0,
hash=hash@entry=910818338996668161) at src/backend.c:1390
1390 if (conn->mux->takeover && conn->mux->takeover(conn, i, 0) == 0) {
However the condition that leads to this situation can be detected
earlier, by the presence of the connection in the toremove_list, whose
race window is much larger and easier to detect.
This patch adds a few BUG_ON_STRESS() at selected places that an detect
this condition. When built with -DDEBUG_STRESS and run under stress with
two distinct processes communicating over H2 over SSL, under a stress of
400-500k req/s, the front process usually crashes in the first 10-30s
triggering in _srv_add_idle() if the fix above is reverted (and it does
not crash with the fix).
This is mainly included to serve as an illustration of how to instrument
the code for seamless stress testing.
The target patch fixes a rare race condition which happen when a MUX IO
handler is working on a connection already moved into the purge list. In
this case, the handler will incorrectly moved back the connection into
the idle list.
To fix this, conn_delete_from_tree() was extended to remove flags along
with the connection from the idle list. This was performed when the
connection is moved into the purge list. However, it introduces another
issue related to the idle server connection accounting. Thus it is
necessary to revert it prior to the incoming newer fix.
This patch must be backported to every version where the original commit
is.
When a server is being disabled or deleted, in case it matches the
backend's ready_srv, this one is reset. However it's currently done in
a non-atomic way when the server goes down, and that could occasionally
reset the entry matching another server, but more importantly if in
parallel some requests are dequeued for that server, it may re-appear
there after having been removed, leading to a possible crash once it
is fully removed, as shown in issue #3177.
Let's make sure we reset the pointer when detaching the server from
the proxy, and use a CAS in both cases to only reset this server.
This fix needs to be backported to 3.2. There, srv_detach() is in
server.c instead of server.h. Thanks to Basha Mougamadou for the
detailed report and the useful backtraces.
Almost all callers of _srv_add_idle() lock the list then call the
function. It's not the most efficient and it requires some care from
the caller to take care of that lock. Let's change this a little bit by
having srv_add_idle() that takes the lock and calls _srv_add_idle() that
is now inlined. This way callers don't have to handle the lock themselves
anymore, and the lock is only taken around the sensitive parts, not the
function call+return.
Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS
to 2.32-2.37M RPS on a 128-thread system.
There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
- when picking a new connection that no longer has any avail stream
- when temporarily working on the connection from an I/O handler,
in which case it's re-added at the end
- when killing a connection
The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.
The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.
The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.
A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.
This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.
Also call srv_reset_path_parameters() when the server changed states,
and got up. It is not enough to do it when the server goes down, because
there's a small race condition, and a connection could get established
just after we did it, and could have set the path parameters.
This does not need to be backported.
Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.
This does not need to be backported.
If a QUIC server is declared without ALPN, "h3" value is automatically
set during _srv_parse_finalize().
This patch adjusts this operation. Instead of relying on
ssl_sock_parse_alpn(), a plain strdup() is used. This is considered more
efficient as the ALPN string is constant in this case. This method is
already used for listeners on the frontend side.
Ensure that QUIC support is compiled into haproxy when a QUIC server is
configured. This check is performed during _srv_parse_finalize() so that
it is detected both on configuration parsing and when adding a dynamic
server via the CLI.
Note that this changes the behavior of srv_is_quic() utility function.
Previously, it always returned false when QUIC support wasn't compiled.
With this new check introduced, it is now guaranteed that a QUIC server
won't exist if compilation support is not active. Hence srv_is_quic()
does not rely anymore on USE_QUIC define.
Previously, QUIC servers were rejected if SSL was not explicitely
activated using 'ssl' configuration keyword.
Change this behavior : now SSL is automatically activated for QUIC
servers when the keyword is missing. A warning is displayed as it is
considered better to explicitely note that SSL is in use.
We normally taint the process when using experimental directives, but
a handful of places were missed so we don't always know that they are
in use. Let's fix these places (hint for future directives, just look
for places checking for "experimental_directives_allowed", and add
"mark_tainted(TAINTED_CONFIG_EXP_KW_DECLARED);").
On creation, schedule the server requeue once it's been created.
It is possible that when the server went up, it tried to queue itself
into the lb specific code, failed to do so, and expect the tasklet to
run to take care of that.
This should be backported to 3.2.
This is part of an attempt to fix github issue #3143.
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.
The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)
Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.
So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.
With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.
Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.
This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):
struct before after delta
conn_hash_nodea 56 32 -24
srv_per_thread 96 72 -24
The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
In srv_parse_id(), there's no point doing all the low-level work with
the tree functions to check for the existence of an ID, we already have
server_find_by_id() which does exactly this, so let's use it.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.
It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.
The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
For HTTPS outgoing connections, the SNI is now automatically set using the
Host header value if no other value is already set (via the "sni" server
keyword). It is now the default behavior. It could be disabled with the
"no-sni-auto" server keyword. And eventually "sni-auto" server keyword may
be used to reset any previous "no-sni-auto" setting. This option can be
inherited from "default-server" settings. Finally, if no connection name is
set via "pool-conn-name" setting, the selected value is used.
The automatic selection of the SNI is enabled by default for all outgoing
connections. But it is concretely used for HTTPS connections only. The
expression used is "req.hdr(host),host_only".
This patch should paritally fix the issue #3081. It only covers the server
part. Another patch will add the feature for HTTP health-checks.
not all changes are concerned. But when the SSL is enabled or disabled for a
server, the healthcheck xprt must be eventually be updated too. This happens
when the healthcheck relies on the server settings.
In the same spirit, when the healthcheck address and port are updated, we
must fallback on the raw xprt if the SSL is not explicitly enabled for the
healthcheck with a "check-ssl" parameter.
This patch should be backported to all stable versions.
By default, for a given server, when no pool-conn-name is specified, the
configured sni is used. However, this must only be done when SSL is in-use
for the server. Of course, it is uncommon to have a sni expression for
now-ssl server. But this may happen.
In addition, the SSL may be disabled via the CLI. In that case, the
pool-conn-name must be discarded if it was copied from the sni. And, we must
of course take care to set it if the ssl is enabled.
Finally, when the attac-srv action is checked, we now checked the
pool-conn-name expression.
This patch should be backported as far as 3.0. It relies on "MINOR: server:
Parse sni and pool-conn-name expressions in a dedicated function" which
should be backported too.
This change is mandatory to fix an issue. The parsing of sni and
pool-conn-name expressions (from string to expression) is now handled in a
dedicated function. This will avoid to duplicate the same code at different
places.
counters_{fe,be}_shared_prepare now take an extra <errmsg> parameter
that contains additional hints about the error in case of failure.
It must be freed accordingly since it is allocated using memprintf
It is not really an issue, but the "check-sni" value inerited from a default
server is not duplicated while the paramter value is duplicated during the
parsing. So here there is a small leak if several "check-sni" parameters are
used on the same server line. The previous value is never released. But to
fix this issue, the value inherited from the default server must also be
duplicated. At the end it is safer this way and consistant with the parsing
of the "sni" parameter.
It is harmless so there is no reason to backport this patch.
When "check-alpn" parameter is inherited from the default server, the value
is not duplicated, the pointer of the default server is used. However, when
this parameter is overridden, the old value is released. So the "check-alpn"
value of the default server is released. So it is possible to have a UAF if
if another server inherit from the same the default server.
To fix the issue, the "check-alpn" parameter must be handled the same way
the "alpn" is. The default value is duplicated. So it could be safely
released if it is forced on the server line.
This patch should fix the issue #3096. It must be backported to all stable
versions.
Do not remove anymore idle and purgeable connections directly under the
"del server" handler. The main objective of this patch is to reduce the
amount of work performed under thread isolation. This should improve
"del server" scheduling with other haproxy tasks.
Another objective is to be able to properly support dynamic servers with
QUIC. Indeed, takeover is not yet implemented for this protocol, hence
it is not possible to rely on cleanup of idle connections performed by a
single thread under "del server" handler.
With this change it is not possible anymore to remove a server if there
is still idle connections referencing it. To ensure this cannot be
performed, srv_check_for_deletion() has been extended to check server
counters for idle and idle private connections.
Server deletion should still remain a viable procedure, as first it is
mandatory to put the targetted server into maintenance. This step forces
the cleanup of its existing idle connections. Thanks to a recent change,
all finishing connections are also removed immediately instead of
becoming idle. In short, this patch transforms idle connections removal
from a synchronous to an asynchronous procedure. However, this should
remain a steadfast and quick method achievable in less than a second.
This patch is considered major as some users may notice this change when
removing a server. In particular with the following CLI commands
pipeline:
"disable server <X>; shutdown sessions server <X>; del server <X>"
Server deletion will now probably fail, as idle connections purge cannot
be completed immediately. Thus, it is now highly advise to always use a
small delay "wait srv-removable" before "del server" to ensure that idle
connections purge is executed prior.
Along with this change, documentation for "del server" and related
"shutdown sessions server" has been refined, in particular to better
highlight under what conditions a server can be removed.
This patch adds a new member <curr_sess_idle_conns> on the server. It
serves as a counter of idle connections attached on a session instead of
regular idle/safe trees. This is used only for private connections.
The objective is to provide a method to detect if there is idle
connections still referencing a server.
This will be particularly useful to ensure that a server is removable.
Currently, this is not yet necessary as idle connections are directly
freed via "del server" handler under thread isolation. However, this
procedure will be replaced by an asynchronous mechanism outside of
thread isolation.
Careful: connections attached to a session but not idle will not be
accounted by this counter. These connections can still be detected via
srv_has_streams() so "del server" will be safe.
This counter is maintain during the whole lifetime of a private
connection. This is mandatory to guarantee "del server" safety and is
conform with other idle server counters. What this means it that
decrement is performed only when the connection transitions from idle to
in use, or just prior to its deletion. For the first case, this is
covered by session_get_conn(). The second case is trickier. It cannot be
done via session_unown_conn() as a private connection may still live a
little longer after its removal from session, most notably when
scheduled for idle purging.
Thus, conn_free() has been adjusted to handle the final decrement. Now,
conn_backend_deinit() is also called for private connections if
CO_FL_SESS_IDLE flag is present. This results in a call to
srv_release_conn() which is responsible to decrement server idle
counters.
When a server goes into maintenance, or if its IP address is changed,
idle connections attached to it are scheduled for deletion via the purge
mechanism. Connections are moved from server idle/safe list to the purge
list relative to their thread. Connections are freed on their owned
thread by the scheduled purge task.
This patch extends this procedure to also handle private idle
connections stored in sessions instead of servers. This is possible
thanks via <sess_conns> list server member. A call to the newly
defined-function session_purge_conns() is performed on each list
element. This moves private connections from their session to the purge
list alongside other server idle connections.
This change relies on the serie of previous commits which ensure that
access to private idle connections is now thread-safe, with idle_conns
lock usage and careful manipulation of private idle conns in
input/output handlers.
The main benefit of this patch is that now all idle connections
targetting a server set in maintenance are removed. Previously, private
connections would remain until their attach sessions were closed.
When a server goes into maintenance mode, its idle connections are
scheduled for an immediate purge. However, this is not the case if the
server is already in stopped state, for example due to a health check
failure.
Adjust _srv_update_status_adm() to ensure that idle connections are
always scheduled for purge when going into maintenance in both cases.
The main advantage of this patch is to ensure consistent behavior for
server maintenance mode.
Note that it will also become necessary as server deletion will be
adjusted with a future patch. Idle connection closure won't be performed
by "del server" handler anymore, so it's important to ensure that a full
cleanup is always performed prior to executing it, else the server may
not be removable during a certain delay.
Currently, when a server is set on maintenance mode, its idle connection
are scheduled for purge. However, this does not prevent currently used
connection to become idle later on, even if the server is still off.
Change this behavior : an idle connection is now rejected by the server
if it is in maintenance. This is implemented with a new condition in
srv_add_to_idle_list() which returns an error value. In this case, muxes
stream detach callback will immediately free the connection.
A similar change is also performed in each MUX and SSL I/O handlers and
in conn_notify_mux(). An idle connection is not reinserted in its idle
list if server is in maintenance, but instead it is immediately freed.
Server member <sess_conns> is a mt_list which contains every backend
connections attached to a session which targets this server. These
connecions are not present in idle server trees.
The main utility of this list is to be able to cleanup these connections
prior to removing a server via "del server" CLI. However, this procedure
will be adjusted by a future patch. As such, <sess_conns> member must be
moved into srv_per_thread struct. Effectively, this duplicates a list
for every threads.
This commit does not introduce functional change. Its goal is to ensure
that these connections are now ordered by their owning thread, which
will allow to implement a purge, similarly to idle connections attached
to servers.