25410 Commits

Author SHA1 Message Date
Willy Tarreau
2d6b5c7a60 MEDIUM: connection: reintegrate conn_hash_node into connection
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.

With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.

Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
2025-09-16 09:23:46 +02:00
Willy Tarreau
ceaf8c1220 MEDIUM: connection: move idle connection trees to ceb64
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.

This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):

     struct       before  after  delta
  conn_hash_nodea   56     32     -24
  srv_per_thread    96     72     -24

The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
2025-09-16 09:23:46 +02:00
Willy Tarreau
95b8adff67 MINOR: connection: pass the thread number to conn_delete_from_tree()
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
2025-09-16 09:23:46 +02:00
Willy Tarreau
efe519ab89 CLEANUP: backend: use a single variable for removed in srv_cleanup_idle_conns()
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
f7d1fc2b08 MINOR: server: pass the server and thread to srv_migrate_conns_to_remove()
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d1c5df6866 CLEANUP: server: use eb64_entry() not ebmb_entry() to convert an eb64
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
2025-09-16 09:23:46 +02:00
Willy Tarreau
3d18a0d4c2 CLEANUP: backend: factor the connection lookup loop
The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.

Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.
2025-09-16 09:23:46 +02:00
Willy Tarreau
7773d87ea6 CLEANUP: proxy: slightly reorganize fields to plug some holes
The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.

At the moment, compared to before this series, we're seeing these
sizes:

    type\size   7d554ca62   current  delta
    listener       752        704     -48  (-6.4%)
    server        4032       3840    -192  (-4.8%)
    proxy         3184       3056    -128  (-4%)
    stktable      3392       3328     -64  (-1.9%)

Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.
2025-09-16 09:23:46 +02:00
Willy Tarreau
8df81b6fcc CLEANUP: server: slightly reorder fields in the struct to plug holes
The struct server still has a lot of holes and padding that make it
quite big. By moving a few fields aronud between areas which do not
interact (e.g. boot vs aligned areas), it's quite easy to plug some
of them and/or to arrange larger ones which could be reused later with
a bit more effort. Here we've reduced holes by 40 bytes, allowing the
struct to shrink by one more cache line (64 bytes). The new size is
3840 bytes.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d18d972b1f MEDIUM: server: index server ID using compact trees
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
2025-09-16 09:23:46 +02:00
Willy Tarreau
66191584d1 MEDIUM: listener: index listener ID using compact trees
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
2025-09-16 09:23:46 +02:00
Willy Tarreau
1a95bc42c7 MEDIUM: proxy: index proxy ID using compact trees
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
2025-09-16 09:23:46 +02:00
Willy Tarreau
eab5b89dce MINOR: proxy: add proxy_index_id() to index a proxy by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
5e4b6714e1 MINOR: listener: add listener_index_id() to index a listener by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
5a5cec4d7a MINOR: server: add server_index_id() to index a server by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
4ed4cdbf3d CLEANUP: server: use server_find_by_id() when looking for already used IDs
In srv_parse_id(), there's no point doing all the low-level work with
the tree functions to check for the existence of an ID, we already have
server_find_by_id() which does exactly this, so let's use it.
2025-09-16 09:23:46 +02:00
Willy Tarreau
0b0aefe19b MINOR: server: add server_get_next_id() to find next free server ID
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
2025-09-16 09:23:46 +02:00
Willy Tarreau
23605eddb1 MINOR: listener: add listener_get_next_id() to find next free listener ID
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated listener_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
2025-09-16 09:23:46 +02:00
Willy Tarreau
b2402d67b7 MINOR: proxy: add proxy_get_next_id() to find next free proxy ID
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated proxy_get_next_id().
2025-09-16 09:23:46 +02:00
Willy Tarreau
f4059ea42f MEDIUM: stktable: index table names using compact trees
Here we're saving 64 bytes per stick-table, from 3392 to 3328, and the
change was really straightforward so there's no reason not to do it.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d0d60a007d MEDIUM: proxy: switch conf.name to cebis_tree
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.

Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.

Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
2025-09-16 09:23:46 +02:00
Willy Tarreau
fdf6fd5b45 MEDIUM: server: switch the host_dn member to cebis_tree
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
2025-09-16 09:23:46 +02:00
Willy Tarreau
413e903a22 MEDIUM: server: switch conf.name to cebis_tree
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.

It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
2025-09-16 09:23:46 +02:00
Willy Tarreau
0e99f64fc6 MEDIUM: server: switch addr_node to cebis_tree
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.

The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
2025-09-16 09:23:46 +02:00
Willy Tarreau
91258fb9d8 MEDIUM: guid: switch guid to more compact cebuis_tree
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.

It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.

Affected struct sizes in bytes:

           Before     After   Diff
  server    4032       4032     0*
  proxy     3184       3160    -24
  listener   752        728    -24

*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
2025-09-16 09:23:46 +02:00
Willy Tarreau
e36b3b60b3 MEDIUM: migrate the patterns reference to cebs_tree
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).

Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):

   $ time  ./haproxy -c -f acl-ip.cfg

Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.

Before:
  RAM (VSZ/RSS): 4443912 3912444

  real    0m19.211s
  user    0m18.138s
  sys     0m1.068s

  Overhead  Command         Shared Object      Symbol
    44.79%  haproxy  haproxy            [.] ebst_insert
    25.07%  haproxy  haproxy            [.] ebmb_insert_prefix
     3.44%  haproxy  libc-2.33.so       [.] __libc_calloc
     2.71%  haproxy  libc-2.33.so       [.] _int_malloc
     2.33%  haproxy  haproxy            [.] free_pattern_tree
     1.78%  haproxy  libc-2.33.so       [.] inet_pton4
     1.62%  haproxy  libc-2.33.so       [.] _IO_fgets
     1.58%  haproxy  libc-2.33.so       [.] _int_free
     1.56%  haproxy  haproxy            [.] pat_ref_push
     1.35%  haproxy  libc-2.33.so       [.] malloc_consolidate
     1.16%  haproxy  libc-2.33.so       [.] __strlen_avx2
     0.79%  haproxy  haproxy            [.] pat_idx_tree_ip
     0.76%  haproxy  haproxy            [.] pat_ref_read_from_file
     0.60%  haproxy  libc-2.33.so       [.] __strrchr_avx2
     0.55%  haproxy  libc-2.33.so       [.] unlink_chunk.constprop.0
     0.54%  haproxy  libc-2.33.so       [.] __memchr_avx2
     0.46%  haproxy  haproxy            [.] pat_ref_append

After:
  RAM (VSZ/RSS): 4166108 3634768

  real    0m18.114s
  user    0m17.113s
  sys     0m0.996s

  Overhead  Command  Shared Object       Symbol
    38.99%  haproxy  haproxy             [.] cebs_insert
    27.09%  haproxy  haproxy             [.] ebmb_insert_prefix
     3.63%  haproxy  libc-2.33.so        [.] __libc_calloc
     3.18%  haproxy  libc-2.33.so        [.] _int_malloc
     2.69%  haproxy  haproxy             [.] free_pattern_tree
     1.99%  haproxy  libc-2.33.so        [.] inet_pton4
     1.74%  haproxy  libc-2.33.so        [.] _IO_fgets
     1.73%  haproxy  libc-2.33.so        [.] _int_free
     1.57%  haproxy  haproxy             [.] pat_ref_push
     1.48%  haproxy  libc-2.33.so        [.] malloc_consolidate
     1.22%  haproxy  libc-2.33.so        [.] __strlen_avx2
     1.05%  haproxy  libc-2.33.so        [.] __strcmp_avx2
     0.80%  haproxy  haproxy             [.] pat_idx_tree_ip
     0.74%  haproxy  libc-2.33.so        [.] __memchr_avx2
     0.69%  haproxy  libc-2.33.so        [.] __strrchr_avx2
     0.69%  haproxy  libc-2.33.so        [.] _IO_getline_info
     0.62%  haproxy  haproxy             [.] pat_ref_read_from_file
     0.56%  haproxy  libc-2.33.so        [.] unlink_chunk.constprop.0
     0.56%  haproxy  libc-2.33.so        [.] cfree@GLIBC_2.2.5
     0.46%  haproxy  haproxy             [.] pat_ref_append

If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).

On large strings such as user agents (1 million here), it's now slightly
slower (+9%):

Before:
  real    0m2.475s
  user    0m2.316s
  sys     0m0.155s

After:
  real    0m2.696s
  user    0m2.544s
  sys     0m0.147s

But such patterns are much less common than short ones, and the memory
savings do still count.

Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
2025-09-16 09:23:46 +02:00
Willy Tarreau
ddf900a0ce IMPORT: cebtree: import version 0.5.0 to support duplicates
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
  - replace a few ceb_node** with ceb_root** since pointers are now
    tagged ;
  - replace cebu*.h with ceb*.h since both are now merged in the same
    include file. This way we can drop the unused cebu*.h files from
    cebtree that are provided only for compatibility.
  - rename immediate storage functions to cebXX_imm_XXX() as per the API
    change in 0.5 that makes immediate explicit rather than implicit.
    This only affects vars and tools.c:copy_file_name().

The tests continue to work.
2025-09-16 09:23:46 +02:00
Willy Tarreau
90b70b61b1 BUILD: makefile: implement support for running a command in range
When running "make range", it would be convenient to support running
reg tests or anything else such as "size", "pahole" or even benchmarks.
Such commands are usually specific to the developer's environment, so
let's just pass a generic variable TEST_CMD that is executed as-is if
not empty.

This way it becomes possible to run "make range RANGE=... TEST_CMD=...".
2025-09-16 09:23:46 +02:00
Valentine Krasnobaeva
f8acac653e BUG/MINOR: resolvers: always normalize FQDN from response
RFC1034 states the following:

By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit. This means that you are free to create a node with
label "A" or a node with label "a", but not both as brothers; you could
refer to either using "a" or "A".

In practice, most DNS resolvers normalize domain labels (i.e., convert
them to lowercase) before performing searches or comparisons to ensure
this requirement is met.

While HAProxy normalizes the domain name in the request, it currently
does not do so for the response. Commit 75cc653 ("MEDIUM: resolvers:
replace bogus resolv_hostname_cmp() with memcmp()") intentionally
removed the `tolower()` conversion from `resolv_hostname_cmp()` for
safety and performance reasons.

This commit re-introduces the necessary normalization for FQDNs received
in the response. The change is made in `resolv_read_name()`, where labels
are processed as an unsigned char string, allowing `tolower()` to be
applied safely. Since a typical FQDN has only 3-4 labels, replacing
`memcpy()` with an explicit copy that also applies `tolower()` should
not introduce a significant performance degradation.

This patch addresses the rare edge case, as most resolvers perform this
normalization themselves.

This fixes the GitHub issue #3102. This fix may be backported in all stable
versions since 2.5 included 2.5.
2025-09-15 18:02:16 +02:00
Remi Tricot-Le Breton
257df69fbd BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.

The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.

This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
2025-09-15 15:34:36 +02:00
Aurelien DARRAGON
6a92b14cc1 MEDIUM: log/proxy: store log-steps selection using a bitmask, not an eb tree
An eb tree was used to anticipate for infinite amount of custom log steps
configured at a proxy level. In turns out this makes no sense to configure
that much logging steps for a proxy, and the cost of the eb tree is non
negligible in terms of memory footprint, especially when used in a default
section.

Instead, let's use a simple bitmask, which allows up to 64 logging steps
configured at proxy level. If we lack space some day (and need more than
64 logging steps to be configured), we could simply modify
"struct log_steps" to spread the bitmask over multiple 64bits integers,
minor some adjustments where the mask is set and checked.
2025-09-15 10:29:02 +02:00
Aurelien DARRAGON
be417c1db2 BUG/MEDIUM: http_ana: fix potential NULL deref in http_process_req_common()
As reported by @kenballus in GH #3118, a potential NULL-deref was
introduced in 3da1d63 ("BUG/MEDIUM: http_ana: handle yield for "stats
http-request" evaluation")

Indeed, px->uri_auth may be NULL when stats directive is not involved in
the current proxy section.

The bug went unnoticed because it didn't seem to cause any side-effect
so far and valgrind didn't catch it. However ASAN did, so let's fix it
before it causes harm.

It should be backported with 3da1d63.
2025-09-15 10:28:59 +02:00
Christopher Faulet
b582fd41c2 Revert "BUG/MINOR: ocsp: Crash when updating CA during ocsp updates"
This reverts commit 167ea8fc7b0cf9d1bf71ec03d7eac3141fbe0080.

The patch was backported by mistake.
2025-09-15 10:16:20 +02:00
Remi Tricot-Le Breton
167ea8fc7b BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.

The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.

This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
2025-09-15 08:20:16 +02:00
Christopher Faulet
157852ce99 BUG/MEDIUM: resolvers: Wake resolver task up whne unlinking a stream requester
Another regression introduced with the commit 3023e9819 ("BUG/MINOR:
resolvers: Restore round-robin selection on records in DNS answers"). Stream
requesters are unlinked from any theards. So we must not try to queue the
resolver's task here because it is not allowed to do so from another thread
than the task thread. Instead, we can simply wake the resolver's task up. It
is only performed when the last stream requester is unlink from the
resolution.

This patch should fix the issue #3119. It must be backported with the commit
above.
2025-09-15 07:57:29 +02:00
Christopher Faulet
e6a9192af6 BUG/MEDIUM: resolvers: Accept to create resolution without hostname
A regression was introduced by commit 6cf2401ed ("BUG/MEDIUM: resolvers:
Make resolution owns its hostname_dn value"). In fact, it is possible (an
allowed ?!) to create a resolution without hostname (hostname_dn ==
NULL). It only happens on startup for a server relying on a resolver but
defined with an IP address and not a hostname

Because of the patch above, an error is triggered during the configuration
parsing when this happens, while it should be accepted.

This patch must be backported with the commit above.
2025-09-12 11:52:06 +02:00
Christopher Faulet
6cf2401eda BUG/MEDIUM: resolvers: Make resolution owns its hostname_dn value
The commit 37abe56b1 ("BUG/MEDIUM: resolvers: Properly cache do-resolv
resolution") introduced a regression. A resolution does not own its
hostname_dn value, it is a pointer on the first request value. But since the
commit above, it is possible to have orphan resolution, with no
requester. So it is important to modify the resolutions to make it owns its
hostname_dn value by duplicating it when it is created.

This patch must be backported with the commit above.
2025-09-12 11:09:19 +02:00
Christopher Faulet
f6dfbbe870 BUG/MEDIUM: resolvers: Test for empty tree when getting a record from DNS answer
In the previous fix 5d1d93fad ("BUG/MEDIUM: resolvers: Properly handle empty
tree when getting a record from the DNS answer"), I missed the fact the
answer tree can be empty.

So, to avoid crashes, when the answer tree is empty, we immediately exit
from resolv_get_ip_from_response() function with RSLV_UPD_NO_IP_FOUND. In
addition, when a record is removed from the tree, we take care to reset the
next node saved if necessary.

This patch must be backported with the commit above.
2025-09-12 11:09:19 +02:00
Collison, Steven
d738fa4ec0 DOC: proxy-protocol: Add TLS group and sig scheme TLVs
This change adds the PP2_SUBTYPE_SSL_GROUP and PP2_SUBTYPE_SSL_SIG_SCHEME
code point reservations in proxy_protocol.txt. The motivation for adding
these two TLVs is for backend visibility into the negotiated TLS key
exchange group and handshake signature scheme.

Demand for visibility is expected to increase as endpoints migrate to use
new Post-Quantum resistant algorithms for key exchange and signatures.
2025-09-12 09:25:14 +02:00
Willy Tarreau
8fb5ae5cc6 MINOR: activity/memory: count allocations performed under a lock
By checking the current thread's locking status, it becomes possible
to know during a memory allocation whether it's performed under a lock
or not. Both pools and memprofile functions were instrumented to check
for this and to increment the memprofile bin's locked_calls counter.

This one, when not zero, is reported on "show profiling memory" with a
percentage of all allocations that such locked allocations represent.
This way it becomes possible to try to target certain code paths that
are particularly expensive. Example:

  $ socat - /tmp/sock1 <<< "show profiling memory"|grep lock
     20297301           0     2598054528              0|   0x62a820fa3991 sockaddr_alloc+0x61/0xa3 p_alloc(128) [pool=sockaddr] [locked=54962 (0.2 %)]
            0    20297301              0     2598054528|   0x62a820fa3a24 sockaddr_free+0x44/0x59 p_free(-128) [pool=sockaddr] [locked=34300 (0.1 %)]
      9908432           0     1268279296              0|   0x62a820eb8524 main+0x81974 p_alloc(128) [pool=task] [locked=9908432 (100.0 %)]
      9908432           0      554872192              0|   0x62a820eb85a6 main+0x819f6 p_alloc(56) [pool=tasklet] [locked=9908432 (100.0 %)]
       263001           0       63120240              0|   0x62a820fa3c97 conn_new+0x37/0x1b2 p_alloc(240) [pool=connection] [locked=20662 (7.8 %)]
        71643           0       47307584              0|   0x62a82105204d pool_get_from_os_noinc+0x12d/0x161 posix_memalign(660) [locked=5393 (7.5 %)]
2025-09-11 16:32:34 +02:00
Willy Tarreau
9d8c2a888b MINOR: activity: collect CPU time spent on memory allocations for each task
When task profiling is enabled, the pool alloc/free code will measure the
time it takes to perform memory allocation after a cache miss or memory
freeing to the shared cache or OS. The time taken with the thread-local
cache is never measured as measuring that time is very expensive compared
to the pool access time. Here doing so costs around 2% performance at 2M
req/s, only when task profiling is enabled, so this remains reasonable.
The scheduler takes care of collecting that time and updating the
sched_activity entry corresponding to the current task when task profiling
is enabled.

The goal clearly is to track places that are wasting CPU time allocating
and releasing too often, or causing large evictions. This appears like
this in "show profiling tasks aggr":

  Tasks activity over 11.428 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   mem_avg   lat_avg
    process_stream             44183891   16.47m    22.36us   491.0ns   1.154us   1.000ns   101.1us
    h1_io_cb                   57386064   4.011m    4.193us   20.00ns   16.00ns      -      29.47us
    sc_conn_io_cb              42088024   49.04s    1.165us      -         -         -      54.67us
    h1_timeout_task              438171   196.5ms   448.0ns      -         -         -      100.1us
    srv_cleanup_toremove_conns       65   1.468ms   22.58us   184.0ns   87.00ns      -      101.3us
    task_process_applet               3   508.0us   169.3us      -      107.0us   1.847us   29.67us
    srv_cleanup_idle_conns            6   225.3us   37.55us   15.74us   36.84us      -      49.47us
    accept_queue_process              2   45.62us   22.81us      -         -      4.949us   54.33us
2025-09-11 16:32:34 +02:00
Willy Tarreau
195794eb59 MINOR: activity: add a new mem_avg column to show profiling stats
This new column will be used for reporting the average time spent
allocating or freeing memory in a task when task profiling is enabled.
For now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
98cc815e3e MINOR: activity: collect time spent with a lock held for each task
When DEBUG_THREAD > 0 and task profiling enabled, we'll now measure the
time spent with at least one lock held for each task. The time is
collected by locking operations when locks are taken raising the level
to one, or released resetting the level. An accumulator is updated in
the thread_ctx struct that is collected by the scheduler when the task
returns, and updated in the sched_activity entry of the related task.

This allows to observe figures like this one:

  Tasks activity over 259.516 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   15466589   2.574m    9.984us      -         -      33.45us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               8047994   8.325s    1.034us      -         -      870.1us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              7734689   4.356m    33.79us   1.990us   1.641us   1.554ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              7734292   46.74m    362.6us   278.3us   132.2us   972.0us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               7733158   46.88s    6.061us      -         -      68.78us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet         6603593   4.484m    40.74us   16.69us   34.00us   96.47us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    task_process_applet         4761796   3.420m    43.09us   18.79us   39.28us   138.2us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    process_table_expire        4710662   4.880m    62.16us   9.648us   53.95us   158.6us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates   4171868   6.786s    1.626us      -      1.487us   47.94us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    h1_io_cb                    2871683   1.198s    417.0ns   70.00ns   69.00ns   1.005ms <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
    process_peer_sync           2304957   5.368s    2.328us      -      1.156us   68.54us <- stktable_add_pend_updates@src/stick_table.c:873 task_wakeup
    process_peer_sync           1388141   3.174s    2.286us      -      1.130us   52.31us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates    463488   3.530s    7.615us   2.000ns   7.134us   771.2us <- stktable_touch_with_exp@src/stick_table.c:654 tasklet_wakeup

Here we see that almost the entirety of stktable_add_pend_updates() is
spent under a lock, that 1/3 of the execution time of process_stream()
was performed under a lock and that 2/3 of it was spent waiting for a
lock (this is related to the 10 track-sc present in this config), and
that the locking time in process_peer_sync() has now significantly
reduced. This is more visible with "show profiling tasks aggr":

  Tasks activity over 475.354 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   25742539   3.699m    8.622us   11.00ns   10.00ns   188.0us
    sc_conn_io_cb              22565666   1.475m    3.920us      -         -      473.9us
    process_stream             21665212   1.195h    198.6us   140.6us   67.08us   1.266ms
    task_process_applet        16352495   11.31m    41.51us   17.98us   36.55us   112.3us
    process_peer_sync           7831923   17.15s    2.189us      -      1.107us   41.27us
    process_table_expire        6878569   6.866m    59.89us   9.359us   51.91us   151.8us
    stktable_add_pend_updates   6602502   14.77s    2.236us      -      2.060us   119.8us
    h1_timeout_task                 801   703.4us   878.0ns      -         -      185.7us
    srv_cleanup_toremove_conns      347   12.43ms   35.82us   240.0ns   70.00ns   1.924ms
    accept_queue_process            142   1.384ms   9.743us      -         -      340.6us
    srv_cleanup_idle_conns           74   475.0us   6.418us   896.0ns   5.667us   114.6us
2025-09-11 16:32:34 +02:00
Willy Tarreau
95433f224e MINOR: activity: add a new lkd_avg column to show profiling stats
This new column will be used for reporting the average time spent
in a task with at least one lock held. It will only have a non-zero
value when DEBUG_THREAD > 0. For now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
4b23b2ed32 MINOR: thread: add a lock level information in the thread_ctx
The new lock_level field indicates the number of cumulated locks that
are held by the current thread. It's fed as soon as DEBUG_THREAD is at
least 1. In addition, thread_isolate() adds 128, so that it's even
possible to check for combinations of both. The value is also reported
in thread dumps (warnings and panics).
2025-09-11 16:32:34 +02:00
Willy Tarreau
503084643f MINOR: activity: collect time spent waiting on a lock for each task
When DEBUG_THREAD > 0, and if task profiling is enabled, then each
locking attempt will measure the time it takes to obtain the lock, then
add that time to a thread_ctx accumulator that the scheduler will then
retrieve to update the current task's sched_activity entry. The value
will then appear avearaged over the number of calls in the lkw_avg column
of "show profiling tasks", such as below:

  Tasks activity over 48.298 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lat_avg
    h1_io_cb                    3200170   26.81s    8.377us      -      32.73us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               1657841   1.645s    992.0ns      -      853.0us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              1600450   49.16s    30.71us   1.936us   1.392ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              1600321   7.770m    291.3us   209.1us   901.6us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               1599928   7.975s    4.984us      -      65.77us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet          997609   46.37s    46.48us   16.80us   113.0us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    process_table_expire         922074   48.79s    52.92us   7.275us   181.1us <- run_tasks_from_lists@src/task.c:670 task_queue
    stktable_add_pend_updates    705423   1.511s    2.142us      -      56.81us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    task_process_applet          683511   34.75s    50.84us   18.37us   153.3us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    h1_io_cb                     535395   198.1ms   370.0ns   72.00ns   930.4us <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup

It now makes it pretty obvious which tasks (hence call chains) spend their
time waiting on a lock and for what share of their execution time.
2025-09-11 16:32:34 +02:00
Willy Tarreau
1956c544b5 MINOR: activity: add a new lkw_avg column to show profiling stats
This new column will be used for reporting the average time spent waiting
for a lock. It will only have a non-zero value when DEBUG_THREAD > 0. For
now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
9f7ce9e807 MINOR: activity: don't report the lat_tot column for show profiling tasks
This column is pretty useless, as the total latency experienced by tasks
is meaningless, what matters is the average per call. Since we'll add more
columns and we need to keep all of this readable, let's get rid of this
column.
2025-09-11 16:32:34 +02:00
Christopher Faulet
3023e98199 BUG/MINOR: resolvers: Restore round-robin selection on records in DNS answers
Since the commit dcb696cd3 ("MEDIUM: resolvers: hash the records before
inserting them into the tree"), When several records are found in a DNS
answer, the round robin selection over these records is no longer performed.

Indeed, before a list of records was used. To ensure each records was
selected one after the other, at each selection, the first record of the
list was moved at the end. When this list was replaced bu a tree, the same
mechanism was preserved. However, the record is indexed using its key, a
hash of the record. So its position never changes. When it is removed and
reinserted in the tree, its position remains the same. When we walk though
the tree, starting from the root, the records are always evaluated in the
same order. So, even if there are several records in a DNS answer, the same
IP address is always selected.

It is quite easy to trigger the issue with a do-resolv action.

To fix the issue, the node to perform the next selection is now saved. So
instead of restarting from the root each time, we can restart from the next
node of the previous call.

Thanks to Damien Claisse for the issue analysis and for the reproducer.

This patch should fix the issue #3116. It must be backported as far as 2.6.
2025-09-11 15:46:45 +02:00
Christopher Faulet
37abe56b18 BUG/MEDIUM: resolvers: Properly cache do-resolv resolution
As stated by the documentation, when a do-resolv resolution is performed,
the result should be cached for <hold.valid> milliseconds. However, the only
way to cache the result is to always have a requester. When the last
requester is unlink from the resolution, the resolution is released. So, for
a do-resolv resolution, it means it could only work by chance if the same
FQDN is requested enough to always have at least two streams waiting for the
resolution. And because in that case, the cached result is used, it means
the traffic must be quite high.

In fact, a good approach to fix the issue is to keep orphan resolutions to
be able cache the result and only release them after hold.valid milliseconds
after the last real resolution. The resolver's task already releases orphan
resolutions. So we only need to check the expiration date and take care to
not release the resolution when the last stream is unlink from it.

This patch should be backported to all stable versions. We can start to
backport it as far as 3.1 and then wait a bit.
2025-09-11 15:46:45 +02:00