25446 Commits

Author SHA1 Message Date
Aurelien DARRAGON
5c299dee5a MEDIUM: stats: consider that shared stats pointers may be NULL
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.

The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)

Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.

So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
2025-09-18 16:49:51 +02:00
Aurelien DARRAGON
40eb1dd135 BUG/MEDIUM: sink: fix unexpected double postinit of sink backend
Willy experienced an unexpected behavior with the config below:

    global
        stats socket :1514

    ring buf1
        server srv1 127.0.0.1:1514

Indeed, haproxy would connect to the ring server twice since commit 23e5f18b
("MEDIUM: sink: change the sink mode type to PR_MODE_SYSLOG"), and one of the
connection would report errors.

The reason behind is is, despite the above commit saying no change of behavior
is expected, with the sink forward_px proxy now being set with PR_MODE_SYSLOG,
postcheck_log_backend() was being automatically executed in addition to the
manual cfg_post_parse_ring() function for each "ring" section. The consequence
is that sink_finalize() was called twice for a given "ring" section, which
means the connection init would be triggered twice.. which in turn resulted in
the behavior described above, plus possible unexpected side-effects.

To fix the issue, when we create the forward_px proxy, we now set the
PR_CAP_INT capability on it to tell haproxy not to automatically manage the
proxy (ie: to skip the automatic log backend postinit), because we are about
to manually manage the proxy from the sink API.

No backport needed, this bug is specific to 3.3
2025-09-18 16:49:29 +02:00
Willy Tarreau
79ef362d9e OPTIM: ring: avoid reloading the tail_ofs value before the CAS in ring_write()
The load followed by the CAS seem to cause two bus cycles, one to
retrieve the cache line in shared state and a second one to get
exclusive ownership of it. Tests show that on x86 it's much better
to just rely on the previous value and preset it to zero before
entering the loop. We just mask the ring lock in case of failure
so as to challenge it on next iteration and that's done.

This little change brings 2.3% extra performance (11.34M msg/s) on
a 64-core AMD.
2025-09-18 15:27:32 +02:00
Willy Tarreau
a727c6eaa5 OPTIM: ring: check the queue's owner using a CAS on x86
In the loop where the queue's leader tries to get the tail lock,
we also need to check if another thread took ownership of the queue
the current thread is currently working for. This is currently done
using an atomic load.

Tests show that on x86, using a CAS for this is much more efficient
because it allows to keep the cache line in exclusive state for a
few more cycles that permit the queue release call after the loop
to be done without having to wait again. The measured gain is +5%
for 128 threads on a 64-core AMD system (11.08M msg/s vs 10.56M).
However, ARM loses about 1% on this, and we cannot afford that on
machines without a fast CAS anyway, so the load is performed using
a CAS only on x86_64. It might not be as efficient on low-end models
but we don't care since they are not the ones dealing with high
contention.
2025-09-18 15:08:12 +02:00
Willy Tarreau
d25099b359 OPTIM: ring: always relax in the ring lock and leader wait loop
Tests have shown that AMD systems really need to use a cpu_relax()
in these two loops. The performance improves from 10.03 to 10.56M
messages per second (+5%) on a 128-thread system, without affecting
intel nor ARM, so let's do this.
2025-09-18 15:07:56 +02:00
Willy Tarreau
eca1f90e16 CLEANUP: ring: rearrange the wait loop in ring_write()
The loop is constructed in a complicated way with a single break
statement in the middle and many continue statements everywhere,
making it hard to better factor between variants. Let's first
reorganize it so as to make it easier to escape when the ring
tail lock is obtained. The sequence of instrucitons remains the
same, it's only better organized.
2025-09-18 14:58:38 +02:00
Willy Tarreau
08c6bbb542 OPTIM: sink: don't waste time calling sink_announce_dropped() if busy
If we see that another thread is already busy trying to announce the
dropped counter, there's no point going there, so let's just skip all
that operation from sink_write() and avoid disturbing the other thread.
This results in a boost from 244 to 262k req/s.
2025-09-18 09:07:35 +02:00
Willy Tarreau
4431e3bd26 OPTIM: sink: reduce contention on sink_announce_dropped()
perf top shows that sink_announce_dropped() consumes most of the CPU
on a 128-thread x86 system. Digging further reveals that the atomic
fetch_or() on the dropped field used to detect the presence of another
thread is entirely responsible for this. Indeed, the compiler implements
it using a CAS that loops without relaxing and makes all threads wait
until they can synchronize on this one, only to discover later that
another thread is there and they need to give up.

Let's just replace this with a hand-crafted CAS loop that will detect
*before* attempting the CAS if another thread is there. Doing so
achieves the same goal without forcing threads to agree. With this
simple change, the sustained request rate on h1 with all traces on
bumped from 110k/s to 244k/s!

This should be backported to stable releases where it's often needed
to help debugging.
2025-09-18 08:38:34 +02:00
Willy Tarreau
361c227465 MINOR: trace: don't call strlen() on the function's name
Currently there's a small mistake in the way the trace function and
macros. The calling function name is known as a constant until the
macro and passed as-is to the __trace() function. That one needs to
know its length and will call ist() on it, resulting in a real call
to strlen() while that length was known before the call. Let's use
an ist instead of a const char* for __trace() and __trace_enabled()
so that we can now completely avoid calling strlen() during this
operation. This has significantly reduced the importance of
__trace_enabled() in perf top.
2025-09-18 08:31:57 +02:00
Willy Tarreau
06fa9f717f MINOR: trace: don't call strlen() on the thread-id numeric encoding
In __trace(), we're making an integer for the thread id but this one
is passed through strlen() in the call to ist() because it's not a
constant. We do know that it's exactly 3 chars long so we can manage
this using ist2() and pass it the length instead in order to reduce
the number of calls to strlen().

Also let's note that the thread number will no longer be numeric for
thread numbers above 100.
2025-09-18 08:02:59 +02:00
Willy Tarreau
d53ad49ad1 BUG/MEDIUM: ring: invert the length check to avoid an int overflow
Vincent Gramer reported in GH issue #3125 a case of crash on a BUG_ON()
condition in the rings. What happens is that a message that is one byte
less than the maximum ring size is emitted, and it passes all the checks,
but once inflated by the extra +1 for the refcount, it can no longer. But
the check was made based on message size compared to space left, except
that this space left can now be negative, which is a high positive for
size_t, so the check remained valid and triggered a BUG_ON() later.

Let's compute the size the other way around instead (i.e. current +
needed) since we can't have rings as large as half of the memory space
anyway, thus we have no risk of overflow on this one.

This needs to be backported to all versions supporting multi-threaded
rings (3.0 and above).

Thanks to Vincent for the easy and working reproducer.
2025-09-17 18:45:13 +02:00
Willy Tarreau
8c077c17eb MINOR: server: add the "cc" keyword to set the TCP congestion controller
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.
2025-09-17 17:19:33 +02:00
Willy Tarreau
4ed3cf295d MINOR: listener: add the "cc" bind keyword to set the TCP congestion controller
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with incoming connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Permission issues
might be reported (as warnings).
2025-09-17 17:03:42 +02:00
Ben Kallus
31d0695a6a IMPORT: ebtree: replace hand-rolled offsetof to avoid UB
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.

Note that there's no clear statement about this point in the spec,
only several points which together converge to this:

- From N3220, 6.5.3.4:
  A postfix expression followed by the -> operator and an identifier
  designates a member of a structure or union object. The value is
  that of the named member of the object to which the first expression
  points, and is an lvalue.

- From N3220, 6.3.2.1:
  An lvalue is an expression (with an object type other than void) that
  potentially designates an object; if an lvalue does not designate an
  object when it is evaluated, the behavior is undefined.

- From N3220, 6.5.4.4 p3:
  The unary & operator yields the address of its operand. If the
  operand has type "type", the result has type "pointer to type". If
  the operand is the result of a unary * operator, neither that operator
  nor the & operator is evaluated and the result is as if both were
  omitted, except that the constraints on the operators still apply and
  the result is not an lvalue. Similarly, if the operand is the result
  of a [] operator, neither the & operator nor the unary * that is
  implied by the [] is evaluated and the result is as if the & operator
  were removed and the [] operator were changed to a + operator.

=> In short, this is saying that C guarantees these identities:
    1. &(*p) is equivalent to p
    2. &(p[n]) is equivalent to p + n

As a consequence, &(*p) doesn't result in the evaluation of *p, only
the evaluation of p (and similar for []). There is no corresponding
special carve-out for ->.

See also: https://pvs-studio.com/en/blog/posts/cpp/0306/

After this patch, HAProxy can run without crashing after building w/
clang-19 -fsanitize=undefined -fno-sanitize=function,alignment

This is ebtree commit bd499015d908596f70277ddacef8e6fa998c01d5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 5211c2f71d78bf546f5d01c8d3c1484e868fac13.
2025-09-17 14:30:32 +02:00
Willy Tarreau
a31da78685 IMPORT: ebtree: add a definition of offsetof()
We'll use this to improve the definition of container_of(). Let's define
it if it does not exist. We can rely on __builtin_offsetof() on recent
enough compilers.

This is ebtree commit 1ea273e60832b98f552b9dbd013e6c2b32113aa5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 69b2ef57a8ce321e8de84486182012c954380401.
2025-09-17 14:30:32 +02:00
Ben Kallus
ddbff4e235 IMPORT: ebtree: Fix UB from clz(0)
From 'man gcc': passing 0 as the argument to "__builtin_ctz" or
"__builtin_clz" invokes undefined behavior. This triggers UBsan
in HAProxy.

[wt: tested in treebench and verified not to cause any performance
 regression with opstime-u32 nor stress-u32]
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 8c29daf9fa6e34de8c7684bb7713e93dcfe09029.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit cf3b93736cb550038325e1d99861358d65f70e9a.
2025-09-17 14:30:32 +02:00
Willy Tarreau
52c6dd773d IMPORT: ebst: use prefetching in lookup() and insert()
While the previous optimizations couldn't be preserved due to the
possibility of out-of-bounds accesses, at least the prefetch is useful.
A test on treebench shows that for 64k short strings, the lookup time
falls from 276 to 199ns per lookup (28% savings), and the insert falls
from 311 to 296ns (4.9% savings), which are pretty respectable, so
let's do this.

This is ebtree commit b44ea5d07dc1594d62c3a902783ed1fb133f568d.
2025-09-17 14:30:32 +02:00
Willy Tarreau
fef4cfbd21 IMPORT: ebtree: only use __builtin_prefetch() when supported
It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no
mention of it in 3.0's doc. Let's replace it with eb_prefetch() which
maps to __builtin_prefetch() on supported compilers and falls back to
the usual do{}while(0) on other ones. It was tested to properly build
with tcc as well as gcc-2.95.

This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.
2025-09-17 14:30:32 +02:00
Willy Tarreau
3dda813d54 IMPORT: eb32/64: optimize insert for modern CPUs
Similar to previous patches, let's improve the insert() descent loop to
avoid discovering mandatory data too late. The change here is even
simpler than previous ones, a prefetch was installed and troot is
calculated before last instruction in a speculative way. This was enough
to gain +50% insertion rate on random data.

This is ebtree commit e893f8cc4d44b10f406b9d1d78bd4a9bd9183ccf.
2025-09-17 14:30:32 +02:00
Willy Tarreau
61654c07bd IMPORT: ebmb: optimize the lookup for modern CPUs
This is the same principles as for the latest improvements made on
integer trees. Applying the same recipes made the ebmb_lookup()
function jump from 10.07 to 12.25 million lookups per second on a
10k random values tree (+21.6%).

It's likely that the ebmb_lookup_longest() code could also benefit
from this, though this was neither explored nor tested.

This is ebtree commit a159731fd6b91648a2fef3b953feeb830438c924.
2025-09-17 14:30:32 +02:00
Willy Tarreau
6c54bf7295 IMPORT: eb32/eb64: place an unlikely() on the leaf test
In the loop we can help the compiler build slightly more efficient code
by placing an unlikely() around the leaf test. This shows a consistent
0.5% performance gain both on eb32 and eb64.

This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.
2025-09-17 14:30:32 +02:00
Willy Tarreau
384907f4e7 IMPORT: eb32: drop the now useless node_bit variable
This one was previously used to preload from the node and keep a copy
in a register on i386 machines with few registers. With the new more
optimal code it's totally useless, so let's get rid of it. By the way
the 64 bit code didn't use that at all already.

This is ebtree commit 1e219a74cfa09e785baf3637b6d55993d88b47ef.
2025-09-17 14:30:31 +02:00
Willy Tarreau
c9e4adf608 IMPORT: eb32/eb64: use a more parallelizable check for lack of common bits
Instead of shifting the XOR value right and comparing it to 1, which
roughly requires 2 sequential instructions, better test if the XOR has
any bit above the current bit, which means any bit set among those
strictly higher, or in other words that XOR & (-bit << 1) is non-zero.
This is one less instruction in the fast path and gives another nice
performance gain on random keys (in million lookups/s):

    eb32   1k:  33.17 -> 37.30   +12.5%
          10k:  15.74 -> 17.08   +8.51%
         100k:   8.00 ->  9.00   +12.5%
    eb64   1k:  34.40 -> 38.10   +10.8%
          10k:  16.17 -> 17.10   +5.75%
         100k:   8.38 ->  8.87   +5.85%

This is ebtree commit c942a2771758eed4f4584fe23cf2914573817a6b.
2025-09-17 14:30:31 +02:00
Willy Tarreau
6af17d491f IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs
The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.

Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.

What this patch does is relatively simple:

  - we try to prefetch the next two branches as soon as the
    node is known, which will help dereference the selected node in
    the next iteration; it was shown that it only works with the next
    changes though, otherwise it can reduce the performance instead.
    In practice the prefetching will start a bit later once the node
    is really in the cache, but since there's no dependency between
    these instructions and any other one, we let the CPU optimize as
    it wants.

  - we preload all important data from the node (next two branches,
    key and node.bit) very early even if not immediately needed.
    This is cheap, it doesn't cause any pipeline stall and speeds
    up later operations.

  - we pre-calculate 1<<bit that we assign into a register, so as
    to avoid serializing instructions when deciding which branch to
    take.

  - we assign the troot based on a ternary operation (or if/else) so
    that the CPU knows upfront the two possible next addresses without
    waiting for the end of a calculation and can prefetch their contents
    every time the branch prediction unit guesses right.

Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):

  eb32   1k:  29.07 -> 33.17  +14.1%
        10k:  14.27 -> 15.74  +10.3%
       100k:   6.64 ->  8.00  +20.5%
  eb64   1k:  27.51 -> 34.40  +25.0%
        10k:  13.54 -> 16.17  +19.4%
       100k:   7.53 ->  8.38  +11.3%

The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).

Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.

It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).

This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.
2025-09-17 14:30:31 +02:00
Willy Tarreau
dcd4d36723 IMPORT: ebtree: delete unusable ebpttree.c
Since commit 21fd162 ("[MEDIUM] make ebpttree rely solely on eb32/eb64
trees") it was no longer used and no longer builds. The commit message
mentions that the file is no longer needed, probably that a rebase failed
and left the file there.

This is ebtree commit fcfaf8df90e322992f6ba3212c8ad439d3640cb7.
2025-09-17 14:30:31 +02:00
Aurelien DARRAGON
b72225dee2 DOC: internals: document the shm-stats-file format/mapping
Add some documentation about shm stats file structure to help writing
tools that can parse the file to use the shared stats counters.

This file was written for shm stats file version 1.0 specifically,
it may need to be updated when the shm stats file structure changes
in the future.
2025-09-17 11:32:58 +02:00
Aurelien DARRAGON
644b6b9925 MINOR: counters: document that tg shared counters are tied to shm-stats-file mapping
Let's explicitly mention that fe_counters_shared_tg and
be_counters_shared_tg structs are embedded in shm_stats_file_object
struct so any change in those structs will result in shm stats file
incompatibility between processes, thus extra precaution must be
taken when making changes to them.

Note that the provisionning made in shm_stats_file_object struct could
be used to add members to {fe,be}_counters_shared_tg without changing
shm_stats_file_object struct size if needed in order to preserve
shm stats file version.
2025-09-17 11:31:29 +02:00
Aurelien DARRAGON
31b3be7aae CLEANUP: log: remove deadcode in px_parse_log_steps()
When logsteps proxy storage was migrated from eb nodes to bitmasks in
6a92b14 ("MEDIUM: log/proxy: store log-steps selection using a bitmask,
not an eb tree"), some unused eb node related code was left over in
px_parse_log_steps()

Not only this code is unused, it also resulted in wasted memory since
an eb node was allocated for nothing.

This should fix GH #3121
2025-09-17 11:31:17 +02:00
Willy Tarreau
3d73e6c818 BUG/MEDIUM: pattern: fix possible infinite loops on deletion (try 2)
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.

While this bug is 3.3 only, it turns out that 3.2 is also affected by
the incorrect loop construct in pat_ref_set_from_node(), where it's
possible to run an infinite loop since commit 010c34b8c7 ("MEDIUM:
pattern: consider gen_id in pat_ref_set_from_node()") due to the
"continue" statement being placed before the ebmb_next_dup() call.

As such the relevant part of this fix (pat_ref_set_from_elt) will
need to be backported to 3.2.
2025-09-16 16:32:39 +02:00
Willy Tarreau
f1b1d3682a Revert "BUG/MEDIUM: pattern: fix possible infinite loops on deletion"
This reverts commit 359a829ccb8693e0b29808acc0fa7975735c0353.
The fix is neither sufficient nor correct (it triggers ASAN). Better
redo it cleanly rather than accumulate invalid fixes.
2025-09-16 16:32:39 +02:00
William Lallemand
6b6c03bc0d CI: scripts: mkdir BUILDSSL_TMPDIR
Creates the BUILDSSL_TMPDIR at the beginning of the script instead of
having to create it in each download functions
2025-09-16 15:35:35 +02:00
William Lallemand
9517116f63 CI: github: add an OpenSSL + ECH job
The upcoming ECH feature need a patched OpenSSL with the "feature/ech"
branch.

This daily job launches an openssl build, as well as haproxy build with
reg-tests.
2025-09-16 15:05:44 +02:00
William Lallemand
31319ff7f0 CI: scripts: add support for git in openssl builds
Add support for git releases downloaded from github in openssl builds:

- GIT_TYPE variable allow you to chose between "branch" or "commit"
- OPENSSL_VERSION variable supports a "git-" prefix
- "git-${commit_id}" is stored in .openssl_version instead of the branch
  name for version comparison.
2025-09-16 15:05:44 +02:00
Willy Tarreau
359a829ccb BUG/MEDIUM: pattern: fix possible infinite loops on deletion
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.

No backport is needed, this is only 3.3.
2025-09-16 11:49:01 +02:00
Willy Tarreau
4edff4a2cc CLEANUP: vars: use the item API for the variables trees
The variables trees use the immediate cebtree API, better use the
item one which is more expressive and safer. The "node" field was
renamed to "name_node" to avoid any ambiguity.
2025-09-16 10:51:23 +02:00
Willy Tarreau
c058cc5ddf CLEANUP: tools: use the item API for the file names tree
The file names tree uses the immediate cebtree API, better use the
item one which is more expressive and safer.
2025-09-16 10:41:19 +02:00
Willy Tarreau
2d6b5c7a60 MEDIUM: connection: reintegrate conn_hash_node into connection
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.

With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.

Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
2025-09-16 09:23:46 +02:00
Willy Tarreau
ceaf8c1220 MEDIUM: connection: move idle connection trees to ceb64
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.

This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):

     struct       before  after  delta
  conn_hash_nodea   56     32     -24
  srv_per_thread    96     72     -24

The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
2025-09-16 09:23:46 +02:00
Willy Tarreau
95b8adff67 MINOR: connection: pass the thread number to conn_delete_from_tree()
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
2025-09-16 09:23:46 +02:00
Willy Tarreau
efe519ab89 CLEANUP: backend: use a single variable for removed in srv_cleanup_idle_conns()
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
f7d1fc2b08 MINOR: server: pass the server and thread to srv_migrate_conns_to_remove()
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d1c5df6866 CLEANUP: server: use eb64_entry() not ebmb_entry() to convert an eb64
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
2025-09-16 09:23:46 +02:00
Willy Tarreau
3d18a0d4c2 CLEANUP: backend: factor the connection lookup loop
The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.

Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.
2025-09-16 09:23:46 +02:00
Willy Tarreau
7773d87ea6 CLEANUP: proxy: slightly reorganize fields to plug some holes
The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.

At the moment, compared to before this series, we're seeing these
sizes:

    type\size   7d554ca62   current  delta
    listener       752        704     -48  (-6.4%)
    server        4032       3840    -192  (-4.8%)
    proxy         3184       3056    -128  (-4%)
    stktable      3392       3328     -64  (-1.9%)

Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.
2025-09-16 09:23:46 +02:00
Willy Tarreau
8df81b6fcc CLEANUP: server: slightly reorder fields in the struct to plug holes
The struct server still has a lot of holes and padding that make it
quite big. By moving a few fields aronud between areas which do not
interact (e.g. boot vs aligned areas), it's quite easy to plug some
of them and/or to arrange larger ones which could be reused later with
a bit more effort. Here we've reduced holes by 40 bytes, allowing the
struct to shrink by one more cache line (64 bytes). The new size is
3840 bytes.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d18d972b1f MEDIUM: server: index server ID using compact trees
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
2025-09-16 09:23:46 +02:00
Willy Tarreau
66191584d1 MEDIUM: listener: index listener ID using compact trees
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
2025-09-16 09:23:46 +02:00
Willy Tarreau
1a95bc42c7 MEDIUM: proxy: index proxy ID using compact trees
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
2025-09-16 09:23:46 +02:00
Willy Tarreau
eab5b89dce MINOR: proxy: add proxy_index_id() to index a proxy by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
5e4b6714e1 MINOR: listener: add listener_index_id() to index a listener by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00