25390 Commits

Author SHA1 Message Date
Willy Tarreau
d0d60a007d MEDIUM: proxy: switch conf.name to cebis_tree
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.

Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.

Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
2025-09-16 09:23:46 +02:00
Willy Tarreau
fdf6fd5b45 MEDIUM: server: switch the host_dn member to cebis_tree
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
2025-09-16 09:23:46 +02:00
Willy Tarreau
413e903a22 MEDIUM: server: switch conf.name to cebis_tree
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.

It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
2025-09-16 09:23:46 +02:00
Willy Tarreau
0e99f64fc6 MEDIUM: server: switch addr_node to cebis_tree
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.

The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
2025-09-16 09:23:46 +02:00
Willy Tarreau
91258fb9d8 MEDIUM: guid: switch guid to more compact cebuis_tree
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.

It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.

Affected struct sizes in bytes:

           Before     After   Diff
  server    4032       4032     0*
  proxy     3184       3160    -24
  listener   752        728    -24

*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
2025-09-16 09:23:46 +02:00
Willy Tarreau
e36b3b60b3 MEDIUM: migrate the patterns reference to cebs_tree
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).

Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):

   $ time  ./haproxy -c -f acl-ip.cfg

Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.

Before:
  RAM (VSZ/RSS): 4443912 3912444

  real    0m19.211s
  user    0m18.138s
  sys     0m1.068s

  Overhead  Command         Shared Object      Symbol
    44.79%  haproxy  haproxy            [.] ebst_insert
    25.07%  haproxy  haproxy            [.] ebmb_insert_prefix
     3.44%  haproxy  libc-2.33.so       [.] __libc_calloc
     2.71%  haproxy  libc-2.33.so       [.] _int_malloc
     2.33%  haproxy  haproxy            [.] free_pattern_tree
     1.78%  haproxy  libc-2.33.so       [.] inet_pton4
     1.62%  haproxy  libc-2.33.so       [.] _IO_fgets
     1.58%  haproxy  libc-2.33.so       [.] _int_free
     1.56%  haproxy  haproxy            [.] pat_ref_push
     1.35%  haproxy  libc-2.33.so       [.] malloc_consolidate
     1.16%  haproxy  libc-2.33.so       [.] __strlen_avx2
     0.79%  haproxy  haproxy            [.] pat_idx_tree_ip
     0.76%  haproxy  haproxy            [.] pat_ref_read_from_file
     0.60%  haproxy  libc-2.33.so       [.] __strrchr_avx2
     0.55%  haproxy  libc-2.33.so       [.] unlink_chunk.constprop.0
     0.54%  haproxy  libc-2.33.so       [.] __memchr_avx2
     0.46%  haproxy  haproxy            [.] pat_ref_append

After:
  RAM (VSZ/RSS): 4166108 3634768

  real    0m18.114s
  user    0m17.113s
  sys     0m0.996s

  Overhead  Command  Shared Object       Symbol
    38.99%  haproxy  haproxy             [.] cebs_insert
    27.09%  haproxy  haproxy             [.] ebmb_insert_prefix
     3.63%  haproxy  libc-2.33.so        [.] __libc_calloc
     3.18%  haproxy  libc-2.33.so        [.] _int_malloc
     2.69%  haproxy  haproxy             [.] free_pattern_tree
     1.99%  haproxy  libc-2.33.so        [.] inet_pton4
     1.74%  haproxy  libc-2.33.so        [.] _IO_fgets
     1.73%  haproxy  libc-2.33.so        [.] _int_free
     1.57%  haproxy  haproxy             [.] pat_ref_push
     1.48%  haproxy  libc-2.33.so        [.] malloc_consolidate
     1.22%  haproxy  libc-2.33.so        [.] __strlen_avx2
     1.05%  haproxy  libc-2.33.so        [.] __strcmp_avx2
     0.80%  haproxy  haproxy             [.] pat_idx_tree_ip
     0.74%  haproxy  libc-2.33.so        [.] __memchr_avx2
     0.69%  haproxy  libc-2.33.so        [.] __strrchr_avx2
     0.69%  haproxy  libc-2.33.so        [.] _IO_getline_info
     0.62%  haproxy  haproxy             [.] pat_ref_read_from_file
     0.56%  haproxy  libc-2.33.so        [.] unlink_chunk.constprop.0
     0.56%  haproxy  libc-2.33.so        [.] cfree@GLIBC_2.2.5
     0.46%  haproxy  haproxy             [.] pat_ref_append

If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).

On large strings such as user agents (1 million here), it's now slightly
slower (+9%):

Before:
  real    0m2.475s
  user    0m2.316s
  sys     0m0.155s

After:
  real    0m2.696s
  user    0m2.544s
  sys     0m0.147s

But such patterns are much less common than short ones, and the memory
savings do still count.

Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
2025-09-16 09:23:46 +02:00
Willy Tarreau
ddf900a0ce IMPORT: cebtree: import version 0.5.0 to support duplicates
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
  - replace a few ceb_node** with ceb_root** since pointers are now
    tagged ;
  - replace cebu*.h with ceb*.h since both are now merged in the same
    include file. This way we can drop the unused cebu*.h files from
    cebtree that are provided only for compatibility.
  - rename immediate storage functions to cebXX_imm_XXX() as per the API
    change in 0.5 that makes immediate explicit rather than implicit.
    This only affects vars and tools.c:copy_file_name().

The tests continue to work.
2025-09-16 09:23:46 +02:00
Willy Tarreau
90b70b61b1 BUILD: makefile: implement support for running a command in range
When running "make range", it would be convenient to support running
reg tests or anything else such as "size", "pahole" or even benchmarks.
Such commands are usually specific to the developer's environment, so
let's just pass a generic variable TEST_CMD that is executed as-is if
not empty.

This way it becomes possible to run "make range RANGE=... TEST_CMD=...".
2025-09-16 09:23:46 +02:00
Valentine Krasnobaeva
f8acac653e BUG/MINOR: resolvers: always normalize FQDN from response
RFC1034 states the following:

By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit. This means that you are free to create a node with
label "A" or a node with label "a", but not both as brothers; you could
refer to either using "a" or "A".

In practice, most DNS resolvers normalize domain labels (i.e., convert
them to lowercase) before performing searches or comparisons to ensure
this requirement is met.

While HAProxy normalizes the domain name in the request, it currently
does not do so for the response. Commit 75cc653 ("MEDIUM: resolvers:
replace bogus resolv_hostname_cmp() with memcmp()") intentionally
removed the `tolower()` conversion from `resolv_hostname_cmp()` for
safety and performance reasons.

This commit re-introduces the necessary normalization for FQDNs received
in the response. The change is made in `resolv_read_name()`, where labels
are processed as an unsigned char string, allowing `tolower()` to be
applied safely. Since a typical FQDN has only 3-4 labels, replacing
`memcpy()` with an explicit copy that also applies `tolower()` should
not introduce a significant performance degradation.

This patch addresses the rare edge case, as most resolvers perform this
normalization themselves.

This fixes the GitHub issue #3102. This fix may be backported in all stable
versions since 2.5 included 2.5.
2025-09-15 18:02:16 +02:00
Remi Tricot-Le Breton
257df69fbd BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.

The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.

This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
2025-09-15 15:34:36 +02:00
Aurelien DARRAGON
6a92b14cc1 MEDIUM: log/proxy: store log-steps selection using a bitmask, not an eb tree
An eb tree was used to anticipate for infinite amount of custom log steps
configured at a proxy level. In turns out this makes no sense to configure
that much logging steps for a proxy, and the cost of the eb tree is non
negligible in terms of memory footprint, especially when used in a default
section.

Instead, let's use a simple bitmask, which allows up to 64 logging steps
configured at proxy level. If we lack space some day (and need more than
64 logging steps to be configured), we could simply modify
"struct log_steps" to spread the bitmask over multiple 64bits integers,
minor some adjustments where the mask is set and checked.
2025-09-15 10:29:02 +02:00
Aurelien DARRAGON
be417c1db2 BUG/MEDIUM: http_ana: fix potential NULL deref in http_process_req_common()
As reported by @kenballus in GH #3118, a potential NULL-deref was
introduced in 3da1d63 ("BUG/MEDIUM: http_ana: handle yield for "stats
http-request" evaluation")

Indeed, px->uri_auth may be NULL when stats directive is not involved in
the current proxy section.

The bug went unnoticed because it didn't seem to cause any side-effect
so far and valgrind didn't catch it. However ASAN did, so let's fix it
before it causes harm.

It should be backported with 3da1d63.
2025-09-15 10:28:59 +02:00
Christopher Faulet
b582fd41c2 Revert "BUG/MINOR: ocsp: Crash when updating CA during ocsp updates"
This reverts commit 167ea8fc7b0cf9d1bf71ec03d7eac3141fbe0080.

The patch was backported by mistake.
2025-09-15 10:16:20 +02:00
Remi Tricot-Le Breton
167ea8fc7b BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.

The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.

This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
2025-09-15 08:20:16 +02:00
Christopher Faulet
157852ce99 BUG/MEDIUM: resolvers: Wake resolver task up whne unlinking a stream requester
Another regression introduced with the commit 3023e9819 ("BUG/MINOR:
resolvers: Restore round-robin selection on records in DNS answers"). Stream
requesters are unlinked from any theards. So we must not try to queue the
resolver's task here because it is not allowed to do so from another thread
than the task thread. Instead, we can simply wake the resolver's task up. It
is only performed when the last stream requester is unlink from the
resolution.

This patch should fix the issue #3119. It must be backported with the commit
above.
2025-09-15 07:57:29 +02:00
Christopher Faulet
e6a9192af6 BUG/MEDIUM: resolvers: Accept to create resolution without hostname
A regression was introduced by commit 6cf2401ed ("BUG/MEDIUM: resolvers:
Make resolution owns its hostname_dn value"). In fact, it is possible (an
allowed ?!) to create a resolution without hostname (hostname_dn ==
NULL). It only happens on startup for a server relying on a resolver but
defined with an IP address and not a hostname

Because of the patch above, an error is triggered during the configuration
parsing when this happens, while it should be accepted.

This patch must be backported with the commit above.
2025-09-12 11:52:06 +02:00
Christopher Faulet
6cf2401eda BUG/MEDIUM: resolvers: Make resolution owns its hostname_dn value
The commit 37abe56b1 ("BUG/MEDIUM: resolvers: Properly cache do-resolv
resolution") introduced a regression. A resolution does not own its
hostname_dn value, it is a pointer on the first request value. But since the
commit above, it is possible to have orphan resolution, with no
requester. So it is important to modify the resolutions to make it owns its
hostname_dn value by duplicating it when it is created.

This patch must be backported with the commit above.
2025-09-12 11:09:19 +02:00
Christopher Faulet
f6dfbbe870 BUG/MEDIUM: resolvers: Test for empty tree when getting a record from DNS answer
In the previous fix 5d1d93fad ("BUG/MEDIUM: resolvers: Properly handle empty
tree when getting a record from the DNS answer"), I missed the fact the
answer tree can be empty.

So, to avoid crashes, when the answer tree is empty, we immediately exit
from resolv_get_ip_from_response() function with RSLV_UPD_NO_IP_FOUND. In
addition, when a record is removed from the tree, we take care to reset the
next node saved if necessary.

This patch must be backported with the commit above.
2025-09-12 11:09:19 +02:00
Collison, Steven
d738fa4ec0 DOC: proxy-protocol: Add TLS group and sig scheme TLVs
This change adds the PP2_SUBTYPE_SSL_GROUP and PP2_SUBTYPE_SSL_SIG_SCHEME
code point reservations in proxy_protocol.txt. The motivation for adding
these two TLVs is for backend visibility into the negotiated TLS key
exchange group and handshake signature scheme.

Demand for visibility is expected to increase as endpoints migrate to use
new Post-Quantum resistant algorithms for key exchange and signatures.
2025-09-12 09:25:14 +02:00
Willy Tarreau
8fb5ae5cc6 MINOR: activity/memory: count allocations performed under a lock
By checking the current thread's locking status, it becomes possible
to know during a memory allocation whether it's performed under a lock
or not. Both pools and memprofile functions were instrumented to check
for this and to increment the memprofile bin's locked_calls counter.

This one, when not zero, is reported on "show profiling memory" with a
percentage of all allocations that such locked allocations represent.
This way it becomes possible to try to target certain code paths that
are particularly expensive. Example:

  $ socat - /tmp/sock1 <<< "show profiling memory"|grep lock
     20297301           0     2598054528              0|   0x62a820fa3991 sockaddr_alloc+0x61/0xa3 p_alloc(128) [pool=sockaddr] [locked=54962 (0.2 %)]
            0    20297301              0     2598054528|   0x62a820fa3a24 sockaddr_free+0x44/0x59 p_free(-128) [pool=sockaddr] [locked=34300 (0.1 %)]
      9908432           0     1268279296              0|   0x62a820eb8524 main+0x81974 p_alloc(128) [pool=task] [locked=9908432 (100.0 %)]
      9908432           0      554872192              0|   0x62a820eb85a6 main+0x819f6 p_alloc(56) [pool=tasklet] [locked=9908432 (100.0 %)]
       263001           0       63120240              0|   0x62a820fa3c97 conn_new+0x37/0x1b2 p_alloc(240) [pool=connection] [locked=20662 (7.8 %)]
        71643           0       47307584              0|   0x62a82105204d pool_get_from_os_noinc+0x12d/0x161 posix_memalign(660) [locked=5393 (7.5 %)]
2025-09-11 16:32:34 +02:00
Willy Tarreau
9d8c2a888b MINOR: activity: collect CPU time spent on memory allocations for each task
When task profiling is enabled, the pool alloc/free code will measure the
time it takes to perform memory allocation after a cache miss or memory
freeing to the shared cache or OS. The time taken with the thread-local
cache is never measured as measuring that time is very expensive compared
to the pool access time. Here doing so costs around 2% performance at 2M
req/s, only when task profiling is enabled, so this remains reasonable.
The scheduler takes care of collecting that time and updating the
sched_activity entry corresponding to the current task when task profiling
is enabled.

The goal clearly is to track places that are wasting CPU time allocating
and releasing too often, or causing large evictions. This appears like
this in "show profiling tasks aggr":

  Tasks activity over 11.428 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   mem_avg   lat_avg
    process_stream             44183891   16.47m    22.36us   491.0ns   1.154us   1.000ns   101.1us
    h1_io_cb                   57386064   4.011m    4.193us   20.00ns   16.00ns      -      29.47us
    sc_conn_io_cb              42088024   49.04s    1.165us      -         -         -      54.67us
    h1_timeout_task              438171   196.5ms   448.0ns      -         -         -      100.1us
    srv_cleanup_toremove_conns       65   1.468ms   22.58us   184.0ns   87.00ns      -      101.3us
    task_process_applet               3   508.0us   169.3us      -      107.0us   1.847us   29.67us
    srv_cleanup_idle_conns            6   225.3us   37.55us   15.74us   36.84us      -      49.47us
    accept_queue_process              2   45.62us   22.81us      -         -      4.949us   54.33us
2025-09-11 16:32:34 +02:00
Willy Tarreau
195794eb59 MINOR: activity: add a new mem_avg column to show profiling stats
This new column will be used for reporting the average time spent
allocating or freeing memory in a task when task profiling is enabled.
For now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
98cc815e3e MINOR: activity: collect time spent with a lock held for each task
When DEBUG_THREAD > 0 and task profiling enabled, we'll now measure the
time spent with at least one lock held for each task. The time is
collected by locking operations when locks are taken raising the level
to one, or released resetting the level. An accumulator is updated in
the thread_ctx struct that is collected by the scheduler when the task
returns, and updated in the sched_activity entry of the related task.

This allows to observe figures like this one:

  Tasks activity over 259.516 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   15466589   2.574m    9.984us      -         -      33.45us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               8047994   8.325s    1.034us      -         -      870.1us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              7734689   4.356m    33.79us   1.990us   1.641us   1.554ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              7734292   46.74m    362.6us   278.3us   132.2us   972.0us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               7733158   46.88s    6.061us      -         -      68.78us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet         6603593   4.484m    40.74us   16.69us   34.00us   96.47us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    task_process_applet         4761796   3.420m    43.09us   18.79us   39.28us   138.2us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    process_table_expire        4710662   4.880m    62.16us   9.648us   53.95us   158.6us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates   4171868   6.786s    1.626us      -      1.487us   47.94us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    h1_io_cb                    2871683   1.198s    417.0ns   70.00ns   69.00ns   1.005ms <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
    process_peer_sync           2304957   5.368s    2.328us      -      1.156us   68.54us <- stktable_add_pend_updates@src/stick_table.c:873 task_wakeup
    process_peer_sync           1388141   3.174s    2.286us      -      1.130us   52.31us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates    463488   3.530s    7.615us   2.000ns   7.134us   771.2us <- stktable_touch_with_exp@src/stick_table.c:654 tasklet_wakeup

Here we see that almost the entirety of stktable_add_pend_updates() is
spent under a lock, that 1/3 of the execution time of process_stream()
was performed under a lock and that 2/3 of it was spent waiting for a
lock (this is related to the 10 track-sc present in this config), and
that the locking time in process_peer_sync() has now significantly
reduced. This is more visible with "show profiling tasks aggr":

  Tasks activity over 475.354 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   25742539   3.699m    8.622us   11.00ns   10.00ns   188.0us
    sc_conn_io_cb              22565666   1.475m    3.920us      -         -      473.9us
    process_stream             21665212   1.195h    198.6us   140.6us   67.08us   1.266ms
    task_process_applet        16352495   11.31m    41.51us   17.98us   36.55us   112.3us
    process_peer_sync           7831923   17.15s    2.189us      -      1.107us   41.27us
    process_table_expire        6878569   6.866m    59.89us   9.359us   51.91us   151.8us
    stktable_add_pend_updates   6602502   14.77s    2.236us      -      2.060us   119.8us
    h1_timeout_task                 801   703.4us   878.0ns      -         -      185.7us
    srv_cleanup_toremove_conns      347   12.43ms   35.82us   240.0ns   70.00ns   1.924ms
    accept_queue_process            142   1.384ms   9.743us      -         -      340.6us
    srv_cleanup_idle_conns           74   475.0us   6.418us   896.0ns   5.667us   114.6us
2025-09-11 16:32:34 +02:00
Willy Tarreau
95433f224e MINOR: activity: add a new lkd_avg column to show profiling stats
This new column will be used for reporting the average time spent
in a task with at least one lock held. It will only have a non-zero
value when DEBUG_THREAD > 0. For now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
4b23b2ed32 MINOR: thread: add a lock level information in the thread_ctx
The new lock_level field indicates the number of cumulated locks that
are held by the current thread. It's fed as soon as DEBUG_THREAD is at
least 1. In addition, thread_isolate() adds 128, so that it's even
possible to check for combinations of both. The value is also reported
in thread dumps (warnings and panics).
2025-09-11 16:32:34 +02:00
Willy Tarreau
503084643f MINOR: activity: collect time spent waiting on a lock for each task
When DEBUG_THREAD > 0, and if task profiling is enabled, then each
locking attempt will measure the time it takes to obtain the lock, then
add that time to a thread_ctx accumulator that the scheduler will then
retrieve to update the current task's sched_activity entry. The value
will then appear avearaged over the number of calls in the lkw_avg column
of "show profiling tasks", such as below:

  Tasks activity over 48.298 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lat_avg
    h1_io_cb                    3200170   26.81s    8.377us      -      32.73us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               1657841   1.645s    992.0ns      -      853.0us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              1600450   49.16s    30.71us   1.936us   1.392ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              1600321   7.770m    291.3us   209.1us   901.6us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               1599928   7.975s    4.984us      -      65.77us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet          997609   46.37s    46.48us   16.80us   113.0us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    process_table_expire         922074   48.79s    52.92us   7.275us   181.1us <- run_tasks_from_lists@src/task.c:670 task_queue
    stktable_add_pend_updates    705423   1.511s    2.142us      -      56.81us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    task_process_applet          683511   34.75s    50.84us   18.37us   153.3us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    h1_io_cb                     535395   198.1ms   370.0ns   72.00ns   930.4us <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup

It now makes it pretty obvious which tasks (hence call chains) spend their
time waiting on a lock and for what share of their execution time.
2025-09-11 16:32:34 +02:00
Willy Tarreau
1956c544b5 MINOR: activity: add a new lkw_avg column to show profiling stats
This new column will be used for reporting the average time spent waiting
for a lock. It will only have a non-zero value when DEBUG_THREAD > 0. For
now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
9f7ce9e807 MINOR: activity: don't report the lat_tot column for show profiling tasks
This column is pretty useless, as the total latency experienced by tasks
is meaningless, what matters is the average per call. Since we'll add more
columns and we need to keep all of this readable, let's get rid of this
column.
2025-09-11 16:32:34 +02:00
Christopher Faulet
3023e98199 BUG/MINOR: resolvers: Restore round-robin selection on records in DNS answers
Since the commit dcb696cd3 ("MEDIUM: resolvers: hash the records before
inserting them into the tree"), When several records are found in a DNS
answer, the round robin selection over these records is no longer performed.

Indeed, before a list of records was used. To ensure each records was
selected one after the other, at each selection, the first record of the
list was moved at the end. When this list was replaced bu a tree, the same
mechanism was preserved. However, the record is indexed using its key, a
hash of the record. So its position never changes. When it is removed and
reinserted in the tree, its position remains the same. When we walk though
the tree, starting from the root, the records are always evaluated in the
same order. So, even if there are several records in a DNS answer, the same
IP address is always selected.

It is quite easy to trigger the issue with a do-resolv action.

To fix the issue, the node to perform the next selection is now saved. So
instead of restarting from the root each time, we can restart from the next
node of the previous call.

Thanks to Damien Claisse for the issue analysis and for the reproducer.

This patch should fix the issue #3116. It must be backported as far as 2.6.
2025-09-11 15:46:45 +02:00
Christopher Faulet
37abe56b18 BUG/MEDIUM: resolvers: Properly cache do-resolv resolution
As stated by the documentation, when a do-resolv resolution is performed,
the result should be cached for <hold.valid> milliseconds. However, the only
way to cache the result is to always have a requester. When the last
requester is unlink from the resolution, the resolution is released. So, for
a do-resolv resolution, it means it could only work by chance if the same
FQDN is requested enough to always have at least two streams waiting for the
resolution. And because in that case, the cached result is used, it means
the traffic must be quite high.

In fact, a good approach to fix the issue is to keep orphan resolutions to
be able cache the result and only release them after hold.valid milliseconds
after the last real resolution. The resolver's task already releases orphan
resolutions. So we only need to check the expiration date and take care to
not release the resolution when the last stream is unlink from it.

This patch should be backported to all stable versions. We can start to
backport it as far as 3.1 and then wait a bit.
2025-09-11 15:46:45 +02:00
William Lallemand
fb832e1e52 BUILD: ssl: functions defined but not used
Previous patch 50d191b ("MINOR: ssl: set functions as static when no
protypes in the .h") broke the WolfSSL function with unused functions.

This patch add __maybe_unused to ssl_sock_sctl_parse_cbk(),
ssl_sock_sctl_add_cbk() and ssl_sock_msgcbk()
2025-09-11 15:32:59 +02:00
William Lallemand
50d191b8a3 MINOR: ssl: set functions as static when no protypes in the .h
Check with -Wmissing-prototypes what should be static.

src/ssl_sock.c:1572:5: error: no previous prototype for ‘ssl_sock_sctl_add_cbk’ [-Werror=missing-prototypes]
 1572 | int ssl_sock_sctl_add_cbk(SSL *ssl, unsigned ext_type, const unsigned char **out, size_t *outlen, int *al, void *add_arg)
      |     ^~~~~~~~~~~~~~~~~~~~~
src/ssl_sock.c:1582:5: error: no previous prototype for ‘ssl_sock_sctl_parse_cbk’ [-Werror=missing-prototypes]
 1582 | int ssl_sock_sctl_parse_cbk(SSL *s, unsigned int ext_type, const unsigned char *in, size_t inlen, int *al, void *parse_arg)
      |     ^~~~~~~~~~~~~~~~~~~~~~~
src/ssl_sock.c:1604:6: error: no previous prototype for ‘ssl_sock_infocbk’ [-Werror=missing-prototypes]
 1604 | void ssl_sock_infocbk(const SSL *ssl, int where, int ret)
      |      ^~~~~~~~~~~~~~~~
src/ssl_sock.c:2107:6: error: no previous prototype for ‘ssl_sock_msgcbk’ [-Werror=missing-prototypes]
 2107 | void ssl_sock_msgcbk(int write_p, int version, int content_type, const void *buf, size_t len, SSL *ssl, void *arg)
      |      ^~~~~~~~~~~~~~~
src/ssl_sock.c:3936:5: error: no previous prototype for ‘sh_ssl_sess_new_cb’ [-Werror=missing-prototypes]
 3936 | int sh_ssl_sess_new_cb(SSL *ssl, SSL_SESSION *sess)
      |     ^~~~~~~~~~~~~~~~~~
src/ssl_sock.c:3990:14: error: no previous prototype for ‘sh_ssl_sess_get_cb’ [-Werror=missing-prototypes]
 3990 | SSL_SESSION *sh_ssl_sess_get_cb(SSL *ssl, __OPENSSL_110_CONST__ unsigned char *key, int key_len, int *do_copy)
      |              ^~~~~~~~~~~~~~~~~~
src/ssl_sock.c:4043:6: error: no previous prototype for ‘sh_ssl_sess_remove_cb’ [-Werror=missing-prototypes]
 4043 | void sh_ssl_sess_remove_cb(SSL_CTX *ctx, SSL_SESSION *sess)
      |      ^~~~~~~~~~~~~~~~~~~~~
src/ssl_sock.c:4075:6: error: no previous prototype for ‘ssl_set_shctx’ [-Werror=missing-prototypes]
 4075 | void ssl_set_shctx(SSL_CTX *ctx)
      |      ^~~~~~~~~~~~~
src/ssl_sock.c:4103:6: error: no previous prototype for ‘SSL_CTX_keylog’ [-Werror=missing-prototypes]
 4103 | void SSL_CTX_keylog(const SSL *ssl, const char *line)
      |      ^~~~~~~~~~~~~~
src/ssl_sock.c:5167:6: error: no previous prototype for ‘ssl_sock_deinit’ [-Werror=missing-prototypes]
 5167 | void ssl_sock_deinit()
      |      ^~~~~~~~~~~~~~~
src/ssl_sock.c:6976:6: error: no previous prototype for ‘ssl_sock_close’ [-Werror=missing-prototypes]
 6976 | void ssl_sock_close(struct connection *conn, void *xprt_ctx) {
      |      ^~~~~~~~~~~~~~
src/ssl_sock.c:7846:17: error: no previous prototype for ‘ssl_action_wait_for_hs’ [-Werror=missing-prototypes]
 7846 | enum act_return ssl_action_wait_for_hs(struct act_rule *rule, struct proxy *px,
      |                 ^~~~~~~~~~~~~~~~~~~~~~
2025-09-11 15:23:59 +02:00
William Lallemand
19daee6549 MINOR: ocsp: put internal functions as static ones
-Wmissing-prototypes let us check which functions can be made static and
is not used elsewhere.

rc/ssl_ocsp.c:1079:5: error: no previous prototype for ‘ssl_ocsp_update_insert_after_error’ [-Werror=missing-prototypes]
 1079 | int ssl_ocsp_update_insert_after_error(struct certificate_ocsp *ocsp)
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1116:6: error: no previous prototype for ‘ocsp_update_response_stline_cb’ [-Werror=missing-prototypes]
 1116 | void ocsp_update_response_stline_cb(struct httpclient *hc)
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1127:6: error: no previous prototype for ‘ocsp_update_response_headers_cb’ [-Werror=missing-prototypes]
 1127 | void ocsp_update_response_headers_cb(struct httpclient *hc)
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1138:6: error: no previous prototype for ‘ocsp_update_response_body_cb’ [-Werror=missing-prototypes]
 1138 | void ocsp_update_response_body_cb(struct httpclient *hc)
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:1149:6: error: no previous prototype for ‘ocsp_update_response_end_cb’ [-Werror=missing-prototypes]
 1149 | void ocsp_update_response_end_cb(struct httpclient *hc)
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_ocsp.c:2095:5: error: no previous prototype for ‘ocsp_update_postparser_init’ [-Werror=missing-prototypes]
 2095 | int ocsp_update_postparser_init()
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-11 15:18:48 +02:00
William Lallemand
0224d60de6 BUG/MINOR: ocsp: prototype inconsistency
Inconsistencies between the .h and the .c can't be catched because the
.h is not included in the .c.

ocsp_update_init() does not have the right prototype and lacks a const
attribute.

Must be backported in all previous stable versions.
2025-09-11 15:18:10 +02:00
Remi Tricot-Le Breton
e0844a305c BUG/MINOR: ssl: Fix potential NULL deref in trace callback
'conn' might be NULL in the trace callback so the calls to
conn_err_code_str must be covered by a proper check.

This issue was found by Coverity and raised in GitHub #3112.
The patch must be backported to 3.2.
2025-09-11 14:31:32 +02:00
Remi Tricot-Le Breton
a316342ec6 BUG/MINOR: ssl: Potential NULL deref in trace macro
'ctx' might be NULL when we exit 'ssl_sock_handshake', it can't be
dereferenced without check in the trace macro.

This was found by Coverity andraised in GitHub #3113.
This patch should be backported up to 3.2.
2025-09-11 14:31:32 +02:00
William Lallemand
e52e6f66ac BUG/MEDIUM: jws: return size_t in JWS functions
JWS functions are supposed to return 0 upon error or when nothing was
produced. This was done in order to put easily the return value in
trash->data without having to check the return value.

However functions like a2base64url() or snprintf() could return a
negative value, which would be casted in a unsigned int if this happen.

This patch add checks on the JWS functions to ensure that no negative
value can be returned, and change the prototype from int to size_t.

This is also related to issue #3114.

Must be backported to 3.2.
2025-09-11 14:31:32 +02:00
William Lallemand
66a7ebfeef BUG/MINOR: acme: null pointer dereference upon allocation failure
Reported in issue #3115:

     	11. var_compare_op: Comparing task to null implies that task might be null.
681                if (!task) {
682                        ret++;
683                        ha_alert("acme: couldn't start the scheduler!\n");
684                }
CID 1609721: (#1 of 1): Dereference after null check (FORWARD_NULL)
12. var_deref_op: Dereferencing null pointer task.
685                task->nice = 0;
686                task->process = acme_scheduler;
687
688                task_wakeup(task, TASK_WOKEN_INIT);
689        }
690

Task would be dereferenced upon allocation failure instead of falling
back to the end of the function after the error.

Should be backported in 3.2.
2025-09-11 14:31:32 +02:00
Amaury Denoyelle
c15129f7dc DOC: quic: clarifies limited-quic support
This patch extends the documentation for "limited-quic" global keyword.
It mentions first that it relies on USE_QUIC_OPENSSL_COMPAT=1 build
option.

Compatibility with TLS libraries is now clearly exposed. In particular,
it highlights the fact that it is mostly targetted at OpenSSL version
prior to 3.5.2, and that it should be disabled if a recent OpenSSL
release is available. It also states that limited-quic does nothing if
USE_QUIC_OPENSSL_COMPAT is not set during compilation.
2025-09-11 10:11:12 +02:00
Amaury Denoyelle
d293cc62dc MINOR: quic: display build warning for compat layer on recent OpenSSL
Build option USE_QUIC_OPENSSL_COMPAT=1 must be set to activate QUIC
support for OpenSSL prior to version 3.5.2. This compiles an internal
compatibility layer, which must be then activated at runtime with global
option limited-quic.

Starting from OpenSSL version 3.5.2, a proper QUIC TLS API is now
exposed. Thus, the compatibility layer is unneeded. However it can still
be compiled against newer OpenSSL releases and activated at runtime,
mostly for test purpose.

As this compatibility layer has some limitations, (no support for QUIC
0-RTT), it's important that users notice this situation and disable it
if possible. Thus, this patch adds a notice warning when
USE_QUIC_OPENSSL_COMPAT=1 is set when building against OpenSSL 3.5.2 and
above. This should be sufficient for users and packagers to understand
that this option is not necessary anymore.

Note that USE_QUIC_OPENSSL_COMPAT=1 is incompatible with others TLS
library which exposed a QUIC API based on original BoringSSL patches
set. A build error will prevent the compatibility layer to be built.
limited-quic option is thus silently ignored.
2025-09-11 10:11:12 +02:00
Frederic Lecaille
5027ba36a9 MINOR: quic-be: make SSL/QUIC objects use their own indexes (ssl_qc_app_data_index)
This index is used to retrieve the quic_conn object from its SSL object, the same
way the connection is retrieved from its SSL object for SSL/TCP connections.

This patch implements two helper functions to avoid the ugly code with such blocks:

   #ifdef USE_QUIC
   else if (qc) { .. }
   #endif

Implement ssl_sock_get_listener() to return the listener from an SSL object.
Implement ssl_sock_get_conn() to return the connection from an SSL object
and optionally a pointer to the ssl_sock_ctx struct attached to the connections
or the quic_conns.

Use this functions where applicable:
   - ssl_tlsext_ticket_key_cb() calls ssl_sock_get_listener()
   - ssl_sock_infocbk() calls ssl_sock_get_conn()
   - ssl_sock_msgcbk() calls ssl_sock_get_ssl_conn()
   - ssl_sess_new_srv_cb() calls ssl_sock_get_conn()
   - ssl_sock_srv_verifycbk() calls ssl_sock_get_conn()

Also modify qc_ssl_sess_init() to initialize the ssl_qc_app_data_index index for
the QUIC backends.
2025-09-11 09:51:28 +02:00
Frederic Lecaille
47bb15ca84 MINOR: quic: get rid of ->target quic_conn struct member
The ->li (struct listener *) member of quic_conn struct was replaced by a
->target (struct obj_type *) member by this commit:

    MINOR: quic-be: get rid of ->li quic_conn member

to abstract the connection type (front or back) when implementing QUIC for the
backends. In these cases, ->target was a pointer to the ojb_type of a server
struct. This could not work with the dynamic servers contrary to the listeners
which are not dynamic.

This patch almost reverts the one mentioned above. ->target pointer to obj_type member
is replaced by ->li pointer to listener struct member. As the listener are not
dynamic, this is easy to do this. All one has to do is to replace the
objt_listener(qc->target) statement by qc->li where applicable.

For the backend connection, when needed, this is always qc->conn->target which is
used only when qc->conn is initialized. The only "problematic" case is for
quic_dgram_parse() which takes a pointer to an obj_type as third argument.
But this obj_type is only used to call quic_rx_pkt_parse(). Inside this function
it is used to access the proxy counters of the connection thanks to qc_counters().
So, this obj_type argument may be null for now on with this patch. This is the
reason why qc_counters() is modified to take this into consideration.
2025-09-11 09:51:28 +02:00
Christopher Faulet
5354c24c76 BUG/MAJOR: stream: Force channel analysis on successful synchronous send
This patchs reverts commit a498e527b ("BUG/MAJOR: stream: Remove READ/WRITE
events on channels after analysers eval") because of a regression. It was an
attempt to properly detect synchronous sends, even when the stream was woken
up on a write event. However, the fix was wrong because it could mask
shutdowns performed during process_stream() and block the stream.

Indeed, when a shutdown is performed, because an error occurred for
instance, a write event is reported. The commit above could mask this event
while the shutdown prevent any synchronous sends. In such case, the stream
could remain blocked infinitly because an I/O event was missed.

So to properly fix the original issue (#3070), the write event must not be
masked before a synchronous send. Instead, we now force the channel analysis
by setting explicitly CF_WAKE_ONCE flags on the corresponding channel if a
write event is reported after the synchronous send. CF_WRITE_EVENT flag is
remove explicitly just before, so it is quite easy to detect.

This patch must be backport to all stable version in same time of the commit
above.
2025-09-11 09:47:47 +02:00
Willy Tarreau
ded2110ec6 MEDIUM: peers: move process_peer_sync() to a single thread
The remaining half of the task_queue() and task_wakeup() contention
is caused by this function when peers are in use, because just like
process_table_expire(), it's created using task_new_anywhere() and
is woken up for local updates. Let's turn it to single thread by
rotating the assigned threads during initialization so that a table
only runs on one thread at a time.

Here we go backwards to assign the threads, so that on small setups
they don't end up on the same CPUs as the ones used by the stick-tables.
This way this will make an even better use of large machines. The
performance remains the same as with previous patch, even slightly
better (1-3% on avg).

At this point there's almost no multi-threaded task activity anymore
(only srv_cleanup_idle_server once in a while). This should improve
the situation described by Felipe in issues #3084 and #3101.

This should be backported to 3.2 after some extended checks.
2025-09-10 19:14:05 +02:00
Willy Tarreau
e05afda249 MEDIUM: stick-table: move process_table_expire() to a single thread
A big deal of the task_queue() contention is caused by this function
because it's created using task_new_anywhere() and is subject to
heavy updates. Let's turn it to single thread by rotating the assigned
threads during initialization so that a table only runs on one thread
at a time.

However there's a trick: the function used to call task_queue() to
requeue the task if it had advanced its timer (may only happen when
learning an entry from a peer). We can't do that anymore since we can't
queue another thread's task. Thus instead of the task needs to be
scheduled earlier than previously planned, we simply perform a wakeup.
It will likely do nothing and will self-adjust its next wakeup timer.

Doing so halves the number of multi-thread task wakeups. In addition
the request rate at saturation increased by 12% with 16 peers and 40
tables on a 16 8-thread processes. This should improve the situation
described by Felipe in issues #3084 and #3101.

This should be backported to 3.2 after some extended checks.
2025-09-10 19:13:33 +02:00
Willy Tarreau
2831cb104f BUG/MINOR: stick-table: make sure never to miss a process_table_expire update
In stktable_requeue_exp(), there's a tiny race at the beginning during
which we check the task's expiration date to decide whether or not to
wake process_table_expire() up. During this race, the task might just
have finished running on its owner thread and we can miss a task_queue()
opportunity, which probably explains why during testing it seldom happens
that a few entries are left at the end.

Let's perform a CAS to confirm the value is still the same before
leaving. This way we're certain that our value has been seen at least
once.

This should be backported to 3.2.
2025-09-10 18:45:01 +02:00
Willy Tarreau
2ce5e0edcc MEDIUM: resolvers: make the process_resolvers() task single-threaded
This task is sometimes caught triggering the watchdog while waiting for
the infamous resolvers lock, or the scheduler's wait queue lock in
task_queue(). Both are caused by its multi-threaded capability. The
task may indeed start on a thread that's different from the one that
is currently receiving a response and that holds the resolvers lock,
and when being queued back, it requires to lock the wait queue. Both
problems disappear when sticking it to a single thread. But for configs
running multiple resolvers sections, it would be suboptimal to run them
all on the same thread. In order to avoid this, we implement a counter
in the resolvers_finalize_config() section that rotates the thread for
each resolvers section.

This was sufficient to further improve the performance here, making the
CPU usage drop to about 7% (from 11 previously or 38 initially) and not
showing any resolvers lock contention anymore in perf top output.

The change was kept fairly minimal to permit a backport once enough
testing is conducted on it. It could address a significant part of
the trouble reported by Felipe in GH issue #3101.
2025-09-10 16:51:14 +02:00
Willy Tarreau
d624aceaef MEDIUM: dns: bind the nameserver sockets to the initiating thread
There's still a big architectural limitation in the dns/resolvers code
regarding threads: resolvers run as a task that is scheduled to run
anywhere, and each NS dgram socket is bound to any thread of the same
thread group as the initiating thread. This becomes a big problem when
dealing with multiple nameservers because responses arrive on any thread,
start by locking the resolvers section, and other threads dealing with
responses are just stuck waiting for the lock to disappear. This means
that most of the time is exclusively spent causing contention. The
process_resolvers() function also also suffers from this contention
but apparently less often.

It turns out that the nameserver sockets are created during emission
of the first packet, triggered from the resolvers task. The present
patch exploits this to stick all sockets to the calling thread instead
of any thread. This way there is no longer any contention between
multiple nameservers of a same resolvers section. Tests with a section
having 10 name servers showed that the CPU usage dropped from 38 to
about 10%, or almost by a factor of 4.

Note that TCP resolvers do not offer this possibility because the
tasks that manage the applets are created earlier to run anywhere
during config parsing. This might possibly be refined later, e.g.
by changing the task's affinity when it first runs.

The change was kept fairly minimal to permit a backport once enough
testing is conducted on it. It could address a significant part of
the trouble reported by Felipe in GH issue #3101.
2025-09-10 16:48:09 +02:00
Olivier Houchard
07c10ec2f1 BUG/MEDIUM: ssl: Fix a crash if we failed to create the mux
In ssl_sock_io_cb(), if we failed to create the mux, we may have
destroyed the connection, so only attempt to access it to get the ALPN
if conn_create_mux() was successful.
This fixes crashes that may happen when using ssl.
2025-09-10 12:02:53 +02:00
Olivier Houchard
1759c97255 BUG/MEDIUM: ssl: Fix a crash when using QUIC
Commit 5ab9954faa9c815425fa39171ad33e75f4f7d56f introduced a new flag in
ssl_sock_ctx, to know that an ALPN was negociated, however, the way to
get the ssl_sock_ctx was wrong for QUIC. If we're using QUIC, get it
from the quic_conn.
This should fix crashes when attempting to use QUIC.
2025-09-10 11:45:03 +02:00