8665 Commits

Author SHA1 Message Date
Willy Tarreau
ceaf8c1220 MEDIUM: connection: move idle connection trees to ceb64
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.

This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):

     struct       before  after  delta
  conn_hash_nodea   56     32     -24
  srv_per_thread    96     72     -24

The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
2025-09-16 09:23:46 +02:00
Willy Tarreau
95b8adff67 MINOR: connection: pass the thread number to conn_delete_from_tree()
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
2025-09-16 09:23:46 +02:00
Willy Tarreau
7773d87ea6 CLEANUP: proxy: slightly reorganize fields to plug some holes
The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.

At the moment, compared to before this series, we're seeing these
sizes:

    type\size   7d554ca62   current  delta
    listener       752        704     -48  (-6.4%)
    server        4032       3840    -192  (-4.8%)
    proxy         3184       3056    -128  (-4%)
    stktable      3392       3328     -64  (-1.9%)

Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.
2025-09-16 09:23:46 +02:00
Willy Tarreau
8df81b6fcc CLEANUP: server: slightly reorder fields in the struct to plug holes
The struct server still has a lot of holes and padding that make it
quite big. By moving a few fields aronud between areas which do not
interact (e.g. boot vs aligned areas), it's quite easy to plug some
of them and/or to arrange larger ones which could be reused later with
a bit more effort. Here we've reduced holes by 40 bytes, allowing the
struct to shrink by one more cache line (64 bytes). The new size is
3840 bytes.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d18d972b1f MEDIUM: server: index server ID using compact trees
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
2025-09-16 09:23:46 +02:00
Willy Tarreau
66191584d1 MEDIUM: listener: index listener ID using compact trees
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
2025-09-16 09:23:46 +02:00
Willy Tarreau
1a95bc42c7 MEDIUM: proxy: index proxy ID using compact trees
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
2025-09-16 09:23:46 +02:00
Willy Tarreau
eab5b89dce MINOR: proxy: add proxy_index_id() to index a proxy by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
5e4b6714e1 MINOR: listener: add listener_index_id() to index a listener by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
5a5cec4d7a MINOR: server: add server_index_id() to index a server by its ID
This avoids needlessly exposing the tree's root and the mechanics outside
of the low-level code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
0b0aefe19b MINOR: server: add server_get_next_id() to find next free server ID
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
2025-09-16 09:23:46 +02:00
Willy Tarreau
23605eddb1 MINOR: listener: add listener_get_next_id() to find next free listener ID
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated listener_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
2025-09-16 09:23:46 +02:00
Willy Tarreau
b2402d67b7 MINOR: proxy: add proxy_get_next_id() to find next free proxy ID
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated proxy_get_next_id().
2025-09-16 09:23:46 +02:00
Willy Tarreau
f4059ea42f MEDIUM: stktable: index table names using compact trees
Here we're saving 64 bytes per stick-table, from 3392 to 3328, and the
change was really straightforward so there's no reason not to do it.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d0d60a007d MEDIUM: proxy: switch conf.name to cebis_tree
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.

Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.

Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
2025-09-16 09:23:46 +02:00
Willy Tarreau
fdf6fd5b45 MEDIUM: server: switch the host_dn member to cebis_tree
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
2025-09-16 09:23:46 +02:00
Willy Tarreau
413e903a22 MEDIUM: server: switch conf.name to cebis_tree
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.

It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
2025-09-16 09:23:46 +02:00
Willy Tarreau
0e99f64fc6 MEDIUM: server: switch addr_node to cebis_tree
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.

The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
2025-09-16 09:23:46 +02:00
Willy Tarreau
91258fb9d8 MEDIUM: guid: switch guid to more compact cebuis_tree
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.

It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.

Affected struct sizes in bytes:

           Before     After   Diff
  server    4032       4032     0*
  proxy     3184       3160    -24
  listener   752        728    -24

*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
2025-09-16 09:23:46 +02:00
Willy Tarreau
e36b3b60b3 MEDIUM: migrate the patterns reference to cebs_tree
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).

Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):

   $ time  ./haproxy -c -f acl-ip.cfg

Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.

Before:
  RAM (VSZ/RSS): 4443912 3912444

  real    0m19.211s
  user    0m18.138s
  sys     0m1.068s

  Overhead  Command         Shared Object      Symbol
    44.79%  haproxy  haproxy            [.] ebst_insert
    25.07%  haproxy  haproxy            [.] ebmb_insert_prefix
     3.44%  haproxy  libc-2.33.so       [.] __libc_calloc
     2.71%  haproxy  libc-2.33.so       [.] _int_malloc
     2.33%  haproxy  haproxy            [.] free_pattern_tree
     1.78%  haproxy  libc-2.33.so       [.] inet_pton4
     1.62%  haproxy  libc-2.33.so       [.] _IO_fgets
     1.58%  haproxy  libc-2.33.so       [.] _int_free
     1.56%  haproxy  haproxy            [.] pat_ref_push
     1.35%  haproxy  libc-2.33.so       [.] malloc_consolidate
     1.16%  haproxy  libc-2.33.so       [.] __strlen_avx2
     0.79%  haproxy  haproxy            [.] pat_idx_tree_ip
     0.76%  haproxy  haproxy            [.] pat_ref_read_from_file
     0.60%  haproxy  libc-2.33.so       [.] __strrchr_avx2
     0.55%  haproxy  libc-2.33.so       [.] unlink_chunk.constprop.0
     0.54%  haproxy  libc-2.33.so       [.] __memchr_avx2
     0.46%  haproxy  haproxy            [.] pat_ref_append

After:
  RAM (VSZ/RSS): 4166108 3634768

  real    0m18.114s
  user    0m17.113s
  sys     0m0.996s

  Overhead  Command  Shared Object       Symbol
    38.99%  haproxy  haproxy             [.] cebs_insert
    27.09%  haproxy  haproxy             [.] ebmb_insert_prefix
     3.63%  haproxy  libc-2.33.so        [.] __libc_calloc
     3.18%  haproxy  libc-2.33.so        [.] _int_malloc
     2.69%  haproxy  haproxy             [.] free_pattern_tree
     1.99%  haproxy  libc-2.33.so        [.] inet_pton4
     1.74%  haproxy  libc-2.33.so        [.] _IO_fgets
     1.73%  haproxy  libc-2.33.so        [.] _int_free
     1.57%  haproxy  haproxy             [.] pat_ref_push
     1.48%  haproxy  libc-2.33.so        [.] malloc_consolidate
     1.22%  haproxy  libc-2.33.so        [.] __strlen_avx2
     1.05%  haproxy  libc-2.33.so        [.] __strcmp_avx2
     0.80%  haproxy  haproxy             [.] pat_idx_tree_ip
     0.74%  haproxy  libc-2.33.so        [.] __memchr_avx2
     0.69%  haproxy  libc-2.33.so        [.] __strrchr_avx2
     0.69%  haproxy  libc-2.33.so        [.] _IO_getline_info
     0.62%  haproxy  haproxy             [.] pat_ref_read_from_file
     0.56%  haproxy  libc-2.33.so        [.] unlink_chunk.constprop.0
     0.56%  haproxy  libc-2.33.so        [.] cfree@GLIBC_2.2.5
     0.46%  haproxy  haproxy             [.] pat_ref_append

If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).

On large strings such as user agents (1 million here), it's now slightly
slower (+9%):

Before:
  real    0m2.475s
  user    0m2.316s
  sys     0m0.155s

After:
  real    0m2.696s
  user    0m2.544s
  sys     0m0.147s

But such patterns are much less common than short ones, and the memory
savings do still count.

Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
2025-09-16 09:23:46 +02:00
Willy Tarreau
ddf900a0ce IMPORT: cebtree: import version 0.5.0 to support duplicates
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
  - replace a few ceb_node** with ceb_root** since pointers are now
    tagged ;
  - replace cebu*.h with ceb*.h since both are now merged in the same
    include file. This way we can drop the unused cebu*.h files from
    cebtree that are provided only for compatibility.
  - rename immediate storage functions to cebXX_imm_XXX() as per the API
    change in 0.5 that makes immediate explicit rather than implicit.
    This only affects vars and tools.c:copy_file_name().

The tests continue to work.
2025-09-16 09:23:46 +02:00
Remi Tricot-Le Breton
257df69fbd BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.

The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.

This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
2025-09-15 15:34:36 +02:00
Aurelien DARRAGON
6a92b14cc1 MEDIUM: log/proxy: store log-steps selection using a bitmask, not an eb tree
An eb tree was used to anticipate for infinite amount of custom log steps
configured at a proxy level. In turns out this makes no sense to configure
that much logging steps for a proxy, and the cost of the eb tree is non
negligible in terms of memory footprint, especially when used in a default
section.

Instead, let's use a simple bitmask, which allows up to 64 logging steps
configured at proxy level. If we lack space some day (and need more than
64 logging steps to be configured), we could simply modify
"struct log_steps" to spread the bitmask over multiple 64bits integers,
minor some adjustments where the mask is set and checked.
2025-09-15 10:29:02 +02:00
Christopher Faulet
b582fd41c2 Revert "BUG/MINOR: ocsp: Crash when updating CA during ocsp updates"
This reverts commit 167ea8fc7b0cf9d1bf71ec03d7eac3141fbe0080.

The patch was backported by mistake.
2025-09-15 10:16:20 +02:00
Remi Tricot-Le Breton
167ea8fc7b BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.

The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.

This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
2025-09-15 08:20:16 +02:00
Willy Tarreau
8fb5ae5cc6 MINOR: activity/memory: count allocations performed under a lock
By checking the current thread's locking status, it becomes possible
to know during a memory allocation whether it's performed under a lock
or not. Both pools and memprofile functions were instrumented to check
for this and to increment the memprofile bin's locked_calls counter.

This one, when not zero, is reported on "show profiling memory" with a
percentage of all allocations that such locked allocations represent.
This way it becomes possible to try to target certain code paths that
are particularly expensive. Example:

  $ socat - /tmp/sock1 <<< "show profiling memory"|grep lock
     20297301           0     2598054528              0|   0x62a820fa3991 sockaddr_alloc+0x61/0xa3 p_alloc(128) [pool=sockaddr] [locked=54962 (0.2 %)]
            0    20297301              0     2598054528|   0x62a820fa3a24 sockaddr_free+0x44/0x59 p_free(-128) [pool=sockaddr] [locked=34300 (0.1 %)]
      9908432           0     1268279296              0|   0x62a820eb8524 main+0x81974 p_alloc(128) [pool=task] [locked=9908432 (100.0 %)]
      9908432           0      554872192              0|   0x62a820eb85a6 main+0x819f6 p_alloc(56) [pool=tasklet] [locked=9908432 (100.0 %)]
       263001           0       63120240              0|   0x62a820fa3c97 conn_new+0x37/0x1b2 p_alloc(240) [pool=connection] [locked=20662 (7.8 %)]
        71643           0       47307584              0|   0x62a82105204d pool_get_from_os_noinc+0x12d/0x161 posix_memalign(660) [locked=5393 (7.5 %)]
2025-09-11 16:32:34 +02:00
Willy Tarreau
9d8c2a888b MINOR: activity: collect CPU time spent on memory allocations for each task
When task profiling is enabled, the pool alloc/free code will measure the
time it takes to perform memory allocation after a cache miss or memory
freeing to the shared cache or OS. The time taken with the thread-local
cache is never measured as measuring that time is very expensive compared
to the pool access time. Here doing so costs around 2% performance at 2M
req/s, only when task profiling is enabled, so this remains reasonable.
The scheduler takes care of collecting that time and updating the
sched_activity entry corresponding to the current task when task profiling
is enabled.

The goal clearly is to track places that are wasting CPU time allocating
and releasing too often, or causing large evictions. This appears like
this in "show profiling tasks aggr":

  Tasks activity over 11.428 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   mem_avg   lat_avg
    process_stream             44183891   16.47m    22.36us   491.0ns   1.154us   1.000ns   101.1us
    h1_io_cb                   57386064   4.011m    4.193us   20.00ns   16.00ns      -      29.47us
    sc_conn_io_cb              42088024   49.04s    1.165us      -         -         -      54.67us
    h1_timeout_task              438171   196.5ms   448.0ns      -         -         -      100.1us
    srv_cleanup_toremove_conns       65   1.468ms   22.58us   184.0ns   87.00ns      -      101.3us
    task_process_applet               3   508.0us   169.3us      -      107.0us   1.847us   29.67us
    srv_cleanup_idle_conns            6   225.3us   37.55us   15.74us   36.84us      -      49.47us
    accept_queue_process              2   45.62us   22.81us      -         -      4.949us   54.33us
2025-09-11 16:32:34 +02:00
Willy Tarreau
195794eb59 MINOR: activity: add a new mem_avg column to show profiling stats
This new column will be used for reporting the average time spent
allocating or freeing memory in a task when task profiling is enabled.
For now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
98cc815e3e MINOR: activity: collect time spent with a lock held for each task
When DEBUG_THREAD > 0 and task profiling enabled, we'll now measure the
time spent with at least one lock held for each task. The time is
collected by locking operations when locks are taken raising the level
to one, or released resetting the level. An accumulator is updated in
the thread_ctx struct that is collected by the scheduler when the task
returns, and updated in the sched_activity entry of the related task.

This allows to observe figures like this one:

  Tasks activity over 259.516 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   15466589   2.574m    9.984us      -         -      33.45us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               8047994   8.325s    1.034us      -         -      870.1us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              7734689   4.356m    33.79us   1.990us   1.641us   1.554ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              7734292   46.74m    362.6us   278.3us   132.2us   972.0us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               7733158   46.88s    6.061us      -         -      68.78us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet         6603593   4.484m    40.74us   16.69us   34.00us   96.47us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    task_process_applet         4761796   3.420m    43.09us   18.79us   39.28us   138.2us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    process_table_expire        4710662   4.880m    62.16us   9.648us   53.95us   158.6us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates   4171868   6.786s    1.626us      -      1.487us   47.94us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    h1_io_cb                    2871683   1.198s    417.0ns   70.00ns   69.00ns   1.005ms <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
    process_peer_sync           2304957   5.368s    2.328us      -      1.156us   68.54us <- stktable_add_pend_updates@src/stick_table.c:873 task_wakeup
    process_peer_sync           1388141   3.174s    2.286us      -      1.130us   52.31us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates    463488   3.530s    7.615us   2.000ns   7.134us   771.2us <- stktable_touch_with_exp@src/stick_table.c:654 tasklet_wakeup

Here we see that almost the entirety of stktable_add_pend_updates() is
spent under a lock, that 1/3 of the execution time of process_stream()
was performed under a lock and that 2/3 of it was spent waiting for a
lock (this is related to the 10 track-sc present in this config), and
that the locking time in process_peer_sync() has now significantly
reduced. This is more visible with "show profiling tasks aggr":

  Tasks activity over 475.354 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   25742539   3.699m    8.622us   11.00ns   10.00ns   188.0us
    sc_conn_io_cb              22565666   1.475m    3.920us      -         -      473.9us
    process_stream             21665212   1.195h    198.6us   140.6us   67.08us   1.266ms
    task_process_applet        16352495   11.31m    41.51us   17.98us   36.55us   112.3us
    process_peer_sync           7831923   17.15s    2.189us      -      1.107us   41.27us
    process_table_expire        6878569   6.866m    59.89us   9.359us   51.91us   151.8us
    stktable_add_pend_updates   6602502   14.77s    2.236us      -      2.060us   119.8us
    h1_timeout_task                 801   703.4us   878.0ns      -         -      185.7us
    srv_cleanup_toremove_conns      347   12.43ms   35.82us   240.0ns   70.00ns   1.924ms
    accept_queue_process            142   1.384ms   9.743us      -         -      340.6us
    srv_cleanup_idle_conns           74   475.0us   6.418us   896.0ns   5.667us   114.6us
2025-09-11 16:32:34 +02:00
Willy Tarreau
95433f224e MINOR: activity: add a new lkd_avg column to show profiling stats
This new column will be used for reporting the average time spent
in a task with at least one lock held. It will only have a non-zero
value when DEBUG_THREAD > 0. For now it is not updated.
2025-09-11 16:32:34 +02:00
Willy Tarreau
4b23b2ed32 MINOR: thread: add a lock level information in the thread_ctx
The new lock_level field indicates the number of cumulated locks that
are held by the current thread. It's fed as soon as DEBUG_THREAD is at
least 1. In addition, thread_isolate() adds 128, so that it's even
possible to check for combinations of both. The value is also reported
in thread dumps (warnings and panics).
2025-09-11 16:32:34 +02:00
Willy Tarreau
503084643f MINOR: activity: collect time spent waiting on a lock for each task
When DEBUG_THREAD > 0, and if task profiling is enabled, then each
locking attempt will measure the time it takes to obtain the lock, then
add that time to a thread_ctx accumulator that the scheduler will then
retrieve to update the current task's sched_activity entry. The value
will then appear avearaged over the number of calls in the lkw_avg column
of "show profiling tasks", such as below:

  Tasks activity over 48.298 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lat_avg
    h1_io_cb                    3200170   26.81s    8.377us      -      32.73us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               1657841   1.645s    992.0ns      -      853.0us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              1600450   49.16s    30.71us   1.936us   1.392ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              1600321   7.770m    291.3us   209.1us   901.6us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               1599928   7.975s    4.984us      -      65.77us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet          997609   46.37s    46.48us   16.80us   113.0us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    process_table_expire         922074   48.79s    52.92us   7.275us   181.1us <- run_tasks_from_lists@src/task.c:670 task_queue
    stktable_add_pend_updates    705423   1.511s    2.142us      -      56.81us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    task_process_applet          683511   34.75s    50.84us   18.37us   153.3us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    h1_io_cb                     535395   198.1ms   370.0ns   72.00ns   930.4us <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup

It now makes it pretty obvious which tasks (hence call chains) spend their
time waiting on a lock and for what share of their execution time.
2025-09-11 16:32:34 +02:00
Willy Tarreau
1956c544b5 MINOR: activity: add a new lkw_avg column to show profiling stats
This new column will be used for reporting the average time spent waiting
for a lock. It will only have a non-zero value when DEBUG_THREAD > 0. For
now it is not updated.
2025-09-11 16:32:34 +02:00
Christopher Faulet
3023e98199 BUG/MINOR: resolvers: Restore round-robin selection on records in DNS answers
Since the commit dcb696cd3 ("MEDIUM: resolvers: hash the records before
inserting them into the tree"), When several records are found in a DNS
answer, the round robin selection over these records is no longer performed.

Indeed, before a list of records was used. To ensure each records was
selected one after the other, at each selection, the first record of the
list was moved at the end. When this list was replaced bu a tree, the same
mechanism was preserved. However, the record is indexed using its key, a
hash of the record. So its position never changes. When it is removed and
reinserted in the tree, its position remains the same. When we walk though
the tree, starting from the root, the records are always evaluated in the
same order. So, even if there are several records in a DNS answer, the same
IP address is always selected.

It is quite easy to trigger the issue with a do-resolv action.

To fix the issue, the node to perform the next selection is now saved. So
instead of restarting from the root each time, we can restart from the next
node of the previous call.

Thanks to Damien Claisse for the issue analysis and for the reproducer.

This patch should fix the issue #3116. It must be backported as far as 2.6.
2025-09-11 15:46:45 +02:00
William Lallemand
e52e6f66ac BUG/MEDIUM: jws: return size_t in JWS functions
JWS functions are supposed to return 0 upon error or when nothing was
produced. This was done in order to put easily the return value in
trash->data without having to check the return value.

However functions like a2base64url() or snprintf() could return a
negative value, which would be casted in a unsigned int if this happen.

This patch add checks on the JWS functions to ensure that no negative
value can be returned, and change the prototype from int to size_t.

This is also related to issue #3114.

Must be backported to 3.2.
2025-09-11 14:31:32 +02:00
Amaury Denoyelle
d293cc62dc MINOR: quic: display build warning for compat layer on recent OpenSSL
Build option USE_QUIC_OPENSSL_COMPAT=1 must be set to activate QUIC
support for OpenSSL prior to version 3.5.2. This compiles an internal
compatibility layer, which must be then activated at runtime with global
option limited-quic.

Starting from OpenSSL version 3.5.2, a proper QUIC TLS API is now
exposed. Thus, the compatibility layer is unneeded. However it can still
be compiled against newer OpenSSL releases and activated at runtime,
mostly for test purpose.

As this compatibility layer has some limitations, (no support for QUIC
0-RTT), it's important that users notice this situation and disable it
if possible. Thus, this patch adds a notice warning when
USE_QUIC_OPENSSL_COMPAT=1 is set when building against OpenSSL 3.5.2 and
above. This should be sufficient for users and packagers to understand
that this option is not necessary anymore.

Note that USE_QUIC_OPENSSL_COMPAT=1 is incompatible with others TLS
library which exposed a QUIC API based on original BoringSSL patches
set. A build error will prevent the compatibility layer to be built.
limited-quic option is thus silently ignored.
2025-09-11 10:11:12 +02:00
Frederic Lecaille
5027ba36a9 MINOR: quic-be: make SSL/QUIC objects use their own indexes (ssl_qc_app_data_index)
This index is used to retrieve the quic_conn object from its SSL object, the same
way the connection is retrieved from its SSL object for SSL/TCP connections.

This patch implements two helper functions to avoid the ugly code with such blocks:

   #ifdef USE_QUIC
   else if (qc) { .. }
   #endif

Implement ssl_sock_get_listener() to return the listener from an SSL object.
Implement ssl_sock_get_conn() to return the connection from an SSL object
and optionally a pointer to the ssl_sock_ctx struct attached to the connections
or the quic_conns.

Use this functions where applicable:
   - ssl_tlsext_ticket_key_cb() calls ssl_sock_get_listener()
   - ssl_sock_infocbk() calls ssl_sock_get_conn()
   - ssl_sock_msgcbk() calls ssl_sock_get_ssl_conn()
   - ssl_sess_new_srv_cb() calls ssl_sock_get_conn()
   - ssl_sock_srv_verifycbk() calls ssl_sock_get_conn()

Also modify qc_ssl_sess_init() to initialize the ssl_qc_app_data_index index for
the QUIC backends.
2025-09-11 09:51:28 +02:00
Frederic Lecaille
47bb15ca84 MINOR: quic: get rid of ->target quic_conn struct member
The ->li (struct listener *) member of quic_conn struct was replaced by a
->target (struct obj_type *) member by this commit:

    MINOR: quic-be: get rid of ->li quic_conn member

to abstract the connection type (front or back) when implementing QUIC for the
backends. In these cases, ->target was a pointer to the ojb_type of a server
struct. This could not work with the dynamic servers contrary to the listeners
which are not dynamic.

This patch almost reverts the one mentioned above. ->target pointer to obj_type member
is replaced by ->li pointer to listener struct member. As the listener are not
dynamic, this is easy to do this. All one has to do is to replace the
objt_listener(qc->target) statement by qc->li where applicable.

For the backend connection, when needed, this is always qc->conn->target which is
used only when qc->conn is initialized. The only "problematic" case is for
quic_dgram_parse() which takes a pointer to an obj_type as third argument.
But this obj_type is only used to call quic_rx_pkt_parse(). Inside this function
it is used to access the proxy counters of the connection thanks to qc_counters().
So, this obj_type argument may be null for now on with this patch. This is the
reason why qc_counters() is modified to take this into consideration.
2025-09-11 09:51:28 +02:00
Olivier Houchard
ff47ae60f3 MEDIUM: server: Introduce the concept of path parameters
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
2025-09-09 19:01:24 +02:00
Olivier Houchard
5ab9954faa MINOR: ssl: Add a flag to let it known we have an ALPN negociated
Add a new flag to the ssl_sock_ctx, to be set as soon as the ALPN has
been negociated.
This happens before the handshake has been completed, and that
information will let us know that, when we receive early data, if the
ALPN has been negociated, then we can immediately create a mux, as the
ALPN will tell us which mux to use.
2025-09-09 19:01:24 +02:00
Willy Tarreau
f87cf8b76e MEDIUM: stick-tables: relax stktable_trash_oldest() to only purge what is needed
stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.

Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.

This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.

The function now returns a success as long as one entry was freed.

With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).

With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).

Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:

  - this one resulted in 1500 failures to find an entry (0.015%
    size overhead), with the lowest contention and the fairest
    peers distibution.

  - leaving only after a success resulted in 229 failures (0.0029%
    size overhead) but doubled the time spent in the function (on
    the write lock precisely).

  - leaving only when both a success and a failed lock were met
    resulted in 31 failures (0.00031% overhead) but the contention
    was high enough again so that peers were not all up to date.

Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.

This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
2025-09-09 17:56:37 +02:00
Willy Tarreau
c3f94fbd9b DEBUG: stream: count the number of passes in the connect loop
Normally the connect loop cannot loop, but some recent traces can easily
convince one of the opposite. Let's add a counter, including in panic
dumps, in order to avoid the repeated long head scratching sessions
starting with "and what if...". In addition, if it's found to loop, this
time it will be certain and will indicate what to zoom in. This should
be backported to 3.2.
2025-09-09 17:56:14 +02:00
Amaury Denoyelle
0b6908385e BUG/MINOR: quic: properly support GSO on backend side
Previously, GSO emission was explicitely disabled on backend side. This
is not true since the following patch, thus GSO can be used, for example
when transfering large POST requests to a HTTP/3 backend.

  commit e064e5d46171d32097a84b8f84ccc510a5c211db
  MINOR: quic: duplicate GSO unsupp status from listener to conn

However, GSO on the backend side may cause crash when handling EIO. In
this case, GSO must be completely disabled. Previously, this was
performed by flagging listener instance. In backend side, this would
cause a crash as listener is NULL.

This patch fixes it by supporting GSO disable flag for servers. Thus, in
qc_send_ppkts(), EIO can be converted either to a listener or server
flag depending on the quic_conn proxy side. On backend side, server
instance is retrieved via <qc.conn.target>. This is enough to guarantee
that server is not deleted.

This does not need to be backported.
2025-09-08 16:18:05 +02:00
Christopher Faulet
e653dc304e MINOR: pools: Don't dump anymore info about pools when purge is forced
Historically, when the purge of pools was forced by sending a SIGQUIT to
haproxy, information about the pools were first dumped. It is now totally
pointless because these info can be retrieved via the CLI. It is even less
relevant now because the purge is forced typically when there are memroy
issues and to dump pools information, data must be allocated.

dump_pools_info() function was simplified because it is now called only from
an applet. No reason to still try to dump info on stderr.
2025-09-08 16:04:40 +02:00
Amaury Denoyelle
f645cd3c74 MINOR: quic: restore QUIC_HP_SAMPLE_LEN constant
The below patch fixes padding emission for small packets, which is
required to ensure that header protection removal can be performed by
the recipient.

  commit d7dea408c64c327cab6aebf4ccad93405b675565
  BUG/MINOR: quic: too short PADDING frame for too short packets

In addition to the proper fix, constant QUIC_HP_SAMPLE_LEN was removed
and replaced by QUIC_TLS_TAG_LEN. However, it still makes sense to have
a dedicated constant which represent the size of the sample used for
header protection. Thus, this patch restores it.

Special instructions for backport : above patch mentions that no
backport is needed. However, this is incorrect, as bug is introduced by
another patch scheduled for backport up to 2.6. Thus, it is first
mandatory to schedule d7dea408c64c327cab6aebf4ccad93405b675565 after it.
Then, this patch can also be used for the sake of code clarity.
2025-09-08 14:49:03 +02:00
Frederic Lecaille
6f9fccec1f MINOR: quic: SSL session reuse for QUIC
Mimic the same behavior as the one for SSL/TCP connetion to implement the
SSL session reuse.

Extract the code which try to reuse the SSL session for SSL/TCP connections
to implement ssl_sock_srv_try_reuse_sess().
Call this function from QUIC ->init() xprt callback (qc_conn_init()) as this
done for SSL/TCP connections.
2025-09-08 11:46:26 +02:00
Frederic Lecaille
d7dea408c6 BUG/MINOR: quic: too short PADDING frame for too short packets
This bug arrvived with this commit:

    MINOR: quic: centralize padding for HP sampling on packet building

What was missed is the fact that at the centralization point for the
PADDING frame to add for too short packet, <len> payload length  already includes
<*pn_len> the packet number field length value.

So when computing the length of the PADDING frame, the packet field length must
not be considered and added to the payload length (<len>).

This bug leaded too short PADDING frame to too short packets. This was the case,
most of times with Application level packets with a 1-byte packet number field
followed by a 1-byte PING frame. A 1-byte PADDING frame was added in this case
in place of a correct 2-bytes PADDINF frame. The header packet protection of
such packet could not be removed by the clients as for instance for ngtcp2 with
such traces:

    I00001828 0x5a135c81e803f092c74bac64a85513b657 pkt could not decrypt packet number

As the header protection could no be removed, the header keyupdate bit could also
not be read by packet analyzers such as pyshark used during the keyupdate tests.

No need to backport.
2025-09-05 16:17:11 +02:00
Christopher Faulet
f9a6ae727c OPTIM: tcpcheck: Reorder tcpchek_connect structure fields to fill holes
Thanks to this patch, two 4-bytes holes are now filled in the
tcpchek_connect structure.
2025-09-05 15:56:42 +02:00
Christopher Faulet
ffc1f096e0 MEDIUM: httpcheck/ssl: Base the SNI value on the HTTP host header by default
Similarly to the automic SNI selection for regulat SSL traffic, the SNI of
health-checks HTTPS connection is now automatically set by default by using
the host header value. "check-sni-auto" and "no-check-sni-auto" server
settings were added to change this behavior.

Only implicit HTTPS health-checks can take advantage of this feature. In
this case, the host header value from the "option httpchk" directive is used
to extract the SNI. It is disabled if http-check rules are used. So, the SNI
must still be explicitly specified via a "http-check connect" rule.

This patch with should paritally fix the issue #3081.
2025-09-05 15:56:42 +02:00
Christopher Faulet
668916c1a2 MEDIUM: server/ssl: Base the SNI value to the HTTP host header by default
For HTTPS outgoing connections, the SNI is now automatically set using the
Host header value if no other value is already set (via the "sni" server
keyword). It is now the default behavior. It could be disabled with the
"no-sni-auto" server keyword. And eventually "sni-auto" server keyword may
be used to reset any previous "no-sni-auto" setting. This option can be
inherited from "default-server" settings. Finally, if no connection name is
set via "pool-conn-name" setting, the selected value is used.

The automatic selection of the SNI is enabled by default for all outgoing
connections. But it is concretely used for HTTPS connections only. The
expression used is "req.hdr(host),host_only".

This patch should paritally fix the issue #3081. It only covers the server
part. Another patch will add the feature for HTTP health-checks.
2025-09-05 15:56:42 +02:00