The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated listener_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.
Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.
Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.
It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.
The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.
It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.
Affected struct sizes in bytes:
Before After Diff
server 4032 4032 0*
proxy 3184 3160 -24
listener 752 728 -24
*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).
Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):
$ time ./haproxy -c -f acl-ip.cfg
Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.
Before:
RAM (VSZ/RSS): 4443912 3912444
real 0m19.211s
user 0m18.138s
sys 0m1.068s
Overhead Command Shared Object Symbol
44.79% haproxy haproxy [.] ebst_insert
25.07% haproxy haproxy [.] ebmb_insert_prefix
3.44% haproxy libc-2.33.so [.] __libc_calloc
2.71% haproxy libc-2.33.so [.] _int_malloc
2.33% haproxy haproxy [.] free_pattern_tree
1.78% haproxy libc-2.33.so [.] inet_pton4
1.62% haproxy libc-2.33.so [.] _IO_fgets
1.58% haproxy libc-2.33.so [.] _int_free
1.56% haproxy haproxy [.] pat_ref_push
1.35% haproxy libc-2.33.so [.] malloc_consolidate
1.16% haproxy libc-2.33.so [.] __strlen_avx2
0.79% haproxy haproxy [.] pat_idx_tree_ip
0.76% haproxy haproxy [.] pat_ref_read_from_file
0.60% haproxy libc-2.33.so [.] __strrchr_avx2
0.55% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.54% haproxy libc-2.33.so [.] __memchr_avx2
0.46% haproxy haproxy [.] pat_ref_append
After:
RAM (VSZ/RSS): 4166108 3634768
real 0m18.114s
user 0m17.113s
sys 0m0.996s
Overhead Command Shared Object Symbol
38.99% haproxy haproxy [.] cebs_insert
27.09% haproxy haproxy [.] ebmb_insert_prefix
3.63% haproxy libc-2.33.so [.] __libc_calloc
3.18% haproxy libc-2.33.so [.] _int_malloc
2.69% haproxy haproxy [.] free_pattern_tree
1.99% haproxy libc-2.33.so [.] inet_pton4
1.74% haproxy libc-2.33.so [.] _IO_fgets
1.73% haproxy libc-2.33.so [.] _int_free
1.57% haproxy haproxy [.] pat_ref_push
1.48% haproxy libc-2.33.so [.] malloc_consolidate
1.22% haproxy libc-2.33.so [.] __strlen_avx2
1.05% haproxy libc-2.33.so [.] __strcmp_avx2
0.80% haproxy haproxy [.] pat_idx_tree_ip
0.74% haproxy libc-2.33.so [.] __memchr_avx2
0.69% haproxy libc-2.33.so [.] __strrchr_avx2
0.69% haproxy libc-2.33.so [.] _IO_getline_info
0.62% haproxy haproxy [.] pat_ref_read_from_file
0.56% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.56% haproxy libc-2.33.so [.] cfree@GLIBC_2.2.5
0.46% haproxy haproxy [.] pat_ref_append
If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).
On large strings such as user agents (1 million here), it's now slightly
slower (+9%):
Before:
real 0m2.475s
user 0m2.316s
sys 0m0.155s
After:
real 0m2.696s
user 0m2.544s
sys 0m0.147s
But such patterns are much less common than short ones, and the memory
savings do still count.
Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
- replace a few ceb_node** with ceb_root** since pointers are now
tagged ;
- replace cebu*.h with ceb*.h since both are now merged in the same
include file. This way we can drop the unused cebu*.h files from
cebtree that are provided only for compatibility.
- rename immediate storage functions to cebXX_imm_XXX() as per the API
change in 0.5 that makes immediate explicit rather than implicit.
This only affects vars and tools.c:copy_file_name().
The tests continue to work.
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.
The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.
This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
An eb tree was used to anticipate for infinite amount of custom log steps
configured at a proxy level. In turns out this makes no sense to configure
that much logging steps for a proxy, and the cost of the eb tree is non
negligible in terms of memory footprint, especially when used in a default
section.
Instead, let's use a simple bitmask, which allows up to 64 logging steps
configured at proxy level. If we lack space some day (and need more than
64 logging steps to be configured), we could simply modify
"struct log_steps" to spread the bitmask over multiple 64bits integers,
minor some adjustments where the mask is set and checked.
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.
The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.
This patch must be backported up to 2.8.
This patch fixes GitHub #3100.
By checking the current thread's locking status, it becomes possible
to know during a memory allocation whether it's performed under a lock
or not. Both pools and memprofile functions were instrumented to check
for this and to increment the memprofile bin's locked_calls counter.
This one, when not zero, is reported on "show profiling memory" with a
percentage of all allocations that such locked allocations represent.
This way it becomes possible to try to target certain code paths that
are particularly expensive. Example:
$ socat - /tmp/sock1 <<< "show profiling memory"|grep lock
20297301 0 2598054528 0| 0x62a820fa3991 sockaddr_alloc+0x61/0xa3 p_alloc(128) [pool=sockaddr] [locked=54962 (0.2 %)]
0 20297301 0 2598054528| 0x62a820fa3a24 sockaddr_free+0x44/0x59 p_free(-128) [pool=sockaddr] [locked=34300 (0.1 %)]
9908432 0 1268279296 0| 0x62a820eb8524 main+0x81974 p_alloc(128) [pool=task] [locked=9908432 (100.0 %)]
9908432 0 554872192 0| 0x62a820eb85a6 main+0x819f6 p_alloc(56) [pool=tasklet] [locked=9908432 (100.0 %)]
263001 0 63120240 0| 0x62a820fa3c97 conn_new+0x37/0x1b2 p_alloc(240) [pool=connection] [locked=20662 (7.8 %)]
71643 0 47307584 0| 0x62a82105204d pool_get_from_os_noinc+0x12d/0x161 posix_memalign(660) [locked=5393 (7.5 %)]
When task profiling is enabled, the pool alloc/free code will measure the
time it takes to perform memory allocation after a cache miss or memory
freeing to the shared cache or OS. The time taken with the thread-local
cache is never measured as measuring that time is very expensive compared
to the pool access time. Here doing so costs around 2% performance at 2M
req/s, only when task profiling is enabled, so this remains reasonable.
The scheduler takes care of collecting that time and updating the
sched_activity entry corresponding to the current task when task profiling
is enabled.
The goal clearly is to track places that are wasting CPU time allocating
and releasing too often, or causing large evictions. This appears like
this in "show profiling tasks aggr":
Tasks activity over 11.428 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lkd_avg mem_avg lat_avg
process_stream 44183891 16.47m 22.36us 491.0ns 1.154us 1.000ns 101.1us
h1_io_cb 57386064 4.011m 4.193us 20.00ns 16.00ns - 29.47us
sc_conn_io_cb 42088024 49.04s 1.165us - - - 54.67us
h1_timeout_task 438171 196.5ms 448.0ns - - - 100.1us
srv_cleanup_toremove_conns 65 1.468ms 22.58us 184.0ns 87.00ns - 101.3us
task_process_applet 3 508.0us 169.3us - 107.0us 1.847us 29.67us
srv_cleanup_idle_conns 6 225.3us 37.55us 15.74us 36.84us - 49.47us
accept_queue_process 2 45.62us 22.81us - - 4.949us 54.33us
This new column will be used for reporting the average time spent
allocating or freeing memory in a task when task profiling is enabled.
For now it is not updated.
When DEBUG_THREAD > 0 and task profiling enabled, we'll now measure the
time spent with at least one lock held for each task. The time is
collected by locking operations when locks are taken raising the level
to one, or released resetting the level. An accumulator is updated in
the thread_ctx struct that is collected by the scheduler when the task
returns, and updated in the sched_activity entry of the related task.
This allows to observe figures like this one:
Tasks activity over 259.516 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lkd_avg lat_avg
h1_io_cb 15466589 2.574m 9.984us - - 33.45us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
sc_conn_io_cb 8047994 8.325s 1.034us - - 870.1us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
process_stream 7734689 4.356m 33.79us 1.990us 1.641us 1.554ms <- sc_notify@src/stconn.c:1206 task_wakeup
process_stream 7734292 46.74m 362.6us 278.3us 132.2us 972.0us <- stream_new@src/stream.c:585 task_wakeup
sc_conn_io_cb 7733158 46.88s 6.061us - - 68.78us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
task_process_applet 6603593 4.484m 40.74us 16.69us 34.00us 96.47us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
task_process_applet 4761796 3.420m 43.09us 18.79us 39.28us 138.2us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
process_table_expire 4710662 4.880m 62.16us 9.648us 53.95us 158.6us <- run_tasks_from_lists@src/task.c:671 task_queue
stktable_add_pend_updates 4171868 6.786s 1.626us - 1.487us 47.94us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
h1_io_cb 2871683 1.198s 417.0ns 70.00ns 69.00ns 1.005ms <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
process_peer_sync 2304957 5.368s 2.328us - 1.156us 68.54us <- stktable_add_pend_updates@src/stick_table.c:873 task_wakeup
process_peer_sync 1388141 3.174s 2.286us - 1.130us 52.31us <- run_tasks_from_lists@src/task.c:671 task_queue
stktable_add_pend_updates 463488 3.530s 7.615us 2.000ns 7.134us 771.2us <- stktable_touch_with_exp@src/stick_table.c:654 tasklet_wakeup
Here we see that almost the entirety of stktable_add_pend_updates() is
spent under a lock, that 1/3 of the execution time of process_stream()
was performed under a lock and that 2/3 of it was spent waiting for a
lock (this is related to the 10 track-sc present in this config), and
that the locking time in process_peer_sync() has now significantly
reduced. This is more visible with "show profiling tasks aggr":
Tasks activity over 475.354 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lkd_avg lat_avg
h1_io_cb 25742539 3.699m 8.622us 11.00ns 10.00ns 188.0us
sc_conn_io_cb 22565666 1.475m 3.920us - - 473.9us
process_stream 21665212 1.195h 198.6us 140.6us 67.08us 1.266ms
task_process_applet 16352495 11.31m 41.51us 17.98us 36.55us 112.3us
process_peer_sync 7831923 17.15s 2.189us - 1.107us 41.27us
process_table_expire 6878569 6.866m 59.89us 9.359us 51.91us 151.8us
stktable_add_pend_updates 6602502 14.77s 2.236us - 2.060us 119.8us
h1_timeout_task 801 703.4us 878.0ns - - 185.7us
srv_cleanup_toremove_conns 347 12.43ms 35.82us 240.0ns 70.00ns 1.924ms
accept_queue_process 142 1.384ms 9.743us - - 340.6us
srv_cleanup_idle_conns 74 475.0us 6.418us 896.0ns 5.667us 114.6us
This new column will be used for reporting the average time spent
in a task with at least one lock held. It will only have a non-zero
value when DEBUG_THREAD > 0. For now it is not updated.
The new lock_level field indicates the number of cumulated locks that
are held by the current thread. It's fed as soon as DEBUG_THREAD is at
least 1. In addition, thread_isolate() adds 128, so that it's even
possible to check for combinations of both. The value is also reported
in thread dumps (warnings and panics).
When DEBUG_THREAD > 0, and if task profiling is enabled, then each
locking attempt will measure the time it takes to obtain the lock, then
add that time to a thread_ctx accumulator that the scheduler will then
retrieve to update the current task's sched_activity entry. The value
will then appear avearaged over the number of calls in the lkw_avg column
of "show profiling tasks", such as below:
Tasks activity over 48.298 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lkw_avg lat_avg
h1_io_cb 3200170 26.81s 8.377us - 32.73us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
sc_conn_io_cb 1657841 1.645s 992.0ns - 853.0us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
process_stream 1600450 49.16s 30.71us 1.936us 1.392ms <- sc_notify@src/stconn.c:1206 task_wakeup
process_stream 1600321 7.770m 291.3us 209.1us 901.6us <- stream_new@src/stream.c:585 task_wakeup
sc_conn_io_cb 1599928 7.975s 4.984us - 65.77us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
task_process_applet 997609 46.37s 46.48us 16.80us 113.0us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
process_table_expire 922074 48.79s 52.92us 7.275us 181.1us <- run_tasks_from_lists@src/task.c:670 task_queue
stktable_add_pend_updates 705423 1.511s 2.142us - 56.81us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
task_process_applet 683511 34.75s 50.84us 18.37us 153.3us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
h1_io_cb 535395 198.1ms 370.0ns 72.00ns 930.4us <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
It now makes it pretty obvious which tasks (hence call chains) spend their
time waiting on a lock and for what share of their execution time.
This new column will be used for reporting the average time spent waiting
for a lock. It will only have a non-zero value when DEBUG_THREAD > 0. For
now it is not updated.
Since the commit dcb696cd3 ("MEDIUM: resolvers: hash the records before
inserting them into the tree"), When several records are found in a DNS
answer, the round robin selection over these records is no longer performed.
Indeed, before a list of records was used. To ensure each records was
selected one after the other, at each selection, the first record of the
list was moved at the end. When this list was replaced bu a tree, the same
mechanism was preserved. However, the record is indexed using its key, a
hash of the record. So its position never changes. When it is removed and
reinserted in the tree, its position remains the same. When we walk though
the tree, starting from the root, the records are always evaluated in the
same order. So, even if there are several records in a DNS answer, the same
IP address is always selected.
It is quite easy to trigger the issue with a do-resolv action.
To fix the issue, the node to perform the next selection is now saved. So
instead of restarting from the root each time, we can restart from the next
node of the previous call.
Thanks to Damien Claisse for the issue analysis and for the reproducer.
This patch should fix the issue #3116. It must be backported as far as 2.6.
JWS functions are supposed to return 0 upon error or when nothing was
produced. This was done in order to put easily the return value in
trash->data without having to check the return value.
However functions like a2base64url() or snprintf() could return a
negative value, which would be casted in a unsigned int if this happen.
This patch add checks on the JWS functions to ensure that no negative
value can be returned, and change the prototype from int to size_t.
This is also related to issue #3114.
Must be backported to 3.2.
Build option USE_QUIC_OPENSSL_COMPAT=1 must be set to activate QUIC
support for OpenSSL prior to version 3.5.2. This compiles an internal
compatibility layer, which must be then activated at runtime with global
option limited-quic.
Starting from OpenSSL version 3.5.2, a proper QUIC TLS API is now
exposed. Thus, the compatibility layer is unneeded. However it can still
be compiled against newer OpenSSL releases and activated at runtime,
mostly for test purpose.
As this compatibility layer has some limitations, (no support for QUIC
0-RTT), it's important that users notice this situation and disable it
if possible. Thus, this patch adds a notice warning when
USE_QUIC_OPENSSL_COMPAT=1 is set when building against OpenSSL 3.5.2 and
above. This should be sufficient for users and packagers to understand
that this option is not necessary anymore.
Note that USE_QUIC_OPENSSL_COMPAT=1 is incompatible with others TLS
library which exposed a QUIC API based on original BoringSSL patches
set. A build error will prevent the compatibility layer to be built.
limited-quic option is thus silently ignored.
This index is used to retrieve the quic_conn object from its SSL object, the same
way the connection is retrieved from its SSL object for SSL/TCP connections.
This patch implements two helper functions to avoid the ugly code with such blocks:
#ifdef USE_QUIC
else if (qc) { .. }
#endif
Implement ssl_sock_get_listener() to return the listener from an SSL object.
Implement ssl_sock_get_conn() to return the connection from an SSL object
and optionally a pointer to the ssl_sock_ctx struct attached to the connections
or the quic_conns.
Use this functions where applicable:
- ssl_tlsext_ticket_key_cb() calls ssl_sock_get_listener()
- ssl_sock_infocbk() calls ssl_sock_get_conn()
- ssl_sock_msgcbk() calls ssl_sock_get_ssl_conn()
- ssl_sess_new_srv_cb() calls ssl_sock_get_conn()
- ssl_sock_srv_verifycbk() calls ssl_sock_get_conn()
Also modify qc_ssl_sess_init() to initialize the ssl_qc_app_data_index index for
the QUIC backends.
The ->li (struct listener *) member of quic_conn struct was replaced by a
->target (struct obj_type *) member by this commit:
MINOR: quic-be: get rid of ->li quic_conn member
to abstract the connection type (front or back) when implementing QUIC for the
backends. In these cases, ->target was a pointer to the ojb_type of a server
struct. This could not work with the dynamic servers contrary to the listeners
which are not dynamic.
This patch almost reverts the one mentioned above. ->target pointer to obj_type member
is replaced by ->li pointer to listener struct member. As the listener are not
dynamic, this is easy to do this. All one has to do is to replace the
objt_listener(qc->target) statement by qc->li where applicable.
For the backend connection, when needed, this is always qc->conn->target which is
used only when qc->conn is initialized. The only "problematic" case is for
quic_dgram_parse() which takes a pointer to an obj_type as third argument.
But this obj_type is only used to call quic_rx_pkt_parse(). Inside this function
it is used to access the proxy counters of the connection thanks to qc_counters().
So, this obj_type argument may be null for now on with this patch. This is the
reason why qc_counters() is modified to take this into consideration.
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
Add a new flag to the ssl_sock_ctx, to be set as soon as the ALPN has
been negociated.
This happens before the handshake has been completed, and that
information will let us know that, when we receive early data, if the
ALPN has been negociated, then we can immediately create a mux, as the
ALPN will tell us which mux to use.
stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.
Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.
This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.
The function now returns a success as long as one entry was freed.
With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).
With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).
Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:
- this one resulted in 1500 failures to find an entry (0.015%
size overhead), with the lowest contention and the fairest
peers distibution.
- leaving only after a success resulted in 229 failures (0.0029%
size overhead) but doubled the time spent in the function (on
the write lock precisely).
- leaving only when both a success and a failed lock were met
resulted in 31 failures (0.00031% overhead) but the contention
was high enough again so that peers were not all up to date.
Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.
This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
Normally the connect loop cannot loop, but some recent traces can easily
convince one of the opposite. Let's add a counter, including in panic
dumps, in order to avoid the repeated long head scratching sessions
starting with "and what if...". In addition, if it's found to loop, this
time it will be certain and will indicate what to zoom in. This should
be backported to 3.2.
Previously, GSO emission was explicitely disabled on backend side. This
is not true since the following patch, thus GSO can be used, for example
when transfering large POST requests to a HTTP/3 backend.
commit e064e5d46171d32097a84b8f84ccc510a5c211db
MINOR: quic: duplicate GSO unsupp status from listener to conn
However, GSO on the backend side may cause crash when handling EIO. In
this case, GSO must be completely disabled. Previously, this was
performed by flagging listener instance. In backend side, this would
cause a crash as listener is NULL.
This patch fixes it by supporting GSO disable flag for servers. Thus, in
qc_send_ppkts(), EIO can be converted either to a listener or server
flag depending on the quic_conn proxy side. On backend side, server
instance is retrieved via <qc.conn.target>. This is enough to guarantee
that server is not deleted.
This does not need to be backported.
Historically, when the purge of pools was forced by sending a SIGQUIT to
haproxy, information about the pools were first dumped. It is now totally
pointless because these info can be retrieved via the CLI. It is even less
relevant now because the purge is forced typically when there are memroy
issues and to dump pools information, data must be allocated.
dump_pools_info() function was simplified because it is now called only from
an applet. No reason to still try to dump info on stderr.
The below patch fixes padding emission for small packets, which is
required to ensure that header protection removal can be performed by
the recipient.
commit d7dea408c64c327cab6aebf4ccad93405b675565
BUG/MINOR: quic: too short PADDING frame for too short packets
In addition to the proper fix, constant QUIC_HP_SAMPLE_LEN was removed
and replaced by QUIC_TLS_TAG_LEN. However, it still makes sense to have
a dedicated constant which represent the size of the sample used for
header protection. Thus, this patch restores it.
Special instructions for backport : above patch mentions that no
backport is needed. However, this is incorrect, as bug is introduced by
another patch scheduled for backport up to 2.6. Thus, it is first
mandatory to schedule d7dea408c64c327cab6aebf4ccad93405b675565 after it.
Then, this patch can also be used for the sake of code clarity.
Mimic the same behavior as the one for SSL/TCP connetion to implement the
SSL session reuse.
Extract the code which try to reuse the SSL session for SSL/TCP connections
to implement ssl_sock_srv_try_reuse_sess().
Call this function from QUIC ->init() xprt callback (qc_conn_init()) as this
done for SSL/TCP connections.
This bug arrvived with this commit:
MINOR: quic: centralize padding for HP sampling on packet building
What was missed is the fact that at the centralization point for the
PADDING frame to add for too short packet, <len> payload length already includes
<*pn_len> the packet number field length value.
So when computing the length of the PADDING frame, the packet field length must
not be considered and added to the payload length (<len>).
This bug leaded too short PADDING frame to too short packets. This was the case,
most of times with Application level packets with a 1-byte packet number field
followed by a 1-byte PING frame. A 1-byte PADDING frame was added in this case
in place of a correct 2-bytes PADDINF frame. The header packet protection of
such packet could not be removed by the clients as for instance for ngtcp2 with
such traces:
I00001828 0x5a135c81e803f092c74bac64a85513b657 pkt could not decrypt packet number
As the header protection could no be removed, the header keyupdate bit could also
not be read by packet analyzers such as pyshark used during the keyupdate tests.
No need to backport.
Similarly to the automic SNI selection for regulat SSL traffic, the SNI of
health-checks HTTPS connection is now automatically set by default by using
the host header value. "check-sni-auto" and "no-check-sni-auto" server
settings were added to change this behavior.
Only implicit HTTPS health-checks can take advantage of this feature. In
this case, the host header value from the "option httpchk" directive is used
to extract the SNI. It is disabled if http-check rules are used. So, the SNI
must still be explicitly specified via a "http-check connect" rule.
This patch with should paritally fix the issue #3081.
For HTTPS outgoing connections, the SNI is now automatically set using the
Host header value if no other value is already set (via the "sni" server
keyword). It is now the default behavior. It could be disabled with the
"no-sni-auto" server keyword. And eventually "sni-auto" server keyword may
be used to reset any previous "no-sni-auto" setting. This option can be
inherited from "default-server" settings. Finally, if no connection name is
set via "pool-conn-name" setting, the selected value is used.
The automatic selection of the SNI is enabled by default for all outgoing
connections. But it is concretely used for HTTPS connections only. The
expression used is "req.hdr(host),host_only".
This patch should paritally fix the issue #3081. It only covers the server
part. Another patch will add the feature for HTTP health-checks.
By default, for a given server, when no pool-conn-name is specified, the
configured sni is used. However, this must only be done when SSL is in-use
for the server. Of course, it is uncommon to have a sni expression for
now-ssl server. But this may happen.
In addition, the SSL may be disabled via the CLI. In that case, the
pool-conn-name must be discarded if it was copied from the sni. And, we must
of course take care to set it if the ssl is enabled.
Finally, when the attac-srv action is checked, we now checked the
pool-conn-name expression.
This patch should be backported as far as 3.0. It relies on "MINOR: server:
Parse sni and pool-conn-name expressions in a dedicated function" which
should be backported too.
We may need additional struct members in shm_stats_file_object and
shm_stats_file_hdr, yet since these structs are exported they should
not change in size nor ordering else it would require a version change
to break compability on purpose since mapping would differ.
Here we reserve 64 additional bytes in shm_stats_file_object, and
128 bytes in shm_stats_file_hdr for future usage.
Document some byte holes and fix some potential aligment issues
between 32 and 64 bits architectures to ensure the shm_stats_file memory
mapping is consistent between operating systems.
same as THREAD_PAD() but doesn't depend on haproxy being compiled with
thread support. It may be useful for memory (or files) that may be
shared between multiple processed.
This is the last patch of the shm stats file series, in this patch we
implement the logic to store and fetch shm stats objects and associate
them to existing shared counters on the current process.
Shm objects are stored in the same memory location as the shm stats file
header. In fact they are stored right after it. All objects (struct
shm_stats_file_object) have the same size (no matter their type), which
allows for easy object traversal without having to check the object's
type, and could permit the use of external tools to scan the SHM in the
future. Each object stores a guid (of GUID_MAX_LEN+1 size) and tgid
which allows to match corresponding shared counters indexes. Also,
as stated before, each object stores the list of users making use of
it. Objects are never released (the map can only grow), but unused
objects (when no more users or active users are found in objects->users),
the object is automatically recycled. Also, each object stores its
type which defines how the object generic data member should be handled.
Upon startup (or reload), haproxy first tries to scan existing shm to
find objects that could be associated to frontends, backends, listeners
or servers in the current config based on GUID. For associations that
couldn't be made, haproxy will automatically create missing objects in
the SHM during late startup. When haproxy matches with an existing object,
it means the counter from an older process is preserved in the new
process, so multiple processes temporarily share the same counter for as
long as required for older processes to eventually exit.