heartbeat messages are sent to keep a connection active. However, a
heartbeat messages was sent periodically, even if some other messages were
sent during this period. It is not an issue but it is useless. heartbeat
messages should only be sent if nothing was sent to the peer since a
moment. So, in this patch, the heartbeat timer is rearmed each time a message
is sent.
On the receiver side, the reconnect timer was only rearmed when a heartbeat
message was received instead of rearming it for any messages. Again, it is
not an issue because the inactivity is managed with PEER_F_ALIVE flag. This
flag is removed when the reconnect timer timed out but it is reinserted when
something is received. But an periodic wakeup may be uselessly performed.
So, in this patch, the reconnect timer is rearmed each time a message is
received.
Instead of looking for new updates in each updates lists to wake a peer
applet up, we now only detect that some updates should have been inserted by
comparing the date of the last update inserted in the list and the last
update sent to the peer.
It is not 100% accurrate of course. Some extra wakeups may be observed. But
this should not lead to any spinning loop because the operation is performed
by the sync task. This task is woken up when a timeout is fired or when an
update was inserted. However, this saves several loops on the updates lists.
In this patch, the update tree is replaced by a mt-list. It is a huge patch
with several changes. Main ones are in the function sending updates.
By using a list instead of a tree, we loose the order between updates and
the ability to restart for a given update using its id (the key to index
updates in the tree). However, to use the tree, it had to be locked and it
was a cause of contention between threads, leading the watchdog to kill the
process in worst cases. Because the idea it to split the updates by buckets
to divide the contention on updated, the order between updates will be lost
anyway. So, the tree can be replaced by a list. By using a mt-list, we can
also remove the update lock.
To be able to use a list instead of a tree, each peer must save its position
in the list, to be able to process new entries only at each loop. These
marker are "special" sticky session of type STKSESS_UPDT_MARKER. Of course,
these marker are not in any stick-tables but only in updates lists. And only
the ownr of a marker can move it in the list. Each peer owns two markers for
each list (so two markers per shared table). The first one used a start
point for a loop, and the other one used as stop point. The way these marker
are moved in the list is not obvious, especially for the first one.
Updates sent during a full resync are now handled exactly a the same way
than other updates. Only the moment the stop marker is set is different.
Instead of using a boolean to know if an entry in the updates tree is local
or not, an enum is used. This change will be mandatory when updates tree
will be replaced by a list to be able to add markers owned by each peer.
So now a sticky sessin has no type (STKSESS_UPDT_NONE) if it is not in the
updates tree. STKSESS_UPDT_LOCAL is used for local entries and
STKSESS_UPDT_REMOTE for remote ones. STKSESS_UPDT_MARKER is not used for
now.
Now the updates are no longer tracked by stick-table and we are no longer
use their id to detect missed updates, there is no reason to have a matching
between the internal update id in the id used in updated messages.
So, now, for a given peer, id of the last update messages sent is saved in
each shared table and it is incremented when a new message is updated.
ACK messages received by a peer sending updates during a full resync are
ignored. So, on the other side, there is no reason to still send these ACK
messages. Let's skip them.
This patch is quite small but the change is really important. Thanks to the
previous patch, we can use PEER_F_SYNCHED flag to know if a peer is
synchronized or not. So instead of tracking last ack messages for each table
to be able to restart at a given point when the peer reconnects, we decided
to restart from the begining if a peer is not synchronized when a new
connection is established.
So, it is a huge change because, on reconnect, instead of pushing some
missed updates, all local updates are pushed again. Most of time, it is not
a problem because nowadays, connection are quite stable, especially because
a heartbeat message is sent to keep it active. The only drawback is when a
peer is restarted. In that case, we have no way to know it is synchronized
because he learned table contents from it old local peer.
This change is mandatory. First to replace the update tree by a mt-list and
remove the update lock. Then to split this list by buckets to reduce
contention.
Info about the last update message sent are now saved for each peer. The
shared-table and the update message id are saved. These information are used
when a ack message is received to know if it matches the last update message
sent. When this matches, we are sure the peer as received all updates sent
and is synchronized. This information is saved thanks to the flag
PEER_F_SYNCHED.
So, at any time, we know if a peer is synchronized or not.
Trace messages for peers were only protocol oriented and information
provided were quite light. With this patch, the traces were
improved. information about the peer, its applet and the section are
dumped. Several verbosities are now available and messages are dumped at
different levels depending on the context. It should easier to track issues
in the peers.
The SSL counters were not handled at all for QUIC connections. This patch
implement ssl_sock_update_counters() extracting the code from ssl_sock.c
and call this function where applicable both in TLS/TCP and QUIC parts.
Must be backported as far as 2.8.
This bug impacts only the backends.
When entering the closing state, a quic_closed_conn is used to replace the quic_conn.
In this state, the ->fd value was reset to -1 value calling qc_init_fd(). This value
is used by qc_may_use_saddr() which supposes it cannot be -1 for a backend, leading
->li to be dereferencd, which is legal only for a listener.
This bug impacts only the backend but with possible crash when qc_may_use_saddr()
is called: qc_test_fd() is false leading qc->li to be dereferenced. This is legal
only for a listener.
This patch prevents such fd value resettings for backends.
No need to backport because the QUIC backends support arrived with 3.3.
A quic_conn_closed struct is initialized to replace the quic_conn when the
connection enters the closing to reduce the connection memory footprint.
->max_udp_payload quic_conn_close was not initialized leading to possible
BUG_ON()s in qc_rcv_buf() when comparing the RX buf size to this payload.
->cntrs counters were alon not initialized with the only consequence
to generate wrong values for these counters.
Must be backported as far as 2.9.
Emeric reported that he can't build haproxy anymore since 9bc6a034
("BUG/MINOR: ssl: Free global_ssl structure contents during deinit").
src/ssl_sock.c:7020:40: error: comparison with string literal results in unspecified behavior [-Werror=address]
7020 | if (global_ssl.listen_default_ciphers != LISTEN_DEFAULT_CIPHERS)
| ^~
src/ssl_sock.c:7023:41: error: comparison with string literal results in unspecified behavior [-Werror=address]
7023 | if (global_ssl.connect_default_ciphers != CONNECT_DEFAULT_CIPHERS)
| ^~
src/ssl_sock.c: At top level:
Indeed the mentionned patch is checking the pointer in order to free
something freeable, but that can't work because these constant are
strings literal which can be passed from the compiler and not pointers.
Also the test is not useful, because these strings are strdup() in
__ssl_sock_init, so they can be free directly.
Must be backported in every stable branches with 9bc6a034.
Fix quic_tx unittest module by adding an explicit define for <mtu> const
member of quic_cc_path.
This should fix coverity report from github issue #3162.
This can be backported up to 3.2.
Ensure applet_putchk() return value is checked when outputing via the
CLI 'show quic' header line.
This is only to align with other usages of the same function, as trash
output buffer should always be large enough for it. As such, the command
is simply aborted if this is not the case.
This should fix coverity report from github issue #3139.
This could be backported up to 2.8.
In stksess_new(), if we failed to allocate memory for the new stksess,
don't forget to decrement the table entry count, as nobody else will
do it for us.
An artificially high count could lead to at least purging entries while
there is no need to.
This should be backported up to 2.8.
WIP decrement current on allocation failure
A subtle regression was introduced in 3.0 by commit faa8c3e02 ("MEDIUM:
lb-chash: Deterministic node hashes based on server address"). When keys
are calculated from the server's ID (which is the default), due to the
reorganisation of the code, the key ended up being hashed twice instead
of being multiplied by the scaling range.
While most users will never notice it, it is blocking some large cache
users from upgrading from 2.8 to 3.0 or 3.2 because the keys are
redistributed.
After a check with users on the mailing list [1] it was estimated that
keep the current situation is the worst choice because those who have
not yet upgraded will face the problem while by fixing it, those who
already have and for whom it happened smoothly will handle it just
right again.
As such this fix must be backported to 3.0 without waiting (in order
to preserve those who upgrade from two redistributions). Please note
that only configurations featuring "hash-type consistent" and not
having "hash-key" present with a value other than "id" are affected,
others are not (e.g. "hash-key addr" is unaffected).
[1] https://www.mail-archive.com/haproxy@formilux.org/msg46115.html
With the fix in commit 982805e6a3 ("BUG/MINOR: pools: Fix the dump of
pools info to deal with buffers limitations"), the max count is now
compared to the number of dumped pools instead of the configured
numbered, and keeping >= is no longer valid because maxcnt is set by
default to the same value when not set, so this means that since this
patch we're always displaying "limited to the first X entries" where X
is the number of dumped entries even in the absence of any limitation.
Let's just fix the comparison to only show this when the limit is lower.
This must be backported to 3.2 where the patch above already is.
The truncation of pools output that was adressed in commit 982805e6a3
("BUG/MINOR: pools: Fix the dump of pools info to deal with buffers
limitations") required to split the pools filling from dumping. However
there is a problem when a limit is passed that is lower than the number
of pools or if a pool name is specified or if pool caches are disabled,
because in this case the number of filled slots will be lower than the
initially allocated one, and empty entries will be visited either by the
sort functions when filling the entries if "byxxx" is specified, or by
the dump function after the last entry, but none of these functions was
expecting to be passed a NULL entry.
Let's just re-adjust nbpools to match the number of filled entries at
the end. Anyway the totals are calculated on the number of dumped
entries.
This must be backported to 3.2 since the fix above was backported there
as well.
The third parameter passed to b_quic_dec_int() is unitialized. This is not a bug.
But this disturbs coverity for an unknown reason as revealed by GH issue #3154.
This patch takes the opportunity to use NULL as passed value to avoid using such
an uneeded third parameter.
Should be backported to 3.2 where this unit test was introduced.
Better check that munmap() always works, otherwise it means we might
have miscalculated an address, and if it fails silently, it will eat
all the memory extremely quickly. Let's add a BUG_ON() on munmap's
return.
As reported by Christopher, in UAF mode memory release of aligned
objects as introduced in commit ef915e672a ("MEDIUM: pools: respect
pool alignment in allocations") does not work. The padding calculation
in the freeing code is no longer correct since it now depends on the
alignment, so munmap() fails on EINVAL. Fortunately we don't care much
about it since we know it's the low bits of the passed address, which
is much simpler to compute, since all mmaps are page-aligned.
There's no need to backport this, as this was introduced in 3.3.
The pcre2 matching requires an array of matches for grouping, that is
allocated when executing the rule by pre-processing it, and that is
immediately freed after use. This is quite inefficient and results in
annoying patterns in "show profiling" that attribute the allocations
to libpcre2 and the releases to haproxy.
A good suggestion from Dragan is to pre-allocate these per thread,
since the entry is not specific to a regex. In addition we're already
limited to MAX_MATCH matches so we don't even have the problem of
having to grow it while parsing nor processing.
The current patch adds a per-thread pair of init/deinit functions to
allocate a thread-local entry for that, and gets rid of the dynamic
allocations. It will result in cleaner memory management patterns and
slightly higher performance (+2.5%) when using pcre2.
'ctx' might be NULL when we exit 'ssl_sock_handshake', it can't be
dereferenced without check in the trace macro.
This was found by Coverity andraised in GitHub #3113.
This patch should be backported up to 3.2
The new "add/del ssl jwt <file>" commands allow to change the "jwt" flag
of an already loaded certificate. It allows to delete certificates used
for JWT validation, which was not yet possible.
The "show ssl jwt" command iterates over all the ckch_stores and dumps
the ones that have the option set.
Add information about the new "jwt_verify_cert" converter and update the
existing "jwt_converter" doc to remove mentions of certificates from it.
Add information about the new "jwt" certificate option.
A certificate that does not have the 'jwt' flag enabled cannot be used
for JWT validation. We now raise a specific return value so that such a
case can be identified.
This option can be used to enable the use of a given certificate for JWT
verification. It defaults to 'off' so certificates that are declared in
a crt-store and will be used for JWT verification must have a
"jwt on" option in the configuration.
This converter will be in charge of performing the same operation as the
'jwt_verify' one except that it takes a full-on pem certificate path
instead of a public key path as parameter.
The certificate path can be either provided directly as a string or via
a variable. This allows to use certificates that are not known during
init to perform token validation.
The jwt_verify converter will not take full-on certificates anymore
in favor of a new soon to come jwt_verify_cert. We might end up with a
new jwt_verify_hmac in the future as well which would allow to deprecate
the jwt_verify converter and remove the need for a specific internal
tree for public keys.
The logic to always look into the internal jwt tree by default and
resolve to locking the ckch tree as little as possible will also be
removed. This allows to get rid of the duplicated reference to
EVP_PKEYs, the one in the jwt tree entry and the one in the ckch_store.
The key_base field of the global_ssl structure is an strdup'ed field
(when set) which was never free'd during deinit.
This patch can be backported up to branch 3.0.
Some fields of the global_ssl structure are strings that are strdup'ed
but never freed. There is only one static global_ssl structure so not
much memory is used but we might as well free it during deinit.
This patch can be backported to all stable branches.
Conditions to detect the spinning loop for applets based on the new API are
not accurrate. We cannot continue to check the channel's buffers state to
know if an applet has made some progress. At least, we must also check the
applet's buffers.
After digging to find the right way to do, it was clear that the best is to
use something similar to what is performed for the streams, namely, checking
read and write events. And in fact, it is quite easy to do with the new
API. So let's do so.
This patch must be backported as far as 3.0.
The purpose of memory profiling precisely is to figure what function
allocates and what function frees for specific objects. It turns out
that a non-negligible number of release callbacks basically do nothing
but a free() or pool_free() call and return, which the compiler happily
turns into a jump, making the caller of that callback appear as the
real one. That's how we can see libcrypto release to pools such as
ssl-capture for example, which also makes the per-DSO calls appear
wrong:
10000 0 10720000 0| 0x448c8d ssl_async_fd_free+0x3b9d p_alloc(1072) [pool=ssl-capture]
50000 0 6800000 0| 0x4456b9 ssl_async_fd_free+0x5c9 p_alloc(136) [pool=ssl-keylogf]
10072 0 644608 0| 0x447f14 ssl_async_fd_free+0x2e24 p_alloc(64) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445987 ssl_async_fd_free+0x897 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459b8 ssl_async_fd_free+0x8c8 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459e9 ssl_async_fd_free+0x8f9 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a1a ssl_async_fd_free+0x92a p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a4b ssl_async_fd_free+0x95b p_free(-136) [pool=ssl-keylogf]
0 20072 0 11364608| 0x7f5f1397db62 libcrypto:CRYPTO_free_ex_data+0xf2/0x261 p_free(-566) [pool=ssl-keylogf] [locked=72 (0.3 %)]
Worse, as can be seen on the last line above, there can be a single pool
per call place (since we don't release to arbitrary pools), and the stats
are misleading by reporting the first used pool only when a same function
can call multiple release callbacks. This is why the free call totals
10k ssl-capture and 10072 ssl-keylogfile.
Let's just disable tail call optimization when using memory profiling.
The gains are only very marginal and complicate so much the debugging
that it's not worth it. Now the output is correct, and no longer claims
that libcrypto is the caller:
10000 0 10720000 0| 0x448c9f ssl_async_fd_free+0x3b9f p_alloc(1072) [pool=ssl-capture]
0 10000 0 10720000| 0x445af0 ssl_async_fd_free+0x9f0 p_free(-1072) [pool=ssl-capture]
50000 0 6800000 0| 0x4456c9 ssl_async_fd_free+0x5c9 p_alloc(136) [pool=ssl-keylogf]
10177 0 1221240 0| 0x45543d ssl_async_fd_handler+0xb51d p_alloc(120) [pool=ssl_sock_ct] [locked=165 (1.6 %)]
10061 0 643904 0| 0x447f1c ssl_async_fd_free+0x2e1c p_alloc(64) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445987 ssl_async_fd_free+0x887 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459b8 ssl_async_fd_free+0x8b8 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x4459e9 ssl_async_fd_free+0x8e9 p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a1a ssl_async_fd_free+0x91a p_free(-136) [pool=ssl-keylogf]
0 10000 0 1360000| 0x445a4b ssl_async_fd_free+0x94b p_free(-136) [pool=ssl-keylogf]
0 10188 0 1222560| 0x44f518 ssl_async_fd_handler+0x55f8 p_free(-120) [pool=ssl_sock_ct] [locked=176 (1.7 %)]
0 10072 0 644608| 0x445aa6 ssl_async_fd_free+0x9a6 p_free(-64) [pool=ssl-keylogf] [locked=72 (0.7 %)]
An attempt was made to only instrument pool_free() to place a compiler
barrier, but that resulted in much larger code and wouldn't cover
functions ending with a simple "free()" call. "ha_free()" however is
already immune against tail call optimization since it has to write
the NULL when returning from free().
This should be backported to recent stable releases that are still
regularly being debugged.