On todays large systems, it's not always desired to run on all threads
for light loads, and usually users enforce nbthread to a lower value
(e.g. 8). The problem is that this is a fixed value, and moving such
configs to smaller machines continues to enforce the value and this
becomes extremely unproductive due to having more threads than CPUs.
This also happens quite a bit in VMs, containers, or cloud instances
of various sizes.
This commit introduces the thread-hard-limit setting that allows to only
set an upper bound to the number of threads without raising a lower value.
This means that using "thread-hard-limit 8" will make sure that no more
than 8 threads will be used when available, but it will remain two when
run on a dual-core machine.
As diagnosed in GH issue #2569, there's currently an issue in LibreSSL's
CHACHA20 in-place implementation that makes haproxy discard incoming QUIC
packets encrypted with it. It's not very easy to observe the issue because:
- QUIC recommends that CHACHA20 is used in priority
- on x86 with AES-NI, LibreSSL prefers AES-GCM for performance
reasons, so the problem is only observed there if a client
explicitly forces TLS_CHACHA20_POLY1305_SHA256 only.
- discarded packets cause retransmits showing some apparent activity,
and the handshake succeeds so it's not easy to analyze from the
client which thinks that the server is slow to respond.
Thus in practice, on non-x86 machines running LibreSSL, requests made over
QUIC freeze for a long time, unless the client explicitly forces algos
excluding TLS_CHACHA20_POLY1305_SHA256. That's typically the case by
default on modern OpenBSD systems, and was reported in the issue above
for an arm64 machine running OpenBSD -current, and was also observed on a
mips64 one running OpenBSD 7.5.
There is no simple solution to this problem due to some of the protocol's
constraints without digging too low into the stack (and risking to break
more). Here we're taking a pragmatic approach consisting in making the
connection fail hard when TLS_CHACHA20_POLY1305_SHA256 is selected,
regardless of the availability of other ciphers. This means that every
time a connection would have hung, instead it will fail fast, allowing
the client to retry over TLS/TCP.
Theo Buehler recommends that we limit this protection to all LibreSSL
versions before 4.0 since it's where the fix will be implemented. Older
stable versions will just see TLS_CHACHA20_POLY1305_SHA256 disabled,
which should be sufficient to make QUIC work there again as well.
The following config is sufficient to reproduce the issue (on a non-x86
machine, both arm64 & mips64 were confirmed to reproduce it):
global
limited-quic
frontend stats
mode http
#bind :8181
#bind :8443 ssl crt rsa+dh2048.pem
bind quic4@:8443 ssl crt rsa+dh2048.pem alpn h3
timeout client 5s
stats uri /
And the following commands will trigger the problem on affected LibreSSL
versions:
curl --tls13-ciphers TLS_CHACHA20_POLY1305_SHA256 -v --http3 -k https://127.0.0.1:8443/
curl -v --http3 -k https://127.0.0.1:8443/
while these ones must work:
curl --tls13-ciphers TLS_AES_128_GCM_SHA256 -v --http3 -k https://127.0.0.1:8443/
curl --tls13-ciphers TLS_AES_256_GCM_SHA384 -v --http3 -k https://127.0.0.1:8443/
Normally all of them will work with LibreSSL 4, and only the first one
should fail with stable LibreSSL versions higher than 3.9.2. An haproxy
version without this workaround will show an unresponsive command after
the GET is sent, while a version with the workaround will close the
connection on error. On a version with this workaround, if TCP listeners
are uncommented, curl will automatically fall back to TCP and attempt
the reqeust again over HTTP/2. Finally, on OpenSSL 1.1.1 in compat mode
(hence the limited-quic option above) all of them must work.
Many thanks to github user @lgv5 for the detailed report, tests, and
for spotting the issue, and to @botovq (Theo Buehler) for the quick
analysis, patch and help on this workaround.
This needs to be backported to versions 2.6 and above.
Update API for PROXY protocol header encoding. Previously, it requires
stream parameter to be set. Change make_proxy_line() and associated
functions to add an extra session parameter. This is useful in context
where no stream is instantiated. For example, this is the case for rhttp
preconnect.
This change allows to extend PROXY v2 TLV encoding. Replace
build_logline() which requires a stream instance and call directly
sess_build_logline().
Note that stream parameter is kept as it is necessary for unique ID
encoding.
This change has no functional change for standard connections. However,
it is necessary to support TLV encoding on rhttp preconnect.
Modify rhttp preconnect by instantiating a new session for each
connection attempt. Connection is thus linked to a session directly on
its instantiation contrary to previously where no session existed until
listener_accept().
This patch will allow to extend rhttp usage. Most notably, it will be
useful to use various sample fetches on the server line and extend
logging capabilities.
Changes are minimal, yet consequences are considered not trivial as for
the first time a FE connection session is instantiated before
listener_accept(). This requires an extra explicit check in
session_accept_fd() to not overwrite an existing session. Also, flag
SESS_FL_RELEASE_LI is not set immediately as listener counters must note
be decremented if connection and its session are freed before reversal
is completed, or else listener counters will be invalid.
conn_session_free() is used as connection destroy callback to ensure the
session will be freed automatically on connection release.
When a session is allocated for a FE connection, session_free() is
responsible to call listener_release() to decrement listener connection
counters and resume listening.
Until now, <listener> member of session was tested inside session_free()
before invocating listener_release(). To highlight more explicitely the
relation between sessions and listeners, introduce a new flag
SESS_FL_RELEASE_LI. Only session with such flag set will invoke
listener_release() on their cleanup. Flag is set inside
session_accept_fd() on success.
This patch has no functional change. However, it will be useful to
implement session creation for rHTTP preconnect.
Ensure "disable frontend" on a reverse HTTP listener is forbidden by
returing -1 on suspend callback. Suspending such a listener has unknown
effect and so is not properly implemented for now.
This should be backported up to 2.9.
This fixes the fd leak, introduced in the commit d3fc982cd7
("MEDIUM: proto: make common fd checks in sock_create_server_socket").
Initially sock_create_server_socket() was designed to return only created
socket FD or -1. Its callers from upper protocol layers were required to test
the returned errno and were required then to apply different configuration
related checks to obtained positive sock_fd. A lot of this code was duplicated
among protocols implementations.
The new refactored version of sock_create_server_socket() gathers in one place
all duplicated checks, but in order to be complient with upper protocol
layers, it needs the 3rd parameter: 'stream_err', in which it sets the
Stream Error Flag for upper levels, if the obtained sock_fd has passed all
additional checks.
No backport needed since this was introduced in 3.0-dev10.
In commit 55e9e9591 ("MEDIUM: ssl: temporarily load files by detecting
their presence in crt-store"), ssl_sock_load_pem_into_ckch() was
replaced by ssl_sock_load_files_into_ckch() in the crt-store loading.
But the side effect was that we always try to autodetect, and this is
not what we want. This patch reverse this, and add specific code in the
crt-list loading, so we could autodetect in crt-list like it was done
before, but still try to load files when a crt-store filename keyword is
specified.
Example:
These crt-list lines won't autodetect files:
foobar.crt [key foobar.key issuer foobar.issuer ocsp-update on] *.foo.bar
foobar.crt [key foobar.key] *.foo.bar
These crt-list lines will autodect files:
foobar.pem [ocsp-update on] *.foo.bar
foobar.pem
Following David Carlier's work in 98d22f21 ("MEDIUM: shctx: Naming shared
memory context"), let's provide an helper function to set a name hint on
a virtual memory area (ie: anonymous map created using mmap(), or memory
area returned by malloc()).
Naming will only occur if available, and naming errors will be ignored.
The function takes mandatory <type> and <name> parameterss to build the
map name as follow: "type:name". When looking at /proc/<pid>/maps, vma
named using this helper function will show up this way (provided that
the kernel has prtcl support for PR_SET_VMA_ANON_NAME):
example, using mmap + MAP_SHARED|MAP_ANONYMOUS:
7364c4fff000-736508000000 rw-s 00000000 00:01 3540 [anon_shmem:type:name]
Another example, using mmap + MAP_PRIVATE|MAP_ANONYMOUS or using
glibc/malloc() above MMAP_THRESHOLD:
7364c4fff000-736508000000 rw-s 00000000 00:01 3540 [anon:type:name]
Since 40d1c84bf0 ("BUG/MAJOR: ring: free the ring storage not the ring
itself when using maps"), munmap() call for startup_logs's ring and
file-backed rings fails to work (EINVAL) and causes memory leaks during
process cleanup.
munmap() fails because it is called with the ring's usable area pointer
which is an offset from the underlying original memory block allocated
using mmap(). Indeed, ring_area() helper function was misused because
it didn't explicitly mention that the returned address corresponds to
the usable storage's area, not the allocated one.
To fix the issue, we add an explicit ring_allocated_area() helper to
return the allocated area for the ring, just like we already have
ring_allocated_size() for the allocated size, and we properly use both
the allocated size and allocated area to manipulate them using munmap()
and msync().
No backport needed.
crt-store is maint to be stricter than your common crt argument on a
bind line, and is supposed to be a declarative format.
However, since the 'ocsp-update' was migrated from ssl_conf to
ckch_conf, the .issuer file is not autodetected anymore when adding a
ocsp-update keyword in a crt-list file, which breaks retro-compatibility.
This patch is a quick fix that will disappear once we are able to be
strict on a crt-store and autodetect on a crt-list.
The ckch_conf_cmp() function allow to compare multiple ckch_conf
structures in order to check that multiple usage of the same crt in the
configuration uses the same ckch_conf definition.
A crt-list allows to use "crt-store" keywords that defines a ckch_store,
that can lead to inconsistencies when a crt is called multiple time with
different parameters.
This function compare and dump a list of differences in the err variable
to be output as error.
The variant ckch_conf_cmp_empty() compares the ckch_conf structure to an
empty one, which is useful for bind lines, that are not able to have
crt-store keywords.
These functions are used when a crt-store is already inialized and we
need to verify if the parameters are compatible.
ckch_conf_cmp() handles multiple cases:
- When the previous ckch_conf was declared with CKCH_CONF_SET_EMPTY, we
can't define any new keyword in the next initialisation
- When the previous ckch_conf was declared with keywords in a crtlist
(CKCH_CONF_SET_CRTLIST), the next initialisation must have the exact
same keywords.
- When the previous ckch_conf was declared in a "crt-store"
(CKCH_CONF_SET_CRTSTORE), the next initialisaton could use no keyword
at all or the exact same keywords.
This patch adds crt-store keywords from the crt-list on the CLI.
- keywords from crt-store can be used over the CLI when inserting
certificate in a crt-list
- keywords from crt-store are dumped when showing a crt-list content
over the CLI
The ckch_conf_kws.func function pointer needed a new "cli" parameter, in
order to differenciate loading that come from the CLI or from the
startup, as they don't behave the same. For example it must not try to
load a file on the filesystem when loading a crt-list line from the CLI.
dump_crtlist_sslconf() was renamed in dump_crtlist_conf() and takes a
new ckch_conf parameter in order to dump relevant crt-store keywords.
This option allow to disable completely the ocsp-update.
To achieve this, the ocsp-update.mode global keyword don't rely anymore
on SSL_SOCK_OCSP_UPDATE_OFF during parsing to call
ssl_create_ocsp_update_task().
Instead, we will inherit the SSL_SOCK_OCSP_UPDATE_* value from
ocsp-update.mode for each certificate which does not specify its own
mode.
To disable completely the ocsp without editing all crt entries,
ocsp-update.disable is used instead of "ocsp-update.mode" which is now
only used as the default value for crt.
Use the ocsp-update keyword in the crt-store section. This is not used
as an exception in the crtlist code anymore.
This patch introduces the "ocsp_update_mode" variable in the ckch_conf
structure.
The SSL_SOCK_OCSP_UPDATE_* enum was changed to a define to match the
ckch_conf on/off parser so we can have off to -1.
The callback used by ckch_store_load_files() only works with
PARSE_TYPE_STR.
This allows to use a callback which will use a integer type for PARSE_TYPE_INT
and PARSE_TYPE_ONOFF.
This require to change the type of the callback to void * to pass either
a char * or a int depending of the parsing type.
The ssl_sock_load_* functions were encapsuled in ckch_conf_load_*
function just to match the type.
This will allow to handle crt-store keywords that are ONOFF or INT
types.
Remove the "ocsp-update" keyword handling from the crt-list.
The code was made as an exception everywhere so we could activate the
ocsp-update for an individual certificate.
The feature will still exists but will be parsed as a "crt-store"
keyword which will still be usable in a "crt-list". This will appear in
future commits.
This commit also disable the reg-tests for now.
This patch allows the usage of "crt-store" keywords from a "crt-list".
The crtstore_parse_load() function was splitted into 2 functions, so the
keywords parsing is done in ckch_conf_parse().
With this patch, crt are loaded with ckch_store_new_load_files_conf() or
ckch_store_new_load_files_path() depending on weither or not there is a
"crt-store" keyword.
More checks need to be done on "crt" bind keywords to ensure that
keywords are compatible.
This patch does not introduce the feature on the CLI.
ckch_store_new_load_files_conf() is the equivalent of
new_ckch_store_load_files_path() but instead of trying to find the files
using a base filename, it will load them from a list of files.
This mask value is unused, so we can safely remove it. It is a chance
because its value was wrong. But there is no bug here, even in stable
versions, because it is no longer used in all versions.
There was a flag to skip the response payload on output, if any, by stating
it is bodyless. It is used for responses to HEAD requests or for 204/304
responses. This allow rewrites during analysis. For instance a HEAD request
can be rewrite to a GET request for any reason (ie, a server not supporting
HEAD requests). In this case, the server will send a response with a
payload. On frontend side, the payload will be skipped and a valid response
(without payload) will be sent to the client.
With this patch we introduce the corresponding flag for the request. It will
be used to skip the request payload. In addition, when payload must be
skipped for a request or a response, The zero-copy data forwarding is now
disabled.
After every release we say that MIN/MAX should be changed to be an
expression that only evaluates each operand once, and before every
version we forget to change it and we recheck that the code doesn't
misuse them. Let's fix them now.
Aurlien reported that clang's build was broken by the recent fix
845fb846c7 ("BUG/MEDIUM: stick-tables: properly mark stktable_data
as packed"), because it now wants to use a helper for some atomic
ops (to increment std_t_uint). While this makes no sense to do
something that slow on modern architectures like x86 and arm64 which
are fine with unaligned accesses, we actually we can simply mark the
struct as aligned to its smallest element which is 32-bit (but still
packed). With this, it was verified that it is enough for clang to
see that its 32-bit operations will always be aligned, while making
64-bit operations safe on 64-bit platforms that do not support unaligned
accesses.
This should be backported wherever the patch above is backported.
Implement basic support for glitches on QUIC multiplexer. This is mostly
identical too glitches for HTTP/2.
A new configuration option named tune.quic.frontend.glitches-threshold
is defined to limit the number of glitches on a connection before
closing it.
Glitches counter is incremented via qcc_report_glitch(). A new
qcc_app_ops callback <report_susp> is defined. On threshold reaching, it
allows to set an application error code to close the connection. For
HTTP/3, value H3_EXCESSIVE_LOAD is returned. If not defined, default
code INTERNAL_ERROR is used.
For the moment, no glitch are reported for QUIC or HTTP/3 usage. This
will be added in future patches as needed.
Rename enum values used for HTTP/3 and QPACK RFC defined codes. First
uses a prefix H3_ERR_* which serves as identifier between them. Also
separate QPACK values in a new dedicated enum qpack_err. This is deemed
cleaner.
There is two distinct enums both related to QPACK error management. The
first one is dedicated to RFC defined code. The other one is a set of
internal values returned by qpack_decode_fs(). There has been issues
discovered recently due to the confusion between them.
Rename internal values with the prefix QPACK_RET_*. The older name
QPACK_ERR_* will be used in a future commit for the first enum.
In order to forcefully unregister a buffer waiter during an inter-thread
takeover under isolation, we'll need to that the function works without
th_ctx but the target thread's ctx instead. Let's implement this by
passing the target thread as an argument. Now b_dequeue() simply calls
this one with tid. It's OK it's not on that critical a path, especially
since the list has been checked for existence before performing the call.
The stktable_data union is made of types of varying sizes, and depending
on which types are stored in a table, some offsets might not necessarily
be aligned. This results in a bus error for certain regtests (e.g.
lb-services) on MIPS64. This bug may impact MIPS64, SPARC64, armv7 when
accessing a 64-bit counter (e.g. bytes) and depending on how the compiler
emitted the operation, and cause a trap that's emulated by the OS on RISCV
(heavy cost). x86_64 and armv8 are not affected at all.
Let's properly mark the struct with __attribute__((packed)) so that the
compiler emits the suitable unaligned-compatible instructions when
accessing the fields.
This should be backported to all versions where it applies.
A test on MIPS64 revealed that the following reg tests would all
fail at the same place in htx_replace_stline() when updating
parts of the request line:
reg-tests/cache/if-modified-since.vtc
reg-tests/http-rules/h1or2_to_h1c.vtc
reg-tests/http-rules/http_after_response.vtc
reg-tests/http-rules/normalize_uri.vtc
reg-tests/http-rules/path_and_pathq.vtc
While the status line is normally aligned since it's the first block
of the HTX, it may become unaligned once replaced. The problem is, it
is a structure which contains some u16 and u32, and dereferencing them
on machines not natively supporting unaligned accesses makes them crash
or handle crap. Typically, MIPS/MIPS64/SPARC will crash, ARMv5 will
either crash or (more likely) return swapped values and do crap, and
RISCV will trap and turn to slow emulation.
We can assign the htx_sl struct the packed attribute, but then this
also causes the ints to fill the 2-bytes gap before them, always causing
unaligned accesses for this part on such machines. The patch does a bit
better, by explicitly filling this two-bytes hole, and packing the
struct.
This should be backported to all versions.
qpack_decode_fs() is used to decode QPACK field section on HTTP/3
headers parsing. Its return value is incoherent as it returns either
QPACK_DECOMPRESSION_FAILED defined in RFC 9204 or any other internal
values defined in qpack-dec.h. On failure, such return code is reused by
HTTP/3 layer to be reported via a CONNECTION_CLOSE frame. This is
incorrect if an internal error values was reported as it is not defined
by any specification.
Fir return values of qpack_decode_fs() in two ways. Firstly, fix invalid
usages of QPACK_DECOMPRESSION_FAILED when decoded content is too large
for the correct internal error QPACK_ERR_TOO_LARGE.
Secondly, adjust qpack_decode_fs() API to only returns internal code
values. A new internal enum QPACK_ERR_DECOMP is defined to replace
QPACK_DECOMPRESSION_FAILED. Caller is responsible to convert it to a
suitable error value. For other internal values, H3_INTERNAL_ERROR is
used. This is done through a set of convert functions.
This should be backported up to 2.6. Note that trailers are not
supported in 2.6 so chunk related to h3_trailers_to_htx() can be safely
skipped.
Now, if a pool_alloc() fails for a buffer and if conditions are met
based on the queue number, we'll try to get an emergency buffer.
Thanks to this the situation is way more stable now. With only 4 reserve
buffers and 1 buffer it's possible to reliably serve 500 concurrent end-
to-end H1 connections and consult stats in parallel in loops showing the
growing number of buf_wait events in "show activity" without facing an
instant stall like in the past. Lower values still cause quick stalls
though.
It's also apparent that some subsystems do not seem to detach from the
buffer_wait lists when leaving. For example several crashes in the H1
part showed list elements still present after a free(), so maybe some
operations performed inside h1_release() after the b_dequeue() call
can sometimes result in a new allocation. Same for streams, where
the dequeue is done relatively early.
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.
These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.
At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
Now when trying to allocate a channel buffer, we can check if we've been
notified of availability via the producer stream connector callback, in
which case we should not consult the queue, or if we're doing a first
allocation and check the queue.
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag in addition to clearing the ALLOC one, for
each of the 3 levels where we may fail an allocation. The flag will be
cleared upon a successful allocation. This will soon be used to decide to
re-allocate without waiting again in the queue. For now it has no effect.
There's just a trick, we need to clear the various *_ALLOC flags before
testing h1_recv_allowed() otherwise it will return false!
When appctx_buf_available() is called, it now sets APPCTX_FL_IN_MAYALLOC
or APPCTX_FL_OUT_MAYALLOC depending on the reportedly permitted buffer
allocation, and these flags are cleared when the said buffers are
allocated. For now they're not used for anything else.
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag on the stream so that the stream knows it
is allowed to bypass the queue checks. For now this is not used.
We used to have two states for the channel's input buffer used by the SC,
NEED_BUFF or not, flipped by sc_need_buff() and sc_have_buff(). We want to
have a 3rd state, indicating that we've just got a desired buffer. Let's
add an HAVE_BUFF flag that is set by sc_have_buff() and that is cleared by
sc_used_buff(). This way by looking at HAVE_BUFF we know that we're coming
back from the allocation callback and that the offered buffer has not yet
been used.
Now b_alloc() will check the queues at the same and higher criticality
levels before allocating a buffer, and will refrain from allocating one
if these are not empty. The purpose is to put some priorities in the
allocation order so that most critical allocators are offered a chance
to complete.
However in order to permit a freshly dequeued task to allocate again while
siblings are still in the queue, there is a special DB_F_NOQUEUE flag to
pass to b_alloc() that will take care of this special situation.
When we want to allocate an in buffer, it's in order to pass data to
the applet, that will consume it, so it must be seen as the same as
a send() from the higher level, i.e. MUX_TX. And for the outbuf, it's
a stream endpoint returning data, i.e. DB_SE_RX.
Instead of having each caller of appctx_get_buf() think about setting
the blocking flag, better have the function do it, since it's already
handling the queue anyway. This way we're sure that both are consistent.
Now that we need to keep the bitmap in sync with the list heads, we don't
want tasks to leave just doing a LIST_DEL_INIT() without updating the map.
Let's provide a b_dequeue() function for that purpose. The function detects
when it's going to remove the last element and figures the queue number
based on the pointer since it points to the root. It's not used yet.
The introduction of buffer_wq[] in thread_ctx pushed a few fields around
and the cache line alignment is less satisfying. And more importantly, even
before this, all the lists in the local parts were 8-aligned, with the first
one split across two cache lines.
We can do better:
- sched_profile_entry is not atomic at all, the data it points to is
atomic so it doesn't need to be in the atomic-only region, and it can
fill the 8-hole before the lists
- the align(2*void) that was only before tasklets[] moves before all
lists (and it's a nop for now)
This now makes the lists and buffer_wq[] start on a cache line boundary,
leaves 48 bytes after the lists before the atomic-only cache line, and
leaves a full cache line at the end for 128-alignment. This way we still
have plenty of room in both parts with better aligned fields.
Let's turn the buffer_wq into an array of 4 list heads. These are chosen
by criticality. The DB_CRIT_TO_QUEUE() macro maps each criticality level
into one of these 4 queues. The goal here clearly is to make it possible
to wake up the most critical queues in priority in order to let some tasks
finish their job and release buffers that others can use.
In order to avoid having to look up all queues, a bit map indicates which
queues are in use, which also allows to avoid looping in the most common
case where queues are empty..
The code places that were used to manipulate the buffer_wq manually
now just call b_queue() or b_requeue(). This will simplify the multiple
list management later.
When failing an allocation we always do the same dance, add the
buffer_wait struct to a list if it's not, and return. Let's just add
dedicated functions to centralize this, this will be useful to implement
a bit more complex logic.
For now they're not used.
The goal is to indicate how critical the allocation is, between the
least one (growing an existing buffer ring) and the topmost one (boot
time allocation for the life of the process).
The 3 tcp-based muxes (h1, h2, fcgi) use a common allocation function
to try to allocate otherwise subscribe. There's currently no distinction
of direction nor part that tries to allocate, and this should be revisited
to improve this situation, particularly when we consider that mux-h2 can
reduce its Tx allocations if needed.
For now, 4 main levels are planned, to translate how the data travels
inside haproxy from a producer to a consumer:
- MUX_RX: buffer used to receive data from the OS
- SE_RX: buffer used to place a transformation of the RX data for
a mux, or to produce a response for an applet
- CHANNEL: the channel buffer for sync recv
- MUX_TX: buffer used to transfer data from the channel to the outside,
generally a mux but there can be a few specificities (e.g.
http client's response buffer passed to the application,
which also gets a transformation of the channel data).
The other levels are a bit different in that they don't strictly need to
allocate for the first two ones, or they're permanent for the last one
(used by compression).
There are 2 new ctl commands that may be used to retrieve the current number
of streams openned for a connection and its limit (the maximum number of
streams a mux connection supports).
For the PT and H1 muxes, the limit is always 1 and the current number of
streams is 0 for idle connections, otherwise 1 is returned.
For the H2 and the FCGI muxes, info are already available in the mux
connection.
For the QUIC mux, the limit is also directly available. It is the maximum
initial sub-ID of bidirectional stream allowed for the connection. For the
current number of streams, it is the number of SC attached on the connection
and the number of not already attached streams present in the "opening_list"
list.
A reason is now passed as parameter to muxes shutdowns to pass additional
info about the abort, if any. No info means no abort or only generic one.
For now, the reason is composed of 2 32-bits integer. The first on represents
the abort code and the other one represents the info about the code (for
instance the source). The code should be interpreted according to the associated
info.
One info is the source, encoding on 5 bits. Other bits are reserverd for now.
For now, the muxes are the only supported source. But we can imagine to extend
it to applets, streams, health-checks...
The current design is quite simple and will most probably evolved.. But the
idea is to let the opposite side forward some errors and let's a mux know
why its stream was aborted. At first glance, a abort reason must only be
evaluated if SE_SHW_SILENT flag is set.
The main goal at short term, is to forward some H2 RST_STREAM codes because
it is mandatory for gRPC applications, mainly to forward gRPC cancellation
from an H2 client to an H2 server. But we can imagine to alter this reason
at the applicative level to enrich it. It would also be used to report more
accurate errors in logs.
Instead of chaining 2 switchcases and performing encoding checks for all
nodes let's actually split the logic in 2: first handle simple node types
(text/separator), and then handle dynamic node types (tag, expr). Encoding
options are only evaluated for dynamic node types.
Also, last_isspace is always set to 0 after next_fmt label, since next_fmt
label is only used for dynamic nodes, thus != LOG_FMT_SEPARATOR.
Since LF_NODE_WITH_OPT() macro (which was introduced recently) is now
unused, let's get rid of it.
No functional change should be expected.
(Use diff -w to check patch changes since reindentation makes the patch
look heavy, but in fact it remains fairly small)
Split code related to proxies list looping in cli_parse_clear_counters()
to a new dedicated function. This function is placed in the new module
stats-proxy.
Create a new module stats-proxy. Move stats functions related to proxies
list looping in it. This allows to reduce stats source file dividing its
size by half.
Convert FN_AGE in stat_cols_px[] as generic columns. These values will
be automatically used for dump/preload of a stats-file.
Remove srv_lastsession() / be_lastsession() function which are now
useless as last_sess is calculated via me_generate_field().
last_change was a member present in both proxy and server struct. It is
used as an age statistics to report the last update of the object.
Move last_change into fe_counters/be_counters. This is necessary to be
able to manipulate it through generic stat column and report it into
stats-file.
Note that there is a change for proxy structure with now 2 different
last_change values, on frontend and backend side. Special care was taken
to ensure that the value is initialized only on the proxy side. The
other value is set to 0 unless a listen proxy is instantiated. For the
moment, only backend counter is reported in stats. However, with now two
distinct values, stats could be extended to report it on both side.
Implement support for FN_RATE stat column into stat-file.
For the output part, only minimal change is required. Reuse the function
read_freq_ctr() to print the same value in both stats output and
stats-file dump.
For counter preloading, define a new utility function
preload_freq_ctr(). This can be used to initialize a freq-ctr type by
preloading previous period value. Reuse this function in load_ctr()
during stats-file parsing.
At the moment, no rate column is defined as generic. Thus, this commit
does not have functional change. This will be changed as soon as FN_RATE
are converted to generic columns.
Move freq-ctr defined in proxy or server structures into their dedicated
fe_counters/be_counters struct.
Functionnaly no change here. This commit will allow to convert rate
stats column to generic one, which is mandatory to manipulate them in
the stats-file.
Currently, only FN_COUNTER are dumped and preloaded via a stats-file.
Thus in several places we relied on the assumption that only FN_COUNTER
are valid in stats-file context.
New stats types will soon be implemented as they are also eligilible to
statistics reloading on process startup. Thus, prepare stats-file
functions to remove any FN_COUNTER restriction.
As one of this change, generate_stat_tree() now uses stcol_is_generic()
for stats name tree indexing before stats-file parsing.
Also related to stats-file parsing, individual counter preloading step
as been extracted from line parsing in a dedicated new function
load_ctr(). This will allow to extend it to support multiple mechanism
of counter preloading depending on the stats type.
If 'namespace' keyword is used in the backend server settings or/and in the
bind string, it means that haproxy process will call setns() to change its
default namespace to the configured one and then, it will create a
socket in this new namespace. setns() syscall requires CAP_SYS_ADMIN
capability in the process Effective set (see man 2 setns). Otherwise, the
process must be run as root.
To avoid to run haproxy as root, let's add cap_sys_admin capability in the
same way as we already added the support for some other network capabilities.
As CAP_SYS_ADMIN belongs to CAP_SYS_* capabilities type, let's add a separate
flag LSTCHK_SYSADM for it. This flag is set, if the 'namespace' keyword was
found during configuration parsing. The flag may be unset only in
prepare_caps_for_setuid() or in prepare_caps_from_permitted_set(), which
inspect process EUID/RUID and Effective and Permitted capabilities sets.
If system doesn't support Linux capabilities or 'cap_sys_admin' was not set
in 'setcap', but 'namespace' keyword is presented in the configuration, we
keep the previous strict behaviour. Process, that has changed uid to the
non-priviledged user, will terminate with alert. This alert invites the user
to recheck its configuration.
In the case, when haproxy will start and run under a non-root user and
'cap_sys_admin' is not set, but 'namespace' keyword is presented, this patch
does not change previous behaviour as well. We'll still let the user to try
its configuration, but we inform via warning, that unexpected things, like
socket creation errors, may occur.
quic_connect_server(), tcp_connect_server(), uxst_connect_server() duplicate
same code to check different ERRNOs, that socket() and setns() may return.
They also duplicate some runtime condition checks, applied to the obtained
server socket fd.
So, in order to remove these duplications and to improve code readability,
let's encapsulate socket() and setns() ERRNOs handling in
sock_handle_system_err(). It must be called just before fd's runtime condition
checks, which we also move in sock_create_server_socket by the same reason.
SO_MARK, SO_USER_COOKIE, SO_RTABLE socket options (used to set the special
mark/ID on socket, in order to perform mark-based routing) are only supported
by AF_INET sockets. So, let's check socket address family, when we enter into
this function.
In 98b44e8 ("BUG/MINOR: log: fix global lf_expr node options behavior"),
I properly restored global node options behavior for when encoding is
not used, however the fix is not optimal when encoding is involved:
Indeed, encoding logic in sess_build_logline() relies on global node
options to know if encoding must be handled expression-wide or
individually. However, because of the above fix, if an expression is
made of 1 or multiple nodes that all set an encoding option manually
(without '%o'), we consider that the option was set globally, but
that's probably not what the user intended. Instead we should only
evaluate global options from '%o', so that it remains possible to
skip global encoding when needed.
No backport needed.
LF_NODE_WITH_OPT(node) returns true if the node's option may be set and
thus should be considered. Logic is based on logformat node's type:
for now only TAG and FMT nodes can be configured.
Rename e_byte_fct to e_fct_byte and e_fct_byte_ctx to e_fct_ctx, and
adjust some comments to make it clear that e_fct_ctx is here to provide
additional user-ctx to the custom cbor encode function pointers.
For now, only e_fct_byte function may be provided, but we could imagine
having e_fct_int{16,32,64}() one day to speed up the encoding when we
know we can encode multiple bytes at a time, but for now it's not worth
the hassle.
The new LIST_ATMOST1() test verifies that the designated element is either
alone or points on both sides to the same element. This is used to detect
that a list has at most a single element, or that an element about to be
deleted was the last one of a list.
In this patch, we make use of the CBOR (RFC8949) encode helper functions
from the previous commit to implement '+cbor' encoding option for log-
formats. The logic behind it is pretty similar to '+json' encoding option,
except that the produced output is a CBOR payload written in HEX format so
that it remains compatible to use this with regular syslog endpoints.
Example:
log-format "%{+cbor}o %[int(4)] test %(named_field)[str(ok)]"
Will produce:
BF6B6E616D65645F6669656C64626F6BFF
Detailed view (from cbor.me):
BF # map(*)
6B # text(11)
6E616D65645F6669656C64 # "named_field"
62 # text(2)
6F6B # "ok"
FF # primitive(*)
If the option isn't set globally, but on a specific node instead, then
only the value will be encoded according to CBOR specification.
Example:
log-format "test cbor bool: %{+cbor}[bool(true)]"
Will produce:
test cbor bool: F5
Add cbor helpers to encode strings (bytes/text) and integers according to
RFC8949, also add cbor_encode_ctx struct to pass encoding options such as
how to encode a single byte.
In this patch, we add the "+json" log format option that can be set
globally or per log format node.
What it does, it that it sets the LOG_OPT_ENCODE_JSON flag for the
current context which is provided to all lf_* log building function.
This way, all lf_* are now aware of this option and try to comply with
JSON specification when the option is set.
If the option is set globally, then sess_build_logline() will produce a
map-like object with key=val pairs for named logformat nodes.
(logformat nodes that don't have a name are simply ignored).
Example:
log-format "%{+json}o %[int(4)] test %(named_field)[str(ok)]"
Will produce:
{"named_field": "ok"}
If the option isn't set globally, but on a specific node instead, then
only the value will be encoded according to JSON specification.
Example:
log-format "{ \"manual_key\": %(named_field){+json}[bool(true)] }"
Will produce:
{"manual_key": true}
When the option is set, +E option will be ignored, and partial numerical
values (ie: because of logasap) will be encoded as-is.
Support '+bin' option argument on logformat nodes to try to preserve
binary output type with binary sample expressions.
For this, we rely on the log/sink API which is capable of conveying binary
data since all related functions don't search for a terminating NULL byte
in provided log payload as they take a string pointer and a string length
as argument.
Example:
log-format "%{+bin}o %[bin(00AABB)]"
Will produce:
00aabb
(output was piped to `hexdump -ve '1/1 "%.2x"'` to dump raw bytes as HEX
characters)
This should be used carefully, because many syslog endpoints don't expect
binary data (especially NULL bytes). This is mainly intended for use with
set-var-fmt actions or with ring/udp log endpoints that know how to deal
with such binary payloads.
Also, this option is only supported globally (for use with '%o'), it will
not have any effect when set on an individual node. (it makes no sense to
have binary data in the middle of log payload that was started without
binary data option)
There is no need to expose such functions since they are only involved in
the log building process that occurs inside sess_build_logline().
Making functions static and removing their public prototype to ease code
maintenance.
This patch implements parsing of headers line from stats-file.
A header line is defined as starting with '#' character. It is directly
followed by a domain name. For the moment, either 'fe' or 'be' is
allowed. The following lines will contain counters values relatives to
the domain context until the next header line.
This is implemented via static function parse_header_line(). It first
sets the domain context used during apply_stats_file(). A stats column
array is generated to contains the order on which column are stored.
This will be reused to parse following lines values.
If an invalid line is found and no header was parsed, considered the
stats-file as ill formatted and stop parsing. This allows to immediately
interrupt parsing if a garbage file was used without emitting a ton of
warnings to the user.
This commit is the first one of a serie to implement preloading of
haproxy counters via stats-file parsing.
This patch defines a basic apply_stats_file() function. It implements
reading line by line of a stats-file without any parsing for the moment.
It is called automatically on process startup via init().
Extract GUID format validation in a dedicated function named
guid_is_valid_fmt(). For the moment, it is only used on guid_insert().
This will be reused when parsing stats-file, to ensure GUID has a valid
format before tree lookup.
Define a new CLI command "dump stats-file" with its handler
cli_parse_dump_stat_file(). It will loop twice on proxies_list to dump
first frontend and then backend side. It reuses the common function
stats_dump_stat_to_buffer(), using STAT_F_BOUND to restrict on the
correct side.
A new module stats-file.c is added to regroup function specifics to
stats-file. It defines two main functions :
* stats_dump_file_header() to generate the list of column list prefixed
by the line context, either "#fe" or "#be"
* stats_dump_fields_file() to generate each stat lines. Object without
GUID are skipped. Each stat entry is separated by a comma.
For the moment, stats-file does not support statistics modules. As such,
stats_dump_*_line() functions are updated to prevent looping over stats
module on stats-file output.
Prepare stats function to handle a new format labelled "stats-file". Its
purpose is to generate a statistics dump with a format closed from the
CSV output. Such output will be then used to preload haproxy internal
counters on process startup.
stats-file output differs from a standard CSV on several points. First,
only an excerpt of all statistics is outputted. All values that does not
make sense to preload are excluded. For the moment, stats-file only list
stats fully defined via "struct stat_col" method. Contrary to a CSV, sll
columns of a stats-file will be filled. As such, empty field value is
used to mark stats which should not be outputted.
Some adaptation specifics to stats-file are necessary into
me_generate_field(). First, stats-file will output separatedly values
from frontend and backend sides with their own respective set of
columns. As such, an empty field value is returned if stat is not
defined for either frontend/listener, or backend/server when outputting
the other side. Also, as stats-file does not support empty column,
stcol_hide() is not used for it.
A minor adjustement was necessary for stats_fill_fe_line() to pass
context flags. This is necessary to detect stat output format. All other
listener/server/backend corresponding functions already have it.
Convert most of proxy counters statistics to new "struct stat_col"
definition. Remove their corresponding switch..case entries in
stats_fill_*_line() functions. Their value are automatically calculate
via me_generate_field() invocation.
Along with this, also complete stcol_hide() when some stats should be
hidden.
Only a few counters where not converted. This is because they rely on
values stored outside of fe/be_counters structure, which
me_generate_field() cannot use for now.
This commit is a direct follow-up of the previous one which define a new
type "struct stat_col" to fully define a statistic entry.
Define a new function metric_generate(). For metrics statistics, it is
able to automatically calculate a stat value field for "offsets" from
"struct stat_col". Use it in stats_fill_*_stats() functions. Maintain a
fallback to previously used switch-case for old-style statistics.
This commit does not introduce functional change as currently no
statistic is defined as "struct stat_col". This will be the subject of a
future commit.
Previously, statistics were simply defined as a list of name_desc, as
for example "stat_cols_px" for proxy stats. No notion of type was fixed
for each stat definition. This correspondance was done individually
inside stats_fill_*_line() functions. This renders the process to
define new statistics tedious.
Implement a more expressive stat definition method via a new API. A new
type "struct stat_col" for stat column to replace name_desc usage is
defined. It contains a field to store the stat nature and format. A
<cap> field is also defined to be able to define a proxy stat only for
certain type of objects.
This new type is also further extended to include counter offsets. This
allows to define a method to automatically generate a stat value field
from a "struct stat_col". This will be the subject of a future commit.
New type "struct stat_col" is fully compatible full name_desc. This
allows to gradually convert stats definition. The focus will be first
for proxies counters to implement statistics preservation on reload.
The name "metrics" was chosen to represent the various list of haproxy
exposed statistics. However, it is deemed as ambiguous as some stats are
indeed metric in the true sense, but some are not, as highlighted by
various "enum field_origin" values.
Replace it by the new name "stat_cols" for statistic columns. Along with
the already existing notion of stat lines it should better reflect its
purpose.
When a process is reloaded, the old process must performed a synchronisation
with the new process. To do so, the sync task notify the local peer to
proceed and waits. Internally, the sync task used PEERS_F_DONOTSTOP flag to
know it should wait. However, this flag was only set/unset in a single
function. There is no real reason to set a flag to do so. A static variable
set to 1 when the resync starts and to 0 when it is finished is enough.
Some flags were used to define the learn state of a peer. It was a bit
confusing, especially because the learn state of a peer is manipulated from
the peer applet but also from the sync task. It is harder to understand the
transitions if it is based on flags than if it is based a dedicated state
based on an enum. It is the purpose of this patch.
Now, we can define the following rules regarding this learn state:
* A peer is assigned to learn by the sync task
* The learn state is then changed by the peer itself to notify the
learning is in progress and when it is finished.
* Finally, when the peer finished to learn, the sync task must acknowledge
it by unassigning the peer.
This patch is a cleanup of the recent change about the relation between a
peer and the applet used to deal with I/O. Three flags was introduced to
reflect the peer applet state as seen from outside (from the sync task in
fact). Using flags instead of true states was in fact a bad idea. This work
but it is confusing. Especially because it was mixed with LEARN and TEACH
peer flags.
So, now, to make it clearer, we are now using a dedicated state for this
purpose. From the outside, the peer may be in one of the following state
with respects of its applet:
* the peer has no applet, it is stopped (PEER_APP_ST_STOPPED).
* the peer applet was created with a validated connection from the protocol
perspective. But the sync task must synchronized it with the peers
section. It is in starting state (PEER_APP_ST_STARTING).
* The starting starting was acknowledged by the sync task, the peer applet
can start to process messages. It is in running state
(PEER_APP_ST_RUNNING).
* The last peer applet was released and the associated connection
closed. But the sync task must synchronized it with the peers section. It
is in stopping state (PEER_APP_ST_STOPPING).
Functionnaly speaking, there is no true change here. But it should be easier
to understand now.
In addition to these changes, __process_peer_state() function was renamed
sync_peer_app_state().
appctx_is_back() function may be used to know if an applet was create on
frontend side or on backend side. It may be handy for some applets that may
exist on both sides, like peer applets.
These new functions is_char4_outside() and is_char8_outside() are meant
to be used to verify if any of the 4 or 8 chars represented respectively
by a uint32_t or a uint64_t is outside of the min,max byte range passed
in argument. This is the simplified, fast version of the function so it
is restricted to less than 0x80 distance between min and max (sufficient
to validate chars). Extra functions are also provided to check for min
or max alone as well, with the same restriction.
The use case typically is to check that the output of read_u32() or
read_u64() contains exclusively certain bytes.
From Linux 5.17, anonymous regions can be name via prctl/PR_SET_VMA
so caches can be identified when looking at HAProxy process memory
mapping.
The most possible error is lack of kernel support, as a result
we ignore it, if the naming fails the mapping of memory context
ought to still occur.
Since 3.0-dev7 with commit 1a088da7c2 ("MAJOR: stktable: split the keys
across multiple shards to reduce contention"), building without threads
yields a warning about the shard not being used. This is because the
locks API does nothing of its arguments, which is the only place where
the shard is being used. We cannot modify the lock API to pretend to
consume its argument because quite often it's not even instantiated.
Let's just pretend we consume shard using an explict ALREADY_CHECKED()
statement instead. While we're at it, let's make sure that XXH32() is
not called when there is a single bucket!
No backport is needed.
Some flags are defined during statistics generation and output. They use
the prefix STAT_* which is also used for other purposes. Rename them
with the new prefix STAT_F_* to differentiate them from the other
usages.
Several unique names were used for different purposes under statistics
implementation. This caused the code to be difficult to understand.
* stat/stats name is removed when a more specific name could be used
* restrict field usage to purely refer to <struct field> which
represents a raw stat value.
* use "line" naming to represent an array of <struct field>