During development, everything related to CPU binding and the CPU topology
is debugged using state dumps at various places, but it does make sense to
have a real command line option so that this remains usable in production
to help users figure why some CPUs are not used by default. Let's add
"-dc" for this. Since the list of global.tune.options values is almost
full and does not 100% match this option, let's add a new "tune.debug"
field for this.
For now the function refrains from detecting the CPU topology when a
restrictive taskset or cpu-map was already performed on the process,
and it's documented as such, the reason being that until we're able
to automatically create groups, better not change user settings. But
we'll need to be able to detect bound CPUs and to process them as
desired by the user, so we now need to move that detection into the
function itself. It changes nothing to the logic, just gives more
freedom to the function.
The function is not convenient because it doesn't allow us to undo the
startup changes, and depending on where it's being used, we don't know
whether the values read have already been altered (this is not the case
right now but it's going to evolve).
Let's just compute the status during cpu_detect_usable() and set a
variable accordingly. This way we'll always read the init value, and
if needed we can even afford to reset it. Also, placing it in cpu_topo.c
limits cross-file dependencies (e.g. threads without affinity etc).
With this patch we're also NUMA node IDs to each CPU when the info is
found. The code is highly inspired from the one in commit f5d48f8b3
("MEDIUM: cfgparse: numa detect topology on FreeBSD."), the difference
being that we're just setting the value in ha_cpu_topo[].
With this patch we're also assigning NUMA node IDs to each CPU when one
is found. The code is highly inspired from the one in commit b56a7c89a
("MEDIUM: cfgparse: detect numa and set affinity if needed") that already
did the job, except that it could be simplified since we're just collecting
info to fill the ha_cpu_topo[] array.
The sibling ID was not reported because it's not directly accessible
but we don't care, what matters is that we assign numbers to all the
threads we find using the same CPU so that some strategies permit to
allocate one thread at a time if we want to use few threads with max
performance.
This uses the publicly available information from /sys to figure the cache
and package arrangements between logical CPUs and fill ha_cpu_topo[], as
well as their SMT capabilities and relative capacity for those which expose
this. The functions clearly have to be OS-specific.
When possible, the offline CPUs are detected at boot and their OFFLINE
flag is set in the ha_cpu_topo[] array. When the detection is not
possible (e.g. not linux, /sys not mounted etc), we just mark none of
them as being offline, as we don't want to infer wrong info that could
hinder automatic CPU placement detection. When valid, we take this
opportunity for refining cpu_topo_lastcpu so that we don't need to
manipulate CPUs beyond this value.
On FreeBSD we can detect online CPUs at least by doing the bitwise-OR of
the CPUs of all domains, so we're using this and adding this detection
to ha_cpuset_detect_online(). If we find simpler later, we can always
rework it, but it's reasonably inexpensive since we only check existing
domains.
This adds a generic function ha_cpuset_detect_online() which for now
only supports linux via /sys. It fills a cpuset with the list of online
CPUs that were detected (or returns a failure).
The cpuset files are normally used only for cpu manipulations. It happens
that the initial CPU binding detection was initially placed there since
there was no better place, but in practice, being OS-specific, it should
really be in cpu-topo. This simplifies cpuset which doesn't need to know
about the OS anymore.
Now before trying to resolve the thread assignment to groups, we detect
which CPUs are not bound at boot so that we can mark them with
HA_CPU_F_EXCLUDED. This will be useful to better know on which CPUs we
can count later. Note that we purposely ignore cpu-map here as we
don't know how threads and groups will map to cpu-map entries, hence
which CPUs will really be used.
It's important to proceed this way so that when we have no info we
assume they're all available.
The new function cpu_dump_topology() will centralize most debugging
calls, and it can make efforts of not dumping some possibly irrelevant
fields (e.g. non-existing cache levels).
We don't want to constantly deal with as many CPUs as a cpuset can hold,
so let's first try to trim the value to what the system claims to support
via _SC_NPROCESSORS_CONF. It is obviously still subject to the limit of
the cpuset size though. The value is stored globally so that we can
reuse it elsewhere after initialization.
This structure will be used to store information about each CPU's
topology (package ID, L3 cache ID, NUMA node ID etc). This will be used
in conjunction with CPU affinity setting to try to perform a mostly
optimal binding between threads and CPU numbers by default. Since it
was noticed during tests that absolutely none of the many machines
tested reports different die numbers, the die_id is not stored.
Also, it was found along experiments that the cluster ID will be used
a lot, half of the time as a node-local identifier, and half of the
time as a global identifier. So let's store the two versions at once
(cl_gid, cl_lid).
Some flags are added to indicate causes of exclusion (offline, excluded
at boot, excluded by rules, ignored by policy).
let's just clean up the thread_cpus_enabled() code a little bit
by removing the OS-specific code and rely on ha_cpuset_detect_bound()
instead. On macos we continue to use sysconf() for now.
Negative IDs are very convenient to mean "not set", so let's just make
the cpuset API robust against this, especially with ha_cpuset_isset()
so that we don't have to manually add this check everywhere when a
value is not known.
Lots of collected data and observations aggregated into a single commit
so as not to lose them. Some parts below come from several commit
messages and are incremental.
Add captures and analysis of intel 14900 where it's not easy to draw
the line between the desired P and E cores.
The 14900 raises some questions (imagine a dual-die variant in multi-socket).
That's the start of an algorithmic distribution of performance cores into
thread groups.
cpu-map currently conflicts a lot with the choices after auto-detection
but it doesn't have to. The problem is the inability to configure the
threads for the whole process like taskset does. By offering this ability
we can also start to designate groups of CPUs symbolically (package, die,
ccx, cores, smt).
It can also be useful to exploit the info from cpuinfo that is not
available in /sys, such as the model number. At least on arm, higher
numbers indicate bigger cores and can be useful to distinguish cores
inside a cluster. It will not indicate big vs medium ones of the same
type (e.g. a78 3.0 vs 2.4 GHz) but can still be effective at identifying
the efficient ones.
In short, infos such as cluster ID not always reliable, and are
local to the package. die_id as well. die number is not reported
here but should definitely be used, as a higher priority than L3.
We're still missing a discriminant between the l3 and cluster number
in order to address heterogenous CPUs (e.g. intel 14900), though in
terms of locality that's currently done correctly.
CPU selection is also a full topic, and some thoughts were noted
regarding sorting by perf vs locality so as never to mix inter-
socket CPUs due to sorting.
The proposed cpu-selection cannot work as-is, because it acts both on
restriction and preference, and these two are not actions but a sequence.
First restrictions must be enforced, and second the remaining CPUs are
sorted according to the preferred criterion, and a number of threads are
selected.
Currently we refine the OS-exposed cluster number but it's not correct
as we can end up with something poorly numbered. We need to respect the
LLC in any case so let's explain the approach.
This is also needed in order to make the requested number of CPUs
appear. For now we don't reroute to the original sysconf() call so
we return -1,EINVAL for all other info.
A build warning is emitted with gcc-4.8 in tools.c since commit
e920d73f59 ("MINOR: tools: improve symbol resolution without dl_addr")
because the compiler doesn't see that <size> is necessarily initialized.
Let's just preset it.
This adds run_poll_loop, run_tasks_from_lists, process_runnable_tasks,
ha_dump_backtrace and cli_io_handler which are fairly common in
backtraces. This will be less relative symbols when dladdr is not
usable.
Let's have a macro that declares both the symbol and its name, it will
avoid the risk of introducing typos, and encourages adding more when
needed. The macro also takes an optional second argument to permit an
inline declaration of an extern symbol.
When dl_addr is not usable or fails, better fall back to the closest
symbol among the known ones instead of providing everything relative
to main. Most often, the location of the function will give some hints
about what it can be. Thus now we can emit fct+0xXXX in addition to
main+0xXXX or main-0xXXX. We keep a margin of +256kB maximum after a
function for a match, which is around the maximum size met in an object
file, otherwise it becomes pointless again.
Performing a diff on stats output before vs after commit 66152526
("MEDIUM: stats: convert counters to new column definition") revealed
that some metrics were not properly ported to to the new API. Namely,
"lbtot", "cli_abrt" and "srv_abrt" are now exposed on frontend and
listeners while it was not the case before.
Also, "hrsp_other" is exposed even when "mode http" wasn't set on the
proxy.
In this patch we restore original behavior by fixing the capabilities
and hide settings.
As this could be considered as a minor regression (looking at the commit
message it doesn't seem intended), better tag this as a bug. It should be
backported in 3.0 with 66152526.
This is a complementary patch to cf913c2f9 ("DOC: management: rename show
stats domain cli "dns" to "resolvers"). The doc still refered to the
legacy "dns" domain filter for stat command. Let's rename those occurences
to "resolvers".
It may be backported to all stable versions.
Since commit 8de8ed4f48 ("MEDIUM: connections: Allow taking over
connections from other tgroups.") we got this partially absurd
build warning when disabling threads:
src/backend.c: In function 'conn_backend_get':
src/backend.c:1371:27: warning: array subscript [0, 0] is outside array bounds of 'struct tgroup_info[1]' [-Warray-bounds]
The reason is that gcc sees that curtgid is not equal to tgid which is
defined as 1 in this case, thus it figures that tgroup_info[curtgid-1]
will be anything but zero and that doesn't fit. It is ridiculous as it
is a perfect case of dead code elimination which should not warrant a
warning. Nevertheless we know we don't need to do this when threads are
disabled and in this case there will not be more than 1 thread group, so
we can happily use that preliminary test to help the compiler eliminate
the dead condition and avoid spitting this warning.
No backport is needed.
The dladdr_lock that was added to avoid re-entering into dladdr is
conditioned by threads, but the way it's declared causes a build
warning if threads are disabled due to the insertion of a lone semi
colon in the variables block. Let's switch to __decl_thread_var()
for this.
This can be backported wherever commit eb41d768f9 ("MINOR: tools:
use only opportunistic symbols resolution") is backported. It relies
on these previous two commits:
bb4addabb7 ("MINOR: compiler: add a simple macro to concatenate resolved strings")
69ac4cd315 ("MINOR: compiler: add a new __decl_thread_var() macro to declare local variables")
__decl_thread() already exists but is more suited for struct members.
When using it in a variables block, it appends the final trailing
semi-colon which is a statement that ends the variable block. Better
clean this up and have one precisely for variable blocks. In this
case we can simply define an unused enum value that will consume the
semi-colon. That's what the new macro __decl_thread_var() does.
It's often useful to be able to concatenate strings after resolving
them (e.g. __FILE__, __LINE__ etc). Let's just have a CONCAT() macro
to do that, which calls _CONCAT() with the same arguments to make
sure the contents are resolved before being concatenated.
A bug was uncovered by the work on NUMA. It only triggers in the CI
with libmusl due to a race condition. What happens is that the call
to set_thread_cpu_affinity() is done very early in the polling loop,
and that it relies on ha_pthread[tid] instead of pthread_self(). The
problem is that ha_pthread[tid] is only set by the return from
pthread_create(), which might happen later depending on the number of
CPUs available to run the starting thread.
Let's just use pthread_self() here. ha_pthread[] is only used to send
signals between threads, there's no point in using it here.
This can be backported to 2.6.
Historically, log-forward proxy used to preserve host field from input
message as much as possible, and if syslog host wasn't provided
(rfc5424 '-' or bad rfc3164 or rfc5424 message) then "localhost" or "-"
would be used as host when outputting message using rfc3164 or rfc5424.
We change that behavior (which corresponds to "keep" host option), so that
log-forward now uses "fill" strategy as default: if the host is provided
in input message, it is preserved. However if it is missing and IP address
from sender is available, we use it.
Following previous patch, we know implement the logic for the host
option under log-forward section. Possible strategies are:
replace If input message already contains a value for the host
field, we replace it by the source IP address from the
sender.
If input message doesn't contain a value for the host field
(ie: '-' as input rfc5424 message or non compliant rfc3164
or rfc5424 message), we use the source IP address from the
sender as host field.
fill If input message already contains a value for the host field,
we keep it.
If input message doesn't contain a value for the host field
(ie: '-' as input rfc5424 message or non compliant rfc3164
or rfc5424 message), we use the source IP address from the
sender as host field.
keep If input message already contains a value for the host field,
we keep it.
If input message doesn't contain a value for the host field,
we set it to localhost (rfc3164) or '-' (rfc5424).
(This is the default)
append If input message already contains a value for the host field,
we append a comma followed by the IP address from the sender.
If input message doesn't contain a value for the host field,
we use the source IP address from the sender.
Default value (unchanged) is "keep" strategy. option host is only relevant
with rfc3164 or rfc5424 format on log targets. Also, if the source address
is not available (ie: UNIX socket), default behavior prevails.
Documentation was updated.
Migrate recently added log-forward section options, currently stored under
proxy->options2 to proxy->options3 since proxy->options2 is running out of
space and we plan on adding more log-forward options.
proxy->options2 is almost full, yet we will add new log-forward options
in upcoming patches so we anticipate that by adding a new {no_}options3
and cfg_opts3[] to further extend proxy options
Prevent code duplication under syslog_fd_handler() and syslog_io_handler()
by merging common code path in a single syslog_process_message() helper
that processed a single message stored in <buf> according to <frontend>
settings.
This test returns a JWS payload signed a specified private key in the
PEM format, and uses the "jose" command tool to check if the signature
is correct against the jwk public key.
The test could be improved later by using the code from jwt.c allowing
to check a signature.
This commits implement JWS signing, this is divided in 3 parts:
- jws_b64_protected() creates a JWS "protected" header, which takes the
algorithm, kid or jwk, nonce and url as input, and fill a destination
buffer with the base64url version of the header
- jws_b64_payload() just encode a payload in base64url
- jws_b64_signature() generates a signature using as input the protected
header and the payload, it supports ES256, ES384 and ES512 for ECDSA
keys, and RS256 for RSA ones. The RSA signature just use the
EVP_DigestSign() API with its result encoded in base64url. For ECDSA
it's a little bit more complicated, and should follow section 3.4 of
RFC7518, R and S should be padded to byte size.
Then the JWS can be output with jws_flattened() which just formats the 3
base64url output in a JSON representation with the 3 fields, protected,
payload and signature.
Released version 3.2-dev7 with the following main changes :
- BUG/MEDIUM: applet: Don't handle EOI/EOS/ERROR is applet is waiting for room
- BUG/MEDIUM: spoe/mux-spop: Introduce an NOOP action to deal with empty ACK
- BUG/MINOR: cfgparse: fix NULL ptr dereference in cfg_parse_peers
- BUG/MEDIUM: uxst: fix outgoing abns address family in connect()
- REGTESTS: fix reg-tests/server/abnsz.vtc
- BUG/MINOR: log: fix outgoing abns address family
- BUG/MINOR: sink: add tempo between 2 connection attempts for sft servers
- MINOR: clock: always use atomic ops for global_now_ms
- CI: QUIC Interop: clean old docker images
- BUG/MINOR: stream: do not call co_data() from __strm_dump_to_buffer()
- BUG/MINOR: mux-h1: always make sure h1s->sd exists in h1_dump_h1s_info()
- MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
- BUG/MEDIUM: stream: never allocate connection addresses from signal handler
- MINOR: freq_ctr: provide non-blocking read functions
- BUG/MEDIUM: stream: use non-blocking freq_ctr calls from the stream dumper
- MINOR: tools: use only opportunistic symbols resolution
- CLEANUP: task: move the barrier after clearing th_ctx->current
- MINOR: compression: Introduce minimum size
- BUG/MINOR: h2: always trim leading and trailing LWS in header values
- MINOR: tinfo: split the signal handler report flags into 3
- BUG/MEDIUM: stream: don't use localtime in dumps from a signal handler
- OPTIM: connection: don't try to kill other threads' connection when !shared
- BUILD: add possibility to use different QuicTLS variants
- MEDIUM: fd: Wait if locked in fd_grab_tgid() and fd_take_tgid().
- MINOR: fd: Add fd_lock_tgid_cur().
- MEDIUM: epoll: Make sure we can add a new event
- MINOR: pollers: Add a fixup_tgid_takeover() method.
- MEDIUM: pollers: Drop fd events after a takeover to another tgid.
- MEDIUM: connections: Allow taking over connections from other tgroups.
- MEDIUM: servers: Add strict-maxconn.
- BUG/MEDIUM: server: properly initialize PROXY v2 TLVs
- BUG/MINOR: server: fix the "server-template" prefix memory leak
- BUG/MINOR: h3: do not report transfer as aborted on preemptive response
- CLEANUP: h3: fix documentation of h3_rcv_buf()
- MINOR: hq-interop: properly handle incomplete request
- BUG/MEDIUM: mux-fcgi: Try to fully fill demux buffer on receive if not empty
- MINOR: h1: permit to relax the websocket checks for missing mandatory headers
- BUG/MINOR: hq-interop: fix leak in case of rcv_buf early return
- BUG/MINOR: server: check for either proxy-protocol v1 or v2 to send hedaer
- MINOR: jws: implement a JWK public key converter
- DEBUG: init: add a way to register functions for unit tests
- TESTS: add a unit test runner in the Makefile
- TESTS: jws: register a unittest for jwk
- CI: github: run make unit-tests on the CI
- TESTS: add config smoke checks in the unit tests
- MINOR: jws: conversion to NIST curves name
- CI: github: remove smoke tests from vtest.yml
- TESTS: ist: fix wrong array size
- TESTS: ist: use the exit code to return a verdict
- TESTS: ist: add a ist.sh to launch in make unit-tests
- CI: github: fix h2spec.config proxy names
- DEBUG: init: Add a macro to register unit tests
- MINOR: sample: allow custom date format in error-log-format
- CLEANUP: log: removing "log-balance" references
- BUG/MINOR: log: set proper smp size for balance log-hash
- MINOR: log: use __send_log() with exact payload length
- MEDIUM: log: postpone the decision to send or not log with empty messages
- MINOR: proxy: make pr_mode enum bitfield compatible
- MINOR: cfgparse-listen: add and use cfg_parse_listen_match_option() helper
- MINOR: log: add options eval for log-forward
- MINOR: log: detach prepare from parse message
- MINOR: log: add dont-parse-log and assume-rfc6587-ntf options
- BUG/MEIDUM: startup: return to initial cwd only after check_config_validity()
- TESTS: change the output of run-unittests.sh
- TESTS: unit-tests: store sh -x in a result file
- CI: github: show results of the Unit tests
- BUG/MINOR: cfgparse/peers: fix inconsistent check for missing peer server
- BUG/MINOR: cfgparse/peers: properly handle ignored local peer case
- BUG/MINOR: server: dont return immediately from parse_server() when skipping checks
- MINOR: cfgparse/peers: provide more info when ignoring invalid "peer" or "server" lines
- BUG/MINOR: stream: fix age calculation in "show sess" output
- MINOR: stream/cli: rework "show sess" to better consider optional arguments
- MINOR: stream/cli: make "show sess" support filtering on front/back/server
- TESTS: quic: create first quic unittest
- MINOR: h3/hq-interop: restore function for standalone FIN receive
- MINOR/OPTIM: mux-quic: do not allocate rxbuf on standalone FIN
- MINOR: mux-quic: refine reception of standalone STREAM FIN
- MINOR: mux-quic: define globally stream rxbuf size
- MINOR: mux-quic: define rxbuf wrapper
- MINOR: mux-quic: store QCS Rx buf in a single-entry tree
- MINOR: mux-quic: adjust Rx data consumption API
- MINOR: mux-quic: adapt return value of qcc_decode_qcs()
- MAJOR: mux-quic: support multiple QCS RX buffers
- MEDIUM: mux-quic: handle too short data splitted on multiple rxbuf
- MAJOR: mux-quic: increase stream flow-control for multi-buffer alloc
- BUG/MINOR: cfgparse-tcp: relax namespace bind check
- MINOR: startup: adjust alert messages, when capabilities are missed
CAP_SYS_ADMIN support was added, in order to access sockets in namespaces. So
let's adjust the alert at startup, where we check preserved capabilities from
global.last_checks. Let's mention here cap_sys_admin as well.
Commit 5cbb278 introduced cap_sys_admin support, and enforced checks for
both binds and servers. However, when binding into a namespace, the bind
is done before dropping privileges. Hence, checking that we have
cap_sys_admin capability set in this case is not needed (and it would
decrease security to add it).
For users starting haproxy with other user than root and without
cap_sys_admin, bind should have already failed.
As a consequence, relax runtime check for binds into a namespace.
Support for multiple Rx buffers per QCS instance has been introduced by
previous patches. However, due to flow-control initial values, client
were still unable to fully used this to increase their upload
throughput.
This patch increases max-stream-data-bidi-remote flow-control initial
values. A new define QMUX_STREAM_RX_BUF_FACTOR will fix the number of
concurrent buffers allocable per QCS. It is set to 90.
Note that connection flow-control initial value did not changed. It is
still configured to be equivalent to bufsize multiplied by the maximum
concurrent streams. This ensures that Rx buffers allocation is still
constrained per connection, so that it won't be possible to have all
active QCS instances using in parallel their maximum Rx buffers count.
Previous commit introduces support for multiple Rx buffers per QCS
instance. Contiguous data may be splitted accross multiple buffers
depending on their offset.
A particular issue could arise with this new model. Indeed, app_ops
rcv_buf callback can still deal with a single buffer at a time. This may
cause a deadlock in decoding if app_ops layer cannot proceed due to
partial data, but such data are precisely divided on two buffers. This
can for example intervene during HTTP/3 frame header parsing.
To deal with this, a new function is implemented to force data realign
between two contiguous buffers. This is called only when app_ops rcv_buf
returned 0 but data is available in the next buffer after the current
one. In this case, data are transferred from the next into the current
buffer via qcs_transfer_rx_data(). Decoding is then restarted, which
should ensure that app_ops layer has enough data to advance.
During this operation, special care is ensure to removed both
qc_stream_rxbuf entries, as their offset are adjusted. The next buffer
is only reinserted if there is remaining data in it, else it can be
freed.
This case is not easily reproducible as it depends on the HTTP/3 framing
used by the client. It seems to be easily reproduced though with quiche.
$ quiche-client --http-version HTTP/3 --method POST --body /tmp/100m \
"https://127.0.0.1:20443/post"