The current BTS/BTR operations on x86 are ugly because they rely on a
CAS, so they may be unfair and take time to converge. Fortunately,
where they are currently used (mostly FDs) the contention is expected
to be rare (mostly listeners). But this also limits their use to such
few low-load cases.
On x86 there is a set of BTS/BTR instructions which help for this,
but before the FD's state migrated to 32 bits there was little use of
them since they do not exist in 8 bits.
Now at least it makes sense to use them, at the very least in order
to significantly reduce the code size (one BTS instead of a CMPXCHG
loop). The implementation relies on modern gcc's ability to return
condition flags and limit code inflation and register spilling. The
fall back is retained on the old implementation for all other situations
(inappropriate target size or non-capable compiler). The code shrank
by 1.6 kB on the fast path.
As expected, for now on up to 4 threads there is no measurable difference
of performance.
Probably due to the result of an old copy-paste, HA_ATOMIC_BTS/BTR were
still implemented using the __sync_* builtins instead of the more
modern __atomic_* which allow to specify the memory model. Let's update
this to use the newer there and also implement the relaxed variants
(which are not used for now).
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Most ADD/SUB callers use them for a single unit (e.g. refcounts) and
it's a pain to always pass ",1". Let's add them to simplify the API.
However we currently don't add any return value. If needed in the future
better report zero/non-zero than a real value for the sake of efficiency
at the instruction level.
The fetch_and_xxx variant is often missing for add/sub/and/or. In fact
it was only provided for ADD under the name XADD which corresponds to
the x86 instruction name. But for destructive operations like AND and
OR it's missing even more as it's not possible to know the value before
modifying it.
This patch explicitly adds HA_ATOMIC_FETCH_{OR,AND,ADD,SUB} which
cover these standard operations, and renames XADD to FETCH_ADD (there
were only 6 call places).
In the future, backport of fixes involving such operations could simply
remap FETCH_ADD(x) to XADD(x), FETCH_SUB(x) to XADD(-x), and for the
OR/AND if needed, these could possibly be done using BTS/BTR.
It's worth noting that xchg could have been renamed to fetch_and_store()
but xchg already has well understood semantics and it wasn't needed to
go further.
In order to make sure these ones will not be used anymore in an expression,
let's make them always void. New callers will now be forced to use the
explicit _FETCH variant if required.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.
Gcc 10.2 implements outline atomics on aarch64. The replace all inline
atomic ops with a function call that checks if the machine supports LSE
atomics. This comes with a small cost but allows modern machines to scale
much better than with the old LL/SC ones even when built for full 8.0
compatibility.
This patch enables the use of the __atomic_compare_exchange() builtin
for the double-word CAS when detected as available instead of using the
hand-written LL/SC version. The extra cost is negligible because we do
very few DWCAS operations (essentially FD migrations and shared pools)
so the cost is low but under high contention it can still be beneficial.
As expected no performance difference was measured in either direction
on 4-core machines with this change.
This could be backported to 2.3 if it was shown that FD migrations were
representing a significant source of contention, but for now it does
not appear to be needed.
There is a function called fd_write_frag_line() that's essentially used
by loggers and that is used to write an atomic message line over a file
descriptor using writev(). However a lock is required around the writev()
call to prevent messages from multiple threads from being interleaved.
Till now a SPIN_TRYLOCK was used on a dedicated lock that was common to
all FDs. This is quite not pretty as if there are multiple output pipes
to collect logs, there will be quite some contention. Now that there
are empty flags left in the FD state and that we can finally use atomic
ops on them, let's add a flag to indicate the FD is locked for exclusive
access by a syscall. At least the locking will now be on an FD basis and
not the whole process, so we can remove the log_lock.
No need to keep this flag apart any more, let's merge it into the global
state. The bit was not cleared in fd_insert() because the only user is
the function used to create and atomically send a log message to a pipe
FD, which never registers the fd. Here we clear it nevertheless for the
sake of clarity.
Note that with an extra cleaning pass we could have a bit number
here and simply use a BTS to test and set it.
No need to keep this flag apart any more, let's merge it into the global
state. The CLI's output state was extended to 6 digits and the linger/cloned
flags moved inside the parenthesis.
For a long time we've had fdtab[].ev and fdtab[].state which contain two
arbitrary sets of information, one is mostly the configuration plus some
shutdown reports and the other one is the latest polling status report
which also contains some sticky error and shutdown reports.
These ones used to be stored into distinct chars, complicating certain
operations and not even allowing to clearly see concurrent accesses (e.g.
fd_delete_orphan() would set the state to zero while fd_insert() would
only set the event to zero).
This patch creates a single uint with the two sets in it, still delimited
at the byte level for better readability. The original FD_EV_* values
remained at the lowest bit levels as they are also known by their bit
value. The next step will consist in merging the remaining bits into it.
The whole bits are now cleared both in fd_insert() and _fd_delete_orphan()
because after a complete check, it is certain that in both cases these
functions are the only ones touching these areas. Indeed, for
_fd_delete_orphan(), the thread_mask has already been zeroed before a
poller can call fd_update_event() which would touch the state, so it
is certain that _fd_delete_orphan() is alone. Regarding fd_insert(),
only one thread will get an FD at any moment, and it as this FD has
already been released by _fd_delete_orphan() by definition it is certain
that previous users have definitely stopped touching it.
Strictly speaking there's no need for clearing the state again in
fd_insert() but it's cheap and will remove some doubts during some
troubleshooting sessions.
In preparation of merging FD_POLL* and FD_EV*, this only changes the
value of FD_POLL_* to use bits 8-15 (the second byte). The size of the
field has been temporarily extended to 32 bits already, as well as
the temporary variables that carry the new composite value inside
fd_update_events(). The resulting fdtab entry becomes temporarily
unaligned. All places making access to .ev or FD_POLL_* were carefully
inspected to make sure they were safe regarding this change. Only one
temporary update was needed for the "show fd" code. The code was only
slightly inflated at this step.
The former was not used and the second was used only as a positive mask
of the flags to keep instead of having the flags that are updated. Both
were removed in favor of a new FD_POLL_UPDT_MASK that only mentions the
updated flags. This will ease merging of state and ev later.
This patch registers the parsed file and the line where a log server
is declared to make those information available in configuration
post check.
Those new informations were added on error messages probed resolving
ring names on post configuration check.
Define MODE_DIAG which is used to run haproxy in diagnostic mode. This
mode is used to output extra warnings about possible configuration
blunder or sub-optimal usage. It can be activated with argument '-dD'.
A new output function ha_diag_warning is implemented reserved for
diagnostic output. It serves to standardize the format of diagnostic
messages.
A macro HA_DIAG_WARN_COND is also available to automatically check if
diagnostic mode is on before executing the diagnostic check.
Historically, an option was added to wait for the request payload (option
http-buffer-request). This option has 2 drawbacks. First, it is an ON/OFF
option for the whole proxy. It cannot be enabled on demand depending on the
message. Then, as its name suggests, it only works on the request side. The
only option to wait for the response payload was to write a dedicated
filter. While it is an acceptable solution for complex applications, it is a
bit overkill to simply match strings in the body.
To make everyone happy, this patch adds a dedicated HTTP action to wait for
the message payload, for the request or the response depending it is used in
an http-request or an http-response ruleset. The time to wait is
configurable and, optionally, the minimum payload size to have before stop
to wait.
Both the http action and the old http analyzer rely on the same internal
function.
L6 sample fetches are now ignored when called from an HTTP proxy. Thus, a
warning is emitted during the startup if such usage is detected. It is true
for most ACLs and for log-format strings. Unfortunately, it is a bit painful
to do so for sample expressions.
This patch relies on the commit "MINOR: action: Use a generic function to
check validity of an action rule list".
The check_action_rules() function is now used to check the validity of an
action rule list. It is used from check_config_validity() function to check
L5/6/7 rulesets.
It is now possible to perform HTTP upgrades on a TCP stream from the
frontend side. To do so, a tcp-request content rule must be defined with the
switch-mode action, specifying the mode (for now, only http is supported)
and optionnaly the proto (h1 or h2).
This way it could be possible to set HTTP directives on a TCP frontend which
will only be evaluated if an upgrade is performed. This new way to perform
HTTP upgrades should replace progressively the old way, consisting to route
the request to an HTTP backend. And it should be also a good start to remove
all HTTP processing from tcp-request content rules.
This action is terminal, it stops the ruleset evaluation. It is only
available on proxy with the frontend capability.
The configuration manual has been updated accordingly.
The code responsible to perform an HTTP upgrade from a TCP stream is moved
in a dedicated function, stream_set_http_mode().
The stream_set_backend() function is slightly updated, especially to
correctly set the request analysers.
Now allocation and initialization of HTTP transactions are performed in a
unique function. Historically, there were two functions because the same TXN
was reset for K/A connections in the legacy HTTP mode. Now, in HTX, K/A
connections are handled at the mux level. A new stream, and thus a new TXN,
is created for each request. In addition, the function responsible to end
the TXN is now also reponsible to release it.
So, now, http_create_txn() and http_destroy_txn() must be used to create and
destroy an HTTP transaction.
MX_FL_NO_UPG flag may now be set on a multiplexer to explicitly disable
upgrades from this mux. For now, it is set on the FCGI multiplexer because
it is not supported and there is no upgrade on backend-only multiplexers. It
is also set on the H2 multiplexer because it is clearly not supported.
Historically we've used SOL_IP/SOL_IPV6/SOL_TCP everywhere as the socket
level value in getsockopt() and setsockopt() but as we've seen over time
it regularly broke the build and required to have them defined to their
IPPROTO_* equivalent. The Linux ip(7) man page says:
Using the SOL_IP socket options level isn't portable; BSD-based
stacks use the IPPROTO_IP level.
And it indeed looks like a pure linuxism inherited from old examples and
documentation. strace also reports SOL_* instead of IPPROTO_*, which does
not help... A check to linux/in.h shows they have the same values. Only
SOL_SOCKET and other non-IP values make sense since there is no IPPROTO
equivalent.
Let's get rid of this annoying confusion by removing all redefinitions of
SOL_IP/IPV6/TCP and using IPPROTO_* instead, just like any other operating
system. This also removes duplicated tests for the same value.
Note that this should not result in exposing syscalls to other OSes
as the only ones that were still conditionned to SOL_IPV6 were for
IPV6_UNICAST_HOPS which already had an IPPROTO_IPV6 equivalent, and
IPV6_TRANSPARENT which is Linux-specific.
Very often we use "int" where negative numbers are not needed (and can
further cause trouble) just because it's painful to type "unsigned int"
or "unsigned", or ugly to use in function arguments. Similarly sometimes
chars would absolutely need to be signed but nobody types "signed char".
Let's add a few aliases for such types and make them part of the standard
internal API so that over time we can get used to them and get rid of
horrible definitions. A comment also reminds some commonly available
types and their properties regarding other types.
In order to process samples from the command line interface we'll need
rules as well, and these rules will have to be marked as coming from
the CLI parser. This new origin is used for this.
In order to prepare for supporting calling sample expressions from the
CLI, let's create a new CLI_PARSER parsing context. This one supports
constants and internal samples only.
In order to process samples from the config file we'll need rules as
well, and these rules will have to be marked as coming from the
config parser. This new origin is used for this.
We'd sometimes like to be able to process samples while parsing
the configuration based on purely internal thing but that's not
possible right now. Let's add a new CFG_PARSER context for samples
which only permits constant samples (i.e. those which do not change
in the process' life and which are stable during config parsing).
This level indicates that everything it constant in the expression during
the whole process' life and that it may safely be used at config parsing
time.
For now smp_resolve_args() complains on stderr via ha_alert(), but if we
want to make it a bit more dynamic, we need it to return errors in an
allocated message. Let's pass it an error pointer and have it fill it.
On return we indent the output if it contains more than one line.
Define a new cap PR_CAP_LUA. It can be used to allocate the internal
proxy for lua Socket class. This cap overrides default settings for
preferable values in the lua context.
Move all liberation code related to a proxy in a dedicated function
free_proxy in proxy.c. For now, this function is only called in
haproxy.c. In the future, it will be used to free the lua proxy.
This helps to clean up haproxy.c.
Create a new function parse_new_proxy specifically designed to allocate
a new proxy from the configuration file and copy settings from the
default proxy.
The function alloc_new_proxy is reduced to a minimal allocation. It is
used for default proxy allocation and could also be used for internal
proxies such as the lua Socket proxy.
Move deinit_acl_cond and deinit_act_rules from haproxy.c respectively in
acl.c and action.c. The name of the functions has been slightly altered,
replacing the prefix deinit_* by free_* to reflect their purpose more
clearly.
This change has been made in preparation to the implementation of a free
proxy function. As a side-effect, it helps to clean up haproxy.c.
Create a new module init which contains code related to REGISTER_*
macros for initcalls. init.h is included in api.h to make init code
available to all modules.
It's a step to clean up a bit haproxy.c/global.h.
The default SSL_CTX used by a specific frontend is the one of the first
ckch instance created for this frontend. If this instance has SNIs, then
the SSL context is linked to the instance through the list of SNIs
contained in it. If the instance does not have any SNIs though, then the
SSL_CTX is only referenced by the bind_conf structure and the instance
itself has no link to it.
When trying to update a certificate used by the default instance through
a cli command, a new version of the default instance was rebuilt but the
default SSL context referenced in the bind_conf structure would not be
changed, resulting in a buggy behavior in which depending on the SNI
used by the client, he could either use the new version of the updated
certificate or the original one.
This patch adds a reference to the default SSL context in the default
ckch instances so that it can be hot swapped during a certificate
update.
This should fix GitHub issue #1143.
It can be backported as far as 2.2.
Christopher discovered an issue mostly affecting 2.2 and to a less extent
2.3 and above, which is that it's possible to deadlock a soft-stop when
several threads are using a same listener:
thread1 thread2
unbind_listener() fd_set_running()
lock(listener) listener_accept()
fd_delete() lock(listener)
while (running_mask); -----> deadlock
unlock(listener)
This simple case disappeared from 2.3 due to the removal of some locked
operations at the end of listener_accept() on the regular path, but the
architectural problem is still here and caused by a lock inversion built
around the loop on running_mask in fd_clr_running_excl(), because there
are situations where the caller of fd_delete() may hold a lock that is
preventing other threads from dropping their bit in running_mask.
The real need here is to make sure the last user deletes the FD. We have
all we need to know the last one, it's the one calling fd_clr_running()
last, or entering fd_delete() last, both of which can be summed up as
the last one calling fd_clr_running() if fd_delete() calls fd_clr_running()
at the end. And we can prevent new threads from appearing in running_mask
by removing their bits in thread_mask.
So what this patch does is that it sets the running_mask for the thread
in fd_delete(), clears the thread_mask, thus marking the FD as orphaned,
then clears the running mask again, and completes the deletion if it was
the last one. If it was not, another thread will pass through fd_clr_running
and will complete the deletion of the FD.
The bug is easily reproducible in 2.2 under high connection rates during
soft close. When the old process stops its listener, occasionally two
threads will deadlock and the old process will then be killed by the
watchdog. It's strongly believed that similar situations do exist in 2.3
and 2.4 (e.g. if the removal attempt happens during resume_listener()
called from listener_accept()) but if so, they should be much harder to
trigger.
This should be backported to 2.2 as the issue appeared with the FD
migration. It requires previous patches "fd: make fd_clr_running() return
the remaining running mask" and "MINOR: fd: remove the unneeded running
bit from fd_insert()".
Notes for backport: in 2.2, the fd_dodelete() function requires an extra
argument "do_close" indicating whether we want to remove and close the FD
(fd_delete) or just delete it (fd_remove). While this information is not
conveyed along the chain, we know that late calls always imply do_close=1
become do_close=0 exclusively results from fd_remove() which is only used
by the config parser and the master, both of which are single-threaded,
hence are always the last ones in the running_mask. Thus it is safe to
assume that a postponed FD deletion always implies do_close=1.
Thanks to Olivier for his help in designing this optimal solution.
There's no point taking the running bit in fd_insert() since by
definition there will never be more than one thread inserting the FD,
and that fd_insert() may only be done after the fd was allocated by
the system, indicating the end of use by any other thread.
This will need to be backported to 2.2 to fix an issue.
We'll need to know that a thread is the last one to use an fd, so let's
make fd_clr_running() return the remaining bits after removal. Note that
in practice we're only interested in knowing if it's zero but the compiler
doesn't make use of the clags after the AND and emits a CMPXCHG anyway :-/
This will need to be backported to 2.2 to fix an issue.
The commit reverts following commits:
* 83926a04 BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable
* a61789a1 MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua
Instead of relying on a Lua function to print the lua traceback into the
debugger, we are now using our own internal function (hlua_traceback()).
This one does not allocate memory and use a chunk instead. This avoids any
issue with a possible deadlock in the memory allocator because the thread
processing was interrupted during a memory allocation.
This patch relies on the commit "BUG/MEDIUM: debug/lua: Use internal hlua
function to dump the lua traceback". Both must be backported wherever the
patches above are backported, thus as far as 2.0
The separator string is now configurable, passing it as parameter when the
function is called. In addition, the message have been slightly changed to
be a bit more readable.
If an unknown CA file was first mentioned in an "add ssl crt-list" CLI
command, it would result in a call to X509_STORE_load_locations which
performs a disk access which is forbidden during runtime. The same would
happen if a "ca-verify-file" or "crl-file" was specified. This was due
to the fact that the crt-list file parsing and the crt-list related CLI
commands parsing use the same functions.
The patch simply adds a new parameter to all the ssl_bind parsing
functions so that they know if the call is made during init or by the
CLI, and the ssl_store_load_locations function can then reject any new
cafile_entry creation coming from a CLI call.
It can be backported as far as 2.2.
str2sa_range function options PA_O_DGRAM and PA_O_STREAM are used to
define the supported address types but also to set the default type
if it is not explicit. If the used address support both STREAM and DGRAM,
the default was always set to STREAM.
This patch introduce a new option PA_O_DEFAULT_DGRAM to force the
default to DGRAM type if it is not explicit in the address field
and both STREAM and DGRAM are supported. If only DGRAM or only STREAM
is supported, it continues to be considered as the default.
In commit a1ecbca0a ("BUG/MINOR: freq_ctr/threads: make use of the last
updated global time"), for period-based counters, the millisecond part
of the global_now variable was used as the date for the new period. But
it's wrong, it only works with sub-second periods as it wraps every
second, and for other periods the counters never rotate anymore.
Let's make use of the newly introduced global_now_ms variable instead,
which contains the global monotonic time expressed in milliseconds.
This patch needs to be backported wherever the patch above is backported.
It depends on previous commit "MINOR: time: also provide a global,
monotonic global_now_ms timer".
The period-based freq counters need the global date in milliseconds,
so better calculate it and expose it rather than letting all call
places incorrectly retrieve it.
Here what we do is that we maintain a new globally monotonic timer,
global_now_ms, which ought to be very close to the global_now one,
but maintains the monotonic approach of now_ms between all threads
in that global_now_ms is always ahead of any now_ms.
This patch is made simple to ease backporting (it will be needed for
a subsequent fix), but it also opens the way to some simplifications
on the time handling: instead of computing the local time and trying
to force it to the global one, we should soon be able to proceed in
the opposite way, that is computing the new global time an making the
local one just the latest snapshot of it. This will bring the benefit
of making sure that the global time is always ahead of the local one.
The pool_alloc_dirty() function was renamed to __pool_alloc() and now
takes a set of flags indicating whether poisonning is permitted or not
and whether zeroing the area is needed or not. The pool_alloc() function
is now just a wrapper calling __pool_alloc(pool, 0).
This one used to maintain a shortcut in the pools allocation path that
was only justified by b_alloc_fast() which was not used! Let's get rid
of it as well so that the allocator becomes a bit more straight forward.
It is never used anymore since 1.7 where it was used by b_alloc_margin()
then replaced by direct calls to the pools function, and it maintains a
dependency on the exposed pools functions. It's time to get rid of it,
as it's not even certain it still works.
The channel's buffer allocator, channel_alloc_buffer(), was still relying
on the principle of a margin for the request and not for the response.
But this margin stopped working around 1.7 with the introduction of the
content filters such as SPOE, and was completely anihilated with the
local pools that came with threads. Let's simplify this and just use
b_alloc().
Right now there is a discrepancy beteween b_alloc() and b_allow_margin():
the former forcefully overwrites the target pointer while the latter tests
it and returns it as-is if already allocated.
As a matter of fact, all callers of b_alloc() either preliminary test the
buffer, or assume it's already null.
Let's remove this pain and make the function test the buffer's allocation
before doing it again, and match call places' expectations.
Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.
In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.
Now that connections aren't being reused when they failed, remove the
reset() method. It was unimplemented anywhere, except for H1 where it did
nothing, anyway.
Introduce a new XPRT method, start(). The init() method will now only
initialize whatever is needed for the XPRT to run, but any action the XPRT
has to do before being ready, such as handshakes, will be done in the new
start() method. That way, we will be sure the full stack of xprt will be
initialized before attempting to do anything.
The init() call is also moved to conn_prepare(). There's no longer any reason
to wait for the ctrl to be ready, any action will be deferred until start(),
anyway. This means conn_xprt_init() is no longer needed.
Remove static qualifier on stats_allocate_proxy_counters_internal. This
function will be used to allocate extra counters at runtime for dynamic
servers.
Prepare the server parsing API to support dynamic servers.
- define a new parsing flag to be used for dynamic servers
- each keyword contains a new field dynamic_ok to indicate if it can be
used for a dynamic server. For now, no keyword are supported.
- do not copy settings from the default server for a new dynamic server.
- a dynamic server is created in a maintenance mode and requires an
explicit 'enable server' command.
- a new server flag named SRV_F_DYNAMIC is created. This flag is set for
all servers created at runtime. It might be useful later, for example
to know if a server can be purged.
Modify the API of parse_server function. Use flags to describe the type
of the parsed server instead of discrete arguments. These flags can be
used to specify if a server/default-server/server-template is parsed.
Additional parameters are also specified (parsing of the address
required, resolve of a name must be done immediately).
It is now unneeded to use strcmp on args[0] in parse_server. Also, the
calls to parse_server are more explicit thanks to the flags.
The freq counters were using the thread's own time as the start of the
current period. The problem is that in case of contention, it was
occasionally possible to perform non-monotonic updates on the edge of
the next second, because if the upfront thread updates a counter first,
it causes a rotation, then the second thread loses the race from its
older time, and tries again, and detects a different time again, but
in the past so it only updates the counter, then a third thread on the
new date would detect a change again, thus provoking a rotation again.
The effect was triple:
- rare loss of stored values during certain transitions from one
period to the next one, causing counters to report 0
- half of the threads forced to go through the slow path every second
- difficult convergence when using many threads where the CAS can fail
a lot and we can observe N(N-1) attempts for N threads to complete
This patch fixes this issue in two ways:
- first, it now makes use og the monotonic global_now value which also
happens to be volatile and to carry the latest known time; this way
time will never jump backwards anymore and only the first thread
updates it on transition, the other ones do not need to.
- second, re-read the time in the loop after each failure, because
if the date changed in the counter, it means that one thread knows
a more recent one and we need to update. In this case if it matches
the new current second, the fast path is usable.
This patch relies on previous patch "MINOR: time: export the global_now
variable" and must be backported as far as 1.8.
This is the process-wide monotonic time that is used to update each
thread's own time. It may be required at a few places where a strictly
monotonic clock is required such as freq_ctr. It will be have to be
backported as a dependency of a forthcoming fix.
Some are not always easy to spot with "chk" vs "check" or hyphens at
some places and not at others. Now entering "option http-close" properly
suggests "httpclose" and "option tcp-chk" suggests "tcp-check". There's
no need to consider the proxy's capabilities, what matters is to figure
what related word the user tried to spell, and there are not that many
options anyway.
The distance between two words can be high due to a sub-word being missing
and in this case it happens that other totally unrealted words are proposed
because their average score looks lower thanks to being shorter. Here we're
introducing the notion of presence of each character so that word sequences
that contain existing sub-words are favored against the shorter ones having
nothing in common. In addition we do not distinguish being/end from a
regular delimitor anymore. That made it harder to spot inverted words.
When tasklets were derived from tasks, there was no immediate need for
the scheduler to know their status after execution, and in a spirit of
simplicity they just started to always return NULL. The problem is that
it simply prevents the scheduler from 1) accounting their execution time,
and 2) keeping track of their current execution status. Indeed, a remote
wake-up could very well end up manipulating a tasklet that's currently
being executed. And this is the reason why those handlers have to take
the idle lock before checking their context.
In 2.5 we'll take care of making tasklets and tasks work more similarly,
but trouble is to be expected if we continue to propagate the trend of
returning NULL everywhere, especially if some fixes relying on a stricter
model later need to be backported. For this reason this patch updates all
known tasklet handlers to make them return NULL only when the tasklet was
freed. It has no effect for now and isn't even guaranteed to always be
100% safe but it puts the code into the right direction for this.
There were still a very small list of functions, variables and fields
called "stats_" while they were really purely CLI-centric. There's the
frontend called "stats_fe" in the global section, which instantiates a
"cli_applet" called "<CLI>" so it was renamed "cli_fe".
The "alloc_stats_fe" function cas renamed to "cli_alloc_fe" which also
better matches the naming convention of all cli-specific functions.
Finally the "stats_permission_denied_msg" used to return an error on
the CLI was renamed "cli_permission_denied_msg".
Now there's no more "stats_something" that designates the CLI.
This is the number of args accepted on a command received on the CLI,
is has long been totally independent of stats and should not carry
this misleading "stats" name anymore.
Instead of making a new one from scratch, let's support not wiping the
existing fingerprint and updating it, and to do the same char by char.
The word-by-word one will still result in multiple beginnings and ends,
but that will accurately translate word boundaries. The char-based one
has more flexibility and requires that the caller maintains the previous
char to indicate the transition, which also allows to insert delimiters
for example.
Entering "show tls" would still emit 35 entries. By measuring the distance
between all unknown words and the candidates, we can sort them and pick the
10 most likely candidates. This works reasonably well, as now "show tls"
only proposes "show tls-keys", "show threads", "show pools" and "show tasks".
If the distance is still too high or if a word is missing, the whole
prefix list continues to be dumped, thus "show" alone will still report
the entire list of commands beginning with "show".
It's still impossible to skip a word, for example "show conn" will not
propose "show servers conn" because the distance is calculated for each
word individually. Some changes to the distance calculation to support
updating an existing map could easily address this. But this is already
a great improvement.
It was mentioned that ACCESS_MASTER_ONLY as for workers only instead of
master-only. And it wasn't clear that all ACCESS_* would belong to the
same thing.
The last time when an item was seen in a resolver responses is now stored in
milliseconds instead of seconds. This avoid some corner-cases at the
edges. This also simplifies time comparisons.
Another way to say it: "Safely unlink requester from a requester callbacks".
Requester callbacks must never try to unlink a requester from a resolution, for
the current requester or another one. First, these callback functions are called
in a loop on a request list, not necessarily safe. Thus unlink resolution at
this place, may be unsafe. And it is useless to try to make these loops safe
because, all this stuff is placed in a loop on a resolution list. Unlink a
requester may lead to release a resolution if it is the last requester.
However, the unkink is necessary because we cannot reset the server state
(hostname and IP) with some pending DNS resolution on it. So, to workaround
this issue, we introduce the "safe" unlink. It is only performed from a
requester callback. In this case, the unlink function never releases the
resolution, it only reset it if necessary. And when a resolution is found
with an empty requester list, it is released.
This patch depends on the following commits :
* MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
* MINOR: resolvers: Use a function to remove answers attached to a resolution
* MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
* MINOR: resolvers: Add function to change the srv status based on SRV resolution
All the series must be backported as far as 2.2. It fixes a regression
introduced by the commit b4badf720 ("BUG/MINOR: resolvers: new callback to
properly handle SRV record errors").
don't release resolution from requester cb
srvrq_update_srv_status() update the server status based on result of SRV
resolution. For now, it is only used from snr_update_srv_status() when
appropriate.
resolv_purge_resolution_answer_records() must be used to removed all answers
attached to a resolution. For now, it is only used when a resolution is
released.
This function search for a SRV answer item associated to a requester
whose type is server.
This is mainly useful to "link" a server to its SRV record when no
additional record were found to configure the IP address.
This patch is required by a bug fix.
action_suggest() will return a pointer to an action whose keyword more or
less ressembles the passed argument. It also accepts to be more tolerant
against prefixes (since actions taking arguments are handled as prefixes).
This will be used to suggest approaching words.
Just like with the server keywords, now's the turn of "bind" keywords.
The difference is that 100% of the bind keywords are registered, thus
we do not need the list of extra keywords.
There are multiple bind line parsers today, all were updated:
- peers
- log
- dgram-bind
- cli
$ printf "listen f\nbind :8000 tcut\n" | ./haproxy -c -f /dev/stdin
[NOTICE] 070/101358 (25146) : haproxy version is 2.4-dev11-7b8787-26
[NOTICE] 070/101358 (25146) : path to executable is ./haproxy
[ALERT] 070/101358 (25146) : parsing [/dev/stdin:2] : 'bind :8000' unknown keyword 'tcut'; did you mean 'tcp-ut' maybe ?
[ALERT] 070/101358 (25146) : Error(s) found in configuration file : /dev/stdin
[ALERT] 070/101358 (25146) : Fatal errors found in configuration.
Instead of just reporting "unknown keyword", let's provide a function which
will look through a list of registered keywords for a similar-looking word
to the one that wasn't matched. This will help callers suggest correct
spelling. Also, given that a large part of the config parser still relies
on a long chain of strcmp(), we'll need to be able to pass extra candidates.
Thus the function supports an optional extra list for this purpose.
This introduces two functions, one which creates a fingerprint of a word,
and one which computes a distance between two words fingerprints. The
fingerprint is made by counting the transitions between one character and
another one. Here we consider the 26 alphabetic letters regardless of
their case, then any digit as a digit, and anything else as "other". We
also consider the first and last locations as transitions from begin to
first char, and last char to end. The distance is simply the sum of the
squares of the differences between two fingerprints. This way, doubling/
missing a letter has the same cost, however some repeated transitions
such as "e"->"r" like in "server" are very unlikely to match against
situations where they do not exist. This is a naive approach but it seems
to work sufficiently well for now. It may be refined in the future if
needed.
It is possible to have a session without a listener. It happens for applets
on the client side. Thus all accesses to the listener info from the session
must be guarded. It was the purpose of the commit 36119de18 ("BUG/MEDIUM:
session: NULL dereference possible when accessing the listener"). However,
some tests on the session's listener existence are missing in proxy_inc_*
functions.
This patch should fix the issues #1171, #1172, #1173, #1174 and #1175. It
must be backported with the above commit as far as 1.8.
Since commit f8fb4f75f ("MINOR: atomic: implement a more efficient arm64
__ha_cas_dw() using pairs"), on some modern arm64 (armv8.1+) compiled
with -march=armv8.1-a under gcc-7.5.0, a build error may appear on ev_poll.o :
/tmp/ccHD2lN8.s:1771: Error: reg pair must start from even reg at operand 1 -- `casp x27,x28,x22,x23,[x12]'
Makefile:927: recipe for target 'src/ev_poll.o' failed
It appears that the compiler cannot always assign register pairs there
for a structure made of two u64. It was possibly later addressed since
gcc-9.3 never caused this, but there's no trivially available info on
the subject in the changelogs. Unsuprizingly, using a u128 instead does
fix this, but it significantly inflates the code (+4kB for just 6 places,
very likely that it loaded some extra stubs) and the comparison is ugly,
involving two slower conditional jumps instead of a single one and a
conditional comparison. For example, ha_random64() grew from 144 bytes
to 232.
However, simply forcing the base register does work pretty well, and
makes the code even cleaner and more efficient by further reducing it
by about 4.5kB, possibly because it helps the compiler to pick suitable
registers for the pair there. And the perf on 64-cores looks steadily
0.5% above the previous one, so let's do this.
Note that the commit above was backported to 2.3 to fix scalability
issues on AWS Graviton2 platform, so this one will need to be as well.
The QUIC connection struct connection member was not initialized. This may
make randomly haproxy handle TLS connections as QUIC ones only when QUIC support
is enabled leading to such OpenSSL errors (captured from a reg test output, TLS
Client-Hello callback failed):
OpenSSL error[0x10000085] OPENSSL_internal: CONNECTION_REJECTED
OpenSSL error[0x10000410] OPENSSL_internal: SSLV3_ALERT_HANDSHAKE_FAILURE
OpenSSL error[0x1000009a] OPENSSL_internal: HANDSHAKE_FAILURE_ON_CLIENT_HELLO
This patch should fix#1168 github issue.
The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.
The recently introduced Financial Information eXchange (FIX)
converters have some hard coded tags based on the specification that
were misspelled. Specifically, SenderComID and TargetComID should
be SenderCompID and TargetCompID according to the specification [1][2].
This patch updates all references, which includes the converters
themselves, the regression test, and the documentation.
[1] https://fiximate.fixtrading.org/en/FIX.5.0SP2_EP264/tag49.html
[2] https://fiximate.fixtrading.org/en/FIX.5.0SP2_EP264/tag56.html
Parameter "accepted_payload_size" is currently considered regardless
the used nameserver is using TCP or UDP. It remains mandatory to annouce
such capability to support e-dns, so a value have to be announced also
in TCP. Maximum DNS message size in TCP is limited by protocol to 65535
and so for UDP (65507) if system supports such UDP messages. But
the maximum value for this option was arbitrary forced to 8192.
This patch change this maximum to 65535 to allow user to set bigger value
for UDP if its system supports. It also sets accepted_payload_size
in TCP allowing to retrieve huge responses if the configuration uses
TCP nameservers.
The request announcing the accepted_payload_size capability is currently
built at resolvers level and is common to all used nameservers of the
section regardess transport protocol used. A further patch should be
made to at least specify a different payload size depending of the
transport, and perhaps could be forced to 65535 in case of TCP and
maximum would be forced back to 65507 matching UDP max.
This patch is appliable since 2.4 version
These are some leftovers from the ancient code where they were still
called sessions, but these areas in the code remain confusing due to
this naming. They were now called "strm" which will not even affect
indenting nor alignment.
It was brought by commit c44b8de99 ("CLEANUP: connection: Use `VAR_ARRAY`
in `struct tlv` definition") but breaks the build with clang. Actually it
had already been done 6 months ago by commit 4987a4744 ("CLEANUP: tree-wide:
use VAR_ARRAY instead of [0] in various definitions") then reverted by
commit 441b6c31e ("BUILD: connection: fix build on clang after the VAR_ARRAY
cleanup") which explained the same thing but didn't place a comment in the
code to justify this (in short it's just an end of struct marker).
The default proxy was passed as a variable to all parsers instead of a
const, which is not without risk, especially when some timeout parsers used
to make some int pointers point to the default values for comparisons. We
want to be certain that none of these parsers will modify the defaults
sections by accident, so it's important to mark this proxy as const.
This patch touches all occurrences found (89).
TCC happens to define __OPTIMIZE__ at -O2 but doesn't proceed with dead
code elimination, resulting in ha_free() to always reference the link
error symbol. Let's condition this test on __GCC__ which others like
Clang also define.
ha_free() uses code that attempts to set a non-existant variable to provoke
a link-time error, with the expectation that the compiler will not omit that
if the code is unreachable. However, clang will emit it when compiling with
no optimization, so only do that if __OPTIMIZE__ is defined.
There's currently quite some thread contention in the server struct
because frequently fields accessed fields are mixed with those being
often written to by any thread. Let's split this a little bit to
separate a few areas:
- pure config / admin / operating status (almost never changes)
- idle and queuing (fast changes, done almost together)
- LB (fast changes, not necessarily dependent on the above)
- counters (fast changes, at a different instant again)
The actconns list creates massive contention on low server counts because
it's in fact a list of streams using a server, all threads compete on the
list's head and it's still possible to see some watchdog panics on 48
threads under extreme contention with 47 threads trying to add and one
thread trying to delete.
Moving this list per thread is trivial because it's only used by
srv_shutdown_streams(), which simply required to iterate over the list.
The field was renamed to "streams" as it's really a list of streams
rather than a list of connections.
There are multiple per-thread lists in the listeners, which isn't the
most efficient in terms of cache, and doesn't easily allow to store all
the per-thread stuff.
Now we introduce an srv_per_thread structure which the servers will have an
array of, and place the idle/safe/avail conns tree heads into. Overall this
was a fairly mechanical change, and the array is now always initialized for
all servers since we'll put more stuff there. It's worth noting that the Lua
code still has to deal with its own deinit by itself despite being in a
global list, because its server is not dynamically allocated.
It's a real pain not to have access to the list of all registered servers,
because whenever there is a need to late adjust their configuration, only
those attached to regular proxies are seen, but not the peers, lua, logs
nor DNS.
What this patch does is that new_server() will automatically add the newly
created server to a global list, and it does so as well for the 1 or 2
statically allocated servers created for Lua. This way it will be possible
to iterate over all of them.
Some entries are atomically updated by various threads, such as the
global counters, and they're mixed with others which are read all the
time like the mode. This explains why "perf" was seeing a huge access
cost on global.mode in process_stream()! Let's reorder them so that
the static config stuff is at the beginning and the live stuff is at
the end.
These functions are used on the mux layer to indicate that the connection
is becoming idle and that the xprt ought to be careful before checking the
context or that it's not idle anymore and that the context is safe. The
purpose is to allow a mux which is going to release a connection to tell
the xprt to be careful when touching it. At the moment, the xprt are
always careful and that's costly so we want to have the ability to relax
this a bit.
No xprt layer uses this yet.
This flag will be usable by any application. It will be preserved across
wakeups so the application can use it to do various stuff. Some I/O
handlers will soon benefit from this.
It's been too short for quite a while now and is now full. It's still
time to extend it to 32-bits since we have room for this without
wasting any space, so we now gained 16 new bits for future flags.
The values were not reassigned just in case there would be a few
hidden u16 or short somewhere in which these flags are placed (as
it used to be the case with stream->pending_events).
The patch is tagged MEDIUM because this required to update the task's
process() prototype to use an int instead of a short, that's quite a
bunch of places.
The nice field isn't needed anymore for the tasklet so we can move it
from the TASK_COMMON area into the struct task which already has a
hole around the expire entry.
It's cleaner to use a flag from the task's state to detect a tasklet
and it's even cheaper. One of the best benefits is that this will
allow to get the nice field out of the common part since the tasklet
doesn't need it anymore. This commit uses the last task bit available
but that's temporary as the purpose of the change is to extend this.
We frequently need to access a simple and fast PRNG for statistical
purposes. The debug_prng() function did exactly this using a xorshift
generator but its use was limited to debug only. Let's move this to
tools.h and tools.c to make it accessible everywhere. Since it needs to
be fast, its state is thread-local. An initialization function starts a
different initial value for each thread for better distribution.
In stream_add_srv_conn() MT_LIST_ADD() is used instead of MT_LIST_ADDQ(),
resulting in the stream being queued at the end of the server list. This
has no particular effect since we cannot dump the streams on a server,
and this is only used by "shutdown sessions" on a server. But it also
turns out to be significantly faster due to the shorter recovery from
the conflict with an adjacent MT_LIST_DEL(), thus it remains desirable
to use it, but at least it deserves a comment.
In addition to this, it's worth mentioning that this list should creates
extreme contention with threads while almost never used. It should be
made per-thread just like the global streams list.
The reason is that H2 can already require 32 16kB buffers for the mux
output at once, which will deplete the local cache. Thus it makes sense
to go further to leave some time to other connection to release theirs.
In addition, the L2 cache on modern CPUs is already 1 MB, so this change
is welcome in any case.
We've reached a point where the global pools represent a significant
bottleneck with threads. On a 64-core machine, the performance was
divided by 8 between 32 and 64 H2 connections only because there were
not enough entries in the local caches to avoid picking from the global
pools, and the contention on the list there was very high. It becomes
obvious that we need to have an array of lists, but that will require
more changes.
In parallel, standard memory allocators have improved, with tcmalloc
and jemalloc finding their ways through mainstream systems, and glibc
having upgraded to a thread-aware ptmalloc variant, keeping this level
of contention here isn't justified anymore when we have both the local
per-thread pool caches and a fast process-wide allocator.
For these reasons, this patch introduces a new compile time setting
CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are
enabled with thread local pool caches, and we know we have a fast
thread-aware memory allocator (currently set for glibc>=2.26). In this
case we entirely bypass the global pool and directly use the standard
memory allocator when missing objects from the local pools. It is also
possible to force it at compile time when a good allocator is used with
another setup.
It is still possible to re-enable the global pools using
CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the
operating system's default allocator, or when building with a recent
libc but a different allocator which provides other benefits but does
not scale well with threads.
There finally is a way to support register pairs on aarch64 assembly
under gcc, it's just undocumented, like many of the options there :-(
As indicated below, it's possible to pass "%H" to mention the high
part of a register pair (e.g. "%H0" to go with "%0"):
https://patchwork.ozlabs.org/project/gcc/patch/59368A74.2060908@foss.arm.com/
By making local variables from pairs of registers via a struct (as
is used in IST for example), we can let gcc choose the correct register
pairs and avoid a few moves in certain situations. The code is now slightly
more efficient than the previous one on AWS' Graviton2 platform, and
noticeably smaller (by 4.5kB approx). A few tests on older releases show
that even Linaro's gcc-4.7 used to support such register pairs and %H,
and by then ATOMICS were not supported so this should not cause build
issues, and as such this patch replaces the earlier implementation.
This variant uses the CASP instruction available on armv8.1-a CPU cores,
which is detected when __ARM_FEATURE_ATOMICS is set (gcc-linaro >= 7,
mainline >= 9). This one was tested on cortex-A55 (S905D3) and on AWS'
Graviton2 CPUs.
The instruction performs way better on high thread counts since it
guarantees some forward progress when facing extreme contention while
the original LL/SC approach is light on low-thread counts but doesn't
guarantee progress.
The implementation is not the most optimal possible. In particular since
the instruction requires to work on register pairs and there doesn't seem
to be a way to force gcc to emit register pairs, we have to decide to force
to use the pair (x0,x1) to store the old value, and (x2,x3) to store the
new one, and this necessarily involves some extra moves. But at least it
does improve the situation with 16 threads and more. See issue #958 for
more context.
Note, a first implementation of this function was making use of an
input/output constraint passed using "+Q"(*(void**)target), which was
resulting in smaller overall code than passing "target" as an input
register only. It turned out that the cause was directly related to
whether the function was inlined or not, hence the "forceinline"
attribute. Any changes to this code should still pay attention to this
important factor.
On highly threaded machines it is possible to occasionally trigger the
watchdog on certain contended areas like the server's connection list,
because while the mechanism inherently cannot guarantee a constant
progress, it lacks CPU relax calls which are absolutely necessary in
this situation to let a thread finish its job.
The loop's "while (1)" was changed to use a "for" statement calling
__ha_cpu_relax() as its continuation expression. This way the "continue"
statements jump to the unique place containing the pause without
excessively inflating the code.
This was sufficient to definitely fix the problem on 64-core ARM Graviton2
machines. This patch should probably be backported once it's confirmed it
also helps on many-cores x86 machines since some people are facing
contention in these environments. This patch depends on previous commit
"REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h".
An attempt was made to first read the value before exchanging, and it
significantly degraded the performance. It's very likely that this caused
other cores to lose exclusive ownership on their line and slow down their
next xchg operation.
In addition it was found that MT_LIST_ADD is significantly faster than
MT_LIST_ADDQ under high contention, because it fails one step earlier
when conflicting with an adjacent MT_LIST_DEL(). It might be worth
switching some operations' order to favor MT_LIST_ADDQ() instead.
There is some confusion here as we need to place some cpu_relax statements
in some loops where it's not easily possible to condition them on the use
of threads. That's what atomic.h already does. So let's take the various
pl_cpu_relax() implementations from there and place them in atomic.h under
the name __ha_cpu_relax() and let them adapt to the presence or absence of
threads and to the architecture (currently only x86 and aarch64 use a barrier
instruction), though it's very likely that arm would work well with a cache
flushing ISB instruction as well).
This time they were implemented as expressions returning 1 rather than
statements, in order to ease their placement as the loop condition or the
continuation expression inside "for" loops. We should probably do the same
with barriers and a few such other ones.
If dispatch mode or transparent backend is used, the backend connection
target is a proxy instead of a server. In these cases, the reuse of
backend connections is not consistent.
With the default behavior, no reuse is done and every new request uses a
new connection. However, if http-reuse is set to never, the connection
are stored by the mux in the session and can be reused for future
requests in the same session.
As no server is used for these connections, no reuse can be made outside
of the session, similarly to http-reuse never mode. A different
http-reuse config value should not have an impact. To achieve this, mark
these connections as private to have a defined behavior.
For this feature to properly work, the connection hash has been slightly
adjusted. The server pointer as an input as been replaced by a generic
target pointer to refer to the server or proxy instance. The hash is
always calculated on connect_server even if the connection target is not
a server. This also requires to allocate the connection hash node for
every backend connections, not just the one with a server target.
Fix such compilation issues:
include/haproxy/quic_tls.h:157:10: error: implicit conversion from
'enum ssl_encryption_level_t' to 'enum quic_tls_enc_level'
[-Werror=enum-conversion]
157 | return ssl_encryption_application;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/xprt_quic.c: In function 'quic_conn_enc_level_init':
src/xprt_quic.c:2358:13: error: implicit conversion from
'enum quic_tls_enc_level' to 'enum ssl_encryption_level_t'
[-Werror=enum-conversion]
2358 | qel->level = quic_to_ssl_enc_level(level);
| ^
Not detected by all the compilators.
This makes the code more readable and less prone to copy-paste errors.
In addition, it allows to place some __builtin_constant_p() predicates
to trigger a link-time error in case the compiler knows that the freed
area is constant. It will also produce compile-time error if trying to
free something that is not a regular pointer (e.g. a function).
The DEBUG_MEM_STATS macro now also defines an instance for ha_free()
so that all these calls can be checked.
178 occurrences were converted. The vast majority of them were handled
by the following Coccinelle script, some slightly refined to better deal
with "&*x" or with long lines:
@ rule @
expression E;
@@
- free(E);
- E = NULL;
+ ha_free(&E);
It was verified that the resulting code is the same, more or less a
handful of cases where the compiler optimized slightly differently
the temporary variable that holds the copy of the pointer.
A non-negligible amount of {free(str);str=NULL;str_len=0;} are still
present in the config part (mostly header names in proxies). These
ones should also be cleaned for the same reasons, and probably be
turned into ist strings.
A network may be specified to avoid header addition for "forwardfor" and
"orignialto" option via the "except" parameter. However, only IPv4
networks/addresses are supported. This patch adds the support of IPv6.
To do so, the net_addr structure is used to store the parameter value in the
proxy structure. And ipcmp2net() function is used to perform the comparison.
This patch should fix the issue #1145. It depends on the following commit:
* c6ce0ab MINOR: tools: Add function to compare an address to a network address
* 5587287 MINOR: tools: Add net_addr structure describing a network addess
ipcmp2net() function may be used to compare an addres (struct
sockaddr_storage) to a network address (struct net_addr). Among other
things, this function will be used to add support of IPv6 for "except"
parameter of "forwardfor" and "originalto" options.
The net_addr structure describes a IPv4 or IPv6 address. Its ip and mask are
represented. Among other things, this structure will be used to add support
of IPv6 for "except" parameter of "forwardfor" and "originalto" options.
This class will be used exclusively for heavy processing tasklets. It
will be cleaner than mixing them with the bulk ones. For now it's
allocated ~1% of the CPU bandwidth.
The largest part of the patch consists in re-arranging the fields in the
task_per_thread structure to preserve a clean alignment with one more
list head. Since we're now forced to increase the struct past a second
cache line, it now uses 4 cache lines (for easy multiplying) with the
first two ones being exclusively used by local operations and the third
one mostly by atomic operations. Interestingly, this better arrangement
causes less stress and reduced the response time by 8 microseconds at
1 million requests per second.
These function names are unbearably long, they don't even fit into the
screen in "show profiling", let's trim the "_connections" to "_conns",
which happens to match the name of the lists there.
While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.
An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.
But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.
Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".
Some static functions are now exported and renamed to follow the same
pattern of other exported functions. Here is the list :
* update_server_fqdn: Renamed to srv_update_fqdn and exported
* update_server_check_addr_port: renamed to srv_update_check_addr_port and exported
* update_server_agent_addr_port: renamed to srv_update_agent_addr_port and exported
* update_server_addr: renamed to srv_update_addr
* update_server_addr_potr: renamed to srv_update_addr_port
* srv_prepare_for_resolution: exported
This change is mandatory to move all functions dealing with the server-state
files in a separate file.
Parsed parameters are now stored in the tree of server-state lines. This
way, a line from the global server-state file is only parsed once. Before,
it was parsed a first time to store it in the tree and one more time to load
the server state. To do so, the server-state line object must be allocated
before parsing a line. This means its size must no longer depend on the
length of first parsed parameters (backend and server names). Thus the node
type was changed to use a hashed key instead of a string.
The structure used to store a server-state line in an eb-tree has a too
generic name. Instead of state_line, the structure is renamed as
server_state_line.
<state_line.name_name> field is a node in an eb-tree. Thus, instead of
"name_name", we now use "node" to name this field. If is a more explicit
name and not too strange.
It is extremely useful to be able to observe the wakeup latency of some
important I/O operations, so let's accept to inflate the tasklet struct
by 8 extra bytes when DEBUG_TASK is set. With just this we have enough
to get live reports like this:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
si_cs_io_cb 8099492 4.833s 596.0ns 8.974m 66.48us
h1_io_cb 7460365 11.55s 1.548us 2.477m 19.92us
process_stream 7383828 22.79s 3.086us 18.39m 149.5us
h1_timeout_task 4157 - - 348.4ms 83.81us
srv_cleanup_toremove_connections751 39.70ms 52.86us 10.54ms 14.04us
srv_cleanup_idle_connections 21 1.405ms 66.89us 30.82us 1.467us
task_run_applet 16 1.058ms 66.13us 446.2us 27.89us
accept_queue_process 7 34.53us 4.933us 333.1us 47.58us
Instead of decrementing grq_total once per task picked from the global
run queue, let's do it at once after the loop like we do for other
counters. This simplifies the code everywhere. It is not expected to
bring noticeable improvements however, since global tasks tend to be
less common nowadays.
The function htx_reserve_max_data() should be used to get an HTX DATA block
with the max possible size. A current block may be extended or a new one
created, depending on the HTX message state. But the idea is to let the
caller to copy a bunch of data without requesting many new blocks. It is its
responsibility to resize the block at the end, to set the final block size.
This function will be used to parse messages with small chunks. Indeed, we
can have more than 2700 1-byte chunks in a 16Kb of input data. So it is easy
to understand how this function may help to improve the parsing of chunk
messages.
When a SRV record was created, it used to register the regular server name
resolution callbacks. That said, SRV records and regular server name
resolution don't work the same way, furthermore on error management.
This patch introduces a new call back to manage DNS errors related to
the SRV queries.
this fixes github issue #50.
Backport status: 2.3, 2.2, 2.1, 2.0
In fd_set_running_excl() we don't reset the old mask in the CAS loop,
so if we fail on the first round, we'll forcefully take the FD on the
next one.
In practice it's used bu fd_insert() and fd_delete() only, none of which
is supposed to be passed an FD which is still in use since in practice,
given that for now only listeners may be enabled on multiple threads at
once.
This can be backported to 2.2 but shouldn't result in fixing any user
visible bug for now.
This function has become large with the multi-queue scheduler. We need
to keep the fast path and the debugging parts inlined, but the rest now
moves to task.c just like was done for task_wakeup(). This has reduced
the code size by 6kB due to less inlining of large parts that are always
context-dependent, and as a side effect, has increased the overall
performance by 1%.
The nb_tasks counter was still global and gets incremented and decremented
for each task_new()/task_free(), and was read in process_runnable_tasks().
But it's only used for stats reporting, so doing this this often is
pointless and expensive. Let's move it to the task_per_thread struct and
have the stats sum it when needed.
Historically we used to call __task_wakeup() with a known tree root but
this is not the case and the code has remained needlessly complicated
with the root calculation in task_wakeup() passed in argument to
__task_wakeup() which compares it again.
Let's get rid of this and just move the detection code there. This
eliminates some ifdefs and allows to simplify the test conditions quite
a bit.
This one is systematically misunderstood due to its unclear name. It
is in fact the number of tasks in the local tasklet list. Let's call
it "tasks_in_list" to remove some of the confusion.
This one is exclusively used as a boolean nowadays and is non-zero only
when the thread-local run queue is not empty. Better check the root tree's
pointer and avoid updating this counter all the time.
This counter is solely used for reporting in the stats and is the hottest
thread contention point to date. Moving it to the scheduler and having a
separate one for the global run queue dramatically improves the performance,
showing a 12% boost on the request rate on 16 threads!
In addition, the thread debugging output which used to rely on rqueue_size
was not totally accurate as it would only report task counts. Now we can
return the exact thread's run queue length.
It is also interesting to note that there are still a few other task/tasklet
counters in the scheduler that are not efficiently updated because some cover
a single area and others cover multiple areas. It looks like having a distinct
counter for each of the following entries would help and would keep the code
a bit cleaner:
- global run queue (tree)
- per-thread run queue (tree)
- per-thread shared tasklets list
- per-thread local lists
Maybe even splitting the shared tasklets lists between pure tasklets and
tasks instead of having the whole and tasks would simplify the code because
there remain a number of places where several counters have to be updated.
The lock was still used exclusively to deal with the concurrency between
the "show sess" release handler and a stream_new() or stream_free() on
another thread. All other accesses made by "show sess" are already done
under thread isolation. The release handler only requires to unlink its
node when stopping in the middle of a dump (error, timeout etc). Let's
just isolate the thread to deal with this case so that it's compatible
with the dump conditions, and remove all remaining locking on the streams.
This effectively kills the streams lock. The measured gain here is around
1.6% with 4 threads (374krps -> 380k).
The global streams list is exclusively used for "show sess", to look up
a stream to shut down, and for the hard-stop. Having all of them in a
single list is extremely expensive in terms of locking when using threads,
with performance losses as high as 7% having been observed just due to
this.
This patch makes the list per-thread, since there's no need to have a
global one in this situation. All call places just iterate over all
threads. The most "invasive" changes was in "show sess" where the end
of list needs to go back to the beginning of next thread's list until
the last thread is seen. For now the lock was maintained to keep the
code auditable but a next commit should get rid of it.
The observed performance gain here with only 4 threads is already 7%
(350krps -> 374krps).
The "show sess" CLI command currently lists all streams and needs to
stop at a given position to avoid dumping forever. Since 2.2 with
commit c6e7a1b8e ("MINOR: cli: make "show sess" stop at the last known
session"), a hack consists in unlinking the stream running the applet
and linking it again at the current end of the list, in order to serve
as a delimiter. But this forces the stream list to be global, which
affects scalability.
This patch introduces an epoch, which is a global 32-bit counter that
is incremented by the "show sess" command, and which is copied by newly
created streams. This way any stream can know whether any other one is
newer or older than itself.
For now it's only stored and not exploited.
RAND_keep_random_devices_open is OpenSSL specific function, not
implemented in LibreSSL and BoringSSL. Let us define guard
HAVE_SSL_RAND_KEEP_RANDOM_DEVICES_OPEN in include/haproxy/openssl-compat.h
That guard does not depend anymore on HA_OPENSSL_VERSION
The runqueue_ticks counts the number of task wakeups and is used to
position new tasks in the run queue, but since we've had per-thread
run queues, the values there are not very relevant anymore and the
nice value doesn't apply well if some threads are more loaded than
others. In addition, letting all threads compete over a shared counter
is not smart as this may cause some excessive contention.
Let's move this index close to the run queues themselves, i.e. one per
thread and a global one. In addition to improving fairness, this has
increased global performance by 2% on 16 threads thanks to the lower
contention on rqueue_ticks.
Fairness issues were not observed, but if any were to be, this patch
could be backported as far as 2.0 to address them.
Historically this function would try to wake the most accurate number of
process_stream() waiters. But since the introduction of filters which could
also require buffers (e.g. for compression), things started not to be as
accurate anymore. Nowadays muxes and transport layers also use buffers, so
the runqueue size has nothing to do anymore with the number of supposed
users to come.
In addition to this, the threshold was compared to the number of free buffer
calculated as allocated minus used, but this didn't work anymore with local
pools since these counts are not updated upon alloc/free!
Let's clean this up and pass the number of released buffers instead, and
consider that each waiter successfully called counts as one buffer. This
is not rocket science and will not suddenly fix everything, but at least
it cannot be as wrong as it is today.
This could have been marked as a bug given that the current situation is
totally broken regarding this, but this probably doesn't completely fix
it, it only goes in a better direction. It is possible however that it
makes sense in the future to backport this as part of a larger series if
the situation significantly improves.
The buffer wait queue used to be global historically but this doest not
make any sense anymore given that the most common use case is to have
thread-local pools. Thus there's no point waking up waiters of other
threads after releasing an entry, as they won't benefit from it.
Let's move the queue head to the thread_info structure and use
ti->buffer_wq from now on.
This revert the commit 63e6cba12 ("MEDIUM: server: add server-states version
2"), but keeping all recent features added to the server-sate file. Instead
of adding a 2nd version for the server-state file format to handle the 5 new
fields added during the 2.4 development, these fields are considered as
optionnal during the parsing. So it is possible to load a server-state file
from HAProxy 2.3. However, from 2.4, these new fields are always dumped in
the server-state file. But it should not be a problem to load it on the 2.3.
This patch seems a bit huge but the diff ignoring the space is much smaller.
The version 2 of the server-state file format is reserved for a real
refactoring to address all issues of the current format.
Remove ebmb_node entry from struct connection and create a dedicated
struct conn_hash_node. struct connection contains now only a pointer to
a conn_hash_node, allocated only for connections where target is of type
OBJ_TYPE_SERVER. This will reduce memory footprints for every
connections that does not need http-reuse such as frontend connections.
In MT_LIST_TRY_ADDQ(), deal with the "prev" field of the element before the
"next". If the element is the first in the list, then its next will
already have been locked when we locked list->prev->next, so locking it
again will fail, and we'll start over and over.
This should be backported to 2.3.
The maximum number of connections accepted at once by a thread for a single
listener used to default to 64 divided by the number of processes but the
tasklet-based model is much more scalable and benefits from smaller values.
Experimentation has shown that 4 gives the highest accept rate for all
thread values, and that 3 and 5 come very close, as shown below (HTTP/1
connections forwarded per second at multi-accept 4 and 64):
ac\thr| 1 2 4 8 16
------+------------------------------
4| 80k 106k 168k 270k 336k
64| 63k 89k 145k 230k 274k
Some tests were also conducted on SSL and absolutely no change was observed.
The value was placed into a define because it used to be spread all over the
code.
It might be useful at some point to backport this to 2.3 and 2.2 to help
those who observed some performance regressions from 1.6.
Since a lot of internal callbacks were turned to tasklets, the runqueue
depth had not been readjusted from the default 200 which was initially
used to favor batched processing. But nowadays it appears too large
already based on the following tests conducted on a 8c16t machine with
a simple config involving "balance leastconn" and one server. The setup
always involved the two threads of a same CPU core except for 1 thread,
and the client was running over 1000 concurrent H1 connections. The
number of requests per second is reported for each (runqueue-depth,
nbthread) couple:
rq\thr| 1 2 4 8 16
------+------------------------------
32| 120k 159k 276k 477k 698k
40| 122k 160k 276k 478k 722k
48| 121k 159k 274k 482k 720k
64| 121k 160k 274k 469k 710k
200| 114k 150k 247k 415k 613k <-- default
It's possible to save up to about 18% performance by lowering the
default value to 40. One possible explanation to this is that checking
I/Os more frequently allows to flush buffers faster and to smooth the
I/O wait time over multiple operations instead of alternating phases
of processing, waiting for locks and waiting for new I/Os.
The total round trip time also fell from 1.62ms to 1.40ms on average,
among which at least 0.5ms is attributed to the testing tools since
this is the minimum attainable on the loopback.
After some observation it would be nice to backport this to 2.3 and
2.2 which observe similar improvements, since some users have already
observed some perf regressions between 1.6 and 2.2.
SCTL (signed certificate timestamp list) specified in RFC6962
was implemented in c74ce24cd22e8c683ba0e5353c0762f8616e597d, let
us introduce macro HAVE_SSL_SCTL for the HAVE_SSL_SCTL sake,
which in turn is based on SN_ct_cert_scts, which comes in the same commit
smp_is_safe() function is used to be sure a sample may be safely
modified. For string samples, a test is performed to verify if there is a
null-terminated byte. If not, one is added, if possible. It means if the
sample is not const and if there is some free space in the buffer, after
data. However, we must not try to read the null-terminated byte if the
string sample is too long (data >= size) or if the size is equal to
zero. This last test was not performed. Thus it was possible to consider a
string sample as safe by testing a byte outside the buffer.
Now, a zero size string sample is always considered as unsafe and is
duplicated when smp_make_safe() is called.
This patch must be backported in all stable versions.
It's pretty easy to pre-initialize the index, change it on free() and check
it during the wakeup, so let's do this to ease detection of any accidental
task_wakeup() after a task_free() or tasklet_wakeup() after a tasklet_free().
If this would ever happen we'd then get a backtrace and a core now. The
index's parity is respected so that the call history remains exploitable.
The idea is to know who woke a task up, by recording the last two
callers in a rotating mode. For now it's trivial with task_wakeup()
but tasklet_wakeup_on() will require quite some more changes.
This typically gives this from the debugger:
(gdb) p t->debug
$2 = {
caller_file = {0x0, 0x8c0d80 "src/task.c"},
caller_line = {0, 260},
caller_idx = 1
}
or this:
(gdb) p t->debug
$6 = {
caller_file = {0x7fffe40329e0 "", 0x885feb "src/stream.c"},
caller_line = {284, 284},
caller_idx = 1
}
But it also provides a trivial macro allowing to simply place a call in
a task/tasklet handler that needs to be observed:
DEBUG_TASK_PRINT_CALLER(t);
Then starting haproxy this way would trivially yield such info:
$ ./haproxy -db -f test.cfg | sort | uniq -c | sort -nr
199992 h1_io_cb woken up from src/sock.c:797
51764 h1_io_cb woken up from src/mux_h1.c:3634
65 h1_io_cb woken up from src/connection.c:169
45 h1_io_cb woken up from src/sock.c:777
The two algos defining these functions (first and leastconn) do not need the
server's lock. However it's already present in pendconn_process_next_strm()
so the API must be updated so that the functions may take it if needed and
that the callers indicate whether they already own it.
As such, the call places (backend.c and stream.c) now do not take it
anymore, queue.c was unchanged since it's already held, and both "first"
and "leastconn" were updated to take it if not already held.
A quick test on the "first" algo showed a jump from 432 to 565k rps by
just dropping the lock in stream.c!
This reverts commit 8f1f177ed0.
Repeated tests have shown a small perforamnce degradation of ~1.8%
caused by this patch at high request rates on 16 threads. The exact
cause is not yet perfectly known but it probably stems in slower
accesses for non-64-bit aligned atomic accesses.
The remaining contention on the server lock solely comes from
sess_change_server() which takes the lock to add and remove a
stream from the server's actconn list. This is both expensive
and pointless since we have mt-lists, and this list is only
used by the CLI's "shutdown server sessions" command!
Let's migrate to an mt-list and remove the need for this costly
lock. By doing so, the request rate increased by ~1.8%.
Since OTHER_LOCK is commonly used it's become much more difficult to
profile lock contention by temporarily changing a lock label. Let's
add DEBUG1..5 to serve only for debugging. These ones must not be
used in committed code. We could decide to only define them when
DEBUG_THREAD is set but that would complicate attempts at measuring
performance with debugging turned off.
The server lock was taken preventively for anything in health_adjust(),
including the static config checks needed to detect that the lock was not
needed, while the function is always called on the response path to update
a server's status. This was responsible for huge contention causing a
performance drop of about 17% on 16 threads. Let's move the lock only
where it should be, i.e. inside the function around the critical sections
only. By doing this, a 16-thread process jumped back from 575 to 675 krps.
This should be backported to 2.3 as the situation degraded there, and
maybe later to 2.2.
Always try to remove a connexion from its toremove_list in conn_free.
This prevents a double-free in case the connection is freed but was
already added in toremove_list.
This bug was easily reproduced by running 4-5 runs of inject on a
single-thread instance of haproxy :
$ inject -u 10000 -d 10 -G 127.0.0.1:20080
A crash would soon be triggered in srv_cleanup_toremove_connections.
This does not need to be backported.
move listen status to a helper, defining both status enum and string
definition.
this will be helpful to be reused in prometheus code. It also removes
this hard-to-read nested ternary.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
prometheus approach requires to output all values for a given metric
name; meaning we iterate through all metrics, and then iterate in the
inner loop on all objects for this metric.
In order to allow more code reuse, adapt the stats API to be able to
select one field or fill them all otherwise.
From this patch it should be possible to add support for listen stats in
prometheus.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
This patch introduce the "dns_stream_nameserver" to use DNS over
TCP on strict nameservers. For the upper layer it is analog to
the api used with udp nameservers except that the user que switch
the name server in "stream" mode at the init using "dns_stream_init".
The fallback from UDP to TCP is not handled and this is not the
purpose of this feature. This is done to choose the transport layer
during the initialization.
Currently there is a hardcoded limit of 4 pipelined transactions
per TCP connections. A batch of idle connections is expired every 5s.
This code is designed to support a maximum DNS message size on TCP: 64k.
Note: this code won't perform retry on unanswered queries this
should be handled by the upper layer
This patch splits current dns.c into two files:
The first dns.c contains code related to DNS message exchange over UDP
and in future other TCP. We try to remove depencies to resolving
to make it usable by other stuff as DNS load balancing.
The new resolvers.c inherit of the code specific to the actual
resolvers.
Note:
It was really difficult to obtain a clean diff dur to the amount
of moved code.
Note2:
Counters and stuff related to stats is not cleany separated because
currently counters for both layers are merged and hard to separate
for now.
This patch splits recv and send functions in two layers. the
lowest is responsible of DNS message transactions over
the network. Doing this we could use DNS message layer
for something else than resolving. Load balancing for instance.
This patch also re-works the way to init a nameserver and
introduce the new struct dns_dgram_server to prepare the arrival
of dns_stream_server and the support of DNS over TCP.
The way to retry a send failure of a request because of EAGAIN
was re-worked. Previously there was no control and all "pending"
queries were re-played each time it reaches a EAGAIN. This
patch introduce a ring to stack messages in case of sent
failure. This patch is emptied if poller shows that the
socket is ready again to push messages.
Counters are currently stored into lowlevel nameservers struct but
most of them are resolving layer data and increased in the upper layer
So this patch renames the prototype used to allocate/dump them with prefix
'resolv' waiting for a clean split.
Some types are specific to resolver code and a renamed using
the 'resolv' prefix instead 'dns'.
-struct dns_query_item {
+struct resolv_query_item {
-struct dns_answer_item {
+struct resolv_answer_item {
-struct dns_response_packet {
+struct resolv_response {
This patch adds the attribute packed on struct dns_question
because it is directly memcpy to network building a response.
This patch also removes the commented line:
// struct list options; /* list of option records */
because it is also used directly using memcpy to build a request
and must not contain host data.
Resolv callbacks are also updated to rely on counters and not on
nameservers.
"show stat domain dns" will now show the parent id (i.e. resolvers
section name).
The "show peers" output has become huge due to the dictionaries making it
less readable. Now this feature has reached a certain level of maturity
which doesn't warrant to dump it all the time, given that it was essentially
needed by developers. Let's make it optional, and disabled by default, only
when "show peers dict" is requested. The default output reminds about the
command. The output has been divided by 5 :
$ socat - /tmp/sock1 <<< "show peers dict" | wc -l
125
$ socat - /tmp/sock1 <<< "show peers" | wc -l
26
It could be useful to backport this to recent stable versions.
Now default proxies are stored into a dedicated tree, sorted by name.
Only unnamed entries are not kept upon new section creation. The very
first call to cfg_parse_listen() will automatically allocate a dummy
defaults section which corresponds to the previous static one, since
the code requires to have one at a few places.
The first immediately visible benefit is that it allows to reuse
alloc_new_proxy() to allocate a defaults section instead of doing it by
hand. And the secret goal is to allow to keep multiple named defaults
section in memory to reuse them from various proxies.
Now we'll have a tree of named defaults sections. The regular insertion
and lookup functions take care of the capability in order to select the
appropriate tree. A new function proxy_destroy_defaults() removes a
proxy from this tree and frees it entirely.
We don't want to expose this one anymore as we'll soon keep multiple
default proxies. Let's move it inside the parser which is the only
place which still uses it, and initialize it on the fly once needed
instead of doing it at boot time.