When threads are disabled, the compiler complains that we might be
accessing tg->abs[] out of bounds since the array is of size 1. It
cannot know that the condition to do this is never met, and given
that it's not in a fast path, we can make it more obvious.
In case too many thread groups are needed for the threads, we emit
an error indicating the problem. Unfortunately the threads and groups
counts were reversed.
This can be backported to 2.6.
The NUMA detection code tries not to interfer with any taskset the user
could have specified in init scripts. For this it compares the number of
CPUs available with the number the process is bound to. However, the CPU
count is retrieved after being applied an upper bound of MAX_THREADS, so
if the machine has more than 64 CPUs, the comparison always fails and
makes haproxy think the user has already enforced a binding, and it does
not pin it anymore to a single NUMA node.
This can be verified by issuing:
$ socat /path/to/sock - <<< "show info" | grep thread
On a dual 48-CPU machine it reports 64, implying that threads are allowed
to run on the second socket:
Nbthread: 64
With this fix, the function properly reports 96, and the output shows 48,
indicating that a single NUMA node was used:
Nbthread: 48
Of course nothing is changed when "no numa-cpu-mapping" is specified:
Nbthread: 64
This can be backported to 2.4.
thread_set_first_group() and thread_set_first_tmask() were modified
and renamed to instead return the number and mask of the nth group.
Passing zero continues to return the first one, but it will be more
convenient to use this way when building shards.
Aurélien reported a bug making a statement such as "thread 2-2" fail for
a config made of exactly 2 threads. What happens is that the parser for
the "thread" keyword scans a range of thread numbers from either 1..64
or 0,-1,-2 for special values, and presets the bit masks accordingly in
the thread set, except that due to the 1..64 range, the shift length must
be reduced by one. Not doing this causes empty masks for single-bit values
that are exactly equal to the number of threads in the group and fails to
properly parse.
No backport is needed as this was introduced in 2.8-dev3 by commit
bef43dfa6 ("MINOR: thread: add a simple thread_set API").
Instead of reading and storing a single group and a single mask for a
"thread" directive on a bind line, we now store the complete range in
a thread set that's stored in the bind_conf. The bind_parse_thread()
function now just calls parse_thread_set() to complete the current set,
which starts empty, and thread_resolve_group_mask() was updated to
support retrieving thread group numbers or absolute thread numbers
directly from the pre-filled thread_set, and continue to feed bind_tgroup
and bind_thread. The CLI parsers which were pre-initialized to set the
bind_tgroup to 1 cannot do it anymore as it would prevent one from
restricting the thread set. Instead check_config_validity() now detects
the CLI frontend and passes the info down to thread_resolve_group_mask()
that will automatically use only the group 1's threads for these
listeners. The same is done for the peers listeners for now.
At this step it's already possible to start with all previous valid
configs as well as extended ones supporting comma-delimited thread
sets. In addition the parser already accepts large ranges spanning
multiple groups, but since the underlying listeners infrastructure
is not read, for now we're maintaining a specific check against this
at the higher level of the config validity check.
The patch is a bit large because thread resolution is performed in
multiple steps, so we need to adjust all of them at once to preserve
functional and technical consistency.
The purpose is to be able to store large thread sets, defined by ranges
that may cross group boundaries, as well as define lists of groups and
masks. The thread_set struct implements the storage, and the parser is
in parse_thread_set(), with a focus on "bind" lines, but not only.
idle and harmless bits in the tgroup_ctx structure were not explicitly
set during boot.
| struct tgroup_ctx ha_tgroup_ctx[MAX_TGROUPS] = { };
As the structure is first statically initialized,
.threads_harmless and .threads_idle are automatically zero-
initialized by the compiler.
Unfortulately, this means that such threads are not considered idle
nor harmless by thread_isolate(_full)() functions until they enter
the polling loop (thread_harmless_now() and thread_idle_now() are
respectively called before entering the polling loop)
Because of this, any attempt to call thread_isolate() or thread_isolate_full()
during a startup phase with nbthreads >= 2 will cause thread_isolate to
loop until every secondary threads make it through their first polling loop.
If the startup phase is aborted during boot (ie: "-c" option to check the
configuration), secondary threads may be initialized but will never be started
(ie: they won't enter the polling loop), thus thread_isolate()
could would loop forever in such cases.
We can easily reveal the bug with this patch reproducer:
| diff --git a/src/haproxy.c b/src/haproxy.c
| index e91691658..0b733f6ee 100644
| --- a/src/haproxy.c
| +++ b/src/haproxy.c
| @@ -2317,6 +2317,10 @@ static void init(int argc, char **argv)
| if (pr || px) {
| /* At least one peer or one listener has been found */
| qfprintf(stdout, "Configuration file is valid\n");
| + printf("haproxy will loop...\n");
| + thread_isolate();
| + printf("we will never reach this\n");
| + thread_release();
| deinit_and_exit(0);
| }
| qfprintf(stdout, "Configuration file has no error but will not start (no listener) => exit(2).\n");
Now we start haproxy with a valid config:
$> haproxy -c -f valid.conf
Configuration file is valid
haproxy will loop...
^C
------------------------------------------------------------------------------
This did not cause any issue so far because no early deinit paths require
full thread isolation. But this may change when new features or requirements
are introduced, so we should fix this before it becomes a real issue.
To fix this, we explicitly assign .threads_harmless and .threads_idle
to .threads_enabled value in thread_map_to_groups() function during boot.
This is the proper place to do this since as long as .threads_enabled is not
explicitly set, its default value is also 0 (zero-initialized by the compiler)
code snippet from thread_isolate() function:
ulong te = _HA_ATOMIC_LOAD(&ha_tgroup_info[tgrp].threads_enabled);
ulong th = _HA_ATOMIC_LOAD(&ha_tgroup_ctx[tgrp].threads_harmless);
if ((th & te) == te)
break;
Thus thread_isolate(_full()) won't be looping forever in thread_isolate()
even if it were to be used before thread_map_to_groups() is executed.
No backport needed unless this is a requirement.
A few loops waiting for threads to synchronize such as thread_isolate()
rightfully filter the thread masks via the threads_enabled field that
contains the list of enabled threads. However, it doesn't use an atomic
load on it. Before 2.7, the equivalent variables were marked as volatile
and were always reloaded. In 2.7 they're fields in ha_tgroup_ctx[], and
the risk that the compiler keeps them in a register inside a loop is not
null at all. In practice when ha_thread_relax() calls sched_yield() or
an x86 PAUSE instruction, it could be verified that the variable is
always reloaded. If these are avoided (e.g. architecture providing
neither solution), it's visible in asm code that the variables are not
reloaded. In this case, if a thread exists just between the moment the
two values are read, the loop could spin forever.
This patch adds the required _HA_ATOMIC_LOAD() on the relevant
threads_enabled fields. It must be backported to 2.7.
The tree that contains OCSP responses is never locked despite being used
at runtime for OCSP stapling as well as the CLI through "set ssl cert"
and "set ssl ocsp-response" commands.
Everything works though because the certificate_ocsp structure is
refcounted and the tree's entries are cleaned up when SSL_CTXs are
destroyed (thanks to an ex_data entry in which the certificate_ocsp
pointer is stored).
This new lock will come to use when the OCSP auto update mechanism is
fully implemented because this new feature will be based on another tree
that stores the same certificate_ocsp members and updates their contents
periodically.
nbhread, thead-group and thread-groups directives must only be defined in
very first global sections. It means no other section must have been parsed
before. Indeed, some parts of the configuratio depends on the value of these
settings and it is undefined to change them after.
Since version 1.1.0, OpenSSL's libcrypto ignores the provided locking
mechanism and uses pthread's rwlocks instead. The problem is that for
some code paths (e.g. async engines) this results in a huge amount of
syscalls on systems facing a bit of contention, to the point where more
than 80% of the CPU can be spent in the system dealing with spinlocks
just for futex_wake().
This patch provides an alternative by redefining the relevant pthread
rwlocks from the low-overhead version of the progressive rw locks. This
way there will be no more syscalls in case of contention, and CPU will
be burnt in userland. Doing this saves massive amounts of CPU, where
the locks only take 12-15% vs 80% before, which allows SSL to work much
faster on large thread counts (e.g. 24 or more).
The tryrdlock and trywrlock variants have been implemented using a CAS
since their goal is only to succeed on no contention and never to wait.
The pthread_rwlock API is complete except that the timed versions of
the rdlock and wrlock do not wait and simply fall back to trylock
versions.
Since the gains have only been observed with async engines for now,
this option remains disabled by default. It can be enabled at build
time using USE_PTHREAD_EMULATION=1.
Since these are not used anymore, let's now remove them. Given the
number of places where we're using ti->ldit_bit, maybe an equivalent
might be useful though.
It used to turn group+local to global but now we're doing the exact
opposite as we want to stick to group-local masks. This means that
"thread 3-4" might very well emit what "thread 2/1-2" used to emit
till now for 2 groups and 4 threads. This is needed because we'll
have to support group-local thread masks in receivers.
However the rest of the code (receivers) is not ready yet for this,
so using this code with more than one thread group will definitely
break some bindings.
The principle remains the same, but instead of having a single process
and ignoring extra ones, now we set the affinity masks for the respective
threads of all groups.
The doc was updated with a few extra examples.
Since we have to use masks to verify owners/waiters, we have no other
option but to have them per group. This definitely inflates the size
of the locks, but this is only used for extreme debugging anyway so
that's not dramatic.
Thus as of now, all masks in the lock stats are local bit masks, derived
from ti->ltid_bit. Since at boot ltid_bit might not be set, we just take
care of this situation (since some structs are initialized under look
during boot), and use bit 0 from group 0 only.
In ha_tkillall(), the current thread's group was used to check for the
thread being running instead of using the target thread's group mask.
Most of the time it would not have any effect unless some groups are
uneven where it can lead to incomplete thread dumps for example.
No backport is needed, this is purely 2.7.
In thread_resolve_group_mask(), if a global thread number is passed
and it belongs to a group greater than 1, an incorrect shift resulted
in shifting that ID again which made it appear nowhere or in a wrong
group possibly. The bug was introduced in 2.5 with commit 627def9e5
("MINOR: threads: add a new function to resolve config groups and
masks") though the groups only starts to be usable in 2.7, so there
is no impact for this bug, hence no backport is needed.
thread_isolate() and thread_isolate_full() were relying on a set of thread
masks for all threads in different states (rdv, harmless, idle). This cannot
work anymore when the number of threads increases beyond LONGBITS so we need
to change the mechanism.
What is done here is to have a counter of requesters and the number of the
current isolated thread. Threads which want to isolate themselves increment
the request counter and wait for all threads to be marked harmless (or idle)
by scanning all groups and watching the respective masks. This is possible
because threads cannot escape once they discover this counter, unless they
also want to isolate and possibly pass first. Once all threads are harmless,
the requesting thread tries to self-assign the isolated thread number, and
if it fails it loops back to checking all threads. If it wins it's guaranted
to be alone, and can drop its harmless bit, so that other competing threads
go back to the loop waiting for all threads to be harmless. The benefit of
proceeding this way is that there's very little write contention on the
thread number (none during work), hence no cache line moves between caches,
thus frozen threads do not slow down the isolated one.
Once it's done, the isolated thread resets the thread number (hence lets
another thread take the place) and decrements the requester count, thus
possibly releasing all harmless threads.
With this change there's no more need for any global mask to synchronize
any thread, and we only need to loop over a number of groups to check
64 threads at a time per iteration. As such, tinfo's threads_want_rdv
could be dropped.
This was tested with 64 threads spread into 2 groups, running 64 tasks
(from the debug dev command), 20 "show sess" (thread_isolate()), 20
"add server blah/blah" (thread_isolate()), and 20 "del server blah/blah"
(thread_isolate_full()). The load remained very low (limited by external
socat forks) and no stuck nor starved thread was found.
The thread group info is not sufficient to represent a thread group's
current state as it's read-only. We also need something comparable to
the thread context to represent the aggregate state of the threads in
that group. This patch introduces ha_tgroup_ctx[] and tg_ctx for this.
It's indexed on the group id and must be cache-line aligned. The thread
masks that were global and that do not need to remain global were moved
there (want_rdv, harmless, idle).
Given that all the masks placed there now become group-specific, the
associated thread mask (tid_bit) now switches to the thread's local
bit (ltid_bit). Both are the same for nbtgroups 1 but will differ for
other values.
There's also a tg_ctx pointer in the thread so that it can be reached
from other threads.
This function was added in 2.0 when reworking the thread isolation
mechanism to make it more reliable. However it if fundamentally
incompatible with the full isolation mechanism provided by
thread_isolate_full() since that one will wait for all threads to
become idle while the former will wait for all threads to finish
waiting, causing a deadlock.
Given that it's not used, let's just drop it entirely before it gets
used by accident.
In order to kill all_threads_mask we'll need to have an equivalent for
the thread groups. The all_tgroups_mask does just this, it keeps one bit
set per enabled group.
Since commit cc7a11ee3 ("MINOR: threads: set the tid, ltid and their bit
in thread_cfg") we ought not use (1UL << thr) to get the group mask for
thread <thr>, but (ha_thread_info[thr].ltid_bit). ha_tkillall() needs
this.
In order to replace the global "all_threads_mask" we'll need to have an
equivalent per group. Take this opportunity for calling it threads_enabled
and make sure which ones are counted there (in case in the future we allow
to stop some).
Now that the tgid is accessible from the thread, it's pointless to have
it in the group, and it was only set but never used. However we'll soon
frequently need the mask corresponding to the group ID and the risk of
getting it wrong with the +1 or to shift 1 instead of 1UL is important,
so let's store the tgid_bit there.
At several places we're dereferencing the thread group just to catch
the group number, and this will become even more required once we start
to use per-group contexts. Let's just add the tgid in the thread_info
struct to make this easier.
Each thread has its own local thread id and its own global thread id,
in addition to the masks corresponding to each. Once the global thread
ID can go beyond 64 it will not be possible to have a global thread Id
bit anymore, so better start to remove it and use only the local one
from the struct thread_info.
There were plenty of leftovers from old code that were never removed
and that are not needed at all since these files do not use any
definition depending on fcntl.h, let's drop them.
The constructor present there could be replaced with an initcall.
This one is set at level STG_PREPARE because it also zeroes the
lock_stats, and it's a bit odd that it could possibly have been
scheduled to run after other constructors that might already
preset some of these locks by accident.
In the configuration sometimes we'll omit a thread group number to designate
a global thread number range, and sometimes we'll mention the group and
designate IDs within that group. The operation is more complex than it
seems due to the need to check for ranges spanning between multiple groups
and determining groups from threads from bit masks and remapping bit masks
between local/global.
This patch adds a function to perform this operation, it takes a group and
mask on input and updates them on output. It's designed to be used by "bind"
lines but will likely be usable at other places if needed.
For situations where specified threads do not exist in the group, we have
the choice in the code between silently fixing the thread set or failing
with a message. For now the better option seems to return an error, but if
it turns out to be an issue we can easily change that in the future. Note
that it should only happen with "x/even" when group x only has one thread.
This is the equivalent of "tid" for ease of access. In the future if we
make th_cfg a pure thread-local array (not a pointer), it may make sense
to move it there.
ha_set_tid() was randomly used either to explicitly set thread 0 or to
set any possibly incomplete thread during boot. Let's replace it with
a pointer to a valid thread or NULL for any thread. This allows us to
check that the designated threads are always valid, and to ignore the
thread 0's mapping when setting it to NULL, and always use group 0 with
it during boot.
The initialization code is also cleaner, as we don't pass ugly casts
of a thread ID to a pointer anymore.
This will be a convenient way to communicate the thread ID and its
local ID in the group, as well as their respective bits when creating
the threads or when only a pointer is given.
This will ease the reporting of the current thread group ID when coming
from the thread itself, especially since it returns the visible ID,
starting at 1.
This takes care of unassigned threads groups and places unassigned
threads there, in a more or less balanced way. Too sparse allocations
may still fail though. For now with a maximum group number fixed to 1
nothing can really fail.
This registers a mapping of threads to groups by enumerating for each thread
what group it belongs to, and marking the group as assigned. It takes care of
checking for redefinitions, overlaps, and holes. It supports both individual
numbers and ranges. The thread group is referenced from the thread config.
This creates a struct tgroup_info which knows the thread ID of the first
thread in a group, and the number of threads in it. For now there's only
one thread group supported in the configuration, but it may be forced to
other values for development purposes by defining MAX_TGROUPS, and it's
enabled even when threads are disabled and will need to remain accessible
during boot to keep a simple enough internal API.
For the purpose of easing the configurations which do not specify a thread
group, we're starting group numbering at 1 so that thread group 0 can be
"undefined" (i.e. for "bind" lines or when binding tasks).
The goal will be to later move there some global items that must be
made per-group.
We want to make sure that the current thread_info accessed via "ti" will
remain constant, so that we don't accidentally place new variable parts
there and so that the compiler knows that info retrieved from there is
not expected to have changed between two function calls.
Only a few init locations had to be adjusted to use the array and the
rest is unaffected.
The scheduler contains a lot of stuff that is thread-local and not
exclusively tied to the scheduler. Other parts (namely thread_info)
contain similar thread-local context that ought to be merged with
it but that is even less related to the scheduler. However moving
more data into this structure isn't possible since task.h is high
level and cannot be included everywhere (e.g. activity) without
causing include loops.
In the end, it appears that the task_per_thread represents most of
the per-thread context defined with generic types and should simply
move to tinfo.h so that everyone can use them.
The struct was renamed to thread_ctx and the variable "sched" was
renamed to "th_ctx". "sched" used to be initialized manually from
run_thread_poll_loop(), now it's initialized by ha_set_tid() just
like ti, tid, tid_bit.
The memset() in init_task() was removed in favor of a bss initialization
of the array, so that other subsystems can put their stuff in this array.
Since the tasklet array has TL_CLASSES elements, the TL_* definitions
was moved there as well, but it's not a problem.
The vast majority of the change in this patch is caused by the
renaming of the structures.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.
These ones are rarely used or only to waste CPU cycles waiting, and are
the last ones requiring system includes in thread.h. Let's uninline them
and move them to thread.c.
This removes the thread identifiers from struct thread_info and moves
them only in static array in thread.c since it's now the only file that
needs to touch it. It's also the only file that needs to include
pthread.h, beyond haproxy.c which needs it to start the poll loop. As
a result, much less system includes are needed and the LoC reduced by
around 3%.