Allow haproxy to take over idle connections from other thread groups
than our own. To control that, add a new tunable,
tune.takeover-other-tg-connections. It can have 3 values, "none", where
we won't attempt to get connections from the other thread group (the
default), "restricted", where we only will try to get idle connections
from other thread groups when we're using reverse HTTP, and "full",
where we always try to get connections from other thread groups.
Unless there is a special need, it is advised to use "none" (or
restricted if we're using reverse HTTP) as using connections from other
thread groups may have a performance impact.
That's essentially in order to help with debugging strange cases like
the occasional epoll issues/races, by keeping a counter of how many
times an FD was taken over since last inserted. The room is available
so let's use it. If it's needed later, this patch can easily be reverted.
The counter is also reported in "show fd" as "tkov".
During tests, it's pretty visible that with many threads and a large
number of FDs, the process may take time to be ready. The reason for
this is that the full fdtab array is scanned by each and every thread
at boot in fd_reregister_all() in order to make each thread-local
poller adopt the FDs that are relevant to it. The problem is that
when dealing with 1-2M FDs and 64+ threads, it starts to represent
quite a number of loops, and usually the fdtab array doesn't entirely
fit in the CPU's L3 cache, causing extra memory accesses.
It's particularly visible when issuing debugging commands to the CLI
because usually the first one fails while the CPU is at 100% for half
a second (which also is socat's timeout). A quick test with this:
global
stats socket /tmp/sock1 level admin mode 666
stats timeout 1h
maxconn 2000000
And the following script started in another window:
while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done
shows that it takes 1.58s for the socat instance that succeeds on an
Ampere Altra with 80 cores, this requires to change the timeout (defaults
to half a second) otherwise it returns nothing. In addition it also means
that during reloads, some CPU spikes will be noticed.
Adding a prefetch of the current FD + 16 improves the startup time by 30%
but that's far from being sufficient.
In practice all of this is performed at boot time, a moment at which we
know that extremely few FDs are registered (basically just the listeners),
so FD numbers are usually very low and the rest of the table is scanned
for no benefit. Ideally, knowing upfront how many FDs we have should be
sufficient.
A first approach would consist in counting the entries on a single thread
before registering pollers. It's not necessarily efficient and would take
time anyway.
This patch takes a different approach. It consists in keeping a thread-local
max ("fd_highest") that is updated whenever fd_insert() is called with a
larger number. Of course this is not correct once all threads have started,
but it will remain valid during boot since the same value is used during
startup and is cloned for each thread, and no scheduling happens anywhere
during this period, so that all threads are aware of the highest FD they've
seen registered, even if it had been done in some init code, and this without
having to deal with a shared variable.
Here on the test platform, the script gets its response in 10ms vs 1580
before.
As raise_rlim_nofile() was moved to limits compilation unit, limits.h includes
the system <sys/resource.h>. So, this definition of rlimit system type
structure is no longer need for compilation of fd unit.
fd_migrate_on() can be used to migrate an existing FD to any thread, even
one belonging to a different group from the current one and from the
caller's. All that is needed is to make sure the FD is still valid when
the operation is performed (which is the case when such operations happen).
This is potentially slightly expensive since it locks the tgid during the
delicate operation, but it is normally performed only from an owning
thread to offer the FD to another one (e.g. reassign a better thread upon
accept()).
In order to permit to migrate FDs from one thread group to another,
we'll need to be able to set a TGID that is compatible with no other
thread group. Either we use a special value or we dedicate a special
bit. Given that we already have way more bits than needed, let's just
sacrifice the topmost one to serve as a lock bit, indicating the tgid
is not valid anymore. This will make all fd_grab_tgid() fail to grab
it.
The new fd_lock_tgid() function now tries to assign a locked tgid to
an idle FD, and fd_unlock_tgid() simply drops the lock bit, revealing
the target tgid.
For now it's still unused so it must not have any effect.
fd_claim_tgid() uses a CAS to set the desired TGID on the FD. It's only
called from fd_insert() where in the vast majority of cases, the tgid
and refcount are zero before the call. However the loop was optimized
for the case where it was equal to the desired TGID, systematically
causing one extra round in the loop there. Better start assuming a
zero value.
A few loops waiting for threads to synchronize such as thread_isolate()
rightfully filter the thread masks via the threads_enabled field that
contains the list of enabled threads. However, it doesn't use an atomic
load on it. Before 2.7, the equivalent variables were marked as volatile
and were always reloaded. In 2.7 they're fields in ha_tgroup_ctx[], and
the risk that the compiler keeps them in a register inside a loop is not
null at all. In practice when ha_thread_relax() calls sched_yield() or
an x86 PAUSE instruction, it could be verified that the variable is
always reloaded. If these are avoided (e.g. architecture providing
neither solution), it's visible in asm code that the variables are not
reloaded. In this case, if a thread exists just between the moment the
two values are read, the loop could spin forever.
This patch adds the required _HA_ATOMIC_LOAD() on the relevant
threads_enabled fields. It must be backported to 2.7.
In issue #1866 an issue was reported under docker, by which a user cannot
lower the number of FD needed. It looks like a restriction imposed in this
environment, but it results in an error while it ought not have to in the
case of shrinking.
This patch adds a new function raise_rlim_nofile() that takes the desired
new setting, compares it to the current one, and only calls setrlimit() if
one of the values in the new setting is larger than the older one. As such
it will continue to emit warnings and errors in case of failure to raise
the limit but will never shrink it.
This patch is only preliminary to another one, but will have to be
backported where relevant (likely only 2.6).
Since the API is still a bit young, let's make sure nobody tries to
assign and FD to a group not strictly 1..MAX_TGROUPS as that would
indicate a bug.
Note: some of these might be relaxed to BUG_ON_HOT() in the future
When a new fd is inserted in the fdtab array, its state is initialized. The
"newstate" variable is used to compute the right state (0 by default, but
FD_ET_POSSIBLE flag is set if edge-triggered is supported for the fd).
However, this variable is never used and the fd state is always set to 0.
Now, the fd state is initialized with "newstate" variable.
This bug was introduced by commit ddedc1662 ("MEDIUM: fd: make
fd_insert/fd_delete atomically update fd.tgid"). No backport needed.
At boot the pollers are allocated for each thread and they need to
reprogram updates for all FDs they will manage. This code is not
trivial, especially when trying to respect thread groups, so we'd
rather avoid duplicating it.
Let's centralize this into fd.c with this function. It avoids closed
FDs, those whose thread mask doesn't match the requested one or whose
thread group doesn't match the requested one, and performs the update
if required under thread-group protection.
These functions need to set/reset the FD's tgid but when they're called
there may still be wakeups on other threads that discover late updates
and have to touch the tgid at the same time. As such, it is not possible
to just read/write the tgid there. It must only be done using operations
that are compatible with what other threads may be doing.
As we're using inc/dec on the refcount, it's safe to AND the area to zero
the lower part when resetting the value. However, in order to set the
value, there's no other choice but fd_claim_tgid() which will assign it
only if possible (via a CAS). This is convenient in the end because it
protects the FD's masks from being modified by late threads, so while
we hold this refcount we can safely reset the thread_mask and a few other
elements. A debug test for non-null masks was added to fd_insert() as it
must not be possible to face this situation thanks to the protection
offered by the tgid.
With the change that was started on other masks, the thread mask was
still not fully converted, sometimes being used as a global mask and
sometimes as a local one. This finishes the code modifications so that
the mask is always considered as a group-local mask. This doesn't
change anything as long as there's a single group, but is necessary
for groups 2 and above since it's used against running_mask and so on.
It's an AND so it destroys information and due to this there's a call
place where we have to perform two reads to know the previous value
then to change it. With a fetch-and-and instead, in a single operation
we can know if the bit was previously present, which is more efficient.
From now on, the FD's running_mask only refers to local thread IDs. However,
there remains a limitation, in updt_fd_polling(), we temporarily have to
check and set shared FDs against .thread_mask, which still contains global
ones. As such, nbtgroups > 1 may break (but this is not yet supported without
special build options).
From now on, the FD's update_mask only refers to local thread IDs. However,
there remains a limitation, in updt_fd_polling(), we temporarily have to
check and set shared FDs against .thread_mask, which still contains global
ones. As such, nbtgroups > 1 may break (but this is not yet supported without
special build options).
The running mask is only valid if the tgid is the expected one. This
function takes a reference on the tgid before reading the running mask,
so that both are checked at once. It returns either the mask or zero if
the tgid differs, thus providing a simple way for a caller to check if
it still holds the FD.
The FD's tgid is refcounted and must be atomically manipulated. Function
fd_grab_tgid() will increase the refcount but only if the tgid matches the
one in argument (likely the current one). fd_claim_tgid() will be used to
self-assign the tgid after waiting for its refcount to reach zero.
fd_drop_tgid() will be used to drop a temporarily held tgid. All of these
are needed to prevent an FD from being reassigned to another group, either
when inspecting/modifying the running_mask, or when checking for updates,
in order to be certain that the mask being seen corresponds to the desired
group. Note that once at least one bit is set in the running mask of an
active FD, it cannot be closed, thus not migrated, thus the reference does
not need to be held long.
The file descriptors will need to know the thread group ID in addition
to the mask. This extends fd_insert() to take the tgid, and will store
it into the FD.
In the FD, the tgid is stored as a combination of tgid on the lower 16
bits and a refcount on the higher 16 bits. This allows to know when it's
really possible to trust the tgid and the running mask. If a refcount is
higher than 1 it indeed indicates another thread else might be in the
process of updating these values.
Since a closed FD must necessarily have a zero refcount, a test was
added to fd_insert() to make sure that it is the case.
It's a bit ugly to see that half of the callers of fd_insert() have to
apply all_threads_mask themselves to the bit field they're passing,
because usually it comes from a listener that may have other bits set.
Let's make the function apply the mask itself.
The update-list needs to be per-group because its inspection is based
on a mask and we need to be certain when scanning it if a mask is for
the same thread or another one. Once per-group there's no doubt about
it, even if the FD's polling changes, the entry remains valid. It will
be needed to check the tgid though.
Note that a soft-stop or pause/resume might not necessarily work here
with tgroups>1, because the operation might be delivered to a thread
that doesn't belong to the group and whoe update mask will not reflect
one that is interesting here. We can't do better at this stage.
Since commit d2494e048 ("BUG/MEDIUM: peers/config: properly set the
thread mask") there must not remain any single case of a receiver that
is bound nowhere, so there's no need anymore for thread_mask().
We're adding a test in fd_insert() to make sure this doesn't happen by
accident though, but the function was removed and its rare uses were
replaced with the original value of the bind_thread msak.
They were initially made to deal with both the cache and the update list
but there's no cache anymore and keeping them for the update list adds a
lot of obfuscation that is really not desired. Let's get rid of them now.
Their purpose was simply to get a pointer to fdtab[fd].update.{,next,prev}
in order to perform atomic tests and modifications. The offset passed in
argument to the functions (fd_add_to_fd_list() and fd_rm_from_fd_list())
was the offset of the ->update field in fdtab, and as it's not used anymore
it was removed. This also removes a number of casts, though those used by
the atomic ops have to remain since only scalars are supported.
This patch adds two BUG_ON on fd_insert() into the fdtab checking
if the fd has been correctly re-initialized into the fdtab
before a new insert.
It will raise a BUG if we try to insert the same fd multiple times
without an intermediate fd_delete().
First one checks that the owner for this fd in fdtab was reset to NULL.
Second one checks that the state flags for this fd in fdtab was reset
to 0.
This patch could be backported on version >= 2.4
Every single place where sleeping_thread_mask was still used was to test
or set a single thread. We can now add a per-thread flag to indicate a
thread is sleeping, and remove this shared mask.
The wake_thread() function now always performs an atomic fetch-and-or
instead of a first load then an atomic OR. That's cleaner and more
reliable.
This is not easy to test, as broadcast FD events are rare. The good
way to test for this is to run a very low rate-limited frontend with
a listener that listens to the fewest possible threads (2), and to
send it only 1 connection at a time. The listener will periodically
pause and the wakeup task will sometimes wake up on a random thread
and will call wake_thread():
frontend test
bind :8888 maxconn 10 thread 1-2
rate-limit sessions 5
Alternately, disabling/enabling a frontend in loops via the CLI also
broadcasts such events, but they're more difficult to observe since
this is causing connection failures.
Right now when an inter-thread wakeup happens, we preliminary check if the
thread was asleep, and if so we wake the poller up and remove its bit from
the sleeping mask. That's not very clean since the sleeping mask cannot be
entirely trusted since a thread that's about to wake up will already have
its sleeping bit removed.
This patch adds a new per-thread flag (TH_FL_NOTIFIED) to remember that a
thread was notified to wake up. It's cleared before checking the task lists
last, so that new wakeups can be considered again (since wake_thread() is
only used to notify about task wakeups and FD polling changes). This way
we do not need to modify a remote thread's sleeping mask anymore. As such
wake_thread() now only tests and sets the TH_FL_NOTIFIED flag but doesn't
clear sleeping anymore.
When returning from the polling syscall, all pollers have a certain
dance to follow, made of wall clock updates, thread harmless updates,
idle time management and sleeping mask updates. Let's have a centralized
function to deal with all of this boring stuff: fd_leaving_poll(), and
make all the pollers use it.
Almost every call place of wake_thread() checks for sleeping threads and
clears the sleeping mask itself, while the function is solely used for
there. Let's move the check and the clearing of the bit inside the function
itself. Note that updt_fd_polling() still performs the check because its
rules are a bit different.
Instead of seeing each location manipulate the fcntl() themselves and
often forget to check previous flags, let's centralize the functions to
do this. It also allows to drop fcntl.h from most call places and will
ease the adoption of different OS-specific mechanisms if needed. Note
that the fd_set_nonblock() function purposely doesn't check the previous
flags as it's meant to be used on new FDs only.
It's among the cases that would provoke memory corruption, let's add
some tests against negative FDs and those larger than the table. This
must never ever happen and would currently result in silent corruption
or a crash. Better have a noticeable one exhibiting the call chain if
that were to happen.
It's not needed to inline it at all (one call per loop) and it introduces
dependencies, let's move it to fd.c.
Removing the few remaining includes that came with it further reduced
by ~0.2% the LoC and the build time is now below 6s.
"f(void)" is the correct and preferred form for a function taking no
argument, while some places use the older "f()". These were reported
by clang's -Wmissing-prototypes, for example:
src/cpuset.c:111:5: warning: no previous prototype for function 'ha_cpuset_size' [-Wmissing-prototypes]
int ha_cpuset_size()
include/haproxy/cpuset.h:42:5: note: this declaration is not a prototype; add 'void' to make it a prototype for a zero-parameter function
int ha_cpuset_size();
^
void
This aggregate patch fixes this for the following functions:
ha_backtrace_to_stderr(), ha_cpuset_size(), ha_panic(), ha_random64(),
ha_thread_dump_all_to_trash(), get_exec_path(), check_config_validity(),
mworker_child_nb(), mworker_cli_proxy_(create|stop)(),
mworker_cleantasks(), mworker_cleanlisteners(), mworker_ext_launch_all(),
mworker_reload(), mworker_(env|proc_list)_to_(proc_list|env)(),
mworker_(un|)block_signals(), proxy_adjust_all_maxconn(),
proxy_destroy_all_defaults(), get_tainted(),
pool_total_(allocated|used)(), thread_isolate(_full|)(),
thread(_sync|)_release(), thread_harmless_till_end(),
thread_cpu_mask_forced(), dequeue_all_listeners(), next_timer_expiry(),
wake_expired_tasks(), process_runnable_tasks(), init_acl(),
init_buffer(), (de|)init_log_buffers(), (de|)init_pollers(),
fork_poller(), pool_destroy_all(), pool_evict_from_local_caches(),
pool_total_failures(), dump_pools_to_trash(), cfg_run_diagnostics(),
tv_init_(process|thread)_date(), __signal_process_queue(),
deinit_signals(), haproxy_unblock_signals()
This function already performs a number of checks prior to calling the
IOCB, and detects the change of thread (FD migration). Half of the
controls are still in each poller, and these pollers also maintain
activity counters for various cases.
Note that the unreliable test on thread_mask was removed so that only
the one performed by fd_set_running() is now used, since this one is
reliable.
Let's centralize all that fd-specific logic into the function and make
it return a status among:
FD_UPDT_DONE, // update done, nothing else to be done
FD_UPDT_DEAD, // FD was already dead, ignore it
FD_UPDT_CLOSED, // FD was closed
FD_UPDT_MIGRATED, // FD was migrated, ignore it now
Some pollers already used to call it last and have nothing to do after
it, regardless of the result. epoll has to delete the FD in case a
migration is detected. Overall this removes more code than it adds.
Since 2.4 with commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") we can merge all flag updates at
once in fd_update_events(). Previously this was performed in 1 to 3 steps,
setting the polling state, then setting READY_R if in/err/hup, and setting
READY_W if out/err. But since the commit above, all flags are stored
together in the same structure field that is being updated with the new
flags, thus we can simply update the flags altogether and avoid multiple
atomic operations. This even removes the need for atomic ops for FDs that
are not shared.
There's a theoretical race (that we failed to trigger) in function
fd_update_events(), which could strike on idle connections. The "locked"
variable will most often be 0 as the FD is bound to the current thread
only. Another thread could take it over once "locked" is set, change
the thread and running masks. Then the first thread updates the FD's
state non-atomically and possibly overwrites what the other thread was
preparing. It still looks like the FD's state will ultimately converge
though.
The solution against this is to set the running flag earlier so that a
takeover() attempt cannot succeed, or that the fd_set_running() attempt
fails, indicating that nothing needs to be done on this FD.
While this is sufficient for a simple fix to be backported, it leaves
the FD actively polled in the calling thread, this will trigger a second
wakeup which will notice the absence of tid_bit in the thread_mask,
getting rid of it.
A more elaborate solution would consist in calling fd_set_running()
directly from the pollers before calling fd_update_events(), getting
rid of the thread_mask test and letting the caller eliminate that FD
from its list if needed.
Interestingly, this code also proves to be suboptimal in that it sets
the FD state twice instead of calculating the new state at once and
always using a CAS to set it. This is a leftover of a simplification
that went into 2.4 and which should be explored in a future patch.
This may be backported as far as 2.2.
With latest commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") one occurrence of a pair of
chars was missed in fd_stop_both(), resulting in the operation to
fail if the upper flags were set. Interestingly it managed to fail
2 tests in all setups in the CI while all used to work fine on my
local machines. Probably that the reason is that the chars had enough
room above them for the CAS to fail then refill "old" overwriting the
upper parts of the stack, and that thanks to this the subsequent tests
worked. With ASAN being used on lots of tests, it very likely caught
it but used to only report failed tests with no more info.
No backport is needed, as this was never released nor backported.
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.