If we've stopped consulting the local wait queue due to too many tasks
(max_processed <= 0), there's no point starting to lock the shared WQ,
check the first task's expiration date, upgrading the lock just to
refrain from doing the work because of the limit. All this does is
increase contention on an already contended system.
Note that there is still a fairness issue in this WQ dequeuing code. If
each thread is busy with expired tasks, no thread will dequeue the global
ones. In practice it doesn't make much sense and should quickly resorb,
but it could be nice to have an alternating flag indicating where to
start from on next call to improve this.
This macro was used both for binding and for lookups. When binding tasks
or FDs, using all_threads_mask instead is better as it will later be per
group. For lookups, ~0UL always does the job. Thus in practice the macro
was already almost not used anymore since the rest of the code could run
fine with a constant of all ones there.
Instead of having a global mask of all the profiled threads, let's have
one flag per thread in each thread's flags. They are never accessed more
than one at a time an are better located inside the threads' contexts for
both performance and scalability.
Since the relaxation of the run-queue locks in 2.0 there has been a
very small but existing race between expired tasks and running tasks:
a task might be expiring and being woken up at the same time, on
different threads. This is protected against via the TASK_QUEUED and
TASK_RUNNING flags, but just after the task finishes executing, it
releases it TASK_RUNNING bit an only then it may go to task_queue().
This one will do nothing if the task's ->expire field is zero, but
if the field turns to zero between this test and the call to
__task_queue() then three things may happen:
- the task may remain in the WQ until the 24 next days if it's in
the future;
- the task may prevent any other task after it from expiring during
the 24 next days once it's queued
- if DEBUG_STRICT is set on 2.4 and above, an abort may happen
- since 2.2, if the task got killed in between, then we may
even requeue a freed task, causing random behaviour next time
it's found there, or possibly corrupting the tree if it gets
reinserted later.
The peers code is one call path that easily reproduces the case with
the ->expire field being reset, because it starts by setting it to
TICK_ETERNITY as the first thing when entering the task handler. But
other code parts also use multi-threaded tasks and rightfully expect
to be able to touch their expire field without causing trouble. No
trivial code path was found that would destroy such a shared task at
runtime, which already limits the risks.
This must be backported to 2.0.
There were a few casts of list* to mt_list* that were upsetting some
old compilers (not sure about the effect on others). We had created
list_to_mt_list() purposely for this, let's use it instead of applying
this cast.
This applicationn specific flag was added in 2.4-dev by commit 6fa8bcdc7
("MINOR: task: add an application specific flag to the state: TASK_F_USR1")
to help preserve a the idle connections status across wakeup calls. While
the code to do this was OK for tasklets, it was wrong for tasks, as in an
effort not to lose it when setting the RUNNING flag (that tasklets don't
have), it ended up being inconditionally set. It just happens that for now
no regular tasks use it, only tasklets.
This fix makes sure we always atomically perform (state & flags | running)
there, using a CAS. It also does it for tasklets because it was possible
to lose some such flags if set by another thread, even though this should
not happen with current code. In order to make the code more readable (and
avoid the previous mistake of repeated flags in the bit field), a new
TASK_PERSISTENT aggregate was declared in task.h for this.
In practice the CAS is cheap here because task states are stable or
convergent so the loop will almost never be taken.
This should be backported to 2.4.
The TI_FL_STUCK flag is manipulated by the watchdog and scheduler
and describes the apparent life/death of a thread so it changes
all the time and it makes sense to move it to the thread's context
for an active thread.
The scheduler contains a lot of stuff that is thread-local and not
exclusively tied to the scheduler. Other parts (namely thread_info)
contain similar thread-local context that ought to be merged with
it but that is even less related to the scheduler. However moving
more data into this structure isn't possible since task.h is high
level and cannot be included everywhere (e.g. activity) without
causing include loops.
In the end, it appears that the task_per_thread represents most of
the per-thread context defined with generic types and should simply
move to tinfo.h so that everyone can use them.
The struct was renamed to thread_ctx and the variable "sched" was
renamed to "th_ctx". "sched" used to be initialized manually from
run_thread_poll_loop(), now it's initialized by ha_set_tid() just
like ti, tid, tid_bit.
The memset() in init_task() was removed in favor of a bss initialization
of the array, so that other subsystems can put their stuff in this array.
Since the tasklet array has TL_CLASSES elements, the TL_* definitions
was moved there as well, but it's not a problem.
The vast majority of the change in this patch is caused by the
renaming of the structures.
The entering_poll/leaving_poll/measure_idle functions that were hard
to classify and used to move to various locations have now been placed
into clock.c since it's precisely about time-keeping. The functions
were renamed to clock_*. The samp_time and idle_time values are now
static since there is no reason for them to be read from outside.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.
It's pointless to inline this, it's called exactly once per poll loop,
and it depends on time.h which is quite deep. Let's move that to task.c
along with sched_report_idle().
The idle time calculation stuff was moved to task.h by commit 6dfab112e
("REORG: sched: move idle time calculation from time.h to task.h") but
these two variables that are only maintained by task.{c,h} were still
left in time.{c,h}. They have to move as well.
A number of files currently access activity counters but rely on their
definitions to be inherited from other files (task.c, backend.c hlua.c,
sock.c, pool.c, stats.c, fd.c).
I don't know why I inlined this one, this makes no sense given that it's
only used for stats, and it starts a circular dependency on tinfo.h which
can be problematic in the future. In addition, all the stuff related to
idle time calculation should be with the rest of the scheduler, which
currently is in task.{c,h}, so let's move it there.
Work lists were a mechanism introduced in 1.8 to asynchronously delegate
some work to be performed on another thread via a dedicated task.
The only user was the listeners, to deal with the queue. Nowadays
the tasklets have made this much more convenient, and have replaced
work_lists in the listeners. It seems there will be no valid use case
of work lists anymore, so better get rid of them entirely and keep the
scheduler code cleaner.
__task_queue() must absolutely not be called with TICK_ETERNITY or it
will place a never-expiring node upfront in the timers queue, preventing
any timer from expiring until the process is restarted. Code was found
to cause this using "task_schedule(task, now_ms)" which does this one
millisecond every 49.7 days, so let's add a condition against this. It
must never trigger since any process susceptible to trigger it would
already accumulate tasks until it dies.
An extra test was added in wake_expired_tasks() to detect tasks whose
timeout would have been changed after being queued.
An improvement over this could be in the future to use a non-scalar
type (union/struct) for expiration dates so as to avoid the risk of
using them directly like this. But now_ms is already such a valid
time and this specific construct would still not be caught.
This could even be backported to stable versions to help detect other
occurrences if any.
Implement an equivalent of task_kill for tasklets. This function can be
used to request a tasklet deletion in a thread-safe way.
Currently this function is unused.
This one comes with a very deep dependency hell, only to know that
process_stream() is a function. Dropping it reduces the preprocessed
output from 1.5MB to 640kB.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.
Instead of setting a hard-limit on runqueue-depth and keeping it short
to maintain fairness, let's allow the scheduler to automatically cut
the existing one in two equal halves if its size is between the configured
size and its double. This will allow to increase the default value while
keeping a low latency.
The default proxy was passed as a variable to all parsers instead of a
const, which is not without risk, especially when some timeout parsers used
to make some int pointers point to the default values for comparisons. We
want to be certain that none of these parsers will modify the defaults
sections by accident, so it's important to mark this proxy as const.
This patch touches all occurrences found (89).
grq_total was incremented when picking tasks from the global run queue,
but this variable was not defined with threads disabled, and the code
was optimized away at -O2. No backport is needed.
This flag will be usable by any application. It will be preserved across
wakeups so the application can use it to do various stuff. Some I/O
handlers will soon benefit from this.
It's been too short for quite a while now and is now full. It's still
time to extend it to 32-bits since we have room for this without
wasting any space, so we now gained 16 new bits for future flags.
The values were not reassigned just in case there would be a few
hidden u16 or short somewhere in which these flags are placed (as
it used to be the case with stream->pending_events).
The patch is tagged MEDIUM because this required to update the task's
process() prototype to use an int instead of a short, that's quite a
bunch of places.
It's cleaner to use a flag from the task's state to detect a tasklet
and it's even cheaper. One of the best benefits is that this will
allow to get the nice field out of the common part since the tasklet
doesn't need it anymore. This commit uses the last task bit available
but that's temporary as the purpose of the change is to extend this.
The preliminary approach to dealing with heavy tasks forced us to quit
the poller after meeting one. Now instead we process at most one per poll
loop and ignore the next ones, so that we get more bandwidth to process
all other classes.
Doing so further reduced the induced HTTP request latency at 100k req/s
under the stress of 1000 concurrent SSL handshakes in the following
proportions:
| default | low-latency
---------+------------+--------------
before | 2.75 ms | 2.0 ms
after | 1.38 ms | 0.98 ms
In both cases, the latency is roughly halved. It's worth noting that
both values are now exactly 10 times better than in 2.4-dev9. Even the
percentiles have much improved. For 16 HTTP connections (1 per thread)
competing with 1000 SSL handshakes, we're seeing these long-tail
latencies (in milliseconds) :
| 99.5% | 99.9% | 100%
-----------+---------+---------+--------
2.4-dev9 | 48.4 | 58.1 | 78.5
previous | 6.2 | 11.4 | 67.8
this patch | 2.8 | 2.9 | 6.1
The task latency profiling report now shows this in default mode:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
si_cs_io_cb 3061966 2.224s 726.0ns 42.03s 13.72us
h1_io_cb 3061960 6.418s 2.096us 18.76m 367.6us
process_stream 3059982 9.137s 2.985us 15.52m 304.3us
ssl_sock_io_cb 602657 4.265m 424.7us 4.736h 28.29ms
h1_timeout_task 202973 - - 6.254s 30.81us
accept_queue_process 135547 1.179s 8.699us 16.29s 120.1us
srv_cleanup_toremove_conns 81 15.64ms 193.1us 30.87ms 381.1us
task_run_applet 10 758.7us 75.87us 51.77us 5.176us
srv_cleanup_idle_conns 4 375.3us 93.83us 54.52us 13.63us
And this in low-latency mode, showing that both si_cs_io_cb() and process_stream()
have significantly benefitted from the improvement, with values 50 to 200 times
smaller than 2.4-dev9:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
h1_io_cb 6407006 11.86s 1.851us 31.14m 291.6us
process_stream 6403890 18.40s 2.873us 2.134m 20.00us
si_cs_io_cb 6403866 4.139s 646.0ns 1.773m 16.61us
ssl_sock_io_cb 894326 6.407m 429.9us 7.326h 29.49ms
h1_timeout_task 301189 - - 8.440s 28.02us
accept_queue_process 211989 1.691s 7.977us 21.48s 101.3us
srv_cleanup_toremove_conns 220 23.46ms 106.7us 65.61ms 298.2us
task_run_applet 16 1.219ms 76.17us 181.7us 11.36us
srv_cleanup_idle_conns 12 713.3us 59.44us 168.4us 14.03us
The changes are slightly more invasive than previous ones and depend on
recent patches so they are not likely well suited for backporting.
Instead of placing heavy tasklets into the TL_BULK queue, we now place
them into the TL_HEAVY one, which is assigned a default weight of ~1%
load at once. This way heavy tasks will not block TL_BULK anymore.
This class will be used exclusively for heavy processing tasklets. It
will be cleaner than mixing them with the bulk ones. For now it's
allocated ~1% of the CPU bandwidth.
The largest part of the patch consists in re-arranging the fields in the
task_per_thread structure to preserve a clean alignment with one more
list head. Since we're now forced to increase the struct past a second
cache line, it now uses 4 cache lines (for easy multiplying) with the
first two ones being exclusively used by local operations and the third
one mostly by atomic operations. Interestingly, this better arrangement
causes less stress and reduced the response time by 8 microseconds at
1 million requests per second.
While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.
An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.
But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.
Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".
First, we don't want to measure wakeup times if the call date had not
been set before profiling was enabled at run time. And second, we may
only collect the value before clearing the TASK_IN_LIST bit, otherwise
another wakeup might happen on another thread and replace the call date
we're about to use, hence artificially lower the wakeup times.
It is extremely useful to be able to observe the wakeup latency of some
important I/O operations, so let's accept to inflate the tasklet struct
by 8 extra bytes when DEBUG_TASK is set. With just this we have enough
to get live reports like this:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
si_cs_io_cb 8099492 4.833s 596.0ns 8.974m 66.48us
h1_io_cb 7460365 11.55s 1.548us 2.477m 19.92us
process_stream 7383828 22.79s 3.086us 18.39m 149.5us
h1_timeout_task 4157 - - 348.4ms 83.81us
srv_cleanup_toremove_connections751 39.70ms 52.86us 10.54ms 14.04us
srv_cleanup_idle_connections 21 1.405ms 66.89us 30.82us 1.467us
task_run_applet 16 1.058ms 66.13us 446.2us 27.89us
accept_queue_process 7 34.53us 4.933us 333.1us 47.58us
Instead of decrementing grq_total once per task picked from the global
run queue, let's do it at once after the loop like we do for other
counters. This simplifies the code everywhere. It is not expected to
bring noticeable improvements however, since global tasks tend to be
less common nowadays.
Now we don't need to decrement rq_total when we pick a tack in the tree
to immediately increment it again after installing it into the local
list. Instead, we simply add to the local queue count the number of
globally picked tasks. Avoiding this shows ~0.5% performance gains at
1Mreq/s (2M task switches/s).
As indicated in previous commit, this function tries to guess which tree
the task is in to figure what counters to update, while we already have
that info in the caller. Let's just pick the relevant parts to place them
in the caller.
In process_runnable_tasks() we're still calling __task_unlink_rq() to
pick a task, and this function tries to guess where to pick the task
from and which counter to update while the caller's context already
has everything. Worse, the number of local tasks is decremented then
recredited, doubling the operations. In order to avoid this we first
need to keep separate counters for local and global tasks that were
picked. This is what this patch does.
This function has become large with the multi-queue scheduler. We need
to keep the fast path and the debugging parts inlined, but the rest now
moves to task.c just like was done for task_wakeup(). This has reduced
the code size by 6kB due to less inlining of large parts that are always
context-dependent, and as a side effect, has increased the overall
performance by 1%.
The nb_tasks counter was still global and gets incremented and decremented
for each task_new()/task_free(), and was read in process_runnable_tasks().
But it's only used for stats reporting, so doing this this often is
pointless and expensive. Let's move it to the task_per_thread struct and
have the stats sum it when needed.
The test in __task_wakeup() to figure if the remote threads are sleeping
doesn't make sense outside of the global runqueue test, since there are
only two possibilities here: local runqueue or global runqueue, hence a
sleeping thread is another one and can only happen when sending to the
global run queue. Let's move the test inside the "if" block.
Historically we used to call __task_wakeup() with a known tree root but
this is not the case and the code has remained needlessly complicated
with the root calculation in task_wakeup() passed in argument to
__task_wakeup() which compares it again.
Let's get rid of this and just move the detection code there. This
eliminates some ifdefs and allows to simplify the test conditions quite
a bit.
This one is systematically misunderstood due to its unclear name. It
is in fact the number of tasks in the local tasklet list. Let's call
it "tasks_in_list" to remove some of the confusion.
This one is exclusively used as a boolean nowadays and is non-zero only
when the thread-local run queue is not empty. Better check the root tree's
pointer and avoid updating this counter all the time.
This counter is solely used for reporting in the stats and is the hottest
thread contention point to date. Moving it to the scheduler and having a
separate one for the global run queue dramatically improves the performance,
showing a 12% boost on the request rate on 16 threads!
In addition, the thread debugging output which used to rely on rqueue_size
was not totally accurate as it would only report task counts. Now we can
return the exact thread's run queue length.
It is also interesting to note that there are still a few other task/tasklet
counters in the scheduler that are not efficiently updated because some cover
a single area and others cover multiple areas. It looks like having a distinct
counter for each of the following entries would help and would keep the code
a bit cleaner:
- global run queue (tree)
- per-thread run queue (tree)
- per-thread shared tasklets list
- per-thread local lists
Maybe even splitting the shared tasklets lists between pure tasklets and
tasks instead of having the whole and tasks would simplify the code because
there remain a number of places where several counters have to be updated.
The runqueue_ticks counts the number of task wakeups and is used to
position new tasks in the run queue, but since we've had per-thread
run queues, the values there are not very relevant anymore and the
nice value doesn't apply well if some threads are more loaded than
others. In addition, letting all threads compete over a shared counter
is not smart as this may cause some excessive contention.
Let's move this index close to the run queues themselves, i.e. one per
thread and a global one. In addition to improving fairness, this has
increased global performance by 2% on 16 threads thanks to the lower
contention on rqueue_ticks.
Fairness issues were not observed, but if any were to be, this patch
could be backported as far as 2.0 to address them.
Now when the profiling is enabled, the scheduler wlil update per-function
task-level statistics on number of calls, cpu usage and lateny, that could
later be checked using "show profiling". This will immediately make it
obvious what functions are responsible for others' high latencies or which
ones are suffering from others, and should help spot issues like undesired
wakeups. For now the stats are only collected but not reported (though they
are readable from sched_activity[] under gdb).
In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM
nodes which would not happen on x86 nodes. After investigation it turned
out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much
more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding
memory ordering.
The issue that was triggered there is that if a tasklet_wakeup() call
is made on a tasklet scheduled to run on a foreign thread and that
tasklet is just being dequeued to be processed, there can be a race at
two places:
- if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and
LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list,
because the emptiness tests matches ;
- if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in
run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends
up being implemented, it may even corrupt the adjacent nodes while
they're being reused for the in-tree storage.
This issue was introduced in 2.2 when support for waking up remote
tasklets was added. Initially the attachment of a tasklet to a list
was enough to know its status and this used to be stable information.
Now it's not sufficient to rely on this anymore, thus we need to use
a different information.
This patch solves this by adding a new task flag, TASK_IN_LIST, which
is atomically set before attaching a tasklet to a list, and is only
removed after the tasklet is detached from a list. It is checked
by tasklet_wakeup_on() so that it may only be done while the tasklet
is out of any list, and is cleared during the state switch when calling
the tasklet. Note that the flag is not set for pure tasks as it's not
needed.
However this introduces a new special case: the function
tasklet_remove_from_tasklet_list() needs to keep both states in sync
and cannot check both the state and the attachment to a list at the
same time. This function is already limited to being used by the thread
owning the tasklet, so in this case the test remains reliable. However,
just like its predecessors, this function is wrong by design and it
should probably be replaced with a stricter one, a lazy one, or be
totally removed (it's only used in checks to avoid calling a possibly
scheduled event, and when freeing a tasklet). Regardless, for now the
function exists so the flag is removed only if the deletion could be
done, which covers all cases we're interested in regarding the insertion.
This removal is safe against a concurrent tasklet_wakeup_on() since
MT_LIST_DEL() guarantees the atomic test, and will ultimately clear
the flag only if the task could be deleted, so the flag will always
reflect the last state.
This should be carefully be backported as far as 2.2 after some
observation period. This patch depends on previous patch
"MINOR: task: remove __tasklet_remove_from_tasklet_list()".
This function is only used at a single place directly within the
scheduler in run_tasks_from_lists() and it really ought not be called
by anything else, regardless of what its comment says. Let's delete
it, move the two lines directly into the call place, and take this
opportunity to factor the atomic decrement on tasks_run_queue. A comment
was added on the remaining one tasklet_remove_from_tasklet_list() to
mention the risks in using it.
In process_runnable_tasks(), we walk the run queue and pick tasks to
insert them into the local list. And for each of these operations we
perform a few increments, some of which are atomic, and they're even
performed under the runqueue's lock. This is useless inside the loop,
better do them at the end, since we don't use these values inside the
loop and they're not used anywhere else either during this time. The
only one is task_list_size which is accessed in parallel by other
threads performing remote tasklet wakeups, but it's already
approximative and is used to decide to get out of the loop when the
limit is reached. So now we compute it first as an initial budget
instead.