In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM
nodes which would not happen on x86 nodes. After investigation it turned
out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much
more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding
memory ordering.
The issue that was triggered there is that if a tasklet_wakeup() call
is made on a tasklet scheduled to run on a foreign thread and that
tasklet is just being dequeued to be processed, there can be a race at
two places:
- if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and
LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list,
because the emptiness tests matches ;
- if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in
run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends
up being implemented, it may even corrupt the adjacent nodes while
they're being reused for the in-tree storage.
This issue was introduced in 2.2 when support for waking up remote
tasklets was added. Initially the attachment of a tasklet to a list
was enough to know its status and this used to be stable information.
Now it's not sufficient to rely on this anymore, thus we need to use
a different information.
This patch solves this by adding a new task flag, TASK_IN_LIST, which
is atomically set before attaching a tasklet to a list, and is only
removed after the tasklet is detached from a list. It is checked
by tasklet_wakeup_on() so that it may only be done while the tasklet
is out of any list, and is cleared during the state switch when calling
the tasklet. Note that the flag is not set for pure tasks as it's not
needed.
However this introduces a new special case: the function
tasklet_remove_from_tasklet_list() needs to keep both states in sync
and cannot check both the state and the attachment to a list at the
same time. This function is already limited to being used by the thread
owning the tasklet, so in this case the test remains reliable. However,
just like its predecessors, this function is wrong by design and it
should probably be replaced with a stricter one, a lazy one, or be
totally removed (it's only used in checks to avoid calling a possibly
scheduled event, and when freeing a tasklet). Regardless, for now the
function exists so the flag is removed only if the deletion could be
done, which covers all cases we're interested in regarding the insertion.
This removal is safe against a concurrent tasklet_wakeup_on() since
MT_LIST_DEL() guarantees the atomic test, and will ultimately clear
the flag only if the task could be deleted, so the flag will always
reflect the last state.
This should be carefully be backported as far as 2.2 after some
observation period. This patch depends on previous patch
"MINOR: task: remove __tasklet_remove_from_tasklet_list()".
This function is only used at a single place directly within the
scheduler in run_tasks_from_lists() and it really ought not be called
by anything else, regardless of what its comment says. Let's delete
it, move the two lines directly into the call place, and take this
opportunity to factor the atomic decrement on tasks_run_queue. A comment
was added on the remaining one tasklet_remove_from_tasklet_list() to
mention the risks in using it.
This function is only called at a single place and adds more confusion
than it removes. It also makes one think it could be used outside of
the scheduler while it must absolutely not. Let's just move its two
lines to the call place, making the code more readable there. In
addition this clearly shows that the preliminary LIST_INIT() is
useless since the entry is immediately overwritten.
This counter is only updated and never used, and in addition it's done
without any atomicity so it's very unlikely to be correct on multi-CPU
systems! Let's just remove it since it's not used.
__task_free() cannot be called with a task still in the queue. This
test adds a check which confirms there is no concurrency issue on such
a case where a thread could requeue nor wakeup a task being freed.
This aims at catching calls to task_unlink_wq() performed by the wrong
thread based on the shared status for the task, as well as calls to
__task_queue() with the wrong timer queue being used based on the task's
capabilities. This will at least help eliminate some hypothesis during
debugging sessions when suspecting that a wrong thread has attempted to
queue a task at the wrong place.
The BUG_ON() test in task_queue() only tests for the case where
we're queuing a task that doesn't run on the current thread. Let's
refine it a bit further to catch all cases where the task does not
run *exactly* on the current thread alone.
Initially when mt_lists were added, their purpose was to be used with
the scheduler, where anyone may concurrently add the same tasklet, so
it sounded natural to implement a check in MT_LIST_ADD{,Q}. Later their
usage was extended and MT_LIST_ADD{,Q} started to be used on situations
where the element to be added was exclusively owned by the one performing
the operation so a conflict was impossible. This became more obvious with
the idle connections and the new macro was called MT_LIST_ADDQ_NOCHECK.
But this remains confusing and at many places it's not expected that
an MT_LIST_ADD could possibly fail, and worse, at some places we start
by initializing it before adding (and the test is superflous) so let's
rename them to something more conventional to denote the presence of the
check or not:
MT_LIST_ADD{,Q} : inconditional operation, the caller owns the
element, and doesn't care about the element's
current state (exactly like LIST_ADD)
MT_LIST_TRY_ADD{,Q}: only perform the operation if the element is not
already added or in the process of being added.
This means that the previously "safe" MT_LIST_ADD{,Q} are not "safe"
anymore. This also means that in case of backport mistakes in the
future causing this to be overlooked, the slower and safer functions
will still be used by default.
Note that the missing unchecked MT_LIST_ADD macro was added.
The rest of the code will have to be reviewed so that a number of
callers of MT_LIST_TRY_ADDQ are changed to MT_LIST_ADDQ to remove
the unneeded test.
tasklet_wakeup() only checks tl->tid to know whether the task is
programmed to run on the current thread or on a specific thread. We'll
have to ease this selection in a subsequent patch, preferably without
modifying tl->tid, so let's have a new tasklet_wakeup_on() function
to specify the thread number to run on. That the logic has not changed
at all.
task_kill() may be used by any thread to kill any task with less overhead
than a regular wakeup. In order to achieve this, it bypasses the priority
tree and inserts the task directly into the shared tasklets list, cast as
a tasklet. The task_list_size is updated to make sure it is properly
decremented after execution of this task. The task will thus be picked by
process_runnable_tasks() after checking the tree and sent to the TL_URGENT
list, where it will be processed and killed.
If the task is bound to more than one thread, its first thread will be the
one notified.
If the task was already queued or running, nothing is done, only the flag
is added so that it gets killed before or after execution. Of course it's
the caller's responsibility to make sur any resources allocated by this
task were already cleaned up or taken over.
Now process_runnable_tasks is responsible for calculating the budgets
for each queue, dequeuing from the tree, and calling run_tasks_from_lists().
This latter one scans the queues, picking tasks there and respecting budgets.
Note that its name was updated with a plural "s" for this reason.
It is neither convenient nor scalable to check each and every tasklet
queue to figure whether it's empty or not while we often need to check
them all at once. This patch introduces a tasklet class mask which gets
a bit 1 set for each queue representing one class of service. A single
test on the mask allows to figure whether there's still some work to be
done. It will later be usable to better factor the runqueue code.
Bits are set when tasklets are queued. They're cleared when queues are
emptied. It is possible that a queue is empty but has a bit if a tasklet
was added then removed, but this is not a problem as this is properly
checked for in run_tasks_from_list().
It will be convenient to have the tasklet queue number soon, better make
current_queue an index rather than a pointer to the queue. When not currently
running (e.g. from I/O), the index is -1.
A test on large objects revealed a big performance loss from 2.1. The
cause was found to be related to cache locality between scheduled
operations that are batched using tasklets. It happens that we now
have several layers of tasklets and that queuing all these operations
leaves time to let memory objects cool down in the CPU cache, effectively
resulting in halving the performance.
A quick test consisting in putting most unknown tasklets into the BULK
queue almost fixed the performance regression, but this is a wrong
approach as it can also slow down some low-latency transfers or access
to applets like the CLI.
What this patch does instead is to queue unknown tasklets into the same
queue as the current one when tasklet_wakeup() is itself called from a
task/tasklet, otherwise it uses urgent for real I/O (when sched->current
is NULL). This results in the called tasklet being woken up much sooner,
often at the end of the current batch of tasklets.
By doing so, a test on 2 cores 4 threads with 256 concurrent H1 conns
transferring 16m objects with 256kB buffers jumped from 55 to 88 Gbps.
It's even possible to go as high as 101 Gbps by evaluating the URGENT
queue after the BULK one, though this was not done as considered
dangerous for latency sensitive operations.
This reinforces the importance of getting back the CPU transfer
mechanisms based on tasklet_wakeup_after() to work at the tasklet level
by supporting an immediate wakeup in certain cases.
No backport is needed, this is strictly 2.2.
The TASK_IS_TASKLET() macro was moved to the proto file instead of the
type one. The proto part was a bit reordered to remove a number of ugly
forward declaration of static inline functions. About a tens of C and H
files had their dependency dropped since they were not using anything
from task.h.