As reported in issue #816, when building task.o at -O1 with gcc 4.7 or
4.8, we get the following warning:
CC src/task.o
In file included from include/haproxy/proxy.h:31:0,
from include/haproxy/cfgparse.h:27,
from src/task.c:19:
src/task.c: In function 'next_timer_expiry':
include/haproxy/ticks.h:121:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized]
src/task.c:349:2: note: 'key' was declared here
It is wrong since the condition to use 'key' is exactly the same as
the one used to set it. This warning disappears at -O2 and disappeared
from gcc 5 and above. Let's just initialize 'key' there, it only adds
16 bytes of code and remains cheap enough for this function.
This should be backported to 2.2.
This aims at catching calls to task_unlink_wq() performed by the wrong
thread based on the shared status for the task, as well as calls to
__task_queue() with the wrong timer queue being used based on the task's
capabilities. This will at least help eliminate some hypothesis during
debugging sessions when suspecting that a wrong thread has attempted to
queue a task at the wrong place.
A bug was introduced by commit 77015abe0 ("MEDIUM: tasks: clean up the
front side of the wait queue in wake_expired_tasks()"): front tasks
that are not yet expired were incorrectly requeued into the local
wait queue instead of the global one. Because of this, the same task
could be found by the same thread on next invocation and be unlinked
without locking, allowing another thread to requeue it in parallel,
and conversely another thread could unlink it while the task was being
walked over, causing all sorts of crashes and endless loops in
wake_expired_tasks() and affiliates.
This bug can easily be triggered by stressing the do_resolve action
in multi-thread (after applying the fixes required to get do_resolve
to work with threads). It certainly is the cause of issue #758.
This must be backported to 2.2 only.
In run_tasks_from_task_list() we may free some tasks that have been
killed. Before doing so we unlink them from the wait queue. But if such
a task is in the global wait queue, the queue isn't locked so this can
result in corrupting the global task list and causing loops or crashes.
It's very likely one cause of issue #758.
This must be backported to 2.2. For 2.1 there doesn't seem to be any
case where a task could be freed this way while in the global queue,
but it doesn't cost much to apply the same change (the code is in
process_runnable_task there).
A bug in task_kill() was fixed by commy 54d31170a ("BUG/MAJOR: sched:
make sure task_kill() always queues the task") which added a list
initialization before adding an element. But in fact an inconditional
addition would have done the same and been simpler than first
initializing then checking the element was initialized. Let's use
MT_LIST_ADDQ() there to add the task to kill into the shared queue
and kill the dirty LIST_INIT().
Initially when mt_lists were added, their purpose was to be used with
the scheduler, where anyone may concurrently add the same tasklet, so
it sounded natural to implement a check in MT_LIST_ADD{,Q}. Later their
usage was extended and MT_LIST_ADD{,Q} started to be used on situations
where the element to be added was exclusively owned by the one performing
the operation so a conflict was impossible. This became more obvious with
the idle connections and the new macro was called MT_LIST_ADDQ_NOCHECK.
But this remains confusing and at many places it's not expected that
an MT_LIST_ADD could possibly fail, and worse, at some places we start
by initializing it before adding (and the test is superflous) so let's
rename them to something more conventional to denote the presence of the
check or not:
MT_LIST_ADD{,Q} : inconditional operation, the caller owns the
element, and doesn't care about the element's
current state (exactly like LIST_ADD)
MT_LIST_TRY_ADD{,Q}: only perform the operation if the element is not
already added or in the process of being added.
This means that the previously "safe" MT_LIST_ADD{,Q} are not "safe"
anymore. This also means that in case of backport mistakes in the
future causing this to be overlooked, the slower and safer functions
will still be used by default.
Note that the missing unchecked MT_LIST_ADD macro was added.
The rest of the code will have to be reviewed so that a number of
callers of MT_LIST_TRY_ADDQ are changed to MT_LIST_ADDQ to remove
the unneeded test.
Sadly, the fix from commit 54d31170a ("BUG/MAJOR: sched: make sure
task_kill() always queues the task") broke the builds without DEBUG_STRICT
as, in order to be careful, it plcaed a BUG_ON() around the previously
failing condition to check for any new possible failure, but this BUG_ON
strips the condition when DEBUG_STRICT is not set. We don't want BUG_ON
to evaluate any condition either as some debugging code calls possibly
expensive ones (e.g. in htx_get_stline). Let's just drop the useless
BUG_ON().
No backport is needed, this is 2.2-dev.
task_kill() may fail to queue a task if this task has never ever run,
because its equivalent (tasklet->list) member has never been "emptied"
since it didn't pass through the LIST_DEL_INIT() that's performed by
run_tasks_from_lists(). This results in these tasks to never be freed.
It happens during the mux takeover since the target task usually is
the timeout task which, by definition, has never run yet.
This fixes commit eb8c2c69f ("MEDIUM: sched: implement task_kill() to
kill a task") which was introduced after 2.2-dev11 and doesn't need to
be backported.
task_kill() may be used by any thread to kill any task with less overhead
than a regular wakeup. In order to achieve this, it bypasses the priority
tree and inserts the task directly into the shared tasklets list, cast as
a tasklet. The task_list_size is updated to make sure it is properly
decremented after execution of this task. The task will thus be picked by
process_runnable_tasks() after checking the tree and sent to the TL_URGENT
list, where it will be processed and killed.
If the task is bound to more than one thread, its first thread will be the
one notified.
If the task was already queued or running, nothing is done, only the flag
is added so that it gets killed before or after execution. Of course it's
the caller's responsibility to make sur any resources allocated by this
task were already cleaned up or taken over.
This flag, when set, will be used to indicate that the task must die.
At the moment this may only be placed by the task itself or by the
scheduler when placing it into the TL_NORMAL queue.
In commit 3ef7a190b ("MEDIUM: tasks: apply a fair CPU distribution
between tasklet classes") we compute a total weight to be used to
split the CPU time between queues. There is a mention that the
total cannot be null, wihch is based on the fact that we only get
there if thread_has_task() returns non-zero. But there is a very
small race which can break this assumption: if two threads conflict
on MT_LIST_ADDQ() on an empty shared list and both roll back before
trying again, there is the possibility that a first call to
MT_LIST_ISEMPTY() sees the first thread install itself, then the
second call will see the list empty when both roll back. Thus we
could proceed with the queue while it's temporarily empty and
compute max lengths using a divide by zero. This case is very
hard to trigger, it seldom happens on 16 threads at 400k req/s.
Let's simply test for max_total and leave the loop when we've not
found any work.
No backport is needed, that's 2.2-only.
Now that all tasklet queues are scanned at once by run_tasks_from_lists(),
it becomes possible to always check for lower priority classes and jump
back to them when they exist.
This patch adds tune.sched.low-latency global setting to enable this
behavior. What it does is stick to the lowest ranked priority list in
which tasks are still present with an available budget, and leave the
loop to refill the tasklet lists if the trees got new tasks or if new
work arrived into the shared urgent queue.
Doing so allows to cut the latency in half when running with extremely
deep run queues (10k-100k), thus allowing forwarding of small and large
objects to coexist better. It remains off by default since it does have
a small impact on large traffic by default (shorter batches).
Now process_runnable_tasks is responsible for calculating the budgets
for each queue, dequeuing from the tree, and calling run_tasks_from_lists().
This latter one scans the queues, picking tasks there and respecting budgets.
Note that its name was updated with a plural "s" for this reason.
It is neither convenient nor scalable to check each and every tasklet
queue to figure whether it's empty or not while we often need to check
them all at once. This patch introduces a tasklet class mask which gets
a bit 1 set for each queue representing one class of service. A single
test on the mask allows to figure whether there's still some work to be
done. It will later be usable to better factor the runqueue code.
Bits are set when tasklets are queued. They're cleared when queues are
emptied. It is possible that a queue is empty but has a bit if a tasklet
was added then removed, but this is not a problem as this is properly
checked for in run_tasks_from_list().
It will be convenient to have the tasklet queue number soon, better make
current_queue an index rather than a pointer to the queue. When not currently
running (e.g. from I/O), the index is -1.
Till now in process_runnable_tasks() we used to reserve a fixed portion
of max_processed to urgent tasks, then a portion of what remains for
normal tasks, then what remains for bulk tasks. This causes two issues:
- the current budget for processed tasks could be drained once for
all by higher level tasks so that they couldn't have enough left
for the next run. For example, if bulk tasklets cause task wakeups,
the required share to run them could be eaten by other bulk tasklets.
- it forces the urgent tasks to be run before scanning the tree so that
we know how many tasks to pick from the tree, and this isn't very
efficient cache-wise.
This patch changes this so that we compute upfront how max_processed will
be shared between classes that require so. We can then decide in advance
to pick a certain number of tasks from the tree, then execute all tasklets
in turn. When reaching the end, if there's still some budget, we can go
back and do the same thing again, improving chances to pick new work
before the global budget is depleted.
The default weights have been set to 50% for urgent tasklets, 37% for
normal ones and 13% for the bulk ones. In practice, there are not that
many urgent tasklets but when they appear they are cheap and must be
processed in as large batches as possible. Every time there is nothing
to pick there, the unused budget is shared between normal and bulk and
this allows bulk tasklets to still have quite some CPU to run on.
In task_per_thread[] we now have current_queue which is a pointer to
the current tasklet_list entry being evaluated. This will be used to
know the class under which the current task/tasklet is currently
running.
We want to be sure not to exceed max_processed. It can actually go
slightly negative due to the rounding applied to ratios, but we must
refrain from processing too many tasks if it's already low.
This became particularly relevant since recent commit 5c8be272c ("MEDIUM:
tasks: also process late wakeups in process_runnable_tasks()") which was
merged into 2.2-dev10. No backport is needed.
Since version 1.8, we've started to use tasks and tasklets more
extensively to defer I/O processing. Originally with the simple
scheduler, a task waking another one up using task_wakeup() would
have caused it to be processed right after the list of runnable ones.
With the introduction of tasklets, we've started to spill running
tasks from the run queues to the tasklet queues, so if a task wakes
another one up, it will only be executed on the next call to
process_runnable_task(), which means after yet another round of
polling loop.
This is particularly visible with I/Os hitting muxes: poll() reports
a read event, the connection layer performs a tasklet_wakeup() on the
mux subscribed to this I/O, and this mux in turn signals the upper
layer stream using task_wakeup(). The process goes back to poll() with
a null timeout since there's one active task, then back to checking all
possibly expired events, and finally back to process_runnable_tasks()
again. Worse, when there is high I/O activity, doing so will make the
task's execution further apart from the tasklet and will both increase
the total processing latency and reduce the cache hit ratio.
This patch brings back to the original spirit of process_runnable_tasks()
which is to execute runnable tasks as long as the execution budget is not
exhausted. By doing so, we're immediately cutting in half the number of
calls to all functions called by run_poll_loop(), and halving the number
of calls to poll(). Furthermore, calling poll() less often also means
purging FD updates less often and offering more chances to merge them.
This also has the nice effect of making tune.runqueue-depth effective
again, as in the past it used to be quickly bounded by this artificial
event horizon which was preventing from executing remaining tasks. On
certain workloads we can see a 2-3% performance increase.
Due to the way the wait queue works, some tasks might be postponed but not
requeued. However when we exit wake_expired_tasks() on a not-yet-expired
task and leave it in this situation, the next call to next_timer_expiry()
will use this first task's key in the tree as an expiration date, but this
date might be totally off and cause needless wakeups just to reposition it.
This patch makes sure that we leave wake_expired_tasks with a clean state
of frontside tasks and that their tree's key matches their expiration date.
Doing so we can already observe a ~15% reduction of the number of wakeups
when dealing with large numbers of health checks.
The patch looks large because the code was rearranged but the real change
is to take the wakeup/requeue decision on the task's expiration date instead
of the tree node's key, the rest is unchanged.
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
This one was not easy because it was embarking many includes with it,
which other files would automatically find. At least global.h, arg.h
and tools.h were identified. 93 total locations were identified, 8
additional includes had to be added.
In the rare files where it was possible to finalize the sorting of
includes by adjusting only one or two extra lines, it was done. But
all files would need to be rechecked and cleaned up now.
It was the last set of files in types/ and proto/ and these directories
must not be reused anymore.
This one is particularly difficult to split because it provides all the
functions used to manipulate a proxy state and to retrieve names or IDs
for error reporting, and as such, it was included in 73 files (down to
68 after cleanup). It would deserve a small cleanup though the cut points
are not obvious at the moment given the number of structs involved in
the struct proxy itself.
The TASK_IS_TASKLET() macro was moved to the proto file instead of the
type one. The proto part was a bit reordered to remove a number of ugly
forward declaration of static inline functions. About a tens of C and H
files had their dependency dropped since they were not using anything
from task.h.
A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h
Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
And also rename standard.c to tools.c. The original split between
tools.h and standard.h dates from version 1.3-dev and was mostly an
accident. This patch moves the files back to what they were expected
to be, and takes care of not changing anything else. However this
time tools.h was split between functions and types, because it contains
a small number of commonly used macros and structures (e.g. name_desc)
which in turn cause the massive list of includes of tools.h to conflict
with the callers.
They remain the ugliest files of the whole project and definitely need
to be cleaned and split apart. A few types are defined there only for
functions provided there, and some parts are even OS-specific and should
move somewhere else, such as the symbol resolution code.
Now the file is ready to be stored into its final destination. A few
minor reorderings were performed to keep the file properly organized,
making the various sections more visible (cache & lockless).
In addition and to stay consistent, memory.c was renamed to pool.c.
types/freq_ctr.h was moved to haproxy/freq_ctr-t.h and proto/freq_ctr.h
was moved to haproxy/freq_ctr.h. Files were updated accordingly, no other
change was applied.
This one is included almost everywhere and used to rely on a few other
.h that are not needed (unistd, stdlib, standard.h). It could possibly
make sense to split it into multiple parts to distinguish operations
performed on timers and the internal time accounting, but at this point
it does not appear much important.
Half of the users of this include only need the type definitions and
not the manipulation macros nor the inline functions. Moves the various
types into mini-clist-t.h makes the files cleaner. The other one had all
its includes grouped at the top. A few files continued to reference it
without using it and were cleaned.
In addition it was about time that we'd rename that file, it's not
"mini" anymore and contains a bit more than just circular lists.
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
This is where other imported components are located. All files which
used to directly include ebtree were touched to update their include
path so that "import/" is now prefixed before the ebtree-related files.
The ebtree.h file was slightly adjusted to read compiler.h from the
common/ subdirectory (this is the only change).
A build issue was encountered when eb32sctree.h is loaded before
eb32tree.h because only the former checks for the latter before
defining type u32. This was addressed by adding the reverse ifdef
in eb32tree.h.
No further cleanup was done yet in order to keep changes minimal.
When tasklet were introduced, it has been decided not to provide the tasklet
to the callback, but NULL instead. While it may have been reasonable back
then, maybe to be able to differentiate a task from a tasklet from the
callback, it also means that we can't access the tasklet from the handler if
the context provided can't be trusted.
As no handler is shared between a task and a tasklet, and there are now
other means of distinguishing between task and tasklet, just pass the
tasklet pointer too.
This may be backported to 2.1, 2.0 and 1.9 if needed.
We can't clear flags on tasklets because we don't know if they're still
present upon return (they all return NULL, maybe that could change in
the future). As a side effect, once TASK_RUNNING is set, it's never
cleared anymore, which is misleading and resulted in some incorrect
flagging of bulk tasks in the recent scheduler changes. And the only
reason for setting TASK_RUNNING on tasklets was to detect self-wakers,
which is not done using a dedicated flag. So instead of setting this
flags for no opportunity to clear it, let's simply not set it.
Now that we can more accurately watch which connection is really
being woken up from itself, it was desirable to re-adjust the CPU BW
thresholds based on measurements. New tests with 60000 concurrent
connections were run at 100 Gbps with unbounded queues and showed
the following distribution:
scenario TC0 TC1 TC2 observation
-------------------+---+---+----+---------------------------
TCP conn rate : 32, 51, 17
HTTP conn rate : 34, 41, 25
TCP byte rate : 2, 3, 95 (2 MB objets)
splicing byte rate: 11, 6, 83 (2 MB objets)
H2 10k object : 44, 23, 33 client-limited
mixed traffic : 18, 10, 72 2*1m+1*0: 11kcps, 36 Gbps
The H2 experienced a huge change since it uses a persistent connection
that was accidently flagged in the previous test. The splicing test
exhibits a higher need for short tasklets, so does the mixed traffic
test. Given that latency mainly matters for conn rate and H2 here,
the ratios were readjusted as 33% for TC0, 50% for TC1 and 17% for
TC2, keeping in mind that whatever is not consumed by one class is
automatically shared in equal propertions by the next one(s). This
setting immediately provided a nice improvement as with the default
settings (maxpollevents=200, runqueue-depth=200), the same ratios as
above are still reported, while the time to request "show activity"
on the CLI dropped to 30-50ms. The average loop time is around 5.7ms
on the mixed traffic.
In addition, one extra stress test at 90.5 Gbps with 5100 conn/s shows
70-100ms CLI request time, with an average loop time of 17 ms.
sched->current is used to know the current task/tasklet, and is currently
only used by the panic dump code. However it turns out it was not set for
tasklets, which prevents us from using it for more usages, despite the
panic handling code already handling this case very well. Let's make sure
it's now set.
Commit a17664d829 ("MEDIUM: tasks: automatically requeue into the bulk
queue an already running tasklet") tried to inflict a penalty to
self-requeuing tasks/tasklets which correspond to those involved in
large, high-latency data transfers, for the benefit of all other
processing which requires a low latency. However, it turns out that
while it ought to do this on a case-by-case basis, basing itself on
the RUNNING flag isn't accurate because this flag doesn't leave for
tasklets, so we'd rather need a distinct flag to tag such tasklets.
This commit introduces TASK_SELF_WAKING to mark tasklets acting like
this. For now it's still set when TASK_RUNNING is present but this
will have to change. The flag is kept across wakeups.
Measures with unbounded execution ratios under 40000 concurrent
connections at 100 Gbps showed the following CPU bandwidth
distribution between task classes depending on traffic scenarios:
scenario TC0 TC1 TC2 observation
-------------------+---+---+----+---------------------------
TCP conn rate : 29, 48, 23 221 kcps
HTTP conn rate : 29, 47, 24 200 kcps
TCP byte rate : 3, 5, 92 53 Gbps
splicing byte rate: 5, 10, 85 70 Gbps
H2 10k object : 10, 21, 74 client-limited
mixed traffic : 4, 7, 89 2*1m+1*0: 11kcps, 36 Gbps
Thus it seems that we always need a bit of bulk tasks even for short
connections, which seems to imply a suboptimal processing somewhere,
and that there are roughly twice as many tasks (TC1=normal) as regular
tasklets (TC0=urgent). This ratio stands even when data forwarding
increases. So at first glance it looks reasonable to enforce the
following ratio by default:
- 16% for TL_URGENT
- 33% for TL_NORMAL
- 50% for TL_BULK
With this, the TCP conn rate climbs to ~225 kcps, and the mixed traffic
pattern shows a more balanced 17kcps + 35 Gbps with 35ms CLI request
time time instead of 11kcps + 36 Gbps and 400 ms response time. The
byte rate tests (1M objects) are not affected at all. This setting
looks "good enough" to allow immediate merging, and could be refined
later.
It's worth noting that it resists very well to massive increase of
run queue depth and maxpollevents: with the run queue depth changed
from 200 to 10000 and maxpollevents to 10000 as well, the CLI's
request time is back to the previous ~400ms, but the mixed traffic
test reaches 52 Gbps + 7500 CPS, which was never met with the previous
scheduling model, while the CLI used to show ~1 minute response time.
The reason is that in the bulk class it becomes possible to perform
multiple rounds of recv+send and eliminate objects at once, increasing
the L3 cache hit ratio, and keeping the connection count low, without
degrading too much the latency.
Another test with mixed traffic involving 2/3 splicing on huge objects
and 1/3 on empty objects without touching any setting reports 51 Gbps +
5300 cps and 35ms CLI request time.
We used to mix high latency tasks and low latency tasklets in the same
list, and to even refill bulk tasklets there, causing some unfairness
in certain situations (e.g. poll-less transfers between many connections
saturating the machine with similarly-sized in and out network interfaces).
This patch changes the mechanism to split the load into 3 lists depending
on the task/tasklet's desired classes :
- URGENT: this is mainly for tasklets used as deferred callbacks
- NORMAL: this is for regular tasks
- BULK: this is for bulk tasks/tasklets
Arbitrary ratios of max_processed are picked from each of these lists in
turn, with the ability to complete in one list from what was not picked
in the previous one. After some quick tests, the following setup gave
apparently good results both for raw TCP with splicing and for H2-to-H1
request rate:
- 0 to 75% for urgent
- 12 to 50% for normal
- 12 to what remains for bulk
Bulk is not used yet.
New function run_tasks_from_list() will run over a tasklet list and will
run all the tasks and tasklets it finds there within a limit of <max>
that is passed in arggument. This is a preliminary work for scheduler QoS
improvements.
Since 1.9 with commit b20aa9eef3 ("MAJOR: tasks: create per-thread wait
queues") a task bound to a single thread will not use locks when being
queued or dequeued because the wait queue is assumed to be the owner
thread's.
But there exists a rare situation where this is not true: the health
check tasks may be running on one thread waiting for a response, and
may in parallel be requeued by another thread calling health_adjust()
after a detecting a response error in traffic when "observe l7" is set,
and "fastinter" is lower than "inter", requiring to shorten the running
check's timeout. In this case, the task being requeued was present in
another thread's wait queue, thus opening a race during task_unlink_wq(),
and gets requeued into the calling thread's wait queue instead of the
running one's, opening a second race here.
This patch aims at protecting against the risk of calling task_unlink_wq()
from one thread while the task is queued on another thread, hence unlocked,
by introducing a new TASK_SHARED_WQ flag.
This new flag indicates that a task's position in the wait queue may be
adjusted by other threads than then one currently executing it. This means
that such WQ manipulations must be performed under a lock. There are two
types of such tasks:
- the global ones, using the global wait queue (technically speaking,
those whose thread_mask has at least 2 bits set).
- some local ones, which for now will be placed into the global wait
queue as well in order to benefit from its lock.
The flag is automatically set on initialization if the task's thread mask
indicates more than one thread. The caller must also set it if it intends
to let other threads update the task's expiration delay (e.g. delegated
I/Os), or if it intends to change the task's affinity over time as this
could lead to the same situation.
Right now only the situation described above seems to be affected by this
issue, and it is very difficult to trigger, and even then, will often have
no visible effect beyond stopping the checks for example once the race is
met. On my laptop it is feasible with the following config, chained to
httpterm:
global
maxconn 400 # provoke FD errors, calling health_adjust()
defaults
mode http
timeout client 10s
timeout server 10s
timeout connect 10s
listen px
bind :8001
option httpchk /?t=50
server sback 127.0.0.1:8000 backup
server-template s 0-999 127.0.0.1:8000 check port 8001 inter 100 fastinter 10 observe layer7
This patch will automatically address the case for the checks because
check tasks are created with multiple threads bound and will get the
TASK_SHARED_WQ flag set.
If in the future more tasks need to rely on this (multi-threaded muxes
for example) and the use of the global wait queue becomes a bottleneck
again, then it should not be too difficult to place locks on the local
wait queues and queue the task on its bound thread.
This patch needs to be backported to 2.1, 2.0 and 1.9. It depends on
previous patch "MINOR: task: only check TASK_WOKEN_ANY to decide to
requeue a task".
Many thanks to William Dauchy for providing detailed traces allowing to
spot the problem.
After processing a task, its RUNNING bit is cleared and at the same time
we check for other bits to decide whether to requeue the task or not. It
happens that we only want to check the TASK_WOKEN_* bits, because :
- TASK_RUNNING was just cleared
- TASK_GLOBAL and TASK_QUEUE cannot be set yet as the task was running,
preventing it from being requeued
It's important not to catch yet undefined flags there because it would
prevent addition of new task flags. This also shows more clearly that
waking a task up with flags 0 is not something safe to do as the task
will not be woken up if it's already running.
We used to have wake_expired_tasks() wake up tasks and return the next
expiration delay. The problem this causes is that we have to call it just
before poll() in order to consider latest timers, but this also means that
we don't wake up all newly expired tasks upon return from poll(), which
thus systematically requires a second poll() round.
This is visible when running any scheduled task like a health check, as there
are systematically two poll() calls, one with the interval, nothing is done
after it, and another one with a zero delay, and the task is called:
listen test
bind *:8001
server s1 127.0.0.1:1111 check
09:37:38.200959 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8696843}) = 0
09:37:38.200967 epoll_wait(3, [], 200, 1000) = 0
09:37:39.202459 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8712467}) = 0
>> nothing run here, as the expired task was not woken up yet.
09:37:39.202497 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8715766}) = 0
09:37:39.202505 epoll_wait(3, [], 200, 0) = 0
09:37:39.202513 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8719064}) = 0
>> now the expired task was woken up
09:37:39.202522 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
09:37:39.202537 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
09:37:39.202565 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
09:37:39.202577 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
09:37:39.202585 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
09:37:39.202659 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
09:37:39.202673 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8814713}) = 0
09:37:39.202683 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
09:37:39.202693 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8818617}) = 0
09:37:39.202701 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
09:37:39.202715 close(7) = 0
Let's instead split the function in two parts:
- the first part, wake_expired_tasks(), called just before
process_runnable_tasks(), wakes up all expired tasks; it doesn't
compute any timeout.
- the second part, next_timer_expiry(), called just before poll(),
only computes the next timeout for the current thread.
Thanks to this, all expired tasks are properly woken up when leaving
poll, and each poll call's timeout remains up to date:
09:41:16.270449 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10223556}) = 0
09:41:16.270457 epoll_wait(3, [], 200, 999) = 0
09:41:17.270130 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10238572}) = 0
09:41:17.270157 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
09:41:17.270194 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
09:41:17.270204 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
09:41:17.270216 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
09:41:17.270224 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
09:41:17.270299 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
09:41:17.270314 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10337841}) = 0
09:41:17.270323 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
09:41:17.270332 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10341860}) = 0
09:41:17.270340 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
09:41:17.270367 close(7) = 0
This may be backported to 2.1 and 2.0 though it's unlikely to bring any
user-visible improvement except to clarify debugging.
As using an mt_list for the tasklet list is costly, instead use a regular list,
but add an mt_list for tasklet woken up by other threads, to be run on the
current thread. At the beginning of process_runnable_tasks(), we just take
the new list, and merge it into the task_list.
This should give us performances comparable to before we started using a
mt_list, but allow us to use tasklet_wakeup() from other threads.
When executing tasks, don't forget to decrement tasks_run_queue once we
popped one task from the task_list. tasks_run_queue used to be decremented
by __tasklet_remove_from_tasklet_list(), but we now call MT_LIST_POP().