mirror of
https://git.haproxy.org/git/haproxy.git/
synced 2025-08-05 22:56:57 +02:00
The watchdog was improved and fixed a few months ago, but the doc had not been updated to reflect this. That's now done.
145 lines
7.7 KiB
Plaintext
145 lines
7.7 KiB
Plaintext
2025-02-13 - Details of the watchdog's internals
|
|
------------------------------------------------
|
|
|
|
1. The watchdog timer
|
|
---------------------
|
|
|
|
The watchdog sets up a timer that triggers every 1 to 1000ms. This is pre-
|
|
initialized by init_wdt() which positions wdt_handler() as the signal handler
|
|
of signal WDTSIG (SIGALRM).
|
|
|
|
But this is not sufficient, an alarm actually has to be set. This is done for
|
|
each thread by init_wdt_per_thread() which calls clock_setup_signal_timer()
|
|
which in turn enables a ticking timer for the current thread, that delivers
|
|
the WDTSIG signal (SIGALRM) to the process. Since there's no notion of thread
|
|
at this point, there are as many timers as there are threads, and each signal
|
|
comes with an integer value which in fact contains the thread number as passed
|
|
to clock_setup_signal_timer() during initialization.
|
|
|
|
The timer preferably uses CLOCK_THREAD_CPUTIME_ID if available, otherwise
|
|
falls back to CLOCK_REALTIME. The former is more accurate as it really counts
|
|
the time spent in the process, while the latter might also account for time
|
|
stuck on paging in etc.
|
|
|
|
Then wdt_ping() is called to arm the timer. It's set to trigger every
|
|
<wdt_warn_blocked_traffic_ns> interval. It is also called by wdt_handler()
|
|
to reprogram a new wakeup after it has ticked.
|
|
|
|
When wdt_handler() is called, it reads the thread number in si_value.sival_int,
|
|
as positioned during initialization. Most of the time the signal lands on the
|
|
wrong thread (typically thread 1 regardless of the reported thread). From this
|
|
point, the function retrieves the various info related to that thread's recent
|
|
activity (its current time and flags), ignores corner cases such as if that
|
|
thread is already dumping another one, being dumped, in the poller, has quit,
|
|
etc.
|
|
|
|
If the thread was not marked as stuck, it's verified that no progress was made
|
|
for at least one second, in which case the TH_FL_STUCK flag is set. The lack of
|
|
progress is measured by the distance between the thread's current cpu_time and
|
|
its prev_cpu_time. If the lack of progress is at least as large as the warning
|
|
threshold, then the signal is bounced to the faulty thread if it's not the
|
|
current one. Since this bounce is based on the time spent without update, it
|
|
already doesn't happen often.
|
|
|
|
Once on the faulty thread, two checks are performed:
|
|
1) if the thread was already marked as stuck, then the thread is considered
|
|
as definitely stuck, and ha_panic() is called. It will not return.
|
|
|
|
2) a check is made to verify if the scheduler is still ticking, by reading
|
|
and setting a variable that only the scheduler can clear when leaving a
|
|
task. If the scheduler didn't make any progress, ha_stuck_warning() is
|
|
called to emit a warning about that thread.
|
|
|
|
Most of the time there's no panic of course, and a wdt_ping() is performed
|
|
before leaving the handler to reprogram a check for that thread.
|
|
|
|
2. The debug handler
|
|
--------------------
|
|
|
|
Both ha_panic() and ha_stuck_warning() are quite similar. In both cases, they
|
|
will first verify that no panic is in progress and just return if so. This is
|
|
verified using mark_tained() which atomically sets a tainted bit and returns
|
|
the previous value. ha_panic() sets TAINTED_PANIC while ha_stuck_warning() will
|
|
set TAINTED_WARN_BLOCKED_TRAFFIC.
|
|
|
|
ha_panic() uses the current thread's trash buffer to produce the messages, as
|
|
we don't care about its contents since that thread will never return. However
|
|
ha_stuck_warning() instead uses a local 8kB buffer in the thread's stack.
|
|
ha_panic() will call ha_thread_dump_fill() for each thread, to complete the
|
|
buffer being filled with each thread's dump messages. ha_stuck_warning() only
|
|
calls ha_thread_dump_one(), which works on the current thread. In both cases
|
|
the message is then directly sent to fd #2 (stderr) and ha_thread_dump_done()
|
|
is called to release the dumped thread.
|
|
|
|
Both print a few extra messages, but ha_panic() just ends by looping on abort()
|
|
until the process dies.
|
|
|
|
ha_thread_dump_fill() uses a locking mechanism to make sure that each thread is
|
|
only dumped once at a time. For this it atomically sets is thread_dump_buffer
|
|
to point to the target buffer. The thread_dump_buffer has 4 possible values:
|
|
- NULL: no dump in progress
|
|
- a valid, even, pointer: this is the pointer to the buffer that's currently
|
|
in the process of being filled by the thread
|
|
- a valid pointer + 1: this is the pointer of the now filled buffer, that the
|
|
caller can consume. The atomic |1 at the end marks the end of the dump.
|
|
- 0x2: this indicates to the dumping function that it is responsible for
|
|
assigning its own buffer itself (used by the debug_handler to pick one of
|
|
its own trash buffers during a panic). The idea here is that each thread
|
|
will keep their own copy of their own dump so that it can be later found in
|
|
the core file for inspection.
|
|
|
|
A copy of the last valid thread_dump_buffer used is kept in last_dump_buffer,
|
|
for easier post-mortem analysis. This one may be NULL or even invalid, but
|
|
usually during a panic it will be valid, and may reveal useful hints even if it
|
|
still contains the dump of the last warning. Usually this will point to a trash
|
|
buffer or to stack area.
|
|
|
|
ha_thread_dump_fill() then either directly calls ha_thread_dump_one() if the
|
|
target thread is the current thread, or sends the target thread DEBUGSIG
|
|
(SIGURG) if it's a different thread. This signal is initialized at boot time
|
|
by init_debug() to call handler debug_handler().
|
|
|
|
debug_handler() then operates on the target thread and recognizes that it must
|
|
allocate its own buffer if the pointer is 0x2, calls ha_thread_dump_one(), then
|
|
waits forever (it does not return from the signal handler so as to make sure
|
|
the dumped thread will not badly interact with other ones).
|
|
|
|
ha_thread_dump_one() collects some info, that it prints all along into the
|
|
target buffer. Depending on the situation, it will dump current tasks or not,
|
|
may mark that Lua is involved and TAINTED_LUA_STUCK, and if running in shared
|
|
mode, also taint the process with TAINTED_LUA_STUCK_SHARED. It calls
|
|
ha_dump_backtrace() before returning.
|
|
|
|
ha_dump_backtrace() produces a backtrace into a local buffer (100 entries max),
|
|
then dumps the code bytes nearby the crashing instrution, dumps pointers and
|
|
tries to resolve function names, and sends all of that into the target buffer.
|
|
On some architectures (x86_64, arm64), it will also try to detect and decode
|
|
call instructions and resolve them to called functions.
|
|
|
|
3. Improvements
|
|
---------------
|
|
|
|
The symbols resolution is extremely expensive, particularly for the warnings
|
|
which should be fast. But we need it, it's just unfortunate that it strikes at
|
|
the wrong moment. At least ha_dump_backtrace() does disable signals while it's
|
|
resolving, in order to avoid unwanted re-entrance. In addition, the called
|
|
function resolve_sym_name() uses some locking and refrains from calling the
|
|
dladdr family of functions in a re-entrant way (in the worst case only well
|
|
known symbols will be resolved)..
|
|
|
|
In an ideal case, ha_dump_backtrace() would dump the pointers to a local array,
|
|
which would then later be resolved asynchronously in a tasklet. This can work
|
|
because the code bytes will not change either so the dump can be done at once
|
|
there.
|
|
|
|
However the tasks dumps are not much compatible with this. For example
|
|
ha_task_dump() makes a number of tests and itself will call hlua_traceback() if
|
|
needed, so it might still need to be dumped in real time synchronously and
|
|
buffered. But then it's difficult to reassemble chunks of text between the
|
|
backtrace (that needs to be resolved later) and the tasks/lua parts. Or maybe
|
|
we can afford to disable Lua trace dumps in warnings and keep them only for
|
|
panics (where the asynchronous resolution is not needed) ?
|
|
|
|
Also differentiating the call paths for warnings and panics is not something
|
|
easy either.
|