haproxy/doc/internals/watchdog.txt

2025-02-13 - Details of the watchdog's internals
------------------------------------------------

1. The watchdog timer
---------------------

The watchdog sets up a timer that triggers every 1 to 1000ms. This is pre-
initialized by init_wdt() which positions wdt_handler() as the signal handler
of signal WDTSIG (SIGALRM).

But this is not sufficient, an alarm actually has to be set. This is done for
each thread by init_wdt_per_thread() which calls clock_setup_signal_timer()
which in turn enables a ticking timer for the current thread, that delivers
the WDTSIG signal (SIGALRM) to the process. Since there's no notion of thread
at this point, there are as many timers as there are threads, and each signal
comes with an integer value which in fact contains the thread number as passed
to clock_setup_signal_timer() during initialization.

The timer preferably uses CLOCK_THREAD_CPUTIME_ID if available, otherwise
falls back to CLOCK_REALTIME. The former is more accurate as it really counts
the time spent in the process, while the latter might also account for time
stuck on paging in etc.

Then wdt_ping() is called to arm the timer. It's set to trigger every
<wdt_warn_blocked_traffic_ns> interval. It is also called by wdt_handler()
to reprogram a new wakeup after it has ticked.

When wdt_handler() is called, it reads the thread number in si_value.sival_int,
as positioned during initialization. Most of the time the signal lands on the
wrong thread (typically thread 1 regardless of the reported thread). From this
point, the function retrieves the various info related to that thread's recent
activity (its current time and flags), ignores corner cases such as if that
thread is already dumping another one, being dumped, in the poller, has quit,
etc.

If the thread was not marked as stuck, it's verified that no progress was made
for at least one second, in which case the TH_FL_STUCK flag is set. The lack of
progress is measured by the distance between the thread's current cpu_time and
its prev_cpu_time. If the lack of progress is at least as large as the warning
threshold, then the signal is bounced to the faulty thread if it's not the
current one. Since this bounce is based on the time spent without update, it
already doesn't happen often.

Once on the faulty thread, two checks are performed:
  1) if the thread was already marked as stuck, then the thread is considered
     as definitely stuck, and ha_panic() is called. It will not return.

  2) a check is made to verify if the scheduler is still ticking, by reading
     and setting a variable that only the scheduler can clear when leaving a
     task. If the scheduler didn't make any progress, ha_stuck_warning() is
     called to emit a warning about that thread.

Most of the time there's no panic of course, and a wdt_ping() is performed
before leaving the handler to reprogram a check for that thread.

2. The debug handler
--------------------

Both ha_panic() and ha_stuck_warning() are quite similar. In both cases, they
will first verify that no panic is in progress and just return if so. This is
verified using mark_tained() which atomically sets a tainted bit and returns
the previous value. ha_panic() sets TAINTED_PANIC while ha_stuck_warning() will
set TAINTED_WARN_BLOCKED_TRAFFIC.

ha_panic() uses the current thread's trash buffer to produce the messages, as
we don't care about its contents since that thread will never return. However
ha_stuck_warning() instead uses a local 8kB buffer in the thread's stack.
ha_panic() will call ha_thread_dump_fill() for each thread, to complete the
buffer being filled with each thread's dump messages. ha_stuck_warning() only
calls ha_thread_dump_one(), which works on the current thread. In both cases
the message is then directly sent to fd #2 (stderr) and ha_thread_dump_done()
is called to release the dumped thread.

Both print a few extra messages, but ha_panic() just ends by looping on abort()
until the process dies.

ha_thread_dump_fill() uses a locking mechanism to make sure that each thread is
only dumped once at a time. For this it atomically sets is thread_dump_buffer
to point to the target buffer. The thread_dump_buffer has 4 possible values:
  - NULL: no dump in progress
  - a valid, even, pointer: this is the pointer to the buffer that's currently
    in the process of being filled by the thread
  - a valid pointer + 1: this is the pointer of the now filled buffer, that the
    caller can consume. The atomic |1 at the end marks the end of the dump.
  - 0x2: this indicates to the dumping function that it is responsible for
    assigning its own buffer itself (used by the debug_handler to pick one of
    its own trash buffers during a panic). The idea here is that each thread
    will keep their own copy of their own dump so that it can be later found in
    the core file for inspection.

A copy of the last valid thread_dump_buffer used is kept in last_dump_buffer,
for easier post-mortem analysis. This one may be NULL or even invalid, but
usually during a panic it will be valid, and may reveal useful hints even if it
still contains the dump of the last warning. Usually this will point to a trash
buffer or to stack area.

ha_thread_dump_fill() then either directly calls ha_thread_dump_one() if the
target thread is the current thread, or sends the target thread DEBUGSIG
(SIGURG) if it's a different thread. This signal is initialized at boot time
by init_debug() to call handler debug_handler().

debug_handler() then operates on the target thread and recognizes that it must
allocate its own buffer if the pointer is 0x2, calls ha_thread_dump_one(), then
waits forever (it does not return from the signal handler so as to make sure
the dumped thread will not badly interact with other ones).

ha_thread_dump_one() collects some info, that it prints all along into the
target buffer. Depending on the situation, it will dump current tasks or not,
may mark that Lua is involved and TAINTED_LUA_STUCK, and if running in shared
mode, also taint the process with TAINTED_LUA_STUCK_SHARED. It calls
ha_dump_backtrace() before returning.

ha_dump_backtrace() produces a backtrace into a local buffer (100 entries max),
then dumps the code bytes nearby the crashing instrution, dumps pointers and
tries to resolve function names, and sends all of that into the target buffer.
On some architectures (x86_64, arm64), it will also try to detect and decode
call instructions and resolve them to called functions.

3. Improvements
---------------

The symbols resolution is extremely expensive, particularly for the warnings
which should be fast. But we need it, it's just unfortunate that it strikes at
the wrong moment. At least ha_dump_backtrace() does disable signals while it's
resolving, in order to avoid unwanted re-entrance. In addition, the called
function resolve_sym_name() uses some locking and refrains from calling the
dladdr family of functions in a re-entrant way (in the worst case only well
known symbols will be resolved)..

In an ideal case, ha_dump_backtrace() would dump the pointers to a local array,
which would then later be resolved asynchronously in a tasklet. This can work
because the code bytes will not change either so the dump can be done at once
there.

However the tasks dumps are not much compatible with this. For example
ha_task_dump() makes a number of tests and itself will call hlua_traceback() if
needed, so it might still need to be dumped in real time synchronously and
buffered. But then it's difficult to reassemble chunks of text between the
backtrace (that needs to be resolved later) and the tasks/lua parts. Or maybe
we can afford to disable Lua trace dumps in warnings and keep them only for
panics (where the asynchronous resolution is not needed) ?

Also differentiating the call paths for warnings and panics is not something
easy either.