BUG/MEDIUM: wdt: always ignore the first watchdog wakeup

With commit a06c215f08 ("MEDIUM: wdt: always make the faulty thread
report its own warnings"), when the TH_FL_STUCK flag was flipped on,
we'd then go to the panic code instead of giving a second chance like
before the commit. This can trigger rare cases that only happen with
moderate loads like was addressed by commit 24ce001771 ("BUG/MEDIUM:
wdt: fix the stuck detection for warnings"). This is in fact due to
the loss of the common "goto update_and_leave" that used to serve
both the warning code and the flag setting for probation, and it's
apparently what hit Christian in issue #2980.

Let's make sure we exit naturally when turning the bit on for the
first time. Let's also update the confusing comment at the end of
the check that was left over by latest change.

Since the first commit was backported to 3.1, this commit should be
backported there as well.
This commit is contained in:
Willy Tarreau 2025-05-20 15:52:44 +02:00
parent dcdf27af70
commit 0a8bfb5b90

View File

@ -122,14 +122,17 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
*/
if (!(_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_STUCK)) {
/* after one second it's clear that we're stuck */
if (n - p >= 1000000000ULL)
if (n - p >= 1000000000ULL) {
_HA_ATOMIC_OR(&ha_thread_ctx[thr].flags, TH_FL_STUCK);
goto update_and_leave;
}
else if (n - p < (ullong)wdt_warn_blocked_traffic_ns) {
/* if we haven't crossed the warning boundary,
* let's just refresh the reporting thread's timer.
*/
goto update_and_leave;
}
}
/* OK so we've crossed the warning boundary and possibly the
* panic one as well. This may only be reported by the original
@ -138,9 +141,6 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
* check the ctxsw count and decide whether to do nothing, to
* warn, or either panic.
*/
}
/* No doubt now, there's no hop to recover, die loudly! */
break;
#if defined(USE_THREAD) && defined(SI_TKILL) /* Linux uses this */