BUG/MEDIUM: wdt: always ignore the first watchdog wakeup

With commit a06c215f08 ("MEDIUM: wdt: always make the faulty thread report its own warnings"), when the TH_FL_STUCK flag was flipped on, we'd then go to the panic code instead of giving a second chance like before the commit. This can trigger rare cases that only happen with moderate loads like was addressed by commit 24ce001771 ("BUG/MEDIUM: wdt: fix the stuck detection for warnings"). This is in fact due to the loss of the common "goto update_and_leave" that used to serve both the warning code and the flag setting for probation, and it's apparently what hit Christian in issue #2980. Let's make sure we exit naturally when turning the bit on for the first time. Let's also update the confusing comment at the end of the check that was left over by latest change. Since the first commit was backported to 3.1, this commit should be backported there as well.
2025-12-16 23:21:01 +01:00 · 2025-05-20 15:52:44 +02:00 · 2025-05-20 15:52:44 +02:00 · 0a8bfb5b90
commit 0a8bfb5b90
parent dcdf27af70
1 changed files with 10 additions and 10 deletions
--- a/src/wdt.c
+++ b/src/wdt.c
@ -122,14 +122,17 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
 		 */
 		if (!(_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_STUCK)) {
 			/* after one second it's clear that we're stuck */
-			if (n - p >= 1000000000ULL)
+			if (n - p >= 1000000000ULL) {
 				_HA_ATOMIC_OR(&ha_thread_ctx[thr].flags, TH_FL_STUCK);
+				goto update_and_leave;
+			}
 			else if (n - p < (ullong)wdt_warn_blocked_traffic_ns) {
 				/* if we haven't crossed the warning boundary,
 				 * let's just refresh the reporting thread's timer.
 				 */
 				goto update_and_leave;
 			}
+		}

 		/* OK so we've crossed the warning boundary and possibly the
 		 * panic one as well. This may only be reported by the original
@ -138,9 +141,6 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
 		 * check the ctxsw count and decide whether to do nothing, to
 		 * warn, or either panic.
 		 */
-		}
-
-		/* No doubt now, there's no hop to recover, die loudly! */
 		break;

 #if defined(USE_THREAD) && defined(SI_TKILL) /* Linux uses this */