In master-worker mode the master CLI proxy (mworker_proxy) has a
hardcoded maxconn of 10. When a client connects to the master CLI
socket and issues a command that gets forwarded to an unresponsive
worker (e.g. one that is stuck or very slow), the connection hangs
waiting for the worker's response. If the client then disconnects
(timeout, Ctrl-C, etc.), the connection slot is never released because
the client-side FIN is never acknowledged by the unresponsive worker.
After 10 such leaked slots the master CLI socket becomes completely
unreachable, returning "Resource temporarily unavailable" to any new
connection attempt. In containerized deployments this means readiness
probes start failing and the pod gets restarted.
The fix adds a timeout server-fin of 1s to the mworker_proxy. When
the client disconnects while waiting for a worker response, this
timeout ensures the dangling backend connection is cleaned up after
1s, freeing the connection slot. This does not affect normal CLI
operations since the timeout only starts after the client has already
closed its side of the connection.
A regression test is included that blocks the worker CLI thread using
"debug dev delay" with nbthread 1, fills all 10 master CLI slots,
waits for client-side timeouts, then verifies a new connection still
succeeds.
This fixes GH issue #3351.
This should be backported to all stable branches.
Co-authored-by: Martin Strenge <github@trixer.net>
Co-authored-by: William Lallemand <wlallemand@haproxy.com>
When threads are enabled and running on a machine with multiple CCX
or multiple nodes, thread groups are now enabled since 3.3-dev2, causing
load-balancing algorithms to randomly fail due to incoming connections
spreading over multiple groups and using different load balancing indexes.
Let's just force "thread-groups 1" into all configs when threads are
enabled to avoid this.
This patch removes completely the support for the program section, the
parsing of the section as well as the internals in the mworker does not
support it anymore.
The program section was considered dysfonctional and not fully
compatible with the "mworker V3" model. Users that want to run an
external program must use their init system.
The documentation is cleaned up in another patch.
-Ws on VTest is not working correctly for an unknown reason, the polling
of the NOTIFY_SOCKET seems to timeout, and VTest never receives the
READY message.
This patch disables the reg-tests using -Ws on OS X.
The -W mode implemented in VTest is not reliable anymore, because VTest
waits for the pidfile to be created. But with the new master-worker
mode, this file is created long before haproxy is ready. This can lead
to the test being started too soon, and failing from time to time.
The -Ws option allows to wait for haproxy to deliver a message to VTest
once it is ready.
Every reg-test now runs without any warning, so let's acivate -dW by
default so the new ones will inheritate the option.
This patch reverts 9d511b3c ("REGTESTS: enable -dW on almost all tests
to fail on warnings") and adds -dW in the default HAPROXY_ARGS of
scripts/run-regtests.sh instead.
As master parses now expose-deprecated-directives option, let's emit warning
about deprecated 'progam' section only in case, if this option wasn't set in
the 'global' section. This allows to people, who don't prefer to remove the
'program' section immediately to continue to start the process in zero-warning
mode.
Adjust the warning message accordingly and mcli_start_progs.vtc test. As
expose-deprecated-directives option is a 'global' section keyword, this section
must always precede any 'program' section, if users still continue to keep
'program' section.
This doesn't need to be backported, as related to the latest changes in
the master-worker architecture.
Now that warnings were almost all removed, let's enable zero-warning
via -dW. All tests were adjusted, but two:
- mcli/mcli_start_progs.vtc:
the programs section currently cannot be silenced
- stats/stats-file.vtc:
the warning comes from the stats file itself on comment lines.
All other ones are now OK.
When vtest starts haproxy process, it loops until the moment, when haproxy
pidfile is created. When pidfile is created, vtest considers that haproxy
process is ready and it starts to perform test commands, in particular, it
connects to CLI. It's not very reliable approach to base the check of the
process readiness on the PID file. After master-worker architecture
refactoring pidfile is created in the early init stage, but master and worker
are not yet finished its initialization routines. So, all mcli tests and some
tests where we sent commands to CLI start to fail regularly.
In vtest at the moment there is no any other approach to check that the
process is really ready. So let's add a delay 0.1s before connecting to CLI in
all mcli tests and in acl_cli_spaces test.
A recent fix broke the pipelined command on the master CLI, this
reg-tests implement a simple test that allow to check its right
behavior.
This could be backported as far as 2.6.
With the CI occasionally slowing down, we're starting to see again some
spurious failures despite the long 1-second timeouts. This reports false
positives that are disturbing and doesn't provide as much value as this
could. However at this delay it already becomes a pain for developers
to wait for the tests to complete.
This commit adds support for the new environment variable
HAPROXY_TEST_TIMEOUT that will allow anyone to modify the connect,
client and server timeouts. It was set to 5 seconds by default, which
should be plenty for quite some time in the CI. All relevant values
that were 200ms or above were replaced by this one. A few larger
values were left as they are special. One test for the set-timeout
action that used to rely on a fixed 1-sec value was extended to a
fixed 5-sec, as the timeout is normally not reached, but it needs
to be known to compare the old and new values.
This one was deprecated in 2.3 and marked for removal in 2.5. It suffers
too many limitations compared to threads, and prevents some improvements
from being engaged. Instead of a bypassable startup error, there is now
a hard error.
The parsing code was removed, and very few obvious cases were as well.
The code is deeply rooted at certain places (e.g. "for" loops iterating
from 0 to nbproc) so it will not be that trivial to remove everywhere.
The "bind" and "bind-process" parsers will have to be adjusted, though
maybe not completely changed if we later want to support thread groups
for large NUMA machines. Some stats socket restrictions were removed,
and the doc was updated according to what was done. A few places in the
doc still refer to nbproc and will have to be revisited. The master-worker
code also refers to the process number to distinguish between master and
workers and will have to be carefully adjusted. The MAX_PROCS macro was
reset to 1, this will at least reduce the size of some remaining arrays.
Two regtests were dependieng on this directive, one with an explicit
"nbproc 1" and another one testing the master's CLI using nbproc 4.
Both were adapted.
This regtest tests the issue #446 by starting 2 programs and checking if
they exist in the "show proc" of the master CLI.
Should be backported as far as 2.0.
VTest can now enable mworker and mcli with separate flags so lets
update vtc files that need it. This also allows to revert the change
made with 1545a59c ("REGTESTS: make seamless-reload depend on 1.9
and above").
This test launches a HAProxy process in master worker with 'nbproc 4'.
It sends a "show info" to the process 3 and verify that the right
process replied.
This regtest depends on the support of the master CLI for VTest.