After master-worker refactoring, master performs re-exec only once up to
receiving "reload" command or USR2 signal. There is no more the second
master's re-exec to free unused memory. Thus, there is no longer need to export
environment variable HAPROXY_LOAD_SUCCESS with worker process load status. This
status can be simply saved in a global variable load_status.
Using nested 'if' operator, while checking if we will need to allocate again the
"reload" sockpair, does not degrade performance, as mworker_create_master_cli is
a startup routine.
This nested 'if' (we check one condition in each operator) makes more visible the
fact, that the "reload" sockpair is allocated only once, when the master process
starts and it does not re-allocated again (hence, its FDs are not closed) during
reloads. This way of checking multiple conditions here makes more easy to spot
this fact, while analysing the code in order to investigate FD leaks between
master and worker.
Before this patch, when wrong argument was provided in the configuration for
mworker-max-reloads keyword, parser shows these errors below on the stderr:
[WARNING] (1820317) : config : parsing [haproxy.cfg:154] : (null)parsing [haproxy.cfg:154] : 'mworker-max-reloads' expects an integer argument.
In a case, when by mistake two arguments were provided instead of one, this has
also triggered a buggy error message:
[ALERT] (1820668) : config : parsing [haproxy.cfg:154] : 'mworker-max-reloads' cannot handle unexpected argument '45'.
[WARNING] (1820668) : config : parsing [haproxy.cfg:154] : (null)
So, as 'mworker-max-reloads' is parsed in discovery mode by master process
let's align now its parser with all others, which could be called for this
mode. Like this in cases, when there are too many args or argument isn't a
valid integer we return proper error codes to global section parser and
messages are formated properly.
This fix should be backported in all stable versions.
It's convienient for testing and for usage to produce different warning
messages, when the former worker exits due to max reloads exceeded, and when it
was terminated by the master.
This patch is a part of series to reintroduce the program support in the new
master-worker architecture.
For the moment we keep the order of program and worker forks the same as before
the refactoring, as we need to be sure that this won't introduce regressions.
So, programs are forked before the new worker process.
Before the program's fork we already need deserialized processes list to find
the programs launched before reload and to stop them. Processes list saved
before the reload in HAPROXY_PROCESSES variable. It should be deserialized
before the first configuration read in discovery mode, because resetenv keyword
could be presented in the global section.
So, let's move mworker_env_to_proc_list() from mworker_create_master_cli() to
main(). We need to call it only after reload in master-worker mode, thus
HAPROXY_MWORKER_REEXEC and HAPROXY_PROCESSES should be still presented in the
re-executing process environment before the first configuration read.
This patch is a part of series to reintroduce the program support in the new
master-worker architecture.
We just only launch and stop external programs and there is no any
communication between the master process and the started program binary. So,
ipc_fd[0] and ipc_fd[1] are not used and kept as -1 for programs processes. Due
to this, no need for the exiting program process to call fd_delete on this
fds. Otherwise, this will trigger a BUG_ON.
With refactored master-worker architecture master and worker processes parse
its parts of the configuration. Worker could have a huge configuration, so it
will take some time to load. As now HAPROXY_LOAD_SUCCESS is set to 1 only
after receiving the status READY from the new worker
cli_io_handler_show_loadstatus() may exit very fast by showing load status 0,
and in such case and mcli socket will be closed.
This already breaks some regression tests and can confuse some APIs. So, let's
slow down the load status delivery. If in the process list there is still some
process, which is loading (PROC_O_INIT). appctx task will sleep in this case for
50ms and then return 0. cli_io_handler_show_loadstatus() is called in loop, so
with such pacing, there is a high chance that the next time, when we enter in
its scope all processes will have the state READY. Like this master CLI
connection socket won't be closed until the loading of the new worker is really
finished, thus the reload status and logs (Success=1/0) will be shown in
synchronious way.
When reloads arrive very often (sent by some APIs), newly forked workers
almost don't have a time to load completely and to send its READY status to
master, which allows then to stop the previous worker (launched before reload).
As a result, the number of workers increases very quickly, previous workers are
still alive and the memory consumption is very high.
To avoid such situations let's return in cli_parse_reload() reload status 0
with the text ""Another reload is still in progress", if there is still a
process with PROC_O_INIT flag in the processes list.
In the new master-worker architecture, when a worker process is forked and
successfully initialized it needs somehow to communicate its "READY" state to
the master, in order to terminate the previous worker and workers, that might
exceeded max_reloads counter.
So, let's implement for this a new master CLI _send_status command. A new
worker can send its status string "READY" to the master, when it's about
entering to the run poll loop, thus it can start to receive data.
In _send_status() in the master context we update the status of the new worker:
PROC_O_INIT flag is withdrawn.
When TERM signal is sent to a worker, worker terminates and this triggers the
mworker_catch_sigchld() handler in master. This handler deletes the exiting
process entry from the processes list.
In _send_status() we loop over the processes list twice. At the first time, in
order to stop workers that exceeded the max_reloads counter. At the second time,
in order to stop the worker forked before the last reload. In the corner case,
when max_reloads=1, we avoid to send SIGTERM twice to the same worker by
setting sigterm_sent flag during the first loop.
When master performs a reexec it should set for an already existed worker the
flag PROC_O_LEAVING. It means that existed worked is marked as the previous one
and will be terminated after the reload.
In the previous implementation master process was need to do the reexec
twice (the first time for parsing its configuration and the second time to free
unused ressources). So the logic of setting PROC_O_LEAVING was based on
comparing the number of reloads, performed by each process from the processes
list, except the master.
Now, as being mentioned before, reexec is performed only once. So, in this case
we need to set PROC_O_LEAVING flag, when we deserialize the list. It is done for
all processes, which have the number of reloads stricly positive.
The case, when the new worker fails while it parses its configuration or while
it tries to apply it, could be considered as the new one, because the master
process is no longer need to reexec again. The master simply keeps the previous
worker (forked before the reload) and it let the new one to exit with failure.
When the new worker exits, in the master process context (mworker_catch_sigchld)
we need to stop a MASTER proxy listener and we need to drop the server,
attached to new worker's CLI sockpair (it's inherited in master). Then we
explicitly delete master's end of this sockpair (child->ipc_fd[0]) from the
fdtab and we free the memory allocated for the worker process.
on_new_child_failure() is called before the clean up to signal systemd that
reload/load was failed.
If the new worker fails during the first start, so there is no any previous
worker, master process should exit immediately in order to keep the same
behaviour, as it was before this architecture change.
If the worker exits due to failure or due to receiving TERM signal, in the
master context, we can't now simply close the master's fd (ipc_fd[0]) of
the inherited master CLI sockpair.
When the worker is created, in the master process context MASTER proxy listener
is bound to ipc_fd[0]. When this worker fails or exits, master process is
always in its polling loop. So, closing some fd in its context immediately
triggers the BUG_ON(fd->owner), as the poller try to reinsert the "freed" fd
into fdtab and try to reuse it. We must call fd_delete in this case. This will
deinitializes fd auxilary data and closes its properly.
For the master process we always need to create a MASTER proxy, even if
master cli settings were not provided via command line, because now we bind a
listener in the master process context at ipc_fd[0]. So, MASTER proxy should be
already allocated at this moment.
This is the first commit in a series to add the support of 4 primary reload
use-cases for the new master-worker architecture:
1. Newly forked worker process dies before any reload, due to some errors in
the configuration. Newly forked worker process crashes before any reload
after sending its "READY" state to master.
2. Newly forked worker process dies due to some errors in the new
configuration. This happens after reload, when this new configuration was
supplied, so the previous worker process is still here.
3. Newly forked worker process crashes after sending its "READY" state to
master due to some bugs. This happens after reload, so the previous worker
process is still here.
4. Newly forked worker process has sent its "READY" state to master and starts
to receive traffic. This happens after reload, the old worker hasn't
terminated yet, as it is waiting on some idle connection and it crashes.
Let's rename in this commit mworker_cli_proxy_new_listener() to
mworker_cli_master_proxy_new_listener() to outline, that this function creates
"master-socket" bind conf and allocates a listener. This listener is attached
to the MASTER proxy and it's bound to the ipc_fd[0] of the sockpair,
inherited in master and in worker processes (master CLI sockpair).
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.
Some keyword parsers tagged with KWF_DISCOVERY (for example those, which parse
runtime modes, poller types, pidfile), should not be called twice when
the configuration will be read the second time after the discovery mode.
It's redundant and could trigger parser's errors in standalone mode. In
master-worker mode the worker process inherits parsed settings from the master.
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.
So, let's add here KWF_DISCOVERY flag to distinguish the keywords,
which should be parsed in "discovery" mode and which are needed for master
process, from all others. Keywords, that should be parsed in "discovery" mode
have its dedicated parser funtions. Let's tag these functions with
KWF_DISCOVERY flag in keywords list. Like this, only these keyword parsers
might be called during the first configuration read in discovery mode.
Let's encapsulate the logic of 'reload' sockpair and master CLI listeners
creation, used by master CLI into a separate function, as we needed this
only in master-worker runtime mode. This makes the code of init() more
readable.
Given the xz drama which allowed liblzma to be linked to openssh, lets remove
libsystemd to get rid of useless dependencies.
The sd_notify API seems to be stable and is now documented. This patch replaces
the sd_notify() and sd_notifyf() function by a reimplementation inspired by the
systemd documentation.
This should not change anything functionnally. The function will be built when
haproxy is built using USE_SYSTEMD=1.
References:
https://github.com/systemd/systemd/issues/32028https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes
Before:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffcfaf65000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x000074637fef4000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000074637fe4f000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000074637f400000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x000074637fe0d000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x000074637f92a000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x000074637f365000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074637f000000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074637f27a000)
libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x000074637fdff000)
libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x000074637eeb8000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000074637fdcd000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x000074637ee01000)
liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x000074637fda8000)
/lib64/ld-linux-x86-64.so.2 (0x000074637ff5d000)
libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x000074637f904000)
After:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffd51901000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f758d6c0000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f758d61b000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f758ca00000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x00007f758d5d9000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f758d365000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f758d5ba000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f758c600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f758c915000)
/lib64/ld-linux-x86-64.so.2 (0x00007f758d729000)
A backport to all stable versions could be considered at some point.
The main CLI I/O handle is responsible to interrupt the processing on
shutdown/abort. It is not the responsibility of the I/O handler of CLI
commands to take care of it.
The mworker mode never had a proper 'hard-stop' (-st) for the reload,
this is a mode which was commonly used with the daemon mode, but it was
never implemented in mworker mode.
This patch fixes the problem by implementing a "hard-reload" command
over the master CLI. It does the same as the "reload" command, but
instead of waiting for the connections to stop in the previous process,
it immediately quits the previous process after binding.
Should fix issue #1034.
Display a more accessible message when a worker crash about what to do.
Example:
$ ./haproxy -W -f haproxy.cfg
[NOTICE] (308877) : New worker (308884) forked
[NOTICE] (308877) : Loading success.
[NOTICE] (308877) : haproxy version is 2.9-dev4-d90d3b-58
[NOTICE] (308877) : path to executable is ./haproxy
[ALERT] (308877) : Current worker (308884) exited with code 139 (Segmentation fault)
[WARNING] (308877) : A worker process unexpectedly died and this can only be explained by a bug in haproxy or its dependencies.
Please check that you are running an up to date and maintained version of haproxy and open a bug report.
HAProxy version 2.9-dev4-d90d3b-58 2023/09/05 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 6.2.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Mon Aug 14 13:42:26 UTC 2023 x86_64
[ALERT] (308877) : exit-on-failure: killing every processes with SIGTERM
[WARNING] (308877) : All workers exited. Exiting... (139)
Aurelien Darragon found a case of leak when working on ticket #2184.
When a reexec_on_failure() happens *BEFORE* protocol_bind_all(), the
worker is not fork and the mworker_proc struct is still there with
its 2 socketpairs.
The socketpair that is supposed to be in the master is already closed in
mworker_cleanup_proc(), the one for the worker was suppposed to
be cleaned up in mworker_cleanlisteners().
However, since the fd is not bound during this failure, the fd is never
closed.
This patch fixes the problem by setting the fd to -1 in the mworker_proc
after the fork, so we ensure that this it won't be close if everything
was done right, and then we try to close it in mworker_cleanup_proc()
when it's not set to -1.
This could be triggered with the script in ticket #2184 and a `ulimit -H
-n 300`. This will fail before the protocol_bind_all() when trying to
increase the nofile setrlimit.
In recent version of haproxy, there is a BUG_ON() in fd_insert() that
could be triggered by this bug because of the global.maxsock check.
Must be backported as far as 2.6.
The problem could exist in previous version but the code is different
and this won't be triggered easily without other consequences in the
master.
In ticket #2184, HAProxy is crashing in a BUG_ON() after a lot of reload
when the previous processes did not exit.
Each worker has a socketpair which is a FD in the master, when reloading
this FD still exists until the process leaves. But the global.maxconn
value is not incremented for each of these FD. So when there is too much
workers and the number of FD reaches maxsock, the next FD inserted in
the poller will crash the process.
This patch fixes the issue by increasing the maxsock for each remaining
worker.
Must be backported in every maintained version.
The purpose of this patch is only a one-to-one replacement, as far as
possible.
CF_SHUTR(_NOW) and CF_SHUTW(_NOW) flags are now carried by the
stream-connecter. CF_ prefix is replaced by SC_FL_ one. Of course, it is not
so simple because at many places, we were testing if a channel was shut for
reads and writes in same time. To do the same, shut for reads must be tested
on one side on the SC and shut for writes on the other side on the opposite
SC. A special care was taken with process_stream(). flags of SCs must be
saved to be able to detect changes, just like for the channels.
This patch handles the case where the fd could be -1 when proc_self was
lost for some reason (environment variable corrupted or upgrade from < 1.9).
This could result in a out of bound array access fdtab[-1] and would crash.
Must be backported in every maintained versions.
Previous versions ( < 1.9 ) of the master-worker process didn't had the
"HAPROXY_PROCESSES" environment variable which contains the list of
processes, fd etc.
The part which describes the master is created at first startup so if
you started the master with an old version you would never have
it.
Since patch 68836740 ("MINOR: mworker: implement a reload failure
counter"), the failedreloads member of the proc_self structure for the
master is set to 0. However if this structure does not exist, it will
result in a NULL dereference and crash the master.
This patch fixes the issue by creating the proc_self structure for the
master when it does not exist. It also shows a warning which states to
restart the master if that is the case, because we can't guarantee that
it will be working correctly.
This MUST be backported as far as 2.5, and could be backported in every
other stable branches.
When parsing the HAPROXY_PROCESSES environement variable, strtok was
done directly from the ptr resulting from getenv(), which replaces the ;
by \0, showing confusing environment variables when debugging in /proc
or in a corefile.
Example:
(gdb) x/39s *environ
[...]
0x7fff6935af64: "HAPROXY_PROCESSES=|type=w"
0x7fff6935af7e: "fd=3"
0x7fff6935af83: "pid=4444"
0x7fff6935af8d: "rpid=1"
0x7fff6935af94: "reloads=0"
0x7fff6935af9e: "timestamp=1676338060"
0x7fff6935afb3: "id="
0x7fff6935afb7: "version=2.4.0-8076da-1010+11"
This patch fixes the issue by doing a strdup on the variable.
Could be backported in previous versions (mworker_proc_to_env_list
exists since 1.9)
Since the recent changes on the clocks, now.tv_sec is not to be used
between processes because it's a clock which is local to the process and
does not contain a real unix timestamp. This patch fixes the issue by
using "data.tv_sec" which is the wall clock instead of "now.tv_sec'.
It prevents having incoherent timestamps.
It also introduces some checks on negatives values in order to never
displays a netative value if it was computed from a wrong value set by a
previous haproxy version.
It must be backported as far as 2.0.
In applets, we stop processing when a write error (CF_WRITE_ERROR) or a shutdown
for writes (CF_SHUTW) is detected. However, any write error leads to an
immediate shutdown for writes. Thus, it is enough to only test if CF_SHUTW is
set.
This cleanup is a follow up of "CLEANUP: peers: unused code path in
process_peer_sync"
There are some remnants of 1.6 peers specific code in mworker_cleanlisteners()
that was introduced with this patch serie:
f83d3fe00a MEDIUM: init: stop any peers section not bound to the correct process
47c8c029db MEDIUM: init: completely deallocate unused peers
Back then, nbthread did not exist, nbproc was used instead.
Updating some comments to make them more relevant to current haproxy design.
(multithreaded single process)
Moreover, in 47c8c029db, task_free() was performed on peers_fe->task.
But by looking at the code, from 1.6 til now, peers_fe->task
is never used for peers proxies, it is only used for main proxies (referenced
in proxies_list).
Removing this extra task cleanup because it is misleading.
During an early failure of the mworker mode, the
mworker_cleanlisteners() function is called and tries to cleanup the
peers, however the peers are in a semi-initialized state and will use
NULL pointers.
The fix check the variable before trying to use them.
Bug revealed in issue #1956.
Could be backported as far as 2.0.
When haproxy is compiled without USE_SHM_OPEN, does not try to dump the
startup-logs in the "reload" output, because it won't show anything
interesting.
Change the output of the "reload" command, it now displays "Success=0"
if the reload failed and "Success=1" if it succeed.
If the startup-logs is available (USE_SHM_OPEN=1), the command will
print a "--\n" line, followed by the content of the startup-logs.
Example:
$ echo "reload" | socat /tmp/master.sock -
Success=1
--
[NOTICE] (482713) : haproxy version is 2.7-dev7-4827fb-69
[NOTICE] (482713) : path to executable is ./haproxy
[WARNING] (482713) : config : 'http-request' rules ignored for proxy 'frt1' as they require HTTP mode.
[NOTICE] (482713) : New worker (482720) forked
[NOTICE] (482713) : Loading success.
$ echo "reload" | socat /tmp/master.sock -
Success=0
--
[NOTICE] (482886) : haproxy version is 2.7-dev7-4827fb-69
[NOTICE] (482886) : path to executable is ./haproxy
[ALERT] (482886) : config : parsing [test3.cfg:1]: unknown keyword 'Aglobal' out of section.
[ALERT] (482886) : config : Fatal errors found in configuration.
[WARNING] (482886) : Loading failure!
$
The environment variable HAPROXY_LOAD_SUCCESS stores "1" if it
successfully load the configuration and started, "0" otherwise.
The "_loadstatus" master CLI command displays either
"Loading failure!\n" or "Loading success.\n"
When using the "reload" command over the master CLI, all connections to
the master CLI were cut, this was unfortunate because it could have been
used to implement a synchronous reload command.
This patch implements an architecture to keep the connection alive after
the reload.
The master CLI is now equipped with a listener which uses a socketpair,
the 2 FDs of this socketpair are stored in the mworker_proc of the
master, which the master keeps via the environment variable.
ipc_fd[1] is used as a listener for the master CLI. During the "reload"
command, the CLI will send the FD of the current session over ipc_fd[0],
then the reload is achieved, so the master won't handle the recv of the
FD. Once reloaded, ipc_fd[1] receives the FD of the session, so the
connection is preserved. Of course it is a new context, so everything
like the "prompt mode" are lost.
Only the FD which performs the reload is kept.
Since commit 2be557f ("MEDIUM: mworker: seamless reload use the internal
sockpair"), we are using the PROC_O_LEAVING flag to determine which
sockpair worker will be used with -x during the next reload.
However in mworker_reexec(), the PROC_O_LEAVING flag is not updated, it
is only updated at startup in mworker_env_to_proc_list().
This could be a problem when a remaining process is still in the list,
it could be selected as the current worker, and its socket will be used
even if _getsocks doesn't work anymore on it. (bug #1803)
This patch fixes the issue by updating the PROC_O_LEAVING flag in
mworker_proc_list_to_env() just before using it in mworker_reexec()
Must be backported to 2.6.
The function mworker_pipe_register_per_thread() is called this way
because the master first used pipes instead of socketpairs.
Rename mworker_pipe_register_per_thread() to
mworker_sockpair_register_per_thread() in order to be more consistent.
Also update a comment inside the function.
The worker was previously changing the iocb of the socketpair in the
worker by mworker_accept_wrapper(). However, it was done using
fd_insert() instead of changing directly the callback in the
fdtab[].iocb pointer.
This patch cleans up this by part by removing fd_insert().
It also stops setting tid_bit on the thread mask, the socketpair will be
handled by any thread from now.
There's no more reason for keepin the code and definitions in conn_stream,
let's move all that to stconn. The alphabetical ordering of include files
was adjusted.