mirror of
				https://git.haproxy.org/git/haproxy.git/
				synced 2025-10-31 16:41:01 +01:00 
			
		
		
		
	Add a documentation about the history of the master-worker and how it was implemented in its first version and how it is currently working. This is a global view of the architecture, and not an exhaustive explanation of all mechanisms.
		
			
				
	
	
		
			211 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			211 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Master Worker
 | |
| 
 | |
| 2024-06-12
 | |
| 
 | |
| ## History
 | |
| 
 | |
| ### haproxy-systemd-wrapper
 | |
| 
 | |
| Back in 2013, distributions are discussing the adoption of systemd as the
 | |
| default init, this was controversial but fedora and archlinux already uses it.
 | |
| At this time HAProxy still had a multi-process model, and the way haproxy is
 | |
| working was incompatible with the daemon mode.
 | |
| 
 | |
| Systemd is compatible with traditionnal forking services, but somehow HAProxy
 | |
| is different. To work correctly, systemd needs a main PID, this is the PID of
 | |
| the process that systemd will supervises.
 | |
| 
 | |
| With `nbproc 1` that could work, since systemd is able to guess the main PID,
 | |
| and even to read a PID file. But there is something uncommon that HAProxy is
 | |
| doing for a reload, which is not supported by systemd. Indeed the reload is in
 | |
| fact a new haproxy process, which will ask the old one to leave. This means the
 | |
| new main PID is supposed to change, but systemd is not supporting this, so it
 | |
| will just see the previous leaving and consider that the service broke and kill
 | |
| every other processes, meaning the new haproxy.
 | |
| 
 | |
| With `nbproc > 1` this is worse, systemd is confused with all the processes,
 | |
| because they are independent, so there is not really a main process to
 | |
| supervise.
 | |
| 
 | |
| The systemd-wrapper appeared in HAProxy 1.5, it's a separated binary, which
 | |
| starts haproxy, so systemd can use the wrapper as the main PID, and the wrapper
 | |
| never change PID. Upon a reload, which is done with a SIGUSR2 signal the wrapper
 | |
| will launch a `haproxy -sf`. This was a non-intrusive work which a first step to
 | |
| deploy in systemd environments. Later contributions would add the support for
 | |
| upgrading the wrapper binary upon a reload.
 | |
| 
 | |
| However the wrapper suffered from several problems:
 | |
| 
 | |
| - It needed a intermediate haproxy process, it's basically a daemon mode, but
 | |
|   instead of the first process leaving to daemonize, it is kept in foreground to
 | |
|   waitpid() on all workers. Which means you need the wrapper + the -Ds + the
 | |
|   haproxy workers, and each reload start a new -Ds.
 | |
| - it was difficult to integrate new features since it wasn't in haproxy itself
 | |
| - there were multiple issues with handling the failures during reload
 | |
| 
 | |
| ### mworker V1
 | |
| 
 | |
| HAProxy 1.8 got ride of the wrapper which was replaced by the master worker
 | |
| mode. This first version was basically a reintegration of the wrapper features
 | |
| within HAProxy. HAProxy is launched with the -W flag, read the configuration and
 | |
| then fork. In mworker mode, the master is usually launched as a root process,
 | |
| and will do chroot operations then setuid in the workers.
 | |
| 
 | |
| Like the wrapper, the master handle the SIGURS2 signal to reload, it is also
 | |
| able to forward the SIGUSR1 signal to the workers, to ask for a soft stop.
 | |
| The reload uses the same logic than the standard `-sf` method, but instead of
 | |
| starting a new process, it will exec() with -sf in the same PID. Which means
 | |
| that haproxy could upgrade its binary during the reload.
 | |
| 
 | |
| Once the SIGUSR2 signal is received, the master would block signals and unregister
 | |
| signals handler so no signals would halt haproxy reload, as it could kill the
 | |
| master to receive a USR2 if the signal is not register yet after the exec.
 | |
| 
 | |
| When doing the exec() upon a reload, a new argv array is constructed by copying
 | |
| the current argv and adding `-sf` and the list of PIDs in the children list, as
 | |
| well as the oldpids list.
 | |
| 
 | |
| When the workers are started, the master will first deinit the poller and clean
 | |
| the FDs that are not needed anymore (inherited fd need to be kept however), then
 | |
| the master will do a wait() loop instead of the haproxy polling loop, which will
 | |
| wait for its workers to leave, or for a signal.
 | |
| 
 | |
| When reloading haproxy, a non-working configuration could exits the master,
 | |
| which could end in killing all previous workers. This is a complex situation to
 | |
| handle, since all configuration parsing code was not written to let a process
 | |
| alive upon a failure. To handle this problem, an atexit() callback was used, so
 | |
| haproxy would reexec() upon a configuration loading failure, without any
 | |
| configuration, and without trying to fork new workers. This is called the
 | |
| master-worker "wait mode".
 | |
| 
 | |
| The master-worker mode also comes with a feature which automates the seamless
 | |
| reload (-x), meaning it would select the stats socket from the configuration to
 | |
| be added to the -x parameter for the next reload, so the FD of the bind could be
 | |
| retrieved automatically.
 | |
| 
 | |
| The master is supervising the workers, when a current worker (not a previous one
 | |
| from before the reload) is exiting without being asked for a reload, the master
 | |
| will emit an "exit-on-failure" error and will kill every workers with a SIGTERM
 | |
| and exits with the same error code than the failed master, this behavior can be
 | |
| changed by using the "no exit-on-failure" option in the global section.
 | |
| 
 | |
| While the master is supervising the workers using the wait() function, the
 | |
| workers is also surpervising the master. To achieve this, there is a pipe
 | |
| between the master and the workers. The FD of the worker side of the pipe is
 | |
| inserted in the poller so it can watch for a close. When the pipe is closed this
 | |
| means the master left, and this is not supposed to happen, so it could have
 | |
| crash. When it happens all workers are leaving. To survive the reloads of the
 | |
| master, the FD are saved in environment variables (HAPROXY_MWORKER_PIPE_{RD,WR})
 | |
| 
 | |
| The master-worker mode could be activated by using either "-W" or
 | |
| "master-worker" in the global section of the configuration, but it is prefered
 | |
| to use "-W".
 | |
| 
 | |
| The pidfile is usable in master-worker mode, instead of writing the PIDs of all
 | |
| workers, this will only write the PID of the master.
 | |
| 
 | |
| A systemd mode (-Ws) could also be used, it behaves the same way as -W, but will
 | |
| keep the master in foreground, and will send status messages to systemd using
 | |
| the sd_notify API.
 | |
| 
 | |
| ### mworker V2
 | |
| 
 | |
| HAProxy 1.9 go a little bit further with the master worker, instead of using the
 | |
| mworker_wait() fuction from V1, it uses the haproxy polling loop, so the signals
 | |
| will be handled directly by haproxy polling loop, removing the specific code.
 | |
| 
 | |
| Instead of using 1 pipe per haproxy instance, the V2 is using a socketpair per
 | |
| worker and the polling loop allows real network communication using these
 | |
| socketpairs. It needs to keep 1 FD per worker in the master, so they can be
 | |
| reused after a reload. The master keeps a linked list of processes,
 | |
| mworker_proc, containing socketpairs fd, PID, relative pid... This list is then
 | |
| serialized in the HAPROXY_PROCESSES environment variable to be unserialized upon
 | |
| a reload and the FD reinserted in the poller.
 | |
| 
 | |
| Since the FD are in the poller, there is a special flag in the listeners
 | |
| LI_O_WORKER, which specify that some FD mustn't be used in the worker, these FD
 | |
| are unbind once in the worker.
 | |
| 
 | |
| Meanwhile the thread support was implemented in haproxy, since mworker shares
 | |
| more code than before when using the polling loop, the nbthread configuration
 | |
| variable is not used for instancing the master, and the master will always
 | |
| remain with only 1 thread.
 | |
| 
 | |
| The HAPROXY_PROCESSES structures allow to store a lot more thing, the number of
 | |
| reload for each worker is kept, the PID etc...
 | |
| 
 | |
| The socketpairs are useful for bi-directional communication, so each socketpair
 | |
| are connected to a stats applet on the worker side, so the master could access
 | |
| to a stats socket for each worker.
 | |
| 
 | |
| The master implements a CLI proxy, which is an analyzer which is able to parse
 | |
| CLI input, which will split individual CLI commands and redirect them to the
 | |
| right worker. This is implemented like the HTTP pipelining with command being
 | |
| sent and responsed one after another. This proxy could be accessed by using the
 | |
| master CLI which is only bound using the -S option of the haproxy command.
 | |
| Special prefixed using @ syntax are used to select the right worker.
 | |
| 
 | |
| The master CLI implements its own commands set like `show proc` which shows the
 | |
| content of the HAPROXY_PROCESSES structure.
 | |
| 
 | |
| A 'reload' command was implemented so the reload could be asked from the master
 | |
| CLI without using the SIGUSR2 signal.
 | |
| 
 | |
| ### more features in mworker V2
 | |
| 
 | |
| HAProxy 2.0 implements a new configuration section called `program` this section
 | |
| allows to handle the start and stop of executables with the master-worker. One
 | |
| could launch the dataplane API from haproxy for example. The programs are
 | |
| shown in the `show proc` command. The programs will be added to the
 | |
| HAPROXY_PROCESSES structure. The option 'start-on-reload' allows to configure
 | |
| the behavior of a program during an haproxy reload, it can either start a new
 | |
| instance of the program or keep the previous one.
 | |
| 
 | |
| A `mworker-max-reloads` keyword was added in the global section, it allows to
 | |
| limit the number of reload a worker can endure. That helps limiting the number
 | |
| of remaining worker processes. This will send a SIGTERM to the worker once it
 | |
| reach this value, instead of a SIGUSR1, so any stuck worker is killed.
 | |
| 
 | |
| Version and starting time were added to HAPROXY_PROCESSES so they could be
 | |
| displayed in `show proc`.
 | |
| 
 | |
| HAProxy 2.1 added user/group to the program section so they could change their
 | |
| uid after the fork.
 | |
| 
 | |
| HAProxy 2.5 added the reexec of haproxy in wait mode after a successful loading,
 | |
| instead of doing it only after a configuration failure. It is useful to clear
 | |
| the memory of the master because charging the configuration from the master can
 | |
| take a lot of RAM, and there is no simple wait to free everything and decrease
 | |
| the memory space of the process.
 | |
| 
 | |
| In HAProxy 2.6, the seamless reload with the master-worker changed, instead of
 | |
| using a stats socket declared in the configuration, this uses the internal
 | |
| socketpair of the previous worker. The change is actually simple, instead of
 | |
| doing a `-x /path/to/previous/socket` it does a `-x sockpair@FD` using the FD
 | |
| number that can be found in HAPROXY_PROCESSES. With this change the stats socket
 | |
| in the configuration is less useful and everything can be done from the master
 | |
| CLI.
 | |
| 
 | |
| With 2.7, the reload mecanism of the master CLI evolved, with previous versions,
 | |
| this mecanism was asynchronous, so once the `reload` command was received, the
 | |
| master would reload, the active master CLI connection was closed, and there was
 | |
| no way to return a status as a response to the `reload` command. To achieve a
 | |
| synchronous reload, a dedicated sockpair is used, one side uses a master CLI
 | |
| applet and the other side wait to receive a socket. When the master CLI receives
 | |
| the `reload` command, it takes the FD of the active master CLI session, sends it
 | |
| in the socketpair and then does an exec. The FD is then stuck in the kernel
 | |
| during the reload, because the poller is disabled. Once haproxy reloaded and the
 | |
| poller active, the FD of the master CLI connection is received, so HAProxy can
 | |
| reply a success or failure status for the reload. When built with
 | |
| USE_SHM_OPEN=1, a shm is used to keep the warnings and errors when loading the
 | |
| configuration in a shared buffer so this could survive the rexec in wait mode
 | |
| and then be dumped as a response to the `reload` command after the status.
 | |
| 
 | |
| In 2.9 the master CLI command `hard-reload` was implemented, it works the same
 | |
| way as the `reload` command, but instead of exec() with -sf for a soft-stop, it
 | |
| starts with -st to achieve a hard stop on the previous worker.
 | |
| 
 | |
| Version 3.0 got rid of the libsystemd dependencies for sd_notify() after the
 | |
| events of xz/openssh, the function is now implemented directly in haproxy in
 | |
| src/systemd.c.
 |