haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-07 15:47:01 +02:00

Author	SHA1	Message	Date
Valentine Krasnobaeva	b73a278df4	MINOR: mworker/cli: add _send_status to support state transition In the new master-worker architecture, when a worker process is forked and successfully initialized it needs somehow to communicate its "READY" state to the master, in order to terminate the previous worker and workers, that might exceeded max_reloads counter. So, let's implement for this a new master CLI _send_status command. A new worker can send its status string "READY" to the master, when it's about entering to the run poll loop, thus it can start to receive data. In _send_status() in the master context we update the status of the new worker: PROC_O_INIT flag is withdrawn. When TERM signal is sent to a worker, worker terminates and this triggers the mworker_catch_sigchld() handler in master. This handler deletes the exiting process entry from the processes list. In _send_status() we loop over the processes list twice. At the first time, in order to stop workers that exceeded the max_reloads counter. At the second time, in order to stop the worker forked before the last reload. In the corner case, when max_reloads=1, we avoid to send SIGTERM twice to the same worker by setting sigterm_sent flag during the first loop.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	154848a314	MINOR: mworker: simplify the code that sets PROC_O_LEAVING When master performs a reexec it should set for an already existed worker the flag PROC_O_LEAVING. It means that existed worked is marked as the previous one and will be terminated after the reload. In the previous implementation master process was need to do the reexec twice (the first time for parsing its configuration and the second time to free unused ressources). So the logic of setting PROC_O_LEAVING was based on comparing the number of reloads, performed by each process from the processes list, except the master. Now, as being mentioned before, reexec is performed only once. So, in this case we need to set PROC_O_LEAVING flag, when we deserialize the list. It is done for all processes, which have the number of reloads stricly positive.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	c8aac63893	MINOR: mworker: add support for case when new worker dies The case, when the new worker fails while it parses its configuration or while it tries to apply it, could be considered as the new one, because the master process is no longer need to reexec again. The master simply keeps the previous worker (forked before the reload) and it let the new one to exit with failure. When the new worker exits, in the master process context (mworker_catch_sigchld) we need to stop a MASTER proxy listener and we need to drop the server, attached to new worker's CLI sockpair (it's inherited in master). Then we explicitly delete master's end of this sockpair (child->ipc_fd[0]) from the fdtab and we free the memory allocated for the worker process. on_new_child_failure() is called before the clean up to signal systemd that reload/load was failed. If the new worker fails during the first start, so there is no any previous worker, master process should exit immediately in order to keep the same behaviour, as it was before this architecture change.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	2bb07b913d	MINOR: startup: rename and adapt reexec_on_failure Previously reexec_on_failure() was called in cases when the process has failed after reload, while it was parsing its configuration or it was trying to apply it. reexec_on_failure() has called mworker_reexec() and the master process has been reexecuted. With the new architecture in such cases there is no longer need to reexecute the master process after its reload again. It simply keeps the previous worker, forked before the reload, and it lets the new one to exit with an error. But we still need the code, which increments the number of failed reloads and which notifies systemd with new "Reload failed!" status. So, let's reuse and adapt for this reexec_on_failure() and let's rename it to on_new_child_failure().	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	9b27f82da3	MINOR: mworker: mworker_catch_sigchld: use fd_delete instead of close If the worker exits due to failure or due to receiving TERM signal, in the master context, we can't now simply close the master's fd (ipc_fd[0]) of the inherited master CLI sockpair. When the worker is created, in the master process context MASTER proxy listener is bound to ipc_fd[0]. When this worker fails or exits, master process is always in its polling loop. So, closing some fd in its context immediately triggers the BUG_ON(fd->owner), as the poller try to reinsert the "freed" fd into fdtab and try to reuse it. We must call fd_delete in this case. This will deinitializes fd auxilary data and closes its properly.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	cf150fd73d	MEDIUM: mworker/cli: close child and parent fds, setup listeners Basically, this is the continuation of the previous commits. So, here after the fork, worker process closes the "master" end of the copied CLI sockpair and binds its end, ipc_fd[1], to the GLOBAL proxy listener. mworker_cli_global_proxy_new_listener() guarantees that GLOBAL proxy will be created, if it wasn't the case before. Master process, at first, allocates the MASTER proxy, creates master CLI listener (-S command line option) and reload sockpair and then closes the "worker" end of the copied CLI sockpair and binds its end, ipc_fd[0], to the created MASTER proxy. Usage of the new PROC_O_INIT state helps to reduce test conditions to find the newly forked worker.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	646299fc95	MINOR: mworker: add and set state PROC_O_INIT for new worker Here, to distinguish between the new worker and the previous one let's add a new process state PROC_O_INIT and let's set it, when the memory is allocated for the new worker in the processes list.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	26ad5465cc	MINOR: mworker/cli: create MASTER proxy before mcli listeners For the master process we always need to create a MASTER proxy, even if master cli settings were not provided via command line, because now we bind a listener in the master process context at ipc_fd[0]. So, MASTER proxy should be already allocated at this moment.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	6ec38c9a74	MINOR: mworker/cli: create master CLI sockpair before fork The main idea here is to create a master CLI inherited sockpair just before the master-worker fork. And only then after the fork let each process to bind a needed listener to the its end of this sockpair. Like this master and worker processes can close unused "ends" of its sockpair copy (ipc_fd[0] for worker and and ipc_fd[1] for master). When this sockpair creation happens inside the mworker_cli_global_proxy_new_listener() is not possible for the master to close ipc_fd[1] bound to the GLOBAL proxy listener, as this triggers a BUG_ON(fd->owner) in fd_insert() in master context, because master process has alredy entered in its polling loop and poller in its turn tries to reused closed fd.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	cc1a631beb	MINOR: mworker/cli: rename and clean mworker_cli_sockpair_new Let's rename mworker_cli_sockpair_new() to mworker_cli_global_proxy_new_listener() to outline that this function creates the GLOBAL proxy, allocates the listener with "master-socket" bind conf and attaches this listener to this GLOBAL proxy. Listener is bound to ipc_fd[1] of the sockpair inherited in master and in worker (master CLI sockpair).	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	0fbf1973ad	MINOR: mworker/cli: rename mworker_cli_proxy_new_listener This is the first commit in a series to add the support of 4 primary reload use-cases for the new master-worker architecture: 1. Newly forked worker process dies before any reload, due to some errors in the configuration. Newly forked worker process crashes before any reload after sending its "READY" state to master. 2. Newly forked worker process dies due to some errors in the new configuration. This happens after reload, when this new configuration was supplied, so the previous worker process is still here. 3. Newly forked worker process crashes after sending its "READY" state to master due to some bugs. This happens after reload, so the previous worker process is still here. 4. Newly forked worker process has sent its "READY" state to master and starts to receive traffic. This happens after reload, the old worker hasn't terminated yet, as it is waiting on some idle connection and it crashes. Let's rename in this commit mworker_cli_proxy_new_listener() to mworker_cli_master_proxy_new_listener() to outline, that this function creates "master-socket" bind conf and allocates a listener. This listener is attached to the MASTER proxy and it's bound to the ipc_fd[0] of the sockpair, inherited in master and in worker processes (master CLI sockpair).	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	223caab96f	MINOR: cfgparse: fix thread keywords sensitive to global section position thread keywords parsers are sensitive to global section position. If they are present there, the global section must be the first section in the configuration. *thread parsers logic is based on non_global_section_parsed counter. So, we need to reset it explicitly before the second configuration read done by worker or in a standalone mode.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	0ed262d7bf	MEDIUM: startup: introduce load_cfg and read_cfg This commit is a part of the series to add a support of discovery mode in the configuration parser and in initialization sequence. In order to support discovery mode, we need to read the configuration twice. So, we need to split the stage, when we load all configuration files, from the stage when we parse it. To do this, let's encapsulate in read_cfg() the part, where we load the configuration files in a separate function, load_cfg(). Like this we can call only the parsing part as many times as we need. Before reading configuration at the first time we set MODE_DISCOVERY. After the reading this mode is immediately unset, as the real runtime mode has been already set by discovery keywords parsers. Second read is performed when all primary runtime modes (daemon, master-worker) are applied, because we should not read the configuration twice in the master process.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	e2b4768224	MEDIUM: cfgparse: parse only "global" section in MODE_DISCOVERY This commit is a part of the series to add a support of discovery mode in the configuration parser and in initialization sequence. So, in discovery mode, when we read the configuration the first time, we parse for the moment only the "global" section. Unknown section names will be ignored.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	699be6a55d	MEDIUM: cfgparse-global: parse only KWF_DISCOVERY keywords in MODE_DISCOVERY This commit is a part of the series to add a support of discovery mode in the configuration parser and in initialization sequence. Global section parser parses the majority of keywords in its function, so those keywords don't have any dedicated parsers yet. Only after this parsing block cfg_parse_global() starts to call dedicated parsers for any other discovered keywords, which were not found in the block. As all keywords, which should be parsed in MODE_DISCOVERY have its own parser funtions, we can skip this block with goto discovery_kw and start directly from the part, where we call parsers from the keywords list. KWF_DISCOVERY flag helps to call in MODE_DISCOVERY only the parsers, which we are needed at this mode. All unknown keywords and garbage will be ignored at this stage.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	48371e6a30	MEDIUM: cfgparse: call some parsers only in MODE_DISCOVERY This commit is a part of the series to add a support of discovery mode in the configuration parser and in initialization sequence. Some keyword parsers tagged with KWF_DISCOVERY (for example those, which parse runtime modes, poller types, pidfile), should not be called twice when the configuration will be read the second time after the discovery mode. It's redundant and could trigger parser's errors in standalone mode. In master-worker mode the worker process inherits parsed settings from the master.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	f9123e2183	MEDIUM: cfgparse: add KWF_DISCOVERY keyword flag This commit is a part of the series to add a support of discovery mode in the configuration parser and in initialization sequence. So, let's add here KWF_DISCOVERY flag to distinguish the keywords, which should be parsed in "discovery" mode and which are needed for master process, from all others. Keywords, that should be parsed in "discovery" mode have its dedicated parser funtions. Let's tag these functions with KWF_DISCOVERY flag in keywords list. Like this, only these keyword parsers might be called during the first configuration read in discovery mode.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	6769745fe5	MINOR: global: add MODE_DISCOVERY flag This is the first commit from a series to add a support of discovery mode in the configuration parser and in initialization sequence. Discovery mode is the mode, when we read the configuration at the first time and we parse and set runtime modes: daemon, zero-warning, master-worker. In this mode we also parse some parameters needed for the master process to start, in case if we are in the master-worker mode. Like this the master process doesn't allocate any additional resources, which it doesn't use and it quickly finishes its initialization and enters to its polling loop. The worker process after its fork reads the rest of the configuration. So, let's add in this commit MODE_DISCOVERY flag to check it in configuration parser functions.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	fe75c1e12d	MEDIUM: startup: remove MODE_MWORKER_WAIT MODE_MWORKER_WAIT becames redundant with MODE_MWORKER, due to moving master-worker fork in init(). This change allows master no longer perform reexec just after forking in order to free additional memory. As after the fork in the master process we set 'master' variable, we can replace now MODE_MWORKER_WAIT in some 'if' statements by simple check of this 'master' variable. Let's also continue to get rid of HAPROXY_MWORKER_WAIT_ONLY environment variable, as it's no longer needed as well. In cfg_program_postparser(), which is used to check if cmdline is defined to launch a program, we completely remove the check of mode for now, because the master process does not parse the configuration for the moment. 'program' section parsing will be reintroduced in master later in the next commits.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	fb7bef781d	MINOR: defaults: update MASTER_MAXCONN description This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT support, as it became redundant with MODE_MWORKER due to moving master-worker fork in init().	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	3f5f57845b	MINOR: startup: rename exit_on_waitmode_failure to exit_on_failure As we no longer support MODE_MWORKER_WAIT for master (it became redundant with MODE_MWORKER after moving master-worker fork in init()), let's rename exit_on_waitmode_failure() callback in just exit_on_failure().	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	7795d49ae6	CLEANUP: haproxy: rm no longer used mworker_reexec_waitmode This a first commit to prepare the removal of MODE_MWORKER_WAIT support. It has became redundant with MODE_MWORKER, due to moving master-worker fork in init(). Master process does no longer perform reexec to free additional memory after forking and does no longer changing its mode to MODE_MWORKER_WAIT, where it has entered to its wait polling loop and has handled signals. Now, master enters in this loop almost immediately after forking a worker and being always in mode MODE_MWORKER. So, we can remove mworker_reexec_waitmode() wrapper, which was used to set HAPROXY_MWORKER_WAIT_ONLY variable and to call mworker_reexec(). But let's keep for the moment the logic of reexec_on_failure() atexit callback for master in order if in the future we will need to support this case again.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	cb0f1f42e1	MINOR: startup: only worker gets capabilities from bin Due to moving the master-worker fork in init(), we need to protect prepare_caps_from_permitted_set() call, which is executed after init(). This call makes sense only for worker, daemon and for foreground mono process modes. prepare_caps_from_permitted_set() allows to read Linux capabilities from haproxy binary and to move some of them in process Effective set, if 'setcap' keyword lists needed capabilities in the global section.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	fe04c2ad37	MEDIUM: startup: do set_identity() if needed in one place There are two set_identity() calls, both under quite same: 'if ((global.mode & (MODE_MWORKER\|MODE_DAEMON...)...' The first call serves to change uid/gid and set some needed Linux capabilities only for process in the foreground mode. The second comes after master-worker fork and allows to do the same in daemon and in worker modes. Due to moving the master-worker fork in init() in some previous commit, the second set_identity() now is no longer under the 'if'. So, it is executed for all modes, except MODE_MWORKER. Now in MODE_MWORKER process enters in its wait polling loop just after forking a worker and it terminates almost immediately, if it exits this loop. Worker, daemon and process in a foreground mode will perform set_identity() as before, but now it will be called in a one place at main(). global.last_checks should be verified just after set_identity() call. As it's stated in comments some configuration options may require full privileges or some Linux capabilities need to be granted to process. set_identity() via prepare_caps_for_setuid() may put configured capabilities in process Effective set and, hence, remove respective flag from global.last_checks.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	02af1fe067	MEDIUM: startup: call chroot() if needed in one place There are two 'chroot' code blocks, both under quite same: 'if ((global.mode & (MODE_MWORKER\|MODE_DAEMON...)...' The first block serves to perform chroot only for process in the foreground mode. The second comes after master-worker fork and allows to do chroot in daemon and in worker modes. Due to moving the master-worker fork in init() in some previous commit, the second 'chroot' code block now is no longer under the 'if'. So, it is executed for all modes, except MODE_MWORKER. Now in MODE_MWORKER process enters in its wait polling loop just after forking a worker and it terminates almost immediately, if it exits this loop. Worker, daemon and process in a foreground mode will perform the chroot as before, but now it will be done in a one place at main().	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	7a2ee10d71	REORG: mworker: move mworker_create_master_cli in master 'case' Let's move mworker_create_master_cli() call in 'master' case just above and get rid of redundant global.mode tests.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	e4c10a704d	MINOR: init: check MODE_MWORKER before creating master CLI mworker_create_master_cli() creates MASTER proxy and allocates listeners, which are attached to this proxy. It also creates a reload sockpair. So, it's more appropriate to do the check, that we are in a MODE_MWORKER, if master CLI settings were provided via command line, just after the config parsing. And only then, if runtime mode and command line settings are coherent, try to perform master-worker fork and try to create master CLI.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	26e53e2e8c	REORG: mworker: set nbthread=1 for master after fork After moving master-worker fork into init() and reintroducing it into a switch-case (see the previous commit), it is more appropriate to set nbthread=1 and nbtgroups=1 immediately in the 'case' for the parent process.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	ae84f06025	BUG/MINOR: mworker: fix memory leak due to master-worker fork Before this fix, startup logs ring was duplicated before the fork(), so master and worker had both the original startup_logs ring and the duplicated one. In the worker context we freed the original ring and used a duplicated one. In the master context we did nothing, but we still create a duplicated copy again and again during the reload. So, let's duplicate startup logs ring only in the worker context. Master continues to use the original ring initialized in init() before its fork().	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	8dd4efe42f	MAJOR: mworker: move master-worker fork in init() This refactoring allows to simplify 'master-worker' logic. The master process with this change will fork a worker very early at the initialization stage, which allows to perform a configuration parsing only for the worker. In reality only the worker process needs to parse and to apply the whole configuration. Master process just polls master CLI sockets, watches worker status, catches its termination state and handles the signals. With this refactoring there is no longer need for master to perform re-execution after reading the whole configuration file to free additional memory. And there is no longer need for worker to register atexit callbacks, in order to free the memory, when it fails to apply the new configuration. In contrast, we now need to set proc_self pointer to the new worker entry in processes list just after the fork in the worker process context. proc_self is dereferenced in mworker_sockpair_register_per_thread(), which is called when worker enters in its polling loop. Following patches will try to gather more 'worker' and 'master' specific' code in the dedicated cases of this new fork() switch, or in a separate functions.	2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva	4cbfcc60f4	MEDIUM: startup: move PID handling in init() Let's move PID handling in init() from the main() code. It is more appropriate to open and to write the PID of the process just after daemonization fork. In case of daemon monoprocess mode, we will simply write a PID of the process, which is already in the background. In case of 'master-worker' mode, we keep the previous behaviour and we write only a PID of the master process. This allows to remove redundant tests of the process execution mode, tests of the pidfd value and consequent writes to this pidfd. This patch prepares the refactoring of master-worker fork by moving it in init() function as well.	2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva	95c19be2ab	MINOR: startup: refactor "daemonization" fork Let's put "daemonization" fork into a switch-case. This is more readable and we don't need to allocate memory for the fork() return value here.	2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva	90b8181c0a	MEDIUM: startup: move daemonization fork in init Let's move daemonization fork in init(). We need to perform this fork always before forking a worker process, in order to be able to launch master and then its worker in daemon, i.e. background mode, if haproxy was started with '-D' option. This refactoring is a preparation step, needed for replacing then master-worker fork in init() as well. This allows the master process not to read the whole configuration file and not to do re-execution in order to free additional memory, when worker was forked. In the new refactored design only the worker process will read and apply a new configuration, while the master will arrive very fast in its polling loop to wait worker's termination and to handle signals. See more details in the following commits.	2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva	df12791da3	MINOR: startup: add O_CLOEXEC flag to open /dev/null As master process performs execvp() syscall to handle USR2 and HUP signals in mworker_reexec(), let's add O_CLOEXEC flag, when we open '/dev/null' in order to avoid fd leak. This a preparation step to refactor master-worker logic. See more details in the next commits.	2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva	5bbcdc003a	REGTESTS: cli: add delay 0.1 before connect to cli When vtest starts haproxy process, it loops until the moment, when haproxy pidfile is created. When pidfile is created, vtest considers that haproxy process is ready and it starts to perform test commands, in particular, it connects to CLI. It's not very reliable approach to base the check of the process readiness on the PID file. After master-worker architecture refactoring pidfile is created in the early init stage, but master and worker are not yet finished its initialization routines. So, all mcli tests and some tests where we sent commands to CLI start to fail regularly. In vtest at the moment there is no any other approach to check that the process is really ready. So let's add a delay 0.1s before connecting to CLI in all mcli tests and in acl_cli_spaces test.	2024-10-16 22:00:58 +02:00
Willy Tarreau	2c2dac77aa	DEBUG: mux-h2/flags: add H2_CF_DEM_RXBUF & H2_SF_EXPECT_RXDATA for the decoder Both flags were recently added but missing from the decoders flags, so they appeared in hex in dev/flags/flags output. No backport needed.	2024-10-16 18:32:52 +02:00
Willy Tarreau	ca275d99ce	BUG/MEDIUM: queue: make sure never to queue when there's no more served conns Since commit `53f52e67a0` ("BUG/MEDIUM: queue: always dequeue the backend when redistributing the last server"), we've got two reports again still showing the theoretically impossible condition in pendconn_add(), including a single threaded one. Thanks to the traces, the issue could be tracked down to the redispatch part. In fact, in non-determinist LB algorithms (RR, LC, FAS), we don't perform the LB if there are pending connections in the backend, since it indicates that previous attempts already failed, so we directly return SRV_STATUS_FULL. And contrary to a previous belief, it is possible to meet this condition with be->served==0 when redispatching (and likely with maxconn not greater than the number of threads). The problem is that in this case, the entry is queued and then the pendconn_must_try_again() function checks if any connections are currently being served to detect if we missed a race, and tries again, but that situation is not caused by a concurrent thread and will never fix itself, resulting in the loop. All that part around pendconn_must_try_again() is still quite brittle, and a safer approach would involve a sequence counter to detect new arrivals and dequeues during the pendconn_add() call. But it's more sensitive work, probably for a later fix. This fix must be backported wherever the fix above was backported. Thanks to Patrick Hemmer, as well as Damien Claisse and Basha Mougamadou from Criteo for their help on tracking this one!	2024-10-16 18:08:39 +02:00
Aurelien DARRAGON	85298189bf	BUG/MEDIUM: server: server stuck in maintenance after FQDN change Pierre Bonnat reported that SRV-based server-template recently stopped to work properly. After reviewing the changes, it was found that the regression was caused by `a4d04c6` ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT") Indeed, HMAINT is not a regular maintenance flag. It was implemented in `b418c122` `a4d04c6` ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT"). This flag is only set (and never removed) when the server FQDN is changed from its initial config-time value. This can happen with "set server fqdn" command as well as SRV records updates from the DNS. This flag should ideally belong to server flags.. but it was stored under srv_admin enum because cur_admin is properly exported/imported via server state-file while regular server's flags are not. Due to `a4d04c6`, when a server FQDN changes, the server is considered in maintenance, and since the HMAINT flag is never removed, the server is stuck in maintenance. To fix the issue, we partially revert `a4d04c6`. But this latter commit is right on one point: HMAINT flag was way too confusing and mixed-up between regular MAINT flags, thus there's nothing to blame about `a4d04c6` as it was error-prone anyway.. To prevent such kind of bugs from happening again, let's rename HMAINT to something more explicit (SRV_ADMF_FQDN_CHANGED) and make it stand out under srv_admin enum so we're not tempted to mix it with regular maintenance flags anymore. Since `a4d04c6` was set to be backported in all versions, this patch must be backported there as well.	2024-10-16 14:26:57 +02:00
Amaury Denoyelle	0918c41ef6	BUG/MEDIUM: quic: support wait-for-handshake wait-for-handshake http-request action was completely ineffective with QUIC protocol. This commit implements its support for QUIC. QUIC MUX layer is extended to support wait-for-handshake. A new function qcc_handle_wait_for_hs() is executed during qcc_io_process(). It detects if MUX processing occurs after underlying QUIC handshake completion. If this is the case, it indicates that early data may be received. As such, connection is flagged with CO_FL_EARLY_SSL_HS, which is necessary to block stream processing on wait-for-handshake action. After this, qcc subscribs on quic_conn layer for RECV notification. This is used to detect QUIC handshake completion. Thus, qcc_handle_wait_for_hs() can be reexecuted one last time, to remove CO_FL_EARLY_SSL_HS and notify every streams flagged as SE_FL_WAIT_FOR_HS. This patch must be backported up to 2.6, after a mandatory period of observation. Note that it relies on the backport of the two previous patches : - MINOR: quic: notify connection layer on handshake completion - BUG/MINOR: stream: unblock stream on wait-for-handshake completion	2024-10-16 11:51:35 +02:00
Amaury Denoyelle	73031e81cd	BUG/MINOR: stream: unblock stream on wait-for-handshake completion wait-for-handshake is an http-request action which permits to delay the processing of content received as TLS early data. The action yields as long as connection handshake is in progress. In the meantime, stconn is flagged with SE_FL_WAIT_FOR_HS. When the handshake is finished, MUX layer is responsible to woken up SE_FL_WAIT_FOR_HS flagged stconn instances to restart the stream processing. On sc_conn_process(), SE_FL_WAIT_FOR_HS flag is removed and stream layer is woken up. However, there may be a blocking after MUX notification. sc_conn_recv() may return 0 due to no new data reception, which prevents sc_conn_process() execution. The stream is thus blocked until its timeout. To fix this, checks in sc_conn_recv() about the handshake termination condition. If true, explicitely returns 1 to ensure sc_conn_process() will be executed. Note that this bug is not reproducible due to various conditions related to early data implementation in haproxy. Indeed, connection layer instantiation is always delayed until SSL handshake completion, which prevents the handling of early data as expected. This fix will be necessary to implement wait-for-handshake support for QUIC. As such, it must be backported with the next commit up to 2.6, after a mandatory period of observation.	2024-10-16 11:44:31 +02:00
Amaury Denoyelle	5a5950e42d	MINOR: quic: notify connection layer on handshake completion Wake up connection layer on QUIC handshake completion via quic_conn_io_cb. Select SUB_RETRY_RECV as this was previously unused by QUIC MUX layer. For the moment, QUIC MUX never subscribes for handshake completion. However, this will be necessary for features such as the delaying of early data forwarding via wait-for-handshake. This patch will be necessary to implement wait-for-handshake support for QUIC. As such, it must be backported with next commits up to 2.6, after a mandatory period of observation.	2024-10-16 11:42:06 +02:00
Willy Tarreau	5091f90479	MINOR: activity/memprofile: always return "other" bin on NULL return address It was found in a large "show profiling memory" output that a few entries have a NULL return address, which causes confusion because this address will be reused by the next new allocation caller, possibly resulting in inconsistencies such as "free() ... pool=trash" which makes no sense. The cause is in fact that the first caller had an entry->info pointing to the trash pool from a p_alloc/p_free with a NULL return address, and the second had a different type and reused that entry. Let's make sure undecodable stacks causing an apparent NULL return address all lead to the "other" bin. While this is not exactly a bug, it would make sense to backport it to the recent branches where the feature is used (probably at least as far as 2.8).	2024-10-15 08:12:34 +02:00
Willy Tarreau	93c9f19af7	REGTESTS: fix a reload race in abns_socket.vtc This test issues a reload over the master CLI, but it is totally possible that the master has not yet finished starting up the master CLI when the command is issued, resulting in a failure. This was much more visible on the new master-worker model, but definitely affects the old one and could be the reason for this test to occasionally fail on the CI.	2024-10-14 19:15:21 +02:00
William Lallemand	0302adf996	CI: cirrus-ci: bump FreeBSD image to 14-1 FreeBSD CI since to be broken for a while, try to upgrade the image to the latest 14.1 version.	2024-10-14 14:28:26 +02:00
Willy Tarreau	e4cb0ad632	MINOR: mux-h2/traces: add buffer-related info to h2s and h2c The traces currently don't contain any info about the amount of data present in buffers, making it difficult to figure if an empty buffer is the cause for not demuxing or if a full buffer is the cause for not reading more data. Let's add them, with the head/tail info as well.	2024-10-12 18:07:21 +02:00
Willy Tarreau	a8f907a459	MINOR: mux-h2/traces: add missing flags and proxy ID in traces H2 traces are unusable to detect bugs most of the time because they miss the h2c and h2s flags, as well as the proxy, which makes it very hard to figure if the info comes from the client or the server as soon as two layers are stacked. This commit adds these precious information as well as the h2s's rx and tx windows. This could be backported to a few recent branches, but the rx window calculation will have to be replaced with the static value there.	2024-10-12 17:45:51 +02:00
Willy Tarreau	fcab647613	OPTIM: mux-h2: use tasklet_wakeup_after() in h2s_notify_recv() This reduces the avg wakeup latency of sc_conn_io_cb() from 1900 to 51us. The L2 cache misses from from 1.4 to 1.2 billion for 20k req. But the perf is not better. Also there are situations where we must not perform such wakeup, these may only be done from h2_io_cb, hence the test on the next_tasklet pointer and its reset when leaving the function. In practice all callers to h2s_close() or h2s_destroy() can reach that code, this includes h2_detach, h2_snd_buf, h2_shut etc. Another test with 40 concurrent connections, transferring 40k 1MB objects at different concurrency levels from 1 to 80 also showed a 21% drop in L2 cache misses, and a 2% perf improvement: Before: 329,510,887,528 instructions 50,907,966,181 branches 843,515,912 branch-misses 2,753,360,222 cache-misses 19,306,172,474 L1-icache-load-misses 17,321,132,742 L1-dcache-load-misses 951,787,350 LLC-load-misses 44.660469000 seconds user 62.459354000 seconds sys => avg perf: 373 MB/s After: 331,310,219,157 instructions 51,343,396,257 branches 851,567,572 branch-misses 2,183,369,149 cache-misses 19,129,827,134 L1-icache-load-misses 17,441,877,512 L1-dcache-load-misses 906,923,115 LLC-load-misses 42.795458000 seconds user 62.277983000 seconds sys => avg perf: 380 MB/s With small requests, it's the L1 and L3 cache misses which reduced by 3% and 7% respectively, and the performance went up by 3%.	2024-10-12 17:17:51 +02:00
Willy Tarreau	04ce6536e1	OPTIM: mux-h2: try to continue reading after demuxing when useful When we stop demuxing in the middle of a frame, we know that there are other data following. The demux buffer is small and unique, but now we have rxbufs, so after h2_process_demux() is left, the dbuf is almost empty and has room to be delivered into another rxbuf. Let's implement a short loop with a counter and a few conditions around the demux call. We limit the number of turns to the number of available rxbufs and no more than 12, since it shows good performance, and the wakeup is only called once. This has shown a nice 12-20% bandwidth gain on backend-side H2 transferring 1MB-large objects, and does not affect the rest (headers, control etc). The number of wakeup calls was divided by 5 to 8, which is also a nice improvement. The counter is limited to make sure we don't add processing latency. Tests were run to find the optimal limit, and it turns out that 16 is just slightly better, but not worth the +33% increase in peak processing latency. The h2_process_demux() function just doens't call the wakeup function anymore, and solely focuses on transferring from dbuf to rxbuf. Practical measurement: test with h2load producing 4 concurrent connections with 10 concurrent streams each, downloading 1MB objects (20k total) via two layers of haproxy stacked, reaching httpterm over H1 (numbers are total for the 2 h2 front and 1 h2 back). All on a single thread. Before: 549-553 MB/s (on h2load) function calls cpu_tot cpu_avg h2_io_cb 2562340 8.157s 3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup h2_io_cb 30109 840.9ms 27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup h2_io_cb 16105 106.4ms 6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup h2_io_cb 1 11.75us 11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup h2_io_cb 2608555 9.104s 3.490us --total-- perf stat: 153,117,996,214 instructions (71.41%) 22,919,659,027 branches # 14.97% of inst (71.41%) 384,009,600 branch-misses # 1.68% of all branches (71.42%) 44,052,220 cache-misses # 1 inst / 3476 (71.44%) 9,819,232,047 L1-icache-load-misses # 6.4% of inst (71.45%) 8,426,410,306 L1-dcache-load-misses # 5.5% of inst (57.15%) 10,951,949 LLC-load-misses # 1 inst / 13982 (57.13%) 12.372600000 seconds user 23.629506000 seconds sys After: 660 MB/s (+20%) function calls cpu_tot cpu_avg h2_io_cb 244502 4.410s 18.04us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup h2_io_cb 42107 1.062s 25.22us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup h2_io_cb 13703 106.3ms 7.758us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup h2_io_cb 1 13.74us 13.74us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup h2_io_cb 300313 5.578s 18.57us --total-- perf stat: 126,840,441,876 instructions (71.40%) 17,576,059,236 branches # 13.86% of inst (71.40%) 274,136,753 branch-misses # 1.56% of all branches (71.42%) 30,413,562 cache-misses # 1 inst / 4170 (71.45%) 6,665,036,203 L1-icache-load-misses # 5.25% of inst (71.46%) 7,519,037,097 L1-dcache-load-misses # 5.9% of inst (57.15%) 6,702,411 LLC-load-misses # 1 inst / 18925 (57.12%) 10.490097000 seconds user 19.212515000 seconds sys It's also interesting to see that less total time is spent in these functions, clearly indicating that the cost of interrupted processing, and the extraneous cache misses come into play at some point. Indeed, after the change, the number of instructions went down by 17.2%, while the L2 cache misses dropped by 31% and the L3 cache misses by 39%!	2024-10-12 16:38:36 +02:00
Willy Tarreau	9fbc01710a	OPTIM: mux-h2: make h2_send() report more accurate wake up conditions h2_send() used to report non-zero every time any data were sent, and this was used from h2_snd_buf() or h2_done_ff() to trigger a wakeup, which possibly can do nothing. Restricting this wakeup to either a successful send() combined with the ability to demux, or an error. Doing this makes the number of h2_io_cb() wakeups drop from 422k to 245k for 1000 1MB objects delivered over 100 streams between two H2 proxies, without any behavior change nor performance change. In practice, most send() calls do not result in a wakeup anymore but synchronous errors still do. A local test downloading 10k 1MB objects from an H1 server with a single connection shows this change: before after caller 1547 1467 h2_process_demux() 2138 0 h2_done_ff() <--- 38 1453 ssl_sock_io_cb() <--- 18 0 h2_snd_buf() 1 1 h2_init() 3742 2921 -- total -- In practice the ssl_sock_io_cb() wakeups are those notifying about SUB_RETRY_RECV, which are not accounted for when h2_done_ff() performs the wakeup because the tasklet is already queued (a counter placed there shows that it's nonetheless called). So there's no transfer and h2_done_ff() was only hiding the other one. Another test involving 4 connections with 10 concurrent streams each and 20000 1MB objects total shows a total disparition of the wakeups from h2_snd_buf and h2_done_ff, which used to account together for 50% of the wakeups, resulting in effectively halving the number of wakeups which, based on their avg process time, were not doing anything: Before: function calls cpu_tot cpu_avg h2_io_cb 2571208 7.406s 2.880us <- h2c_restart_reading@src/mux_h2.c:940 tasklet_wakeup h2_io_cb 2536949 251.4ms 99.00ns <- h2_snd_buf@src/mux_h2.c:7573 tasklet_wakeup ### h2_io_cb 41100 5.622ms 136.0ns <- h2_done_ff@src/mux_h2.c:7779 tasklet_wakeup ### h2_io_cb 38979 852.8ms 21.88us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup h2_io_cb 12519 90.28ms 7.211us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup h2_io_cb 1 13.81us 13.81us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup h2_io_cb 5200756 8.606s 1.654us --total-- After: h2_io_cb 2562340 8.157s 3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup h2_io_cb 30109 840.9ms 27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup h2_io_cb 16105 106.4ms 6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup h2_io_cb 1 11.75us 11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup h2_io_cb 2608555 9.104s 3.490us --total--	2024-10-12 16:38:36 +02:00
Willy Tarreau	633c41c621	MEDIUM: mux-h2: rework h2_restart_reading() to differentiate recv and demux From the beginning, h2_restart_reading() has always been confusing because it decides whether or not to wake the tasklet handler up or not. This tasklet handler does two things, one is receiving from the socket to the demux buf, and one is demuxing from the demux buf to the streams' rxbufs. The conditions are governed by h2_recv_allowed(), which is also called at a few places to decide whether or not to actually receive from the socket. It starts to be visible that this leaves some difficulties regarding what to do with possibly pending data. In 2.0 with commit `3ca18bf0b` ("BUG/MEDIUM: h2: Don't attempt to recv from h2_process_demux if we subscribed."), we even had to address a special case where it was possibly to endlessly wake up because the conditions would rely on the demux buffer's contents, though the solution consisted in passing a flag to decide whether or not to consider the buffer's contents. In 2.5 commit `b5f7b5296` ("BUG/MEDIUM: mux-h2: Handle remaining read0 cases on partial frames") introduced a new flag H2_CF_DEM_SHORT_READ which indicates that the demux had to stop in the middle of a frame and cannot make progress without more data. More adaptations later came in based on this but this actually reflected exactly what was needed to solve this painful situation: a state indicating whether to receive or parse. Now's about time to definitely address this by reworking h2_restart_reading() to check two completely independent things: - the ability to receive more data into the demux buffer, which is based on its allocation/fill state and the socket's errors - the ability to demux such data, which is based on the presence of enough data (i.e. no stuck short read), and ability to find an rx buf to continue the processing. Now the conditions are much more understandable, and it's also visible that the consider_buffer argument, whose value was not trivial for callers, is not used anymore. Tests stacking two layers of H2 show strictly no change to the wakeup cause distributions nor counts.	2024-10-12 16:38:36 +02:00

... 2 3 4 5 6 ...

23295 Commits