haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-17 20:46:58 +02:00

Author	SHA1	Message	Date
Willy Tarreau	5496d06b2b	IMPORT: plock: export the uninlined version of the lock wait function The inlining of the lock waiting function was made more easily configurable with commit 7505c2e ("plock: always expose the inline version of the lock wait function"). However, the standard one remained static, but in order to resolve the symbols in "perf top", it's much better to export it, so let's move "static" with "inline" and leave it exported when PLOCK_INLINE_EBO is not set. This is plock commit 3bea7812ec705b9339bbb0ed482a2cd8aa6c185c.	2025-02-07 18:04:29 +01:00
Christopher Faulet	eb4e517489	CLEANUP: mux-spop: Remove useless comments Just a small cleanup to remove some comments added during the development of the mux.	2025-02-06 11:19:32 +01:00
Christopher Faulet	d16c534511	MINOR: mux-spop: Report EOI on the SE when a ACK is received for a stream The spop stream now reports the end of input when the ACK is transferred to the SPOE applet. To do so, the flag SPOP_SF_ACK_RCVD was added. It is set on the SPOP stream when its ACK is received by the SPOP connection. In addition when SPOP stream flags are propagated to the SE, the error is now reported if end of input was not reached instead of testing the connection error code. It is more accurate. This patch should be backported to 3.1.	2025-02-06 11:19:32 +01:00
Frederic Lecaille	85cb1cc7f4	BUILD: ssl: remove a boringssl definition defined by recent boringssl libs This is the case for AWS-LC which derives from boringssl, where X509_OBJECT_get0_X509_CRL() is already defined. There is definitively no more need to define this function to build haproxy against TLS libs derived from boringssl.	2025-02-06 10:48:25 +01:00
Aurelien DARRAGON	0846638f7f	MEDIUM: stream: interrupt costly rulesets after too many evaluations It is not rare to see configurations with a large number of "tcp-request content" or "http-request" rules for instance. A large number of rules combined with cpu-demanding actions (e.g.: actions that work on content) may create thread contention as all the rules from a given ruleset are evaluated under the same polling loop if the evaluation is not interrupted Thus, in this patch we add extra logic around "tcp-request content", "tcp-response content", "http-request" and "http-response" rulesets, so that when a certain number of rules are evaluated under the single polling loop, we force the evaluating function to yield. As such, the rule which was about to be evaluated is saved, and the function starts evaluating rules from the save pointer when it returns (in the next polling loop). We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG is mandatory here because process_stream() expects TASK_WOKEN_MSG for explicit analyzers re-evaluation. rules_bcount stream's attribute was added to count how manu rules were evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag was added to know that the s->current_rule was assigned due to forced yield and not regular yield. By default haproxy will enforce a yield every 50 rules, this behavior can be configured using the "tune.max-rules-at-once" global keyword. There is a limitation though: for now, if the ACT_OPT_FINAL flag is set on act_opts, we consider it is not safe to yield (as it is already the case for automatic yield). In this case instead of yielding an taking the risk of not being called back, we skip the yield and hope it will not create contention. This is something we should ideally try to improve in order to yield in all conditions.	2025-02-03 17:09:48 +01:00
Christopher Faulet	5f927f603a	BUG/MEDIUM: mux-fcgi: Properly handle read0 on partial records A Read0 event could be ignored by the FCGI multiplexer if it is blocked on a partial record. Instead of handling the event, it remained blocked, waiting for the end of the record. To fix the issue, the same solution than the H2 multiplexer is used. Two flags are introduced. The first one, FCGI_CF_END_REACHED, is used to acknowledge a read0. This flag is set when a read0 was received AND the FCGI multiplexer must handle it. The second one, FCGI_CF_DEM_SHORT_READ, is set when the demux is interrupted on a partial record. A short read and a read0 lead to set the FCGI_CF_END_REACHED flag. With these changes, the FCGI mux should be able to properly handle read0 on partial records. This patch should be backported to all stable versions after a period of observation.	2025-02-03 07:49:50 +01:00
Christopher Faulet	71320fc9c1	MINOR: tevt/connection: Add support for POLL_HUP/POLL_ERR events Connection errors can be detected via connect/recv/send syscall, but also because it was reported by the poller. So dedicated events, at the FD level, are introduced to make the difference. term_events tool was updated accordingly.	2025-01-31 10:41:50 +01:00
Christopher Faulet	990854ee0d	REORG: tevt/connection: Move enums at the end of the header file Enums used to report events were placed in the connection header for conveniance. But it is not specifically related to connection. So, they are moved at the end of the file to have a better isolation.	2025-01-31 10:41:50 +01:00
Christopher Faulet	487d6b09f1	MINOR: tevt: Improve function to convert a termination events log to string The function is now responsible to handle empty log because no event was reported. In that case, an empty string is returned. It is also responsible to handle case where termination events log is not supported for an given entity (for instance the quic mux for now). In that case, a dash ("-") is returned.	2025-01-31 10:41:50 +01:00
Christopher Faulet	cbd898c42b	MINOR: tevt: Don't duplicate termination event during reporting It is hard to never detect the same event several time without painful tests. In other words, the same termination event can be reported several time and this must be handled. To do so, "tevt_report_event" macro is updated to ignore an event if the last reported one is of the same type, for the same location. Of course, if the same event is reported several times at different moment, it will not be detected.	2025-01-31 10:41:50 +01:00
Christopher Faulet	2dc02f75b1	MEDIUM: tevt/stconn/stream: Add dedicated termination events for stream location If it is the last patch to introduce dedicated termination events for each location. In this one, events for the stream location are introcued. The old enum is also removed because it is now unused. Here, more accurate evets are added. The "intercepted" event was splitted.	2025-01-31 10:41:50 +01:00
Christopher Faulet	a58e650ad1	MEDIUM: tevt/muxes: Add dedicated termination events for muxc/se locations Termination events dedicated to mux connection and stream-endpoint descriptors are added in this patch. Specific events to these locations are thus added. Changes for the H1 and H2 multiplexers are reviewed to be more accurate.	2025-01-31 10:41:50 +01:00
Christopher Faulet	f2778ccc7d	MINOR: tevt/connection: Add dedicated termination events for lower locations To be able to add more accurate termination events for each location, the enum will be splitted by location. Indeed, there are at most 16 possbile events. It will be pretty confusing to use same termination events for the different locations. So the best is to split them. In this patch, the termination events for the fd, hs and xprt locations are introduced. For now some holes are added to keep similar events aligned across enums. But this may change in future.	2025-01-31 10:41:50 +01:00
Christopher Faulet	a4c281a190	MINOR: tevt/muxes: Add CTL and SCTL command to get the termination event logs MUX_CTL_TEVTS command is added to get the termination event logs of a mux connection and MUX_SCTL_TEVTS command to get the termination event logs of a mux stream.	2025-01-31 10:41:50 +01:00
Christopher Faulet	00a07c8b54	MINOR: tevt/stream/stconn: Report termination events for stream and sc In this patch, events for the stream location are reported. These events are first reported on the corresponding stream-connector. So front events on scf and back event on scb. Then all events are both merged in the stream. But only 4 events are saved on the stream. Several internal events are for now grouped with the type "tevt_type_intercepted". More events will be added to have a better resolution. But at least the place to report these events are identified. For now, when a event is reported on a SC, it is also reported on the stream and vice versa.	2025-01-31 10:41:50 +01:00
Christopher Faulet	992b4b9726	MINOR: tevt/stconn: Add a termination events log in the SE descriptor This termination events log will be used to report events from the mux streams. The location will be "tevt_loc_se" and the muxes will be responsible to report the corresponding events.	2025-01-31 10:41:50 +01:00
Christopher Faulet	e944944990	MINOR: tevt: Add the termination events log's fundations Termination events logs will be used to report the events that led to close a connection. Unlike flags, that reflect a state, the idea here is to store a log to preserve the order of the events. Most of time, when debugging an issue, the order of the events is crucial to be able to understand the root cause of the issue. The traces are trully heplful to do so. But it is not always possible to active them because it is pretty verbose. On heavily loaded platforms, it is not acceptable. We hope that the termination events logs will help us in that situations. One termination events log will be be store at each layer (connection, mux connection, mux stream...) as a 32-bits integer. Each event will be store on 8 bits, 4 bits for the location and 4 bits for the type. So the first four events will be stored only for each layer. It should be enough why a connection is closed. In this patch, the enums defining the termination event locations and types are added. The macro to report a new event is also added and a function to convert a termination events log to a string that could be display in log messages for instance.	2025-01-31 10:41:49 +01:00
Christopher Faulet	e56e718c82	MINOR: mux-h1: Add masks to group H1S DEMUX and MUX errors It is just a small patch to clean up mux/demux functions. Instead of listing the H1S errors that must be handled during demux of mux operations, masks of flags are used. It is more readable.	2025-01-31 10:41:49 +01:00
Willy Tarreau	d155924efe	MINOR: fd: add a generation number to file descriptors This patch adds a counter of close() on file descriptors in the fdtab. The goal is to better detect if reported events concern the current or a previous file descriptor. For now the counter is only added, and is showed in "show fd" as "gen". We're reusing unused space at the end of the struct. If it's needed for something more important later, this patch can be reverted.	2025-01-30 19:45:34 +01:00
Willy Tarreau	44ac7a7e73	DEBUG: fd: add a counter of takeovers of an FD since it was last opened That's essentially in order to help with debugging strange cases like the occasional epoll issues/races, by keeping a counter of how many times an FD was taken over since last inserted. The room is available so let's use it. If it's needed later, this patch can easily be reverted. The counter is also reported in "show fd" as "tkov".	2025-01-30 19:45:34 +01:00
Amaury Denoyelle	b849ee5fa3	BUILD: quic: fix overflow in global tune A new global option was recently introduced to disable pacing. However, the value used (1<<31) caused issue with some compiler as options field used for storage is declared as int. Move pacing deactivation flag outside into the newly defined quic_tune to fix this. This should be backported up to 3.1 after a period of observation. Note that it relied on the previous patch which defined new quic_tune type.	2025-01-30 18:12:53 +01:00
Amaury Denoyelle	09e9c7d5b7	MINOR: quic: define quic_tune Define a new structure quic_tune. It will be useful to regroup various configuration settings and tunable related to QUIC, instead of defining them into the global structure.	2025-01-30 18:12:40 +01:00
Amaury Denoyelle	0c8b54b2d1	MINOR: quic: transform pacing settings into a global option Pacing support was previously activated on each bind line individually, via an optional argument of quic-cc-algo keyword. Remove this optional argument and introduce a global setting to enable/disable pacing. Pacing activation is still flagged as experimental. One important change is that previously BBR usage automatically activated pacing support. This is not the case anymore, so users should now always explicitely activate pacing if BBR is selected. A new warning message will be displayed if this is not the case. Another consequence of this change is that now pacing_inter callback is always defined for every quic_cc_algo types. As such, QUIC MUX uses global.tune.options to determine if pacing is required. This should be backported up to 3.1, after a period of observation.	2025-01-30 17:19:38 +01:00
William Lallemand	b43e5d8c16	BUILD: ssl: more cleaner approach to WolfSSL without renegotiation Patch discussed in https://github.com/wolfSSL/wolfssl/issues/6834 When building Wolfssl without renegotiation options, WolfSSL still defines the macros about it, which warns during the build. This patch completes the previous one by undefining the macros so haproxy could build without any warning.	2025-01-28 20:55:20 +01:00
William Lallemand	c6a8279cdf	BUILD: ssl: allow to build without the renegotiation API of WolfSSL In ticket https://github.com/wolfSSL/wolfssl/issues/6834, it was suggested to push --enable-haproxy within --enable-distro. WolfSSL does not want to include the renegotiation support in --enable-distro. To achieve this, let haproxy build without SSL_renegotiate_pending() when wolfssl does not define HAVE_SECURE_RENEGOCIATION or HAVE_SERVER_RENEGOCIATION_INFO.	2025-01-28 18:31:32 +01:00
Willy Tarreau	f17b0a994b	BUILD: tools: fix build on BSD by dropping the ETIME check Commit `44537379fc` ("MINOR: tools: add errname to print errno macro name") brought a facility to report errno using a symbolic string when known instead of showing only the value. However, among the listed options, ETIME is mentioned but is unknown from FreeBSD where it breaks the build. Let's simply drop it, we don't use ETIME anyway and even if it would be reported, the default code path still reports the numeric value so there's no harm. If other ones fail to build in the future, they could be handled the same way.	2025-01-28 15:58:57 +01:00
Christopher Faulet	36d151dc10	MEDIUM: stream: No longer use TASK_F_UEVT* to shut a stream down Thanks to the previous patch, it is now possible to explicitly rely on stream's events to shut it down. The right event is set in stream_shutdown(), before waking up the stream, via an atomic operation. In process_stream(), this event will be handled as expected. Thus, TASK_F_UEVT* are no longer used, but not removed since still usable for other tasks. This patch depends on "MEDIUM: stream: Map task wake up reasons to dedicated stream events".	2025-01-28 14:53:37 +01:00
Christopher Faulet	6048460102	MEDIUM: stream: Map task wake up reasons to dedicated stream events To fix thread-safety issues when a stream must be shut, three new task states were added. These states are generic (UEVT1, UEVT2 and UEVT3), the task callback function is responsible to know what to do with them. However, it is not really scalable. The best is to use an atomic field in the stream structure itself to deal with these dedicated events. There is already the "pending_events" field that save wake up reasons (TASK_WOKEN_) to not loose them if process_stream() is interrupted before it had a chance to handle them. So the idea is to introduce a new field to handle streams dedicated events and merged them with the task's wake up reasons used by the stream. This means a mapping must be performed between some task wake up reasons and streams events. Note that not all task wake up reasons will be mapped. In this patch, the "new_events" field is introduced. It is an atomic bit-field. Streams events (STRM_EVT_) are also introduced to map the task wake up reasons used by process_stream(). Only TASK_WOKEN_TIMER and TASK_WOKEN_MSG are mapped, in addition to TASK_F_UEVT* flags. In process_stream(), "pending_events" field is now filled with new stream events and the mapping of the wake up reasons.	2025-01-28 14:53:37 +01:00
Christopher Faulet	0a52a75ef7	BUG/MINOR: stream: Properly handle "on-marked-up shutdown-backup-sessions" shutdown-backup-sessions action for on-marked-up directive does not work anymore since the stream_shutdown() function was modified to be async-safe. When stream_shutdown() was modified to be async-safe, dedicated task events were added to map the reasons to shut a stream down. SF_ERR_DOWN was mapped to TASK_F_EVT1 and SF_ERR_KILLED was mapped to TASK_F_EVT2. The reverse mapping was performed by process_stream() to shut the stream with the appropriate reason. However, SF_ERR_UP reason, used by shutdown-backup-sessions action to shut a stream down because a preferred server became available, was not mapped in the same way. So since commit `b8e3b0a18d` ("BUG/MEDIUM: stream: make stream_shutdown() async-safe"), this action is ignored and does not work anymore. To fix an issue, and being able to bakcport the fix, a third task event was added. TASK_F_EVT3 is now mapped on SF_ERR_UP. This patch should fix the issue #2848. It must be backported as far as 2.6.	2025-01-28 14:53:37 +01:00
Olivier Houchard	26b3e5236f	MEDIUM: servers/proxies: Switch to using per-tgroup queues. For both servers and proxies, use one connection queue per thread-group, instead of only one. Having only one can lead to severe performance issues on NUMA machines, it is actually trivial to get the watchdog to trigger on an AMD machine, having a server with a maxconn of 96, and an injector that uses 160 concurrent connections. We now have one queue per thread-group, however when dequeueing, we're dequeuing MAX_SELF_USE_QUEUE (currently 9) pendconns from our own queue, before dequeueing one from another thread group, if available, to make sure everybody is still running.	2025-01-28 12:49:41 +01:00
Olivier Houchard	583303c48b	MINOR: proxies/servers: Calculate queueslength and use it. For both proxies and servers, properly calculates queueslength, which is the total number of element in each queues (as they currently are only using one queue, it is equivalent to the number of element of that queue), and use it instead of the queue's length.	2025-01-28 12:49:41 +01:00
Olivier Houchard	59eddabe16	MINOR: Add fields to the per-thread group field in struct server. Add a per-thread group queue and associated fields in per-thread group field in struct server, as well as a new field, queues length. This is currently unused, so should change nothing.	2025-01-28 12:49:41 +01:00
Olivier Houchard	f879b9a18a	MINOR: proxies: Add a per-thread group field to struct proxy. Add a per-thread group field to struct proxy, that will contain a struct queue, as well as a new field, "queueslength". This is currently unused, so should change nothing. Please note that proxy_init_per_thr() must now be called for each proxy once the thread groups number is known.	2025-01-28 12:49:41 +01:00
Aurelien DARRAGON	e768a531b7	CLEANUP: tree-wide: define and use acl_match_cond() helper acl_match_cond() combines acl_exec_cond() + acl_pass() and a check on the condition->pol (to check if the cond is inverted) in order to return either 0 if the cond doesn't match or 1 if it matches (or NULL). Thanks to this we can actually simplify some redundant constructs that iterate over rules and evaluate if the condition matches or not. Conditions for tcp-request inspect-content and tcp-response inspect-content couldn't be simplified because they perform an extra check for missing data, and thus still need to leverage acl_exec_cond() It's best to display the patch using "-w", like "git show xxxx -w", because some blocks had to be re-indented after the cleanup, which makes the patch hard to review by default.	2025-01-27 11:11:43 +01:00
Valentine Krasnobaeva	94d3b7375a	CLEANUP: ssl: move ssl_sock_gencert_load_ca declaration in ssl_gencert.h As ssl_sock_gencert_load_ca and ssl_sock_gencert_free_ca are compiled only if SSL_NO_GENERATE_CERTIFICATES is not defined, let's align it and move these declarations in ssl_gencert.h.	2025-01-24 12:31:07 +01:00
Valentine Krasnobaeva	846819b316	CLEANUP: ssl: rename ssl_sock_load_ca to ssl_sock_gencert_load_ca ssl_sock_load_ca is defined in ssl_gencert.c and compiled only if SSL_NO_GENERATE_CERTIFICATES is not defined. It's name is a bit confusing, as we may think at the first glance, that it's a generic function, which is also used to load CA file, provided via 'ca-file' keyword. ssl_set_verify_locations_file is used in this case. So let's rename ssl_sock_load_ca into ssl_sock_gencert_load_ca. Same is applied to ssl_sock_free_ca.	2025-01-24 12:31:07 +01:00
Valentine Krasnobaeva	44537379fc	MINOR: tools: add errname to print errno macro name Add helper to print the name of errno's corresponding macro, for example "EINVAL" for errno=22. This may be helpful for debugging and for using in some CLI commands output. The switch-case in errname() contains only the errnos currently used in the code. So, it needs to be extended, if one starts to use new syscalls.	2025-01-24 09:54:57 +01:00
Amaury Denoyelle	7896edccdc	MINOR: quic: remove unused pacing burst in bind_conf/quic_cc_path Pacing burst size is now dynamic. As such, configuration value has been removed and related fields in bind_conf and quic_cc_path structures can be safely removed. This should be backported up to 3.1.	2025-01-23 17:40:48 +01:00
Amaury Denoyelle	cb91ccd8a8	MEDIUM: quic: use dynamic credit for pacing Major improvements have been introduced in pacing recently. Most notably, QMUX schedules emission on a millisecond resolution, which allow to use passive wait to be much CPU friendly. However, an issue remains with the pacing max credit. Unless BBR is used, it is fixed to the configured value from quic-cc-algo bind statement. This is not practical as if too low, it may drastically reduce performance due to 1ms sleep resolution. If too high, some clients will suffer from too much packet loss. This commit fixes the issue by implementing a dynamic maximum credit value based on the network condition specific to each clients. Calculation is done to fix a maximum value which should allow QMUX current tasklet context to emit enough data to cover the delay with the next tasklet invokation. As such, avg_loop_us is used to detect the process load. If too small, 1.5ms is used as minimal value, to cover the extra delay incurred by the system which will happen for a default 1ms sleep. This should be backported up to 3.1.	2025-01-23 17:40:48 +01:00
Amaury Denoyelle	8098be1fdc	MEDIUM: mux-quic: reduce pacing CPU usage with passive wait Pacing algorithm has been revamped in the previous commit to implement a credit based solution. This is a far more adaptative solution, in particular which allow to catch up in case pause between pacing emission was longer than expected. This allows QMUX to remove the active loop based on tasklet wake-up. Instead, a new task is used when emission should be paced. The main advantage is that CPU usage is drastically reduced. New pacing task timer is reset each time qcc_io_send() is invoked. Timer will be set only if pacing engine reports that emission must be interrupted. In this case timer is set via qcc_wakeup_pacing() to the delay reported by congestion algorithm, or 1ms if delay is too short. At the end of qcc_io_cb(), pacing task is queued if timer has been set. Pacing task execution is simple enough : it immediately wakes up QCC I/O handler. Note that to have decent performance, it requires to have a large enough burst defined in configuration of quic-cc-algo. However, this value is common to every listener clients, which may cause too much loss under network conditions. This will be address in a future patch. This should be backported up to 3.1.	2025-01-23 17:40:22 +01:00
Amaury Denoyelle	4489a61585	MEDIUM: quic: implement credit based pacing Implement a new method for QUIC pacing emission based on credit. This represents the number of packets which can be emitted in a single burst. After emission, decrement from the credit the number of emitted packets. Several emission can be conducted in the same sequence until the credit is completely decremented. When a new emission sequence is initiated (i.e. under a new QMUX tasklet invokation), credit is refilled according to the delay which occured between the last and current emission context. This new mechanism main advantage is that it allows to conduct several emission in the same task context without having to wait between each invokation. Wait is only forced if pacing is expired, which is now equivalent to having a null credit. Furthermore, if delay between two emissions sequence would have been smaller than expected, credit is only partially refilled. This allows to restart emission without having to wait for the whole credit to be available. On the implementation side, a new field <credit> is avaiable in quic_pacer structure. It is automatically decremented on quic_pacing_sent_done() invokation. Also, a new function quic_pacing_reload() must be used by QUIC MUX when a new emission sequence is initiated to refill credit. <next> field from quic_pacer has been removed. For the moment, credit is based on the burst configured via quic-cc-algo keyword, or directly reported by BBR. This should be backported up to 3.1.	2025-01-23 17:40:20 +01:00
Amaury Denoyelle	bbaa7aef7b	BUG/MINOR: quic: do not increase congestion window if app limited Previously, congestion window was increased any time each time a new acknowledge was received. However, it did not take into account the window filling level. In a network condition with negligible loss, this will cause the window to be incremented until the maximum value (by default 480k), even though the application does not have enough data to fill it. In most cases, this issue is not noticeable. However, it may lead to excessive memory consumption when a QUIC connection is suddendly interrupted, as in this case haproxy will fill the window with retransmission. It even has caused OOM crash when thousands of clients were interrupted at once on a local network benchmark. Fix this by first checking window level prior to every incrementation via a new helper function quic_cwnd_may_increase(). It was arbitrarily decided that the window must be at least 50% full when the ACK is handled prior to increment it. This value is a good compromise to keep window in check while still allowing fast increment when needed. Note that this patch only concerns cubic and newreno algorithm. BBR has already its notion of application limited which ensures the window is only incremented when necessary. This should be backported up to 2.6.	2025-01-23 14:49:35 +01:00
Amaury Denoyelle	7c0820892f	MINOR: quic: rename pacing_rate cb to pacing_inter Rename one of the congestion algorithms pacing callback from pacing_rate to pacing_inter. This better reflects that this function returns a delay (in nanoseconds) which should be applied between each packet emission to fill the congestion window with a perfectly smoothed emission. This should be backported up to 3.1.	2025-01-23 14:49:35 +01:00
Amaury Denoyelle	2178bf1192	CLEANUP: quic: remove unused prototype Remove undefined quic_pacing_send() function prototype from quic_pacing module. This should be backported up to 3.1.	2025-01-23 14:49:35 +01:00
Frederic Lecaille	4f38c4bfd8	MINOR: quic: Add a BUG_ON() on quic_tx_packet refcount This is definitively a bug to call quic_tx_packet_refdec() to decrement the reference counter of a TX packet calling quic_tx_packet_refdec(), and possibly to release its memory when it is negative or null. This counter is incremented when a TX frm is attached to it with some allocated memory and when the packet is inserted into a data structure, if needed (list or tree). Should be easily backported as far as 2.6 to ease any further backport around this code part.	2025-01-21 22:01:34 +01:00
Frederic Lecaille	cb729fb64d	BUG/MINOR: quic: ensure a detached coalesced packet can't access its neighbours Reset ->prev and ->next fields of a coalesced TX packet to ensure it cannot access several times its neighbours after it is supposed to be detached from them calling quic_tx_packet_dgram_detach(). There are two cases where a packet can be coalesced to another previous built one: this is when it is built into the same datagrame without GSO (and flagged flag with QUIC_FL_TX_PACKET_COALESCED) or when sent from the same sendto() syscall with GOS (not flagged with QUIC_FL_TX_PACKET_COALESCED). This fix may be in relation with GH #2839. Must be backported as far as 2.6.	2025-01-21 22:01:34 +01:00
Willy Tarreau	b066c0affb	REORG: version: move the remaining BUILD_* stuff from haproxy.c to version.c version.c tries to centralize all variables conveying version information, but there's still an issue with the BUILD_* variables which are only passed to haproxy.o and are only updated when that one is rebuilt. This is not very logical given that we can end up with values there which contradict info from version.c. Better move all of these to version.c which is systematically rebuilt. Most of these variables only end up as string concatenation at the moment. Some of them are even duplicated. In version.c we now have one variable (or constant) for each of them and haproxy.c references them in messages. This is much more logical and easier to maintain in a consistent state. The patch looks a bit large but it really only moves the ifdefed string assignment from one file to another, placing them into variables.	2025-01-20 17:53:55 +01:00
Amaury Denoyelle	a50dd07c16	MINOR: trace: ensure -dt priority over traces config section Traces can be activated on startup either via -dt command line argument or via the traces configuration section. This can caused confusion as it may not be clear as trace source can be completed or overriden by one or the other. Fix the precedence to give the priority to the command line argument. Now, each trace source configured via -dt is first resetted to a default state before applying new settings. Then, it is impossible to change a trace source via the configuration file if it was already targetted via -dt argument.	2025-01-10 14:50:59 +01:00
Willy Tarreau	b25850f25b	MINOR: tools: add a few functions to simply check for a file's existence At many places we'd like to be able to simply construct a path from a format string and check if that path corresponds to an existing file, directory etc. Here we add 3 functions, a generic one to test that a path corresponds to a given file mode (e.g. S_IFDIR, S_IFREG etc), and two other ones specifically checking for a file or a dir for easier use.	2025-01-09 09:18:49 +01:00
Willy Tarreau	bd06502b22	BUILD: makefile: add a qinfo macro to pass info in quiet mode Some commands such as $(cmd_CC) etc already handle the quiet vs verbose mode in the makefile, but sometimes we may want to pass other info. The new "qinfo" macro can be called with a 9-char string argument (spaces included) as a prefix for some commands, to emit that string when in quiet mode. The caller must fill the spaces needed for alignment. E.g: $(call quinfo, CC )$(CC) ...	2025-01-08 11:26:05 +01:00
Amaury Denoyelle	af00be8e0f	MINOR: mux-quic: change return value of qcs_attach_sc() A recent fix was introduced to ensure that a streamdesc instance won't be attached to an already completed QCS which is eligible to purging. This was performed by skipping application protocol decoding if a QCS is in such a state. Here is the patch responsible for this change. `caf60ac696` BUG/MEDIUM: mux-quic: do not attach on already closed stream However, this is too restrictive, in particular for unidirection stream where no streamdesc is never attached. To fix this behavior, first qcs_attach_sc() API has been modified. Instead of returning a streamdesc instance, it returns either 0 on success or a negative error code. There should be no functional changes with this patch. It is only to be able to extend qcs_attach_sc() with the possibility of skipping streamdesc instantiation while still keeping a success return value. This should be backported wherever the above patch has been merged. For the record, it was scheduled for immediate backport on 3.1, plus merging on older releases up to 2.8 after a period of observation.	2025-01-03 17:19:21 +01:00
Willy Tarreau	f486f976c7	BUILD: limits: make normalize_rlim() take an rlim_t to fix build on m68k As can be seen here, the build fails on m68k since commit `665dde648` ("MINOR: debug: use LIM2A to show limits") in 3.1: https://github.com/haproxy/haproxy/actions/runs/12440234399/job/34735360177 The reason is the comparison between a ulong limit and RLIM_INFINITY. Indeed, on m68k, rlim_t is an unsigned long long. Let's just change the function's input type to take an rlim_t instead. This also allows to get rid of the casts in the call place. This can be backported to 3.1 though it's not important given the low prevalence of this platform for such use cases.	2024-12-25 12:33:06 +01:00
Willy Tarreau	f78121dd32	BUILD: compat: add missing fcntl.h before defining F_SETPIPE_SZ n 1.5-dev8, 13 years ago, support for setting pipe size was added by commit `bd9a0a778` ("OPTIM/MINOR: make it possible to change pipe size (tune.pipesize)"). For compatibility purposes, it was defining F_SETPIPE_SZ in compat.h if it was not set. It apparently always had F_SETPIPE_SZ defined before being included. Now in 3.2-dev1, commit `fbc534a6f` ("REORG: startup: move nofile limit checks in limits.c") reordered a few includes and ended up with mworker-prog.c including compat.h before fcntl.h, causing a redefinition error on certain libcs: CC src/mworker-prog.o In file included from /usr/include/bits/fcntl.h:61:0, from /usr/include/fcntl.h:35, from include/haproxy/limits.h:11, from include/haproxy/mworker.h:18, from src/mworker-prog.c:27: /usr/include/bits/fcntl-linux.h:203:0: warning: "F_SETPIPE_SZ" redefined [enabled by default] In file included from include/haproxy/api-t.h:35:0, from include/haproxy/api.h:33, from src/mworker-prog.c:23: include/haproxy/compat.h:161:0: note: this is the location of the previous definition Let's simply include fcntl.h in compat.h before the macro is redefined. There's normally no need to backport this, though it's harmless to do it if needed.	2024-12-25 11:53:11 +01:00
Olivier Houchard	505480eeef	CLEANUP: Remove pendconn_must_try_again(). Remove pendconn_must_try_again(), now that it no longer is used.	2024-12-24 14:10:06 +01:00
Olivier Houchard	cda7275ef5	MEDIUM: queue: Handle the race condition between queue and dequeue differently There is a small race condition, where a server would check if there is something left in the proxy queue, and adding something to the proxy queue. If the server checks just before the stream is added to the queue, and it no longer has any stream to deal with, then nothing will take care of the stream, that may stay in the queue forever. This was worked around with commit `5541d4995d`, by checking for that exact condition after adding the stream to the queue, and trying again to get a server assigned if it is detected. That fix lead to multiple infinite loops, that got fixed, but it is not unlikely that it could happen again. So let's fix the initial problem differently : a single server may mark itself as ready, and it removes itself once used. The principle is that when we discover that the just queued stream is alone with no active request anywhere ot dequeue it, instead of rebalancing it, it will be assigned to that current "ready" server that is available to handle it. The extra cost of the atomic ops is negligible since the situation is super rare.	2024-12-24 14:10:06 +01:00
Olivier Houchard	5b8899b6cc	BUG/MEDIUM: queue: Make process_srv_queue return the number of streams Make process_srv_queue() return the number of streams unqueued, as pendconn_grab_from_px() did, as that number is used by srv_update_status() to generate logs. This should be backported up to 2.6 with `111ea83ed4`	2024-12-23 15:03:40 +01:00
William Lallemand	056ec51c26	MEDIUM: ssl/ocsp: counters for OCSP stapling Add 2 counters in the SSL stats module for OCSP stapling. - ssl_ocsp_staple is the number of OCSP response successfully stapled with the handshake - ssl_failed_ocsp_stapled is the number of OCSP response that we couldn't staple, it could be because of an error or because the response is expired. These counters are incremented in the OCSP stapling callback, so if no OCSP was configured they won't never increase. Also they are only working in frontends. This was discussed in github issue #2822.	2024-12-23 11:23:00 +01:00
William Lallemand	0e6af97233	MINOR: ssl: change visibility of ssl_stats_module In order to add stats from other files, the ssl_stats_module need to be visible from other files. This moves the ssl_counters definition in ssl_sock-t.h and removes the static of ssl_stats_module.	2024-12-23 11:23:00 +01:00
William Lallemand	acb2c9eb8b	MINOR: ssl: improve HAVE_SSL_OCSP ifdef Allow to build correctly without OCSP. It could be disabled easily with OpenSSL build with OPENSSL_NO_OCSP. Or even with DEFINE="-DOPENSSL_NO_OCSP" on haproxy make line.	2024-12-19 10:53:05 +01:00
Remi Tricot-Le Breton	93f2c73423	MINOR: ssl/ocsp: Add extra details in error logs when possible When the ocsp response auto update process fails during insertion or while validating the received ocsp response, we call ssl_sock_update_ocsp_response or ssl_ocsp_check_response respectively and both these functions take an 'err' parameter in which detailed error messages can be written. Until now, those error messages were discarded and the only information given to the user was a generic error (ERR_CHECK or ERR_INSERT) which does not help much. We now keep a pointer to the last error message in the certificate_ocsp structure and dump its content in the update logs as well as in the "show ssl ocsp-updates" cli command. This issue was raised in GitHub #2817.	2024-12-18 10:41:16 +01:00
Amaury Denoyelle	9d155ca706	MINOR: trace: implement tracing disabling API Define a set of functions to temporarily disable/reactivate tracing for the current thread. This could be useful when wanting to quickly remove tracing output for some code parts. The API relies on a disable/resume set of functions, with a thread-local counter. This counter is tested under __trace_enabled(). It is a cumulative value so that the same count of resume must be issued after several disable usage. There is also the possibility to force reset the counter to 0 before restoring the old value. This should be backported up to 3.1.	2024-12-18 09:52:06 +01:00
Amaury Denoyelle	e296585ae9	MEDIUM/OPTIM: mux-quic: implement purg_list This commit is part of the current serie which aims to refactor and improve overall performance of QUIC MUX I/O handler. qcc_io_process() is responsible to perform some internal operations on QUIC MUX after I/O completion. It is notably called on every qcc_io_cb() tasklet handler. The most intensive work on it is the purging of QCS instances after transfer completion. This was implemented by looping on QCC streams tree and inspecting the state of every QCS. The purpose of this commit is to optimize this processing. A new purg_list QCC member is defined. It is responsible to list every QCS instances whose transfer has been completed. It is thus safe to reuse <el_send> QCS list attach point. Stream purging will thus only loop on purg_list instead of every known QCS. This should be backported up to 3.1.	2024-12-18 09:33:52 +01:00
Amaury Denoyelle	4b42dd4ae0	MEDIUM/OPTIM: mux-quic: define a recv_list for demux resumption This commit is part of the current serie which aims to refactor and improve overall performance of QUIC MUX I/O handler. Define a recv_list element into qcc structure. This is used to registered every instance of qcs which are currently blocked on demuxing, which happen on no more space in <rx.appbuf>. The purpose of this patch is to reduce qcc_io_recv() CPU usage. Now, only recv_list iteration is performed, instead of the previous looping over every qcs instances. This is useful as qcc_io_recv() is called each time qcc_io_cb() is scheduled, even if only sending condition was the wakeup origin. A qcs is not inserted into recv_list immediately after blocking on demux full buffer. Instead, this is only done after unblocking via stream rcv_buf callback, which ensure that new buffer space is available. This should be backported up to 3.1.	2024-12-18 09:23:41 +01:00
Amaury Denoyelle	0a53a008d0	MINOR: mux-quic: refactor wait-for-handshake support This commit refactors wait-for-handshake support from QUIC MUX. The flag logic QC_CF_WAIT_HS is inverted : it is now positionned only if MUX is instantiated before handshake completion. When the handshake is completed, the flag is removed. The flag is now set directly on initialization via qmux_init(). Removal via qcc_wait_for_hs() is moved from qcc_io_process() to qcc_io_recv(). This is deemed more logical as QUIC MUX is scheduled on RECV to be notify by the transport layer about handshake termination. Moreover, qcc_wait_for_hs() is now called if recv subscription is still active. This commit is the first of a serie which aims to refactor QUIC MUX I/O handler and improves its overall performance. The ultimate objective is to be able to stream qcc_io_cb() by removing pacing specific code path via qcc_purge_sending(). This should be backported up to 3.1.	2024-12-18 09:23:41 +01:00
Amaury Denoyelle	17bfe93768	CLEANUP: mux-quic: remove unused qcc member send_retry_list Remove unused fields send_retry_list from qcc and its corresponding attach element el from qcs. This should be backported up to 3.1.	2024-12-18 09:20:20 +01:00
Willy Tarreau	7b6acb6a51	MINOR: bug: make BUG_ON() fall back to ASSUME When the strict level is zero and BUG_ON() is not implemented, some possible null-deref warnings are emitted again because some were covering for these cases. Let's make it fall back to ASSUME() so that the compiler continues to know that the tested expression never happens. It also allows to further optimize certain functions by helping the compiler eliminate certain tests for impossible values. However it requires that the expression is really evaluated before passing the result through ASSUME() otherwise it was shown that gcc-11 and above will fail to evaluate its implications and will continue to emit the null-deref warnings in case the expression is non-trivial (e.g. it has multiple terms). We don't do it for BUG_ON_HOT() however because the extra cost of evaluating the condition is generally not welcome in fast paths, particularly when that BUG_ON_HOT() was kept disabled for performance reasons.	2024-12-17 17:39:12 +01:00
Willy Tarreau	63798088b3	MINOR: compiler: add ASSUME_NONNULL() to tell the compiler a pointer is valid At plenty of places we have ALREADY_CHECKED() or DISGUISE() on a pointer just to avoid "possibly null-deref" warnings. These ones have the side effect of weakening optimizations by passing through an assembly step. Using ASSUME_NONNULL() we can avoid that extra step. And when the __builtin_unreachable() builtin is not present, we fall back to the old method using assembly. The macro returns the input value so that it may be used both as a declarative way to claim non-nullity or directly inside an expression like DISGUISE().	2024-12-17 16:46:46 +01:00
Willy Tarreau	2ce63b7b17	MINOR: compiler: also enable __builtin_assume() for ASSUME() Clang apparently has __builtin_assume() which does exactly the same as our macro, since at least v3.8. Let's enable it, in case it may even better detect assumptions vs unreachable code.	2024-12-17 16:46:46 +01:00
Willy Tarreau	efc897484b	MINOR: compiler: add a new "ASSUME" macro to help the compiler This macro takes an expression, tests it and calls an unreachable statement if false. This allows the compiler to know that such a combination does not happen, and totally eliminate tests that would be related to this condition. When the statement is not available in the compiler, we just perform a break from a do {} while loop so that the expression remains evaluated if needed (e.g. function call).	2024-12-17 16:46:46 +01:00
Willy Tarreau	41fc18b1d1	MINOR: compiler: rely on builtin detection for __builtin_unreachable() Due to __builtin_unreachable() only being associated to gcc 4.5 and above, it turns out it was not enabled for clang. It's not used that much but still a little bit, so let's enable it now. This reduces the code size by 0.2% and makes it a bit more efficient.	2024-12-17 16:46:46 +01:00
Willy Tarreau	96cfcb1df3	MINOR: compiler: add a __has_builtin() macro to detect features more easily We already have a __has_attribute() macro to detect when the compiler supports a specific attribute, but we didn't have the equivalent for builtins. clang-3 and gcc-10 have __has_builtin() for this. Let's just bring it using the same mechanism as __has_attribute(), which will allow us to simply define the macro's value for older compilers. It will save us from keeping that many compiler-specific tests that are incomplete (e.g. the __builtin_unreachable() test currently doesn't cover clang).	2024-12-17 16:46:46 +01:00
Olivier Houchard	b3cd5a4b86	CLEANUP: queues: Remove pendconn_grab_from_px(). pendconn_grab_from_px() is now unused, so just remove it.	2024-12-17 16:05:44 +01:00
William Lallemand	bb88f68cf7	MINOR: ssl: add utils functions to extract X509 notAfter date Add ASN1_to_time_t() which converts an ASN1_TIME to a time_t and x509_get_notafter_time_t() which returns the notAfter date in time_t format.	2024-12-16 14:54:53 +01:00
Valentine Krasnobaeva	fbc534a6fa	REORG: startup: move nofile limit checks in limits.c Let's encapsulate the code, which checks the applied nofile limit into a separate helper check_nofile_lim_and_prealloc_fd(). Let's keep in this new function scope the block, which tries to create a copy of FD with the highest number, if prealloc-fd is set in the configuration.	2024-12-16 10:44:01 +01:00
Valentine Krasnobaeva	14f5e00d38	REORG: startup: move code that applies limits to limits.c In step_init_3() we try to apply provided or calculated earlier haproxy maxsock and memmax limits. Let's encapsulate these code blocks in dedicated functions: apply_nofile_limit() and apply_memory_limit() and let's move them into limits.c. Limits.c gathers now all the logic for calculating and setting system limits in dependency of the provided configuration.	2024-12-16 10:44:01 +01:00
Valentine Krasnobaeva	1332e9b58d	REORG: startup: move global.maxconn calculations in limits.c Let's encapsulate the code, which calculates global.maxconn and global.maxsslconn into a dedicated function set_global_maxconn() and let's move this function in limits.c. In limits.c we keep helpers to calculate and check haproxy internal limits, based on the system nofile and memory limits.	2024-12-16 10:44:01 +01:00
Frederic Lecaille	e1d25cdbdd	CLEANUP: quic: remove a wrong comment about ->app_limited (drs) ->app_limited quic_drs struct member is not a boolean. This is the index of the last transmitted packet marked as application-limited, or 0 if the connection is not currently application-limited (see C.app_limited definition in BBR v3 draft).	2024-12-13 14:42:43 +01:00
Frederic Lecaille	eeaeb412dc	MINOR: quic: reduce the private data size of QUIC cc algos After these commits: BUG/MINOR: quic: remove max_bw filter from delivery rate sampling BUG/MINOR: quic: fix BBB max bandwidth oscillation issue where some members were removed from bbr struct, the private data size of QUIC cc algorithms may be reduced from 160 to 144 uint32_t. Should be easily backported to 3.1 alonside the commits mentioned above.	2024-12-13 14:42:43 +01:00
Frederic Lecaille	22ab45a3a8	BUG/MINOR: quic: remove max_bw filter from delivery rate sampling This filter is no more needed after this commit: BUG/MINOR: quic: fix BBB max bandwidth oscillation issue. Indeed, one added this filter at delivery rate sampling level to filter the BBR max bandwidth estimations and was inspired from ngtcp2 code source when trying to fix the oscillation issue. But this BBR max bandwidth oscillation issue was fixed by the aforementioned commit. Furthermore this code tends to always increment the BBR max bandwidth. From my point of view, this is not a good idea at all. Must be backported to 3.1.	2024-12-13 14:42:43 +01:00
Frederic Lecaille	a9a2f98f86	MINOR: window_filter: rely on the time to update the filter samples (QUIC/BBR) The windowed filters are used only the BBR implementation for QUIC to filter the maximum bandwidth samples for its estimation over a virtual time interval tracked by counting the cyclical progression through ProbeBW cycles. ngtcp2 and quiche use such windowed filters in their BBR implementation. But in a slightly different way. When updating the 2nd or 3rd filter samples, this is done based on their values in place of the time they have been sampled. It seems more logical to rely on the sample timestamps even if this has no implication because when a sample is updated using another sample because it has the same value, they have both the same timestamps! This patch modifies two statements which compare two consecutive filter samples based on their values (smp[]->v) by statements which compare them based on the virtual time they have been sampled (smp[]->t). This fully complies which the code used by the Linux kernel in lib/win_minmax.c. Alo take the opportunity of this patch to shorten some statements using <smp> local variable value to update smp[2] sample in place of initializing its two members with the <smp> member values. This patch SHOULD be easily backported to 3.1 where BBR was first implemented.	2024-12-13 14:42:43 +01:00
Amaury Denoyelle	1f458b3ea8	MINOR: applet: define applet_putchk_stress() alternative Previous patch introduced stress mode to be able to easily test alternative code paths. The first point would be to force interruption of stats dump on every line and check reentrant patchs, in particular while adding and removing servers instances. The purpose of this patch is to be able to use applet_putchk_stress() during stats dump while not impacting other applets. To support this, extract applet_putchk() into an internal _applet_putchk() which have a new argument stress. Define two helpers applet_putchk() and applet_putchk_stress(), the latter to set the stress argument to true. For the moment, applet_putchk_stress() is not used. This will be the subject of the next patch.	2024-12-12 11:26:33 +01:00
Amaury Denoyelle	9d19fc4cf7	MINOR: build: define DEBUG_STRESS Define a new build mode DEBUG_STRESS. This will be used to stress some code parts which cannot be reproduce easily with an alternative suboptimal code. First, a global <mode_stress> is set either to 1 or 0 depending on DEBUG_STRESS compilation. A new global keyword "stress-level" is also defined. It allows to specify a level from 0 to 9, to increase the stress incurred on the code. Helper macro STRESS_RUN* are defined for each stress level. This allows to easily specify an instruction in default execution and a stress counterpart if running on the corresponding stress level.	2024-12-12 11:19:10 +01:00
Aurelien DARRAGON	358166ae6a	BUG/MINOR: hlua_fcn: restore server pairs iterator pointer consistency Since `9c91b30` ("MINOR: server: remove prev_deleted server list"), hlua server pair iterator may use and return invalid (stale) server pointer if multiple servers were deleted between two iterations. Indeed, the server refcount mechanism (using srv_take()) is no longer sufficient as the prev_deleted mitigation was removed. To ensure server pointer consistency between two yields, the new watcher mechanism must be used (as it already the case for stats dumping). Thus in this patch we slightly change the server iteration logic: hlua_server_list_iterator_context struct now stores the next valid server pointer, and a watcher is added to ensure this pointer is never stale. Then in hlua_listable_servers_pairs_iterator(), this next pointer is used to create the Lua server object, and the next valid pointer is obtained by leveraging watcher_next(). No backport needed unless `9c91b30` ("MINOR: server: remove prev_deleted server list") is. Please note that dynamic servers were not supported in Lua prior to 2.8, so it doesn't make sense to backport this patch further than 2.8.	2024-12-11 10:52:11 +01:00
Amaury Denoyelle	9c91b30139	MINOR: server: remove prev_deleted server list This patch is a direct follow-up to the previous one. Thanks to watcher type, it is not safe to assume that servers manipulated via stats dump were not targetted by a "delete server" CLI command. As such, prev_deleted list server member is now unneeded. This patch thus removes any reference to it.	2024-12-10 16:19:33 +01:00
Amaury Denoyelle	071ae8ce3d	BUG/MEDIUM: stats/server: use watcher to track server during stats dump If a server A is deleted while a stats dump is currently on it, deletion is delayed thanks to reference counting. Server A is nonetheless removed from the proxy list. However, this list is a single linked list. If the next server B is deleted and freed immediately, server A would still point to it. This problem has been solved by the prev_deleted list in servers. This model seems correct, but it is difficult to ensure completely its validity. In particular, it implies when stats dump is resumed, server A elements will be accessed despite the server being in a half-deleted state. Thus, it has been decided to completely ditch the refcount mechanism for stats dump. Instead, use the watcher element to register every stats dump currently tracking a server instance. Each time a server is deleted on the CLI, each stats dump element which may points to it are updated to access the next server instance, or NULL if this is the last server. This ensures that a server which was deleted via CLI but not completely freed is never accessed on stats dump resumption. Currently, no race condition related to dynamic servers and stats dump is known. However, as described above, the previous model is deemed too fragile, as such this patch is labelled as bug-fix. It should be backported up to 2.6, after a reasonable period of observation. It relies on the following patch : MINOR: list: define a watcher type	2024-12-10 16:19:33 +01:00
Amaury Denoyelle	eafa8a32bb	MINOR: list: define a watcher type Define a new watcher type into list module. This type is similar to bref and can be used to register an element which is currently tracking a dynamic target. Contrary to bref, if the target is freed, every watcher element are updated to point to a next valid entry or NULL. This type will simplify handling of dynamic servers deletion, in particular while stats dump are performed. This patch is not a bug-fix. However, it is mandatory to fix a race condition in dynamic servers. Thus, it should be backported along the next commit up to 2.6.	2024-12-10 16:04:11 +01:00
Valentine Krasnobaeva	1f63a53955	BUG/MINOR: mworker: detach from tty when received READY from worker Some master process' initialization steps are conditioned by receiving the READY message from worker (pidfile creation, forwarding READY message to the launching parent). So, master process can not do these initialization routines before. If the master process fails, while creating pid or forwarding the READY to the parent in daemon mode, he exits with a proper alert message. In daemon mode we no longer see such message, as process is already detached from the tty. To fix this, as these alerts could be very useful, let's detach the master process from the tty after his last initialization steps in _send_status.	2024-12-09 21:32:54 +01:00
Valentine Krasnobaeva	663d75e7a0	BUG/MEDIUM: startup: report status if daemonized process fails Due to master-worker rework, daemonization fork happens now before parsing and applying the configuration. This makes impossible to report correctly all warnings and alerts to shell's stdout. Daemonzied process fails, while being already in background, exit code reported by shell via '$?' equals to 0, as it's the exit code of his parent. To fix this, let's create a pipe between parent and daemonized child. The child will send into this pipe a "READY" message, when it finishes his initialization. The parent will wait on the "read" end of the pipe until receiving something. If read() fails, parent obtains the status of the exited child with waitpid(). So, the parent can correctly report the error to the stdout and he can exit with child's exitcode. This fix should be backported only in 3.1.	2024-12-09 21:32:44 +01:00
William Lallemand	5454824e31	MINOR: ssl: add notBefore and notAfter utility functions Extracting notBefore and notAfter as a string can be bothersome, add 2 utility functions that returns the value in a static buffer.	2024-12-09 18:29:23 +01:00
Willy Tarreau	c3ee4e375b	MINOR: tools: make fddebug() automatically emit the location fddebug() is sometimes quite helpful, but annoying to use when following a call path because it's a pain to always repeat the function name and call place. Let's have it automatically prepend the function name, the file name and the line number, and make its arguments optional, replacing them by a simple LF when all absent. This way, simply placing: fddebug(); is sufficient to emit a location follocing "[%s@%s:%d]\n". This function must not be used in production (and even call places with it shouldn't be committed) and it should only be used by developers, so the simplest the better.	2024-12-09 18:05:09 +01:00
Willy Tarreau	d6dc8120c0	BUILD: debug: fix build issues in COUNT_IF() with -Wunused-value Commit `7f64bb79fd` ("BUG/MINOR: debug: COUNT_IF() should return true/false") allowed the COUNT_IF() macro to return the evaluated value. This is handy to place it in "if ()" conditions and count them at the same time. When glitches are disabled, the condition is just returned as-is, but most call places do not use the result, making some compilers complain. In addition, while reviewing this, it was noticed that when DEBUG_STRICT=0, the macro would still be replaced by a "do { } while (0)" statement, which not only does not evaluate the expression, but also cannot return anything. Ditto for COUNT_IF_HOT(). Let's make sure both are always properly evaluated now.	2024-12-09 18:04:51 +01:00
Willy Tarreau	7f64bb79fd	BUG/MINOR: debug: COUNT_IF() should return true/false The COUNT_IF() macro was initially meant to return true/false to be used in if() conditions but had an extra do { } while(0) that prevents it from doing so. Let's get rid of the do { } while(0) before the code generalizes to too many places. There's no impact on existing code, but may have to be backported if future fixes rely on it.	2024-12-06 18:45:46 +01:00
Valentine Krasnobaeva	cd0b58e23e	BUG/MINOR: startup: fix error path for master, if can't open pidfile If master process can't open a pidfile, there is no sense to send SIGTTIN to oldpids, as it will exit. So, old workers will terminate as well. It's better to send the last alert to the log about unrecoverable error, because master is already in its polling loop. For the standalone mode we should keep the previous logic in this case: send SIGTTIN to old process and unbind listeners for the new one. So, it's better to put this error path in main(), as it's done when other configuration settings can't be applied. This patch should be backported only in 3.1.	2024-12-06 12:00:22 +01:00
Aurelien DARRAGON	ae9d8d40d0	CLEANUP: stktable: add some stktable flags polishing Better late than never, commit `1f73d35` ("MINOR: stktable: implement "recv-only" table option") implemented stktable flags and initial definitions, but it lacks some comments plus the flag is stored as 16bits but the SKT_FL_ definition width allows for only 8bits so it is a bit confusing, let's fix that	2024-12-05 13:14:21 +01:00
Aurelien DARRAGON	9f44c5f9be	CLEANUP: stktable: replace nopurge attribute with flag Thanks to previous commit stktable struct now have a "flags" struct member Let's take this opportunity to remove the isolated "nopurge" attribute in stktable struct and rely on a flag named STK_FL_NOPURGE instead. This helps to better organize stktable struct members.	2024-12-05 12:15:31 +01:00
Aurelien DARRAGON	1f73d3524d	MINOR: stktable: implement "recv-only" table option When "recv-only" keyword is added on a stick table declaration (in peers or proxy section), haproxy considers that the table is only used for data retrieval from a remote location and not used to perform local updates. As such, it enables the retrieval of local-only values such as conn_cur that are ignored by default. This can be useful in some contexts where we want to know about local-values such are conn_cur from a remote peer. To do this, add stktable struct flags which default to NONE and enable the RECV_ONLY flag on the table then "recv-only" keyword is found in the table declaration. Then, when in peer_treat_updatemsg(), when handling table updates, don't ignore data updates for local-only values if the flag is set.	2024-12-05 12:15:24 +01:00
Willy Tarreau	e6f4f15929	MINOR: tasklet: set TASK_WOKEN_OTHER on tasklets by default Now when tasklets are woken up via tasklet_wakeup(), tasklet_wakeup_on() or tasklet_wakeup_after(), either the optional wakeup flags will be used, or TASK_WOKEN_OTHER will be used. This allows tasklet handlers waking up for any given cause to notice whether or not they were also woken for another reason. For example, a mux handler could skip heavy parts when seeing that TASK_WOKEN_OTHER is absent, proving that no standard tasklet_wakeup() was done, for example in response to a subscribe(). The benefit of the TASK_WOKEN_* flags is that they're purged during the wakeup, and that they're easy to check for using TASK_WOKEN_ANY. TASK_F_UEVT1 and TASK_F_UEVT2 are also usable for private use (e.g. wakeup from a stream to a connection inside a mux). Probably that in the future, code dealing with subscribe events should start to place TASK_WOKEN_IO like is done for upper layers.	2024-12-03 19:45:08 +01:00
Willy Tarreau	6322c9fbbf	MINOR: tools: add a new macro DEFVAL() to provide a default argument This is like DEFZERO and DEFNULL, but this one allows to specify the default value to be used as the first argument.	2024-12-03 19:45:08 +01:00
Valentine Krasnobaeva	a33977da48	BUG/MINOR: startup: close pidfd and free global.pidfile in handle_pidfile() After master-worker mode refactoring, global.pidfile is only used in handle_pidfile(), which opens the provided file and writes the PID into it. So, it's more appropriate to perform the close(pidfd) and ha_free(&global.pidfile) also in this function. This commit prepares the fix of the pidfile creation, as it's created now very early, when we are not sure, that process has successfully started. In master-worker mode handle_pidfile() can be called in the master process context. So, let's make it accessible from other compilation units via global.h. This should be backported only in 3.1.	2024-12-02 17:28:04 +01:00
Aurelien DARRAGON	8bce7ff854	MINOR: hlua_fcn: add Patref:commit() method commit() method may be used to commit pending updates on the local patref object: hlua_patref flags were added: HLUA_PATREF_FL_GEN means the patref object has been updated and it is associated to a new revision (curr_gen) in order to prepare and commit the pending updates. upon commit, the pattern API is leveraged with curr_gen as revision to commit new object items. Once commit is performed, previous (pending) revisions that are older than the committed one are cleaned up (similar to what's done with commit on the cli). Also, Patref function APIs now take into account curr_gen to perform lookups.	2024-11-29 07:23:08 +01:00
Aurelien DARRAGON	e769d8f426	MINOR: pattern: add pat_ref_may_commit() helper function pat_ref_may_commit() may be used to know if a given generation ID id still valid, which means it may still be committed at some point. Else it means that another pending generation ID older than the tested one was already committed and thus other generations ID below this one are stale and must be regenerated.	2024-11-29 07:23:01 +01:00
Aurelien DARRAGON	43ab25f007	MINOR: hlua_fcn: wrap pat_ref struct for patref class In order to extend the patref class features, let's wrap the pat_ref struct into hlua_patref struct. This way we may add additional data alongside the pat_ref pointer to store additional context required for pat_ref data manipulation from lua. Since the wrapper (hlua_patref) is an allocated object, we declare the _gc metamethod for patref class in order to properly cleanup resources when they are out of scope.	2024-11-29 07:22:54 +01:00
Aurelien DARRAGON	2021072391	MINOR: hlua_fcn: implement index and pair metamethods for patref class patref object may now leverage index and pair methamethods to list and access patref elements at a specific index (=key) Also, patref:is_map() method may be used to know if the patref stores acl (key only) or map-style (key:value) patterns.	2024-11-29 07:22:46 +01:00
Aurelien DARRAGON	956a25cf60	MINOR: hlua: add patref class Implement patref class to expose pat_ref struct internal pattern struct in lua. This is some prerequisite work needed to be able to manipulate exisiting generic pattern object lists (acl/map) from Lua, because the Map class can only be used to perform matching ops on Map files.	2024-11-29 07:22:32 +01:00
Aurelien DARRAGON	f72a66eef2	MINOR: pattern: publish event_hdl events on pat_ref updates Now that PAT_REF events were defined in previous commit, let's actually publish them from pattern API where relevant. Unlike server events, pattern reference events are only published in the pat_ref subscriber's list on purpose, because in some setups patref updates (updates performed on a map for instance from action or cli) are very frequent, and we don't want to impact pattern API performance just for that. Moreover, as the main use case is to be able to subscribe to maps updates from Lua, allowing a per-pattern reference registration is already enough. No additional data is provided for such events (also for performance reason) Care was taken not to publish events when the update doesn't affect the live subset (the one targeted by curr_gen).	2024-11-29 07:22:25 +01:00
Aurelien DARRAGON	f7267bd315	MINOR: event_hdl: add PAT_REF events This is some prerequisite work for implementing PAT_REF events. In this commit we define the PAT_REF event_hdl family (which gets family slot id #2), with the following supported events: - EVENT_HDL_SUB_PAT_REF_ADD: element was added to the current version of the pattern ref - EVENT_HDL_SUB_PAT_REF_DEL: element was deleted from the current version of the pattern ref - EVENT_HDL_SUB_PAT_REF_SET: element was modified in the current version of the pattern ref - EVENT_HDL_SUB_PAT_REF_COMMIT: pending element(s) was/were commited in the current version of the pattern ref - EVENT_HDL_SUB_PAT_REF_CLEAR: all elements were cleared from the current version of the pattern ref The goal is to be able to track a pat_ref struct in order to be notified when it is updated. For performance reasons, events from this family won't provide any additional info, and will only be published in the pat_ref subscription list. Indeed, pat_ref may be updated at a relatively high frequency (or worse, batch work), so we cannot afford doing expensive treatment for each update.	2024-11-29 07:22:18 +01:00
Frederic Lecaille	f8b697c19b	BUG/MINOR: improve BBR throughput on very fast links This patch fixes the loss of information when computing the delivery rate (quic_cc_drs.c) on links with very low latency due to usage of 32bits variables with the millisecond as precision. Initialize the quic_conn task with TASK_F_WANTS_TIME flag ask it to ask the scheduler to update the call date of this task. This allows this task to get a nanosecond resolution on the call date calling task_mono_time(). This is enabled only for congestion control algorithms with delivery rate estimation support (BBR only at this time). Store the send date with nanosecond precision of each TX packet into ->time_sent_ns new quic_tx_packet struct member to store the date a packet was sent in nanoseconds thanks to task_mono_time(). Make use of this new timestamp by the delivery rate estimation algorithm (quic_cc_drs.c). Rename current ->time_sent member from quic_tx_packet struct to ->time_sent_ms to distinguish the unit used by this variable (millisecond) and update the code which uses this variable. The logic found in quic_loss.c is not modified at all. Must be backported to 3.1.	2024-11-28 21:39:05 +01:00
Christopher Faulet	bc66d31985	MINOR: proxy: Add support of 421-Misdirected-Request in retry-on status The "421" status can now be specified on retry-on directives. PR_RE_* flags were updated to remains sorted. This patch should fix the issue #2794. It is quite simple so it may safely be backported to 3.1 if necessary.	2024-11-28 11:47:40 +01:00
Willy Tarreau	97d33abb23	MINOR: version: this is development again (3.2) This basically reverts commit `b629f366a7` ("MINOR: version: mention that 3.1 is stable now").	2024-11-26 17:21:16 +01:00
Aurelien DARRAGON	4792f27892	MINOR: pattern: add pat_ref_gen_delete() function pat_ref_gen_delete(ref, gen_id, key) tries to delete all samples belonging to <gen_id> and matching <key> under <ref> The goal is to be able to target a single subset from <ref>	2024-11-26 16:12:21 +01:00
Aurelien DARRAGON	a131c542a6	MINOR: pattern: add pat_ref_gen_find_elt() function pat_ref_gen_find_elt(ref, gen_id, key) tries to find <elt> element belonging to <gen_id> and matching <key> in <ref> reference. The goal is to be able to target a single subset from <ref>	2024-11-26 16:12:16 +01:00
Aurelien DARRAGON	c9d6af3c6d	MINOR: pattern: add pat_ref_gen_set() function pat_ref_gen_set(ref, gen_id, value, err) modifies to <value> the sample of all patterns matching <key> and belonging to <gen_id> (generation id) under <ref> The goal is to be able to target a single subset from <ref>	2024-11-26 16:12:11 +01:00
Aurelien DARRAGON	3d250b3be8	MINOR: pattern: split pat_ref_set() split pat_ref_set() function in 2 distinct functions. Indeed, since `0844bed7d3` ("MEDIUM: map/acl: Improve pat_ref_set() efficiency (for "set-map", "add-acl" action perfs)"), pat_ref_set() prototype was updated to include an extra <elt> argument. But the logic behind is not explicit because the function will not only try to set <elt>, but also its duplicate (unlike pat_ref_set_elt() which only tries to update <elt>). Thus, to make it clearer and better distinguish between the key-based lookup version and the elt-based one, restotre pat_ref_set() previous prototype and add a dedicated pat_ref_set_elt_duplicate() that takes <elt> as argument and tries to update <elt> and all duplicates.	2024-11-26 16:12:05 +01:00
Willy Tarreau	4d58f521ee	[RELEASE] Released version 3.2-dev0 Released version 3.2-dev0 with the following main changes : - exact copy of 3.1.0	2024-11-26 15:33:57 +01:00
Christopher Faulet	b629f366a7	MINOR: version: mention that 3.1 is stable now This version will be maintained up to around Q1 2026. The INSTALL file also mentions it.	2024-11-26 15:23:54 +01:00
Amaury Denoyelle	2fffd85b97	BUG/MEDIUM: quic: prevent EMSGSIZE with GSO for larger bufsize A UDP datagram cannot be greater than 65535 bytes, as UDP length header field is encoded on 2 bytes. As such, sendmsg() will reject a bigger input with error EMSGSIZE. By default, this does not cause any issue as QUIC datagrams are limited to 1.252 bytes and sent individually. However, with GSO support, value bigger than 1.252 bytes are specified on sendmsg(). If using a bufsize equal to or greater than 65535, syscall could reject the input buffer with EMSGSIZE. As this value is not expected, the connection is immediately closed by haproxy and the transfer is interrupted. This bug can easily reproduced by requesting a large object on loopback interface and using a bufsize of 65535 bytes. In fact, the limit is slightly less than 65535, as extra room is also needed for IP + UDP headers. Fix this by reducing the count of datagrams encoded in a single GSO invokation via qc_prep_pkts(). Previously, it was set to 64 as specified by man 7 udp. However, with 1252 datagrams, this is still too many. Reduce it to a value of 52. Input to sendmsg will thus be restricted to at most 65.104 bytes if last datagram is full. If there is still data available for encoding in qc_prep_pkts(), they will be written in a separate batch of datagrams. qc_send_ppkts() will then loop over the whole QUIC Tx buffer and call sendmsg() for each series of at most 52 datagrams. This does not need to be backported.	2024-11-26 11:49:30 +01:00
Valentine Krasnobaeva	3500865bc1	REORG: startup: move mworker_apply_master_worker_mode in mworker.c mworker_apply_master_worker_mode() is called only in master-worker mode, so let's move it mworker.c	2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva	dee247c14e	REORG: startup: move mworker_reexec and mworker_reload in mworker.c Let's move mworker_reexec() and mworker_reload() in mworker.c. mworker_reload() is called only within the functions, which are already in mworker.c. So, this reorganization allows to declare mworker_reload() as a static.	2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva	0c7b93eb1d	REORG: startup: move mworker_run_master and mworker_loop in mworker.c mworker_run_master() is called only in master mode. mworker_loop() is static and called only in mworker_run_master(). So let's move these both functions in mworker.c. We also need here to make run_thread_poll_loop() accessible from other units, as it's used in mworker_loop().	2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva	7974089ac6	REORG: startup: move mworker_prepare_master in mworker.c mworker_prepare_master() performs some preparation routines for the new worker process, which will be forked during the startup. It's called only in master-worker mode, so let's move it in mworker.c.	2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva	af642420b4	REORG: startup: move on_new_child_failure in mworker.c mworker_on_new_child_failure() performs some routines for the worker process, if it has failed the reload. As it's called only in mworker_catch_sigchld() from mworker.c, let's move mworker_on_new_child_failure() in mworker.c as well. Like this it could also be declared as a static.	2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva	321c021a83	MINOR: startup: rename on_new_child_failure to mworker_on_new_child_failure This patch prepares the moving of on_new_child_failure definition into mworker.c. So, let's rename it accordingly and let's also update its description.	2024-11-25 15:20:24 +01:00
Amaury Denoyelle	22bd92a87f	MINOR: mux-quic: use sched call time for pacing QUIC pacing was recently implemented to limit burst and improve overall bandwidth. This is used only for MUX STREAM emission. Pacing requires nanosecond resolution. As such, it used now_cpu_time() which relies on clock_gettime() syscall. The usage of clock_gettime() has several drawbacks : * it is a syscall and thus requires a context-switch which may hurt performance * it is not be available on all systems * timestamp is retrieved multiple times during a single task execution, thus yielding different values which may tamper pacing calculation Improve this by using task_mono_time() instead. This returns task call time from the scheduler thread context. It requires the flag TASK_F_WANTS_TIME on QUIC MUX tasklet to force the scheduler to update call time with now_mono_time(). This solves every limitations listed above : * syscall invokation is only performed once before tasklet execution, thus reducing context-switch impact * on non compatible system, a millisecond timer is used as a fallback which should ensure that pacing works decently for them * timer value is now guaranteed to be fixed duing task execution	2024-11-25 11:21:45 +01:00
Willy Tarreau	670507a66e	MINOR: tools: add a new function "resolve_dso_name" to find a symbol's DSO In the memprofile summary per DSO, we currently have to pay a high price by calling dladdr() on each symbol when doing the summary per DSO at the end, while we're not interested in these details, we just want the DSO name which can be made cheaper to obtain, and easier to manipulate. So let's create resolve_dso_name() to only extract minimal information from an address. At the moment it still uses dladdr() though it avoids all the extra expensive work, and will further be able to leverage the same mechanism as "show libs" to instantly spot DSO from address ranges.	2024-11-21 19:58:06 +01:00
Willy Tarreau	5ddc8b3ad4	MINOR: activity/memprofile: monitor non-portable calls as well Some dependencies might very well rely on posix_memalign(), strndup() or other less portable callsn making us miss them when chasing memory leaks, resulting in negative global allocation counters. Let's provide the handlers for the following functions: strndup() // _POSIX_C_SOURCE >= 200809L \|\| glibc >= 2.10 valloc() // _BSD_SOURCE \|\| _XOPEN_SOURCE>=500 \|\| glibc >= 2.12 aligned_alloc() // _ISOC11_SOURCE posix_memalign() // _POSIX_C_SOURCE >= 200112L memalign() // obsolete pvalloc() // obsolete This time we don't fail if they're not found, we just silently forward the calls.	2024-11-21 19:58:06 +01:00
Willy Tarreau	33c0ce299d	MINOR: activity/memprofile: also monitor strdup() activity Some memory profiling outputs have showed negative counters, very likely due to some libs calling strdup(). Let's add it to the list of monitored activities. Actually even haproxy itself uses some. Having "profiling.memory on" in the config reveals 35 call places.	2024-11-21 19:58:06 +01:00
Willy Tarreau	623a2c4e19	CLEANUP: activity: better use a mask to tests freeing methods In "show profiling memory", we need to distinguish methods which really free memory from those which do not so that we don't account for the free value twice. However for now it's done using multiple tests, which are going to complicate the addition of new methods. Let's switch to a bit field defined as a mask in a single place instead, as we don't intend to use more than 32/64 methods!	2024-11-21 19:58:06 +01:00
Willy Tarreau	859341c1ec	MINOR: activity/memprofile: offer a function to unregister stale info There's actually a problem with memprofiles: the pool pointer is stored in ->info but some pools are replaced during startup, such as the trash pool, leaving a dangling pointer there. Let's complete the API with a new function memprof_remove_stale_info() that will remove all stale references to this info pointer. It's also present when USE_MEMORY_PROFILING is not set so as to ease the job on callers.	2024-11-21 19:58:06 +01:00
Valentine Krasnobaeva	bfe0f9d02d	MINOR: startup: use global progname variable Let's store progname in the global variable, as it is handy to use it in different parts of code to format messages sent to stdout. This reduces the number of arguments, which we should pass to some functions.	2024-11-21 19:55:21 +01:00
Valentine Krasnobaeva	ef154a49e1	MINOR: capabilities: rename program_name argument to progname This commit prepares the usage of the global progname variable. prepare_caps_from_permitted_set() use progname value in warning messages. So, let's rename program_name argument to progname.	2024-11-21 19:55:21 +01:00
Frederic Lecaille	01fcbd6c08	BUG/MINOR: quic: Missing application limitations tracking for BBR The ->app_limited member of the delivery rate struct (quic_cc_drs) aim is to store the index of the last transmitted byte marked as application-limited so that to track the application-limited phases. During these phases, BBR must ignore delivery rate samples to properly estimate the delivery rate. Without such a patch, the Startup phase could be exited very quickly with a very low estimated bottleneck bandwidth. This had a very bad impact on little objects with download times smaller than the expected Startup phase duration. For such objects, with enough bandwith, BBR should stay in the Startup state. No need to be backported, as BBR is implemented in the current developement version.	2024-11-21 19:23:53 +01:00
Amaury Denoyelle	95d3edd68f	MINOR: quic: support pacing for newreno and nocc Extend extra pacing support for newreno and nocc congestion algorithms, as with cubic. For better extensibility of cc algo definition, define a new flags field in quic_cc_algo structure. For now, the only value is QUIC_CC_ALGO_FL_OPT_PACING which is set if pacing support can be optionally activated. Both cubic, newreno and nocc now supports this. This new flag is then reused by QUIC config parser. If set, extra quic-cc-algo burst parameter is taken into account. If positive, this will activate pacing support on top of the congestion algorithm. As with cubic previously, pacing is only supported if running under experimental mode. Only BBR is not flagged with this new value as pacing is directly builtin in the algorithm and cannot be turn off. Furthermore, BBR calculates automatically its value for maximum burst. As such, any quic-cc-algo burst argument used with BBR is still ignored with a warning.	2024-11-21 11:33:44 +01:00
Christopher Faulet	e58a30d369	MINOR: cfgparse: Emit a warning for misplaced "tcp-response content" rules When a "tcp-response content" rule is placed after a "http-response" rule, a warning is now emitted, just like for rules applied on the requests.	2024-11-21 09:55:04 +01:00
Christopher Faulet	5dcd3b0d99	CLEANUP: cfgparse: Add direction in functions name that warn on misplaced rules This only concerns functions emitting warnings about misplaced tcp-request rules. The direction is now specified in the functions name. For instance "warnif_misplaced_tcp_conn" is replaced by "warnif_misplaced_tcp_req_conn".	2024-11-21 09:51:37 +01:00
Christopher Faulet	7710580428	MINOR: config: Improve warnings on misplaced rules by adding an optional arg In warnings about misplaced rules, only the first keyword is mentionned. It works well for http-request or quic-initial rules for instance. But it is a bit confusing for tcp-request rules, because the layer is missing (session or content). To make it a bit systematic (and genric), the second argument can now be provided. It can be set to NULL if there is no layer or scope. But otherwise, it may be specified and it will be reported in the warning. So the following snippet: tcp-request content reject if FALSE tcp-request session reject if FALSE tcp-request connection reject if FALSE Will now emit the following warnings: a 'tcp-request session' rule placed after a 'tcp-request content' rule will still be processed before. a 'tcp-request connection' rule placed after a 'tcp-request session' rule will still be processed before. This patch should fix the issue #2596.	2024-11-21 09:28:42 +01:00
Frederic Lecaille	d85eb127e9	MINOR: quic: quic_loss modifications to support BBR qc_packet_loss_lookup() aim is to detect the packet losses. This is this function which must called ->on_pkt_lost() BBR specific callback. It also set <bytes_lost> passed parameter to the total number of bytes detected as lost upon an ACK frame receipt for its caller. Modify qc_release_lost_pkts() to call ->congestion_event() with the send time from the newest packet detected as lost. Modify qc_release_lost_pkts() to call ->slow_start() callback only if define by the congestion control algorithm. This is not the case for BBR.	2024-11-20 17:34:22 +01:00
Frederic Lecaille	af75665cb7	MINOR: quic: quic_cc modifications to support BBR Add several callbacks to quic_cc_algo struct which are only called by BBR. ->get_drs() may be used to retrieve the delivery rate sampling information from an congestion algorithm struct (quic_cc). ->on_transmit() must be called before sending any packet a QUIC sender. ->on_ack_rcvd() must be called after having received an ACK. ->on_pkt_lost() must be called after having detected a packet loss. ->congestion_event() must be called after any congestion event detection Modify quic_cc.c to call ->event only if defined. This is not the case for BBR.	2024-11-20 17:34:22 +01:00
Frederic Lecaille	d04adf44dc	MINOR: quic: implement BBR congestion control algorithm for QUIC Implement the version 3 of BBR for QUIC specified by the IETF in this draft: https://datatracker.ietf.org/doc/draft-ietf-ccwg-bbr/ Here is an extract from the Abstract part to sum up the the capabilities of BBR: BBR ("Bottleneck Bandwidth and Round-trip propagation time") uses recent measurements of a transport connection's delivery rate, round-trip time, and packet loss rate to build an explicit model of the network path. BBR then uses this model to control both how fast it sends data and the maximum volume of data it allows in flight in the network at any time. Relative to loss-based congestion control algorithms such as Reno [RFC5681] or CUBIC [RFC9438], BBR offers substantially higher throughput for bottlenecks with shallow buffers or random losses, and substantially lower queueing delays for bottlenecks with deep buffers (avoiding "bufferbloat"). BBR can be implemented in any transport protocol that supports packet-delivery acknowledgment. Thus far, open source implementations are available for TCP [RFC9293] and QUIC [RFC9000]. In haproxy, this implementation is considered as still experimental. It depends on the newly implemented pacing feature. BBR was asked in GH #2516 by @KazuyaKanemura, @osevan and @kennyZ96.	2024-11-20 17:34:22 +01:00
Frederic Lecaille	472d575950	MINOR: quic: implement delivery rate sampling algorithm This patch implements an algorithm which may be used by congestion algorithms for QUIC to estimate the current delivery rate of a sender. It is at least used by BBR and could be used by others congestion algorithms as cubic. This algorithm was specified by an RFC draft here: https://datatracker.ietf.org/doc/html/draft-cheng-iccrg-delivery-rate-estimation before being merged into BBR v3 here: https://datatracker.ietf.org/doc/html/draft-cardwell-ccwg-bbr#section-4.5.2.2	2024-11-20 17:34:22 +01:00
Frederic Lecaille	c08b877657	MINOR: window_filter: Implement windowed filter (only max) Implement the Kathleen Nichols' algorithm used by several congestion control algorithm implementation (TCP/BBR in Linux kernel, QUIC/BBR in quiche) to track the maximum value of a data type during a fixe time interval. In this implementation, counters which are periodically reset are used in place of timestamps. Only the max part has been implemented. (see lib/minmax.c implemenation for Linux kernel).	2024-11-20 17:34:22 +01:00
Frederic Lecaille	7bbe8828ba	MINOR: quic: Add the congestion window initial value to QUIC path Add ->initial_wnd new member to quic_cc_path struct to keep the initial value of the congestion window. This member is initialized as soon as a QUIC connection is allocated. This modification is required for BBR congestion control algorithm.	2024-11-20 17:34:22 +01:00
Willy Tarreau	1171a23aec	BUILD: makefile: make ERR apply to build options as well Once in a while we find some makefiles ignoring some outdated arguments and just emit a warning. What's annoying is that if users (say, distro packagers), have purposely added ERR=1 to their build scripts to make sure to fail on any warning, these ones will be ignored and the build can continue with invalid or missing options. William rightfully suggested that ERR=1 should also catch make's warnings so this patch implements this, by creating a new "complain" variable that points either to "error" or "warning" depending on $(ERR), and that is used to send the messages using $(call $(complain),...). This does the job right at little effort (tested from GNU make 3.82 to 4.3). Note that for this purpose the ERR declaration was upped in the makefile so that it appears before the new errors.mk file is included.	2024-11-20 14:58:35 +01:00
Willy Tarreau	12fcd65468	MINOR: tasklet: support an optional set of wakeup flags to tasklet_wakeup_on() tasklet_wakeup_on() and its derivates (tasklet_wakeup_after() and tasklet_wakeup()) do not support passing a wakeup cause like task_wakeup(). This is essentially due to an API limitation cause by the fact that for a very long time the only reason for waking up was to process pending I/O. But with the growing complexity of mux tasks, it is becoming important to be able to skip certain heavy processing when not strictly needed. One possibility is to permit the caller of tasklet_wakeup() to pass flags like task_wakeup(). Instead of going with a complex naming scheme, let's simply make the flags optional and be zero when not specified. This means that tasklet_wakeup_on() now takes either 2 or 3 args, and that the third one is the optional flags to be passed to the callee. Eligible flags are essentially the non-persistent ones (TASK_F_UEVT* and TASK_WOKEN_*) which are cleared when the tasklet is executed. This way the handler will find them in its <state> argument and will be able to distinguish various causes for the call.	2024-11-19 20:13:41 +01:00
Willy Tarreau	0334cb28a9	MINOR: tasklet: make the low-level tasklet API take a flag Everything in the tasklet layer supports flags, except that they are just not implemented in the wakeup functions, while they are in the task_wakeup functions. Initially it was not considered useful to pass wakeup causes because these were essentially I/O, but with the growing number of I/O handlers having to deal with various types of operations (typically cheap I/O notifications on subscribe vs heavy parsing on application-level wakeups), it would be nice to start to make this distinction possible. This commit extends _tasklet_wakeup_on() and _tasklet_wakeup_after() to pass a set of flags that continues to be set as zero. For now this changes nothing, but new functions will come.	2024-11-19 20:13:41 +01:00
Willy Tarreau	e57581d76d	MINOR: tools: add new macro DEFZERO to provide a default zero argument This is the equivalent of DEFNULL except that it sets a zero value instead of a NULL for a missing argument.	2024-11-19 20:13:41 +01:00
Willy Tarreau	c5052bad8a	MINOR: sched: add TASK_F_WANTS_TIME to make the scheduler update the call date Currently tasks being profiled have th_ctx->sched_call_date set to the current nanosecond in monotonic time. But there's no other way to have this, despite the scheduler being capable of it. Let's just declare a new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time just before calling the handler. This way, a task that needs nanosecond resolution on the call date will be able to be called with an up-to-date date without having to abuse now_mono_time() if not needed. In addition, if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0), the date is set to the most recently known now_ns, which is guaranteed to be atomic and is only updated once per poll loop. This date can be more conveniently retrieved using task_mono_time(). This can be useful, e.g. for pacing. The code was slightly adjusted so as to merge the common parts between the profiling case and this one.	2024-11-19 20:13:41 +01:00
Willy Tarreau	12969c1b17	MINOR: tinfo/clock: turn sched_call_date to 64-bits We used to store it in 32-bits since we'd only use it for latency and CPU usage calculation but usages will evolve so let's not truncate the value anymore. Now we store the full 64 bits. Note that this doesn't even increase the storage size due to alignment. The 3 usage places were verified to still be valid (most were already cast to 32 bits anyway).	2024-11-19 20:13:41 +01:00
Willy Tarreau	973c81ceec	CLEANUP: tinfo: move sched__date/_mono_time to the thread-local area These ones are never atomically accessed, they have nothing to do in the atomic ops cache line, let's move them to the thread-local area.	2024-11-19 20:13:41 +01:00
Amaury Denoyelle	5a29fd6c61	MINOR: mux_quic/pacing: display pacing info on show quic To improve debugging, extend "show quic" output to report if pacing is activated on a connection. Two values will be displayed for pacing : * a new counter paced_sent_ctr is defined in QCC structure. It will be incremented each time an emission is interrupted due to pacing. * pacing engine now saves the number of datagrams sent in the last paced emission. This will be helpful to ensure burst parameter is valid.	2024-11-19 16:21:05 +01:00
Amaury Denoyelle	24cea66e07	MEDIUM: quic: define cubic-pacing congestion algorithm Define a new QUIC congestion algorithm token 'cubic-pacing' for quic-cc-algo bind keyword. This is identical to default cubic implementation, except that pacing is used for STREAM frames emission. This algorithm supports an extra argument to specify a burst size. This is stored into a new bind_conf member named quic_pacing_burst which can be reuse to initialize quic path. Pacing support is still considered experimental. As such, 'cubic-pacing' can only be used with expose-experimental-directives set.	2024-11-19 16:20:58 +01:00
Amaury Denoyelle	796446a15e	MAJOR: mux-quic: support pacing emission Support pacing emission for STREAM frames at the QUIC MUX layer. This is implemented by adding a quic_pacer engine into QCC structure. The main changes have been written into qcc_io_send(). It now differentiates cases when some frames have been rejected by transport layer. This can occur as previously due to congestion or FD buffer full, which requires subscribing on transport layer. The new case is when emission has been interrupted due to pacing timing. In this case, QUIC MUX I/O tasklet is rescheduled to run with the flag TASK_F_USR1. On tasklet execution, if TASK_F_USR1 is set, all standard processing for emission and reception is skipped. Instead, a new function qcc_purge_sending() is called. Its purpose is to retry emission with the saved STREAM frames list. Either all remaining frames can now be send, subscribe is done on transport error or tasklet must be rescheduled for pacing purging. In the meantime, if tasklet is rescheduled due to other conditions, TASK_F_USR1 is reset. This will trigger a full regeneration of STREAM frames. In this case, pacing expiration must be check before calling qcc_send_frames() to ensure emission is now allowed.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	ede4cd4c2e	MINOR: mux-quic: encapsulate QCC tasklet wakeup QUIC MUX will be responsible to drive emission with pacing. This will be implemented via setting TASK_F_USR1 before I/O tasklet wakeup. To prepare this, encapsulate each I/O tasklet wakeup into a new function qcc_wakeup(). This commit is purely refactoring prior to pacing implementation into QUIC MUX.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	4a94a018f0	MINOR: mux-quic: define a tx STREAM frame list member For STREAM emission, MUX QUIC previously used a local list defined under qcc_io_send(). This was suitable as either all frames were sent, or emission must be interrupted due to transport congestion or fatal error. In the latter case, the list was emptied anyway and a new frame list was built on future qcc_io_send() invokation. For pacing, MUX QUIC may have to save the frame list if pacing should be applied across emission. This is necessary to avoid to unnecessarily rebuilt stream frame list between each paced emission. To support this, STREAM list is now stored as a member of QCC structure. Ensure frame list is always deleted, even on QCC release, using newly defined utility function qcc_tx_frms_free().	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	886a7c475c	MINOR: quic/pacing: add burst support qc_send_mux() has been extended previously to support pacing emission. This will ensure that no more than one datagram will be emitted during each invokation. However, to achieve better performance, it may be necessary to emit a batch of several datagrams one one turn. A so-called burst value can be specified by the user in the configuration. However, some congestion control algos may defined their owned dynamic value. As such, a new CC callback pacing_burst is defined. quic_cc_default_pacing_burst() can be used for algo without pacing interaction, such as cubic. It will returns a static value based on user selected configuration.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	8039fe43e6	MINOR: quic/pacing: support pacing emission on quic_conn layer Pacing will be implemented for STREAM frames emission. As such, qc_send_mux() API has been extended to add an argument to a quic_pacer engine. If non NULL, engine will be used to pace emission. In short, no more than one datagram will be emitted for each qc_send_mux() invokation. Pacer is then notified about the emission and a timer for a future emission is calculated. qc_send_mux() will return PACING error value, to inform QUIC MUX layer that it will be responsible to retry emission after some delay.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	ab82fab442	MINOR: quic/pacing: implement quic_pacer engine Extend quic_pacer engine to support pacing emission. Several functions are defined. * quic_pacing_sent_done() to notify engine about an emission of one or several datagrams * quic_pacing_expired() to check if emission should be delayed or can be conducted immediately	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	3e11492c99	MINOR: quic: define quic_pacing module Add a new module quic_pacing. A new structure quic_pacer is defined. This will be used as a pacing engine to implement smooth emission of QUIC data.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	7fd48a5723	MINOR: quic: extend qc_send_mux() return type with a dedicated enum This commit is part of a adjustment on QUIC transport send API to support pacing. Here, qc_send_mux() return type has been changed to use a new enum quic_tx_err. This is useful to explain different failure causes of emission. For now, only two values have been defined : NONE and FATAL. When pacing will be implemented, a new value would be added to specify that emission was interrupted on pacing. This won't be a fatal error as this allows to retry emission but not immediately.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	5cb8f8a622	MINOR: quic: support a max number of built packet per send iteration Extend QUIC transport emission function to support a maximum datagram argument. The purpose is to ensure that qc_send() won't emit more than the specified value, unless it is 0 which is considered as unlimited. In qc_prep_pkts(), a counter of built datagram has been added to support this. The packet building loop is interrupted if it reaches a specified maximum value. Also, its return value has been changed to the number of prepared datagrams. This is reused by qc_send() to interrupt its work if a specified max datagram argument value is reached over one or several iteration of prepared/sent datagrams. This change is necessary to support pacing emission. Note that ideally, the total length in bytes of emitted datagrams should be taken into account instead of the raw number of datagrams. However, for a first implementation, it was deemed easier to implement it with the latter.	2024-11-19 16:16:48 +01:00
Amaury Denoyelle	4069873403	MINOR: mux-quic: add missing values for show flags Add QCC QC_CF_WAIT_FOR_HS and QCS QC_SF_TXBUB_OOB flags to their respective show_flags to be able to decipher them via dev flags utility. These values have been added in the current dev version, thus no need to backport this patch.	2024-11-19 16:16:48 +01:00
Christopher Faulet	bc967758a2	MINIR: mux-h1: Return 414 or 431 when appropriate When the request is too large to fit in a buffer a 414 or a 431 error message is returned depending on the error state of the request parser. A 414 is returned if the URI is too long, otherwise a 431 is returned. This patch should fix the issue #1309.	2024-11-19 15:29:40 +01:00
Christopher Faulet	62dc8750a9	MINOR: http: Add support for HTTP 414/431 status codes 414-Uri-Too-Long and 431-Request-Header-Fields-Too-Large are now part of supported status codes that can be define as error files. The hash table defined in http_get_status_idx() was updated accordingly.	2024-11-19 15:29:40 +01:00
Christopher Faulet	1be7140ade	MINOR: http-ana: Add support for "set-cookie-fmt" option to redirect rules It is now possible to use a log-format string to define the "Set-Cookie" header value of a response generated by a redirect rule. There is no special check on the result format and it is not possible during the configuration parsing. It is proably not a big deal because already existing "set-cookie" and "clear-cookie" options don't perform any check. Here is an example: http-request redirect location https://someurl.com/ set-cookie haproxy="%[var(txn.var)]" This patch should fix the issue #1784.	2024-11-19 15:20:02 +01:00
Christopher Faulet	b2877db47c	MINOR: http-ana: Add option to keep query-string on a localtion-based redirect On prefix-based redirect, there is an option to drop the query-string of the location. Here it is the opposite. an option is added to preserve the query-string of the original URI for a localtion-based redirect. By setting "keep-query" option, for a location-based redirect only, the query-string of the original URI is appended to the location. If there is no query-string, nothing is added (no empty '?'). If there is already a non-empty query-string on the localtion, the original one is appended with '&' separator. This patch should fix issue #2728.	2024-11-19 15:20:02 +01:00
Willy Tarreau	82f190f882	MINOR: tools: make parse_size_err() support 32/64 bits parse_size_err() currently is a function working only on an uint. It's not convenient for certain elements such as rings on large machines. This commit addresses this by having one function for uints and one for ullong, and making parse_size_err() a macro that automatically calls one or the other. It also has the benefit of automatically supporting compatible types (long, size_t etc).	2024-11-19 10:50:42 +01:00
Willy Tarreau	9c6ccb8dbb	MEDIUM: config: warn on unitless timeouts < 100 ms From time to time we face a configuration with very small timeouts which look accidental because there could be expectations that they're expressed in seconds and not milliseconds. This commit adds a check for non-nul unitless values smaller than 100 and emits a warning suggesting to append an explicit unit if that was the intent. Only the common timeouts, the server check intervals and the resolvers hold and timeout values were covered for now. All the code needs to be manually reviewed to verify if it supports emitting warnings. This may break some configs using "zero-warning", but greps in existing configs indicate that these are extremely rare and solely intentionally done during tests. At least even if a user leaves that after a test, it will be more obvious when reading 10ms that something's probably not correct.	2024-11-19 10:33:20 +01:00
Willy Tarreau	e72b525832	MINOR: cfgparse: parse tune.bufsize.small as a size Till now this value was parsed as raw integer using atol() and would silently ignore any trailing suffix, causing unexpected behaviors when set, e.g. to "4k". Let's make use of parse_size_err() on it so that units are supported. This requires to turn it to uint as well, which was verified to be OK.	2024-11-18 19:07:05 +01:00
Willy Tarreau	a344d37fad	MINOR: cfgparse: parse tune.bufsize as a size Till now this value was parsed as raw integer using atol() and would silently ignore any trailing suffix, preventing from starting when set e.g. to "64k". Let's make use of parse_size_err() on it so that units are supported. This requires to turn it to uint as well, and to explicitly limit its range to INT_MAX - 2sizeof(void), which was previously partially handled as part of the sign check.	2024-11-18 19:06:25 +01:00
Willy Tarreau	2f0c6ff3a5	MINOR: cfgparse: parse tune.recv_enough as a size Till now this value was parsed as raw integer using atol() and would silently ignore any trailing suffix, causing unexpected behaviors when set, e.g. to "512k". Let's make use of parse_size_err() on it so that units are supported. This requires to turn it to uint as well, and since it's sometimes compared to an int, we limit its range to 0..INT_MAX.	2024-11-18 19:01:28 +01:00
Willy Tarreau	a90a7d4d60	MINOR: cfgparse: parse tune.pipesize as a size Till now this value was parsed as raw integer using atol() and would silently ignore any trailing suffix, causing unexpected behaviors when set, e.g. to "512k". Let's make use of parse_size_err() on it so that units are supported. This requires to turn it to uint as well, which was verified to be OK.	2024-11-18 18:51:31 +01:00
Willy Tarreau	f9f28b7584	MINOR: cfgparse: parse tune.{rcvbuf,sndbuf}.{frontend,backend} as sizes Till now these values were parsed as raw integer using atol() and would silently ignore any trailing suffix, causing unexpected behaviors when set, e.g. to "512k". Let's make use of parse_size_err() on them so that units are supported. This requires to turn them to uint as well, which is OK.	2024-11-18 18:50:02 +01:00
Willy Tarreau	a923c72357	MINOR: cfgparse: parse tune.{rcvbuf,sndbuf}.{client,server} as sizes Till now these values were parsed as raw integer using atol() and would silently ignore any trailing suffix, causing unexpected behaviors when set, e.g. to "512k". Let's make use of parse_size_err() on them so that units are supported. This requires to turn them to uint as well, which is OK.	2024-11-18 18:49:01 +01:00
Willy Tarreau	00fcda1ff2	MINOR: acl: export find_acl_default() It will be needed in a future patch, so let's export it (it was static).	2024-11-18 15:15:54 +01:00
Willy Tarreau	4fd6d15344	MINOR: mux-quic/h3: count glitches when they're reported The qcc_report_glitch() function is now replaced with a macro to support enumerating counters for each individual glitch line. For now this adds 36 such counters. The macro supports an optional description, though that is not being used for now. As a reminder, this requires to build with -DDEBUG_GLITCHES=1.	2024-11-14 20:43:33 +01:00
Aurelien DARRAGON	42710b7320	MEDIUM: uri_auth: implement clean uri_auth cleaning proxy auth_uri struct was manually cleaned up during deinit, but the logic behind was kind of akward because it was required to find out which ones were shared or not. Instead, let's switch to a proper refcount mechanism and free the auth_uri struct directly in proxy_free_common().	2024-11-14 15:03:38 +01:00
Aurelien DARRAGON	e1ec37ea51	MINOR: uri_auth: add stats_uri_auth_free helper Let's now leverage stats_uri_auth_free() helper to free uri_auth struct instead of manually performing the cleanup, which is error-prone.	2024-11-14 15:03:33 +01:00
Willy Tarreau	502790ed7e	MINOR: debug: add a new counter type for glitches COUNT_GLITCH() will implement an unconditional counter on its declaration line when DEBUG_GLITCHES is set, and do nothing otherwise. The output will be reported as "GLT" and can be filtered as "glt" on the CLI. The purpose is to help figure what's happening if some glitches counters start going through the roof. The macro supports an optional string argument to describe the cause of the glitch (e.g. "truncated header"), which is then reported in the dump. For now this is conditioned by DEBUG_GLITCHES but if it turns out to be light enough, maybe we'll keep it enabled full time. In this case it might have to be moved away from debug dev, or at least documented (or done as debug counters maybe so that dev can remain undocumented and updatable within a branch?).	2024-11-14 08:49:38 +01:00
Willy Tarreau	e119095290	MINOR: debug: explicitly permit the counter condition to be empty In order to count new event types, we'll need to support empty conditions so that we don't have to fake if (1) that would pollute the output. This change checks if #cond is an empty string before concatenating it with the optional var args, and avoids dumping the colon on the dump if the whole description is empty.	2024-11-14 08:47:00 +01:00
Valentine Krasnobaeva	d5d41dee3d	MINOR: startup: replace HAPROXY_LOAD_SUCCESS with global load_status After master-worker refactoring, master performs re-exec only once up to receiving "reload" command or USR2 signal. There is no more the second master's re-exec to free unused memory. Thus, there is no longer need to export environment variable HAPROXY_LOAD_SUCCESS with worker process load status. This status can be simply saved in a global variable load_status.	2024-11-13 09:50:05 +01:00
Amaury Denoyelle	8e0e7d9d1a	BUG/MINOR: guid/server: ensure thread-safety on GUID insert/delete Since 3.0, it is possible to assign a GUID to proxies, listeners and servers. These objects are stored in a global tree guid_tree. Proxies and listeners are static. However, servers may be added or deleted at runtime, which imply that guid_tree must be protected. Fix this by declaring a read-write lock to protect tree access. For now, only guid_insert() and guid_remove() are protected using a write lock. Outside of these, GUID tree is not accessed at runtime. If server CLI commands are extended to support GUID as server identifier, lookup operation should be extended with a read lock protection. Note that during stat-file preloading, GUID tree is accessed for lookup. However, as it is performed on startup which is single threaded, there is no need for lock here. A BUG_ON() has been added to ensure this precondition remains true. This bug could caused a segfault when using dynamic servers with GUID. However, it was never reproduced for now. This must be backported up to 3.0. To avoid a conflict issue, the previous cleanup patch can be merged before it.	2024-11-07 18:17:03 +01:00
Amaury Denoyelle	b70880cdc9	CLEANUP: guid: remove global tree export guid_tree is not directly used outside of functions provided by the guid module. Remove its export from the include file.	2024-11-07 17:20:00 +01:00
Aurelien DARRAGON	79a346aa28	MINOR: event_hdl: add event_hdl_sub_list_empty() helper func event_hdl_sub_list_empty() may be used to know if the subscription list passed as argument is empty or not (ie: if there currently are any subcribers or not). It can be useful to know if the subscription is empty is order to avoid unecessary preparation work and skip event publishing to save CPU time if we already know that no one is interested in tracking the changes for a given subscription list.	2024-11-07 11:35:55 +01:00
Willy Tarreau	84dd05e7d8	DEBUG: wdt: add a stats counter "BlockedTrafficWarnings" in show info Every time a warning is issued about traffic being blocked, let's increment a global counter so that we can check for this situation in "show info".	2024-11-06 18:35:42 +01:00
Willy Tarreau	148eb5875f	DEBUG: wdt: better detect apparently locked up threads and warn about them In order to help users detect when threads are behaving abnormally, let's try to emit a warning when one is no longer making any progress. This will allow to catch faulty situations more accurately, instead of occasionally triggering just after the long task. It will also let users know that there is something wrong with their configuration, and inspect the call trace to figure whether they're using excessively long rules or Lua for example (the usual warnings about lua-load vs lua-load-per-thread are still reported). The warning will only be emitted for threads not yet marked as stuck so as not to interfere with panic dumps and avoid sending a warning just before a panic. A tainted flag is set when this happens however (0x2000).	2024-11-06 18:35:42 +01:00
Willy Tarreau	0950778b3a	MINOR: debug: add a function to dump a stuck thread There's currently no way to just emit a warning informing that a thread is stuck without crashing. This is a problem because sometimes users would benefit from this info to clean up their configuration (e.g. abuse of map_regm, lua-load etc). This commit adds a new function ha_stuck_warning() that will emit a warning indicating that the designated thread has been stuck for XX milliseconds, with a number of streams blocked, and will make that thread dump its own state. The warning will then be sent to stderr, along with some reminders about the impacts of such situations to encourage users to fix their configuration. In order not to disrupt operations, a local 4kB buffer is allocated in the stack. This should be quite sufficient. For now the function is not used.	2024-11-06 18:35:42 +01:00
Willy Tarreau	1f34a0fd27	CLEANUP: stats: fix misleading comment on top of stat_idx_info The comment asks to update the "metrics_info" array, which does not exist, instead it's called stat_cols_info[] and is in stats.c. Let's mention all that to save time searching for the needed info. While no version seems to have ever known that "metrics_info", it's not needed to backport this as it's only a comment.	2024-11-06 18:35:42 +01:00
Amaury Denoyelle	1767196d5b	BUG/MINOR: quic: repeat packet parsing to deal with fragmented CRYPTO A ClientHello may be splitted accross several different CRYPTO frames, then mixed in a single QUIC packet. This is used notably by clients such as chrome to render the first Initial packet opaque to middleboxes. Each packet frame is handled sequentially. Out-of-order CRYPTO frames are buffered in a ncbuf, until gaps are filled and data is transferred to the SSL stack. If CRYPTO frames are heavily splitted with small fragments, buffering may fail as ncbuf does not support small gaps. This causes the whole packet to be rejected and unacknowledged. It could be solved if the client reemits its ClientHello after remixing its CRYPTO frames. This patch is written to improve CRYPTO frame parsing. Each CRYPTO frames which cannot be buffered due to ncbuf limitation are now stored in a temporary list. Packet parsing is completed until all frames have been handled. If temporary list is not empty, reparsing is done on the stored frames. With the newly buffered CRYPTO frames, ncbuf insert operation may this time succeeds if the frame now covers a whole gap. Reparsing will loop until either no progress can be made or it has been done at least 3 times, to prevent CPU utilization. This patch should fix github issue #2776. This should be backported up to 2.6, after a period of observation. Note that it relies on the following refactor patches : MINOR: quic: extend return value of CRYPTO parsing MINOR: quic: use dynamically allocated frame on parsing MINOR: quic: simplify qc_parse_pkt_frms() return path	2024-11-06 14:29:14 +01:00
Amaury Denoyelle	d65e782c8c	MINOR: quic: extend return value of CRYPTO parsing qc_handle_crypto_frm() is the function used to handled a newly received CRYPTO frame. Change its API to use a newly dedicated return type. This allows to report if the frame was properly handled, ignored if already parsed previously or rejected after a fatal error. This commit does not have any functional changes. However, it allows to simplify qc_handle_crypto_frm() API by removing <fast_retrans> as output parameter. Also, this patch will be necessary to support multiple iteration of packet parsing for CRYPTO frames.	2024-11-06 14:28:14 +01:00
Aurelien DARRAGON	24dd7154a6	MINOR: http: don't %-encode the payload when not relevant As reported by Pierre Maoui in GH #2477, it's not possible to render control chars from variables or expressions verbatim in the payload part of http-return statements. That's a problem because this part should not require to be encoded at all (we could even imagine building favicons on the fly for example). In fact it is the LOG_OPT_HTTP option when passed as default options on parse_logformat_string() which tells the log encoder that the payload should be http-encoded using lf_chunk() instead of being printed using the per-type encoder. This option was set when parsing logformat expressions for lf-string expression under http-return statements, as well as logformat expressions for set-map action. While it is true that those actions may only be used under http context, the LOG_OPT_HTTP logformat option is not relevant there, because the payload is expected to be used without being encoded. So let's simply get rid of this option when parsing logformat expressions for set-map action key/value and lf-string from http-request return action, and add a note next to LOG_OPT_HTTP option to indicate that it is used to tell the log encoder that the payload should be HTTP-encoded. Thanks to Pierre for having reported the issue and Willy for the analysis and patch proposal.	2024-11-06 10:21:15 +01:00
Willy Tarreau	601b34fe7b	MINOR: connection: add new sample fetch functions fc_err_name and bc_err_name These functions return a symbolic error code such as ECONNRESET to keep logs compact while making them human-readable. It's a good alternative to the numeric code in that it's more expressive, and a good one to the full message since it's shorter and more precise (some codes even match errno names). The doc was updated so that the symbolic names appear in the table. It could be useful to backport this feature to help with troubleshooting some issues, though backporting the doc might possibly be more annoying in case users have local patches already, so maybe the table update does not need to be backported in this case.	2024-11-05 18:57:43 +01:00
Willy Tarreau	00c383ff65	MINOR: connection: add more connection error codes to cover common errno While we get reports of connection setup errors in fc_err/bc_err, we don't have the equivalent for the recv/send/splice syscalls. Let's add provisions for new codes that cover the common errno values that recv/send/splice can return, i.e. ECONNREFUSED, ENOMEM, EBADF, EFAULT, EINVAL, ENOTCONN, ENOTSOCK, ENOBUFS, EPIPE. We also add a special case for when the poller reported the error itself. It's worth noting that EBADF/EFAULT/EINVAL will generally indicate serious bugs in the code and should not be reported. The only thing is that it's quite hard to forcefully (and reliably) trigger these errors in automated tests as the timing is critical. Using iptables to manually reset established connections in the middle of large transfers at least permits to see some ECONNRESET and/or EPIPE, but the other ones are harder to trigger.	2024-11-05 18:57:43 +01:00
Willy Tarreau	393957908b	CLEANUP: connection: properly name the CO_ER_SSL_FATAL enum entry It was the only one prefixed with "CO_ERR_", making it harder to batch process and to look up. It was added in 2.5 by commit `61944f7a73` ("MINOR: ssl: Set connection error code in case of SSL read or write fatal failure") so it can be backported as far as 2.6 if needed to help integrate other patches.	2024-11-05 18:57:42 +01:00
Willy Tarreau	b300db55f6	BUILD: compiler: define __builtin_prefetch() for tcc We're using a few occurrences of __builtin_prefetch() but tcc doesn't know about it so let's give it a dummy definition. Now the code builds and works again with tcc without thread support.	2024-11-05 15:43:17 +01:00
Willy Tarreau	033db091fc	BUILD: import/mt_list: support building with TCC TCC is often convenient to quickly test builds, run CI tests etc. It has limited thread support (e.g. no thread-local stuff) but that is often sufficient for testing. TCC lacks __atomic_exchange_n() but has the exactly equivalent __atomic_exchange(), and doesn't have any barrier. For this reason we force the atomic_exchange to use the stricter SEQ_CST mem ordering that allows to ignore the barrier. [wt: that's upstream commit ca8b865 ("BUILD: support building with TCC")]	2024-11-05 15:43:17 +01:00
William Lallemand	e75a019fba	MINOR: startup: tune.renice.{startup,runtime} allow to change priorities This commit introduces the tune.renice.startup and tune.renice.runtime global keywords that allows to change the priority with setpriority(). tune.renice.startup is parsed and applied in the worker or the standalone process for configuration parsing. If this keyword is used alone, the nice value is changed to the previous one after configuration parsing. tune.renice.runtime is applied after configuration parsing, so in the worker or a standalone process. Combined with tune.renice.startup it allows to have a different nice value during configuration parsing and during runtime. The feature was discussed in github issue #1919. Example: global tune.renice.startup 15 tune.renice.runtime 0	2024-11-04 17:48:58 +01:00
Christopher Faulet	64554a55f4	MINOR: stream: Add http-buffer-request option in the waiting entities When http-buffer-request option is set on a proxy, the processing will be paused to wait the full request payload or a full buffer. So it is an entity that block the processing, just like a rule or a filter that yields. So now, it is reported as a waiting entity if an error or a timeout occurred. To do so, an stream entity type is added for this option. There is no pointer. And "waiting_entity" sample fetch returns the option name.	2024-10-31 20:24:50 +01:00
Christopher Faulet	c64712b085	MINOR: stream: Use an enum to identify last and waiting entities for streams Instead of using 1 for last/waiting rule and 2 for last/waiting filter, an enum is used. It is less ambiguous this way.	2024-10-31 20:24:37 +01:00
Christopher Faulet	537f20eb3e	MINOR: stream: Save the entity waiting to continue its processing When a rule or a filter yields because it waits for something to be able to continue its processing, this entity is saved in the stream. If an error or a timeout occurred, info on this entity may be retrieved via the "waiting_entity" sample fetch, for instance to dump it in the logs. This info may be useful to found root cause of some bugs because it is a way to know the processing was temporarily stopped. This may explain timeouts for instance. The sample fetch is not documented yet.	2024-10-31 16:40:09 +01:00
Christopher Faulet	53de6da1c0	MINOR: stream: Save the last filter evaluated interrupting the processing It is very similar to the last evaluated rule. When a filter returns an error that interrupts the processing, it is saved in the stream, in the last_entity field, with the type 2. The pointer on filter config is saved. This pointer never changes during runtime and is part of the proxy's structure. It is an element of the filter_configs list in the proxy structure. "last_entity" sample fetch was update accordingly. The filter identifier is returned, if defined. Otherwise the save pointer.	2024-10-31 16:39:04 +01:00
Christopher Faulet	c9fa78e747	MINOR: stream: Replace last_rule_file/line fields by a more generic field The last evaluated rule is now saved in a generic structure, named last_entity, with a type to identify it. The idea is to be able to store other kind of entity that may interrupt a specific processing. The type of the last evaluated rule is set to 1. It will be replace later by an enum to be more explicit. In addition, the pointer to the rule itself is saved instead of its location. The sample fetch "last_entity" was added to retrieve the information about it. In this case, it is the rule localtion, the config file containing the rule followed by the line where the rule is defined, separated by a colon. This sample fetch is not documented yet.	2024-10-31 16:36:39 +01:00
Amaury Denoyelle	dcf334168c	MINOR: quic: move qc_send_mux() prototype into quic_tx.h qc_send_mux() is defined in quic_tx.c. As such, its prototype is moved from quic_conn.h to quic_tx.h.	2024-10-31 15:35:31 +01:00
Tristan	18582ede05	MEDIUM: socket: add zero-terminated ABNS alternative When an abstract unix socket is bound by HAProxy (using "abns@" prefix), NUL bytes are appended at the end of its path until sun_path is filled (for a total of 108 characters). Here we add an alternative to pass only the non-NUL length of that path to connect/bind calls, such that the effective path of the socket's name is as humanly written. This may be useful to interconnect with existing softwares that implement abstract sockets with this logic instead of the default haproxy one. This is achieved by implementing the "abnsz" socket prefix (instead of "abns"), which stands for "zero-terminated ABNS". "abnsz" prefix may be used anywhere "abns" is. Internally, haproxy uses the custom socket family (AF_CUST_ABNS vs AF_CUST_ABNSZ) to differentiate default abns sockets from zero-terminated ones. Documentation was updated and regtest was added. Fixes GH issues #977 and #2479 Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>	2024-10-29 12:15:24 +01:00
Aurelien DARRAGON	43861e3234	MEDIUM: sock_unix: use per-family addrcmp function Thanks to previous commit, we may now use dedicated addrcmp functions for each UNIX address family. This allows to simplify sock_unix_addrcmp() function and avoid useless checks in order to try to guess the socket type. In this patch we implement sock_abns_addrcmp() and sock_abnsz_addrcmp() functions, which are respectively used for ABNS and ABNSZ custom families sock_unix_addrcmp() now only holds regular UNIX socket comparing logic.	2024-10-29 12:15:09 +01:00
Willy Tarreau	d24768ab44	MINOR: protocol: create abnsz socket address family For now it's the same as abns. We'll need to modify sock_unix_addrcmp(), and a few other ones to support effective path length when dealing with the \0. Let's check with Tristan's patch for this (upcoming patch). Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>	2024-10-29 12:14:50 +01:00
Willy Tarreau	78ac312bbd	MEDIUM: protocol: make abns a custom unix socket address family This is a pre-requisite to adding the abnsz socket address family: in this patch we make use of protocol API rework started by `732913f` ("MINOR: protocol: properly assign the sock_domain and sock_family") in order to implement a dedicated address family for ABNS sockets (based on UNIX parent family). Thanks to this, it will become trivial to implement a new ABNSZ (for abns zero) family which is essentially the same as ABNS but with a slight difference when it comes to path handling (ABNS uses the whole sun_path length, while ABNSZ's path is zero terminated and evaluation stops at 0) It was verified that this patch doesn't break reg-tests and behaves properly (tests performed on the CLI with show sess and show fd). Anywhere relevant, AF_CUST_ABNS is handled alongside AF_UNIX. If no distinction needs to be made, real_family() is used to fetch the proper real family type to handle it properly. Both stream and dgram were converted, so no functional change should be expected for this "internal" rework, except that proto will be displayed as "abns_{stream,dgram}" instead of "unix_{stream,dgram}". Before ("show sess" output): 0x64c35528aab0: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=21,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp= After: 0x619da7ad74c0: proto=abns_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=22,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp= Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>	2024-10-29 12:14:25 +01:00
William Lallemand	596db3ef86	BUG/MINOR: trace: stop rewriting argv with -dt When using trace with -dt, the trace_parse_cmd() function is doing a strtok which write \0 into the argv string. When using the mworker mode, and reloading, argv was modified and the trace won't work anymore because the first : is replaced by a '\0'. This patch fixes the issue by allocating a temporary string so we don't modify the source string directly. It also replace strtok by its reentrant version strtok_r. Must be backported as far as 2.9.	2024-10-29 11:01:47 +01:00
Aurelien DARRAGON	24131dee30	MINOR: tools: add strnlen2() helper strnlen2() is functionally equivalent to strnlen(). Goal is to provide an alternative to strnlen() which is not portable since it requires _POSIX_C_SOURCE >= 200809L	2024-10-28 14:59:35 +01:00
Willy Tarreau	fba48e1c40	MINOR: pools: export the pools variable We want it to be accessible from debuggers for inspection and it's currently unavailable. Let's start by exporting it as a first step.	2024-10-24 16:12:46 +02:00
Christopher Faulet	362de90f3e	BUG/MINOR: stconn: Don't disable 0-copy FF if EOS was reported on consumer side There is no reason to disable the 0-copy data forwarding if an end-of-stream was reported on the consumer side. Indeed, the consumer will send data in this case. So there is no reason to check the read side here. This patch may be backported as far as 2.9.	2024-10-24 12:07:50 +02:00
Amaury Denoyelle	7a02fcaf20	BUG/MEDIUM: server: fix race on servers_list during server deletion Each server is inserted in a global list named servers_list on new_server(). This list is then only used to finalize servers initialization after parsing. On dynamic server creation, there is no issue as new_server() is under thread isolation. However, when a server is deleted after its refcount reached zero, srv_drop() removes it from servers_list without lock protection. In the longterm, this can cause list corruption and crashes, especially if multiple adjacent servers are removed in parallel. To fix this, convert servers_list to a mt_list. This should not impact performance as servers_list is not used during runtime outside of server creation/deletion. This should fix github issue #2733. Thanks to Chris Staite who first found the issue here. This must be backported up to 2.6.	2024-10-24 11:35:57 +02:00
Valentine Krasnobaeva	ddb829bb51	MINOR: mworker/cli: split mworker_cli_proxy_create There are two parts in mworker_cli_proxy_create(): allocating and setting up MASTER proxy and allocating and setting up servers on ipc_fd[0] of the sockpairs shared with workers. So, let's split mworker_cli_proxy_create() into two functions respectively. Each of them takes **errmsg as an argument to write an error message, which may be triggered by some subcalls. The content of this errmsg will allow to extend the final alert message shown to user, if these new functions will fail. The main goals of this split is to allow to move these two parts independantly in future and makes the code of haproxy initialization in haproxy.c more transparent.	2024-10-24 11:32:20 +02:00
Willy Tarreau	19e4ec43b9	MINOR: filters: add per-filter call counters The idea here is to record how many times a filter is being called on a stream. We're incrementing the same counter all along, regardless of the type of event, since the purpose is essentially to detect one that might be misbehaving. The number of calls is reported in "show sess all" next to the filter name. It may also help detect suboptimal processing. For example compressing 1GB shows 138k calls to the compression filter, which is roughly two calls per buffer. Maybe we wake up with incomplete buffers and compress less. That's left for a future analysis.	2024-10-22 20:13:00 +02:00
Willy Tarreau	37d5c6fe3a	MINOR: stream: maintain per-stream counters of the number of passes on code Process_stream() is a complex function and a few times some lopos were either witnessed or suspected. Each time this happens it's extremely difficult to figure why because it involves combinations of analysers, filters, errors etc. Let's at least maintain a set of 4 counters per stream that report the number of times we've been through each of the 4 most important blocks (stconn changes, request analysers, response analysers, and propagation of changes down). These ones are stored in the stream and reported in "show sess all", just like they will be reported in panic dumps.	2024-10-22 20:13:00 +02:00
Willy Tarreau	da66c42f65	MINOR: debug: add a new debug macro COUNT_IF() This macro works exactly like BUG_ON() except that it never logs anything nor crashes, it only implements an atomic counter that is incremented on every call. This can be used to count a number of unlikely events that are worth checking at run time on setups showing unusual and unreproducible behaviors.	2024-10-21 19:14:07 +02:00
Willy Tarreau	776fd03509	MEDIUM: debug: add match counters for BUG_ON/WARN_ON/CHECK_IF These macros do not always kill the process, and sometimes it would be nice to know if some match or not, and how many times (especially for the CHECK_IF one). This commit adds a new section "dbg_cnt" made of structs that contain function name, file name, line number, check type, condition and match count. A newe macro __DBG_COUNT() adds one to the counter, and is placed inside _BUG_ON() and _BUG_ON_ONCE(). It's worth noting that the exact type of the check is not very precise but in practice we don't care, as most checks will cause the process to die anyway unless they're of type _BUG_ON_ONCE() (used by CHECK_IF by default). All of this is limited to !defined(USE_OBSOLETE_LINKER) because we're creating a section, thus we need a modern linker to be able to scan this section later. Doing so adds ~50kB to the executable due to the ~1266 BUG_ON() and others placed there. That's not huge in comparison to the visibility it can provide.	2024-10-21 19:14:07 +02:00
Willy Tarreau	8844ed2009	CLEANUP: debug: make the BUG_ON() macros check the condition in the outer one The BUG_ON() macros are made of two levels so as to resolve the condition to a string. However this doesn't offer much flexibility for performing other operations when the condition is validated, so let's adjust them so that the condition is checked in the outer macro and the operations are performed in the inner one.	2024-10-21 18:17:25 +02:00
Willy Tarreau	278b9613a3	MEDIUM: debug: on panic, make the target thread automatically allocate its buf One main problem with panic dumps is that they're filling the dumping thread's trash, and that the global thread_dump_buffer is too small to catch enough of them. Here we're proceeding differently. When dumping threads for a panic, we're passing the magic value 0x2 as the buffer, and it will instruct the target thread to allocate its own buffer using get_trash_chunk() (which is signal safe), so that each thread dumps into its own buffer. Then the thread will wait for the buffer to be consumed, and will assign its own thread_dump_buffer to it. This way we can simply dump all threads' buffers from gdb like this: (gdb) set $t=0 while ($t < global.nbthread) printf "%s\n", ha_thread_ctx[$t].thread_dump_buffer.area set $t=$t+1 end For now we make it wait forever since it's only called on panic and we want to make sure the thread doesn't leave and continues to use that trash buffer or do other nasty stuff. That way the dumping thread will make all of them die. This would be useful to backport to the most recent branches to help troubleshooting. It backports well to 2.9, except for some trivial context in tinfo-t.h for an updated comment. 2.8 and older would also require TAINTED_PANIC. The following previous patches are required: MINOR: debug: make mark_tainted() return the previous value MINOR: chunk: drop the global thread_dump_buffer MINOR: debug: split ha_thread_dump() in two parts MINOR: debug: slightly change the thread_dump_pointer signification MINOR: debug: make ha_thread_dump_done() take the pointer to be used MINOR: debug: replace ha_thread_dump() with its two components	2024-10-19 16:01:52 +02:00
Willy Tarreau	afeac4bc02	MINOR: debug: replace ha_thread_dump() with its two components At the few places we were calling ha_thread_dump(), now we're calling separately ha_thread_dump_fill() and ha_thread_dump_done() once the data are consumed.	2024-10-19 15:42:34 +02:00
Willy Tarreau	8e048603d1	MINOR: debug: make mark_tainted() return the previous value Since mark_tainted() uses atomic ops to update the tainted status, let's make it return the prior value, which will allow the caller to detect if it's the first one to set it or not.	2024-10-19 15:13:47 +02:00
Willy Tarreau	84340d108b	OPTIM: buffers: avoid a useless wrapping check for ofs == 0 As mentioned in previous commit, b_peek_ofs() performs a wrapping check but is often called with ofs == 0 as a constant. We can detect this case with __builtin_const_p() so it makes sense to use it. A test shows a size reduction of about 320 bytes, which is not much, but it happens in hot code paths, and each 16 bytes reduction indicates an eliminated conditional branch. Some clear winners are ci_getblk_nc() (-48 bytes), h2c_dec_hdrs (-141B), h1_copy_msg_data (-124B), tcpcheck_spop_expect_hello (-80B), h1_parse_msg_data (-44B). These ones will definitely benefit from doing less conditional jumps.	2024-10-18 18:42:47 +02:00
Willy Tarreau	8b5a1fd1fc	BUILD: buffers: keep b_getblk_nc() and b_peek_varint() in buf.h Some large functions were moved to buf.c by commit `ac66df4e2` ("REORG: buffers: move some of the heavy functions from buf.h to buf.c"). However, as found by Amaury, haring doesn't build anymore. Upon close inspection, b_getblk_nc() isn't that big since it's very much inlinable, and a part of its apparently large size comes from the BUG_ON_HOT() that were implemented. Regarding b_peek_varint(), it doesn't have any dependency and is used only at 4 places in the DNS code, so its loop will not have big impacts, and the rest around can be optimised away by the compiler so it remains relevant to keep it inlined. Also it can serve as a base to deduplicate the code in b_get_varint(). No backport needed.	2024-10-18 17:53:25 +02:00
Dragan Dosen	f33e9079a9	MINOR: arg: add an argument type for identifier The ARGT_ID argument type may now be used to set a custom resolve function in order to help resolve the argument string value. If the custom resolve function is not set, the behavior is the same as of type ARGT_STR.	2024-10-18 14:30:24 +02:00
Frederic Lecaille	b1af5dabf0	BUG/MEDIUM: quic: avoid freezing 0RTT connections This issue came with this commit: `f627b92` BUG/MEDIUM: quic: always validate sender address on 0-RTT and could be easily reproduced with picoquic QUIC client with -Q option which splits a big ClientHello TLS message into two Initial datagrams. A second condition must be fulfilled to reprodue this issue: picoquic must not send the token provided by haproxy (NEW_TOKEN). To do that, haproxy must be patched to prevent it to send such tokens. Under these conditions, if haproxy has enough time to reply to the first Initial datagrams, when it receives the second Initial datagram it sends a Retry paquet. Then the client ignores the Retry paquet as mentionned by RFC 9000: 17.2.5.2. Handling a Retry Packet A client MUST accept and process at most one Retry packet for each connection attempt. After the client has received and processed an Initial or Retry packet from the server, it MUST discard any subsequent Retry packets that it receives. On its side, haproxy has closed the connection. When it receives the second Initial datagram, it open a new connection but with Initial packets it cannot decrypt (wrong ODCID) leaving the client without response. To fix this, as the aim of the token (NEW_TOKEN) sent by haproxy is to validate the peer address, in place of closing the connection when no token was received for a 0RTT connection, one leaves this validation to the handshake process. Indeed, the peer adress is validated during the handshake when a valid handshake packet is received by the listener. But as one does not want haproxy to process 0RTT data when no token was received, one does not accept the connection before the successful handshake completion. In addition to this, the 0RTT packets are not released after successful handshake completion when no token was received to leave a chance to haproxy to process these 0RTT data in such case (see quic_conn_io_cb()). Must be backported as far as 2.9.	2024-10-17 15:04:06 +02:00
Christopher Faulet	52a3d807fc	BUG/MAJOR: filters/htx: Add a flag to state the payload is altered by a filter When a filter is registered on the data, it means it may change the payload length by rewritting data. It means consumers of the message cannot trust the expected length of payload as announced by the producer. The commit `8bd835b2d2` ("MEDIUM: filters/htx: Don't rely on HTX extra field if payload is filtered") was pushed to solve this issue. When the HTTP payload of a message is filtered, the extra field is set to 0 to be sure it will never be used by error by any consumer. However, it is not enough. Indeed, the filters must be called before fowarding some data. They cannot be by-passed. But if a consumer is unable to flush the HTX message, some outgoing data can remain blocked in the channel's buffer. If some new data are then pushed because there is some room in the channel's buffe, the producer will set the HTX extra field. At this stage, if the consumer is unblocked and can send again data, it is possible to call it to forward outgoing data blocked in the channel's buffer before waking the stream up to filter new input data. It is the purpose of the data fast-forwarding. In this case, the HTX extra field will be seen by the consumer. It is unexpected and leads to undefined behavior. One consequence of this bug is to perform a wrong chunking on compressed messages, leading to processing errors at the end of the message, reported as "ID--" in logs. To fix the bug, a HTX flag is added to state the payload of the current HTX message is altered. When this flag is set (HTX_FL_ALTERED_PAYLOAD), the HTX extra field must not be trusted. And to keep things simple, when this flag is set, the HTX extra field is automatically set to 0 when the HTX message is loaded, in htxbuf() function. It is probably the less intrusive way to fix the bug for now. But this part must be reviewed to save meta-info of the HTX message outside of the message itself. This commit should solve the issue #2741. It must be backported as far as 2.9.	2024-10-17 13:54:54 +02:00
Valentine Krasnobaeva	b73a278df4	MINOR: mworker/cli: add _send_status to support state transition In the new master-worker architecture, when a worker process is forked and successfully initialized it needs somehow to communicate its "READY" state to the master, in order to terminate the previous worker and workers, that might exceeded max_reloads counter. So, let's implement for this a new master CLI _send_status command. A new worker can send its status string "READY" to the master, when it's about entering to the run poll loop, thus it can start to receive data. In _send_status() in the master context we update the status of the new worker: PROC_O_INIT flag is withdrawn. When TERM signal is sent to a worker, worker terminates and this triggers the mworker_catch_sigchld() handler in master. This handler deletes the exiting process entry from the processes list. In _send_status() we loop over the processes list twice. At the first time, in order to stop workers that exceeded the max_reloads counter. At the second time, in order to stop the worker forked before the last reload. In the corner case, when max_reloads=1, we avoid to send SIGTERM twice to the same worker by setting sigterm_sent flag during the first loop.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	2bb07b913d	MINOR: startup: rename and adapt reexec_on_failure Previously reexec_on_failure() was called in cases when the process has failed after reload, while it was parsing its configuration or it was trying to apply it. reexec_on_failure() has called mworker_reexec() and the master process has been reexecuted. With the new architecture in such cases there is no longer need to reexecute the master process after its reload again. It simply keeps the previous worker, forked before the reload, and it lets the new one to exit with an error. But we still need the code, which increments the number of failed reloads and which notifies systemd with new "Reload failed!" status. So, let's reuse and adapt for this reexec_on_failure() and let's rename it to on_new_child_failure().	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	646299fc95	MINOR: mworker: add and set state PROC_O_INIT for new worker Here, to distinguish between the new worker and the previous one let's add a new process state PROC_O_INIT and let's set it, when the memory is allocated for the new worker in the processes list.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	cc1a631beb	MINOR: mworker/cli: rename and clean mworker_cli_sockpair_new Let's rename mworker_cli_sockpair_new() to mworker_cli_global_proxy_new_listener() to outline that this function creates the GLOBAL proxy, allocates the listener with "master-socket" bind conf and attaches this listener to this GLOBAL proxy. Listener is bound to ipc_fd[1] of the sockpair inherited in master and in worker (master CLI sockpair).	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	0fbf1973ad	MINOR: mworker/cli: rename mworker_cli_proxy_new_listener This is the first commit in a series to add the support of 4 primary reload use-cases for the new master-worker architecture: 1. Newly forked worker process dies before any reload, due to some errors in the configuration. Newly forked worker process crashes before any reload after sending its "READY" state to master. 2. Newly forked worker process dies due to some errors in the new configuration. This happens after reload, when this new configuration was supplied, so the previous worker process is still here. 3. Newly forked worker process crashes after sending its "READY" state to master due to some bugs. This happens after reload, so the previous worker process is still here. 4. Newly forked worker process has sent its "READY" state to master and starts to receive traffic. This happens after reload, the old worker hasn't terminated yet, as it is waiting on some idle connection and it crashes. Let's rename in this commit mworker_cli_proxy_new_listener() to mworker_cli_master_proxy_new_listener() to outline, that this function creates "master-socket" bind conf and allocates a listener. This listener is attached to the MASTER proxy and it's bound to the ipc_fd[0] of the sockpair, inherited in master and in worker processes (master CLI sockpair).	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	f9123e2183	MEDIUM: cfgparse: add KWF_DISCOVERY keyword flag This commit is a part of the series to add a support of discovery mode in the configuration parser and in initialization sequence. So, let's add here KWF_DISCOVERY flag to distinguish the keywords, which should be parsed in "discovery" mode and which are needed for master process, from all others. Keywords, that should be parsed in "discovery" mode have its dedicated parser funtions. Let's tag these functions with KWF_DISCOVERY flag in keywords list. Like this, only these keyword parsers might be called during the first configuration read in discovery mode.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	6769745fe5	MINOR: global: add MODE_DISCOVERY flag This is the first commit from a series to add a support of discovery mode in the configuration parser and in initialization sequence. Discovery mode is the mode, when we read the configuration at the first time and we parse and set runtime modes: daemon, zero-warning, master-worker. In this mode we also parse some parameters needed for the master process to start, in case if we are in the master-worker mode. Like this the master process doesn't allocate any additional resources, which it doesn't use and it quickly finishes its initialization and enters to its polling loop. The worker process after its fork reads the rest of the configuration. So, let's add in this commit MODE_DISCOVERY flag to check it in configuration parser functions.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	fe75c1e12d	MEDIUM: startup: remove MODE_MWORKER_WAIT MODE_MWORKER_WAIT becames redundant with MODE_MWORKER, due to moving master-worker fork in init(). This change allows master no longer perform reexec just after forking in order to free additional memory. As after the fork in the master process we set 'master' variable, we can replace now MODE_MWORKER_WAIT in some 'if' statements by simple check of this 'master' variable. Let's also continue to get rid of HAPROXY_MWORKER_WAIT_ONLY environment variable, as it's no longer needed as well. In cfg_program_postparser(), which is used to check if cmdline is defined to launch a program, we completely remove the check of mode for now, because the master process does not parse the configuration for the moment. 'program' section parsing will be reintroduced in master later in the next commits.	2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva	fb7bef781d	MINOR: defaults: update MASTER_MAXCONN description This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT support, as it became redundant with MODE_MWORKER due to moving master-worker fork in init().	2024-10-16 22:02:39 +02:00
Willy Tarreau	2c2dac77aa	DEBUG: mux-h2/flags: add H2_CF_DEM_RXBUF & H2_SF_EXPECT_RXDATA for the decoder Both flags were recently added but missing from the decoders flags, so they appeared in hex in dev/flags/flags output. No backport needed.	2024-10-16 18:32:52 +02:00
Aurelien DARRAGON	85298189bf	BUG/MEDIUM: server: server stuck in maintenance after FQDN change Pierre Bonnat reported that SRV-based server-template recently stopped to work properly. After reviewing the changes, it was found that the regression was caused by `a4d04c6` ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT") Indeed, HMAINT is not a regular maintenance flag. It was implemented in `b418c122` `a4d04c6` ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT"). This flag is only set (and never removed) when the server FQDN is changed from its initial config-time value. This can happen with "set server fqdn" command as well as SRV records updates from the DNS. This flag should ideally belong to server flags.. but it was stored under srv_admin enum because cur_admin is properly exported/imported via server state-file while regular server's flags are not. Due to `a4d04c6`, when a server FQDN changes, the server is considered in maintenance, and since the HMAINT flag is never removed, the server is stuck in maintenance. To fix the issue, we partially revert `a4d04c6`. But this latter commit is right on one point: HMAINT flag was way too confusing and mixed-up between regular MAINT flags, thus there's nothing to blame about `a4d04c6` as it was error-prone anyway.. To prevent such kind of bugs from happening again, let's rename HMAINT to something more explicit (SRV_ADMF_FQDN_CHANGED) and make it stand out under srv_admin enum so we're not tempted to mix it with regular maintenance flags anymore. Since `a4d04c6` was set to be backported in all versions, this patch must be backported there as well.	2024-10-16 14:26:57 +02:00
Amaury Denoyelle	0918c41ef6	BUG/MEDIUM: quic: support wait-for-handshake wait-for-handshake http-request action was completely ineffective with QUIC protocol. This commit implements its support for QUIC. QUIC MUX layer is extended to support wait-for-handshake. A new function qcc_handle_wait_for_hs() is executed during qcc_io_process(). It detects if MUX processing occurs after underlying QUIC handshake completion. If this is the case, it indicates that early data may be received. As such, connection is flagged with CO_FL_EARLY_SSL_HS, which is necessary to block stream processing on wait-for-handshake action. After this, qcc subscribs on quic_conn layer for RECV notification. This is used to detect QUIC handshake completion. Thus, qcc_handle_wait_for_hs() can be reexecuted one last time, to remove CO_FL_EARLY_SSL_HS and notify every streams flagged as SE_FL_WAIT_FOR_HS. This patch must be backported up to 2.6, after a mandatory period of observation. Note that it relies on the backport of the two previous patches : - MINOR: quic: notify connection layer on handshake completion - BUG/MINOR: stream: unblock stream on wait-for-handshake completion	2024-10-16 11:51:35 +02:00
Willy Tarreau	4eb3ff1d3b	MAJOR: mux-h2: make streams use the connection's buffers For now it seems to work as before, and even when artificially inflating the number of allocatable buffers per stream. The number of allocated slots is always the same as the max number of streams, which guarantees that each stream will find one buffer. we only grant one buffer per stream at this point, since the goal was to replace the existing single rxbuf. A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate a failure to get an rxbuf slot from the connection. It was lightly tested (by forcing bl_init() to a lower number of buffers). It is not yet certain whether it's more useful to have a new flag or to reuse the existing H2_CF_DEM_SFULL which indicates the rxbuf is full, but at least the new flag more accurately translates the condition, that may make a difference in the future. However, given that when RXBUF is set, most of the time it results in a failure to find more room to demux and it sets SFULL, for now we have to always clear SFULL when clearing RXBUF as well. This means that most of the time we'll see 3 combinations: - none: everything's OK - SFULL: the unique rx buffer is full - RXBUF \|\| (RXBUF\|SFULL): cannot allocate more entries Note that we need to be super careful in h2_frt_transfer_data() because the htx_free_data_space() function doesn't guarantee that the room is usable, so htx_add_data() may still fail despite an apparent room. For this reason, h2_frt_transfer_data() maintains a "full" flag to indicate that a transfer attempt failed and that a new buffer is required.	2024-10-12 16:29:16 +02:00
Willy Tarreau	3b5ac2b553	MINOR: mux-h2: move H2_CF_WAIT_IN_LIST flag away from the demux flags It's not convenient to have this flag in the middle of the demux flags, it easily hides other ones that need to be added. Let's move it after the other ones.	2024-10-12 16:29:16 +02:00
Willy Tarreau	721ea5b06c	MINOR: mux-h2: count within a connection, how many streams are receiving data A stream is receiving data from after the HEADERS frame missing END_STREAM, to the end of the stream or HREM (the presence of END_STREAM). We're now adding a flag to the stream that indicates this state, as well as a counter in the connection of streams currently receiving data. The purpose will be to gauge at any instant the number of streams that might have to share the available bandwidth and buffers count in order not to allocate too much flow control to any single stream. For now the counter is kept up to date, and is reported in "show fd".	2024-10-12 16:29:16 +02:00
Willy Tarreau	8f09bdce10	MINOR: buffer: add a buffer list type with functions The buffer ring is problematic in multiple aspects, one of which being that it is only usable by one entity. With multiplexed protocols, we need to have shared buffers used by many entities (streams and connection), and the only way to use the buffer ring model in this case is to have each entity store its own array, and keep a shared counter on allocated entries. But even with the default 32 buf and 100 streams per HTTP/2 connection, we're speaking about 3210132 bytes = 103424 bytes per H2 connection, just to store up to 32 shared buffers, spread randomly in these tables. Some users might want to achieve much higher than default rates over high speed links (e.g. 30-50 MB/s at 100ms), which is 3 to 5 MB storage per connection, hence 180 to 300 buffers. There it starts to cost a lot, up to 1 MB per connection, just to store buffer indexes. Instead this patch introduces a variant which we call a buffer list. That's basically just a free list encoded in an array. Each cell contains a buffer structure, a next index, and a few flags. The index could be reduced to 16 bits if needed, in order to make room for a new struct member. The design permits initializing a whole freelist at once using memset(0). The list pointer is stored at a single location (e.g. the connection) and all users (the streams) will just have indexes referencing their first and last assigned entries (head and tail). This means that with a single table we can now have all our buffers shared between multiple streams, irrelevant to the number of potential streams which would want to use them. Now the 180 to 300 entries array only costs 7.2 to 12 kB, or 80 times less. Two large functions (bl_deinit() & bl_get()) were implemented in buf.c. A basic doc was added to explain how it works.	2024-10-12 16:29:15 +02:00
Willy Tarreau	ac66df4e2e	REORG: buffers: move some of the heavy functions from buf.h to buf.c Over time, some of the buffer management functions grew quite a bit, and were still forced to remain inlined since all defined in buf.h. Let's create buf.c and move the heaviest ones there. All those moved here were above 200 bytes.	2024-10-12 16:29:15 +02:00
Aurelien DARRAGON	1bdf6e884a	MEDIUM: sink: implement sink_find_early() sink_find_early() is a convenient function that can be used instead of sink_find() during parsing time in order to try to find a matching sink even if the sink is not defined yet. Indeed, if the sink is not defined, sink_find_early() will try to create it and mark it as forward-declared. It will also save informations from the caller to better identify it in case of errors. If the sink happens to be found in the config, it will transition from forward-declared type to its final type. Else, it means that the sink was not found in the config, in this case, during postresolve, we raise an error to indicate that the sink was not found in the configuration. It should help solve postresolving issue with rings, because for now only log targets implement proper ring postresolving.. but rings may be used at different places in the code, such as debug() converter or in "traces" section.	2024-10-10 16:55:15 +02:00
Aurelien DARRAGON	0e271f1d2a	MINOR: log: add do_log_parse_act() helper func Function may be used from places where per-context actions are usually registered (tcp_act.c, http_act.c, quic_rules.c.. to name a few) in order to expose the do_log() action.	2024-10-04 21:38:08 +02:00
Aurelien DARRAGON	e63c7da508	MINOR: log: add do_log() logging helper do_log() is quite similar to sess_log() or strm_log(), excepts that it may be called at any time during session handling in an opportunistic way as long as the session exists (the stream may or may not exist). Also, it will try to emit the log as INFO by default, unless set-log-level is used on the stream, or error origin flag is set.	2024-10-04 21:38:02 +02:00
Amaury Denoyelle	f6599cf5a6	MEDIUM: quic: decount out-of-order ACK data range for MUX txbuf window This commit is the last one of a serie whose objective is to restore QUIC transfer throughput performance to the state prior to the recent QUIC MUX buffer allocator rework. This gain is obtained by reporting received out-of-order ACK data range to the QUIC MUX which can then decount room in its txbuf window. This is implemented in QUIC streamdesc layer by adding a new invokation of notify_room callback. This is done into qc_stream_buf_store_ack() which handle out-of-order ACK data range. Previous commit has introduced merging of overlapping ACK data range. As such, it's easy to only report the newly acknowledged data range. As with in-order ACKs, this new notification is only performed on released streambuf. As such, when a streambuf instance is released, notify_room notification now also reports the total length of out-of-order ACK data range currently stored. This value is stored in a new streambuf member <room> to avoid unnecessary tree lookup. This <room> member also serves on in-order ACK notification to reduce the notified room. This prevents to report invalid values when overlap ranges are treated first out-of-order and then in-order, which would cause an invalid QUIC MUX txbuf window value. After this change has been implemented, performance has been significantly improved, both with ngtcp2-client rate usage and on interop goodput test. These values are now similar to the rate observed on older haproxy version before QUIC MUX buffer allocator rework.	2024-10-04 18:09:51 +02:00
Amaury Denoyelle	e7578084b0	MINOR: quic: implement dedicated type for out-of-order stream ACK QUIC streamdesc layer is responsible to handle reception of ACK for streams. It removes stream data from the underlying buffers on ACK reception. Streamdesc layer treats ACK in order at the stream level. Out of order ACKs are buffered in a tree until they can be handled on older data acknowledgement reception. Previously, qf_stream instance which comes from the quic_tx_packet was used as tree node to buffer such ranges. Introduce a new type dedicated to represent out of order stream ack data range. This type is named qc_stream_ack. It contains minimal infos only relative to the acknowledged stream data range. This allows to reduce size of frequently used quic_frame with the removal of tree node from qf_stream. Another side effect of this change is that now quic_frame are always released immediately on ACK reception, both in-order and out-of-order. This allows to also release the quic_tx_packet instance which should reduce memory consumption. The drawback of this change is that qc_stream_ack instance must be allocated on out-of-order ACK reception. As such, qc_stream_desc_ack() may fail if an error happens on allocation. For the moment, such error is silenly recovered up to qc_treat_rx_pkts() with the dropping of the received packet containing the ACK frame. In the future, it may be useful to close the connection as this error may only happens on low memory usage.	2024-10-04 17:56:45 +02:00
Christopher Faulet	15a520d474	MINOR: config/trace: Add a 'traces' section to declare debug traces It is no longer supported to declare debug traces, via 'trace' directive, in a global section. A 'traces' directive must be used instead. The syntax of the 'trace' directive in these sections remains the same. But it is no longer experimental. The main reason for this change is to avoid to have a ring section defined before a global one. Indeed, for now, forward declarations of ring sections are not supported. So to configure traces, you had to add a ring section before the global one defining the traces. Most of time, that meant to have two global sections : global [...] # global settings ring <name> [...] global [...] # trace config In addition, it will be possible to easily extend the traces section by adding some new directives.	2024-10-02 10:22:51 +02:00
Amaury Denoyelle	cc4384aeb7	MEDIUM: quic: handle out-of-order ACK at streamdesc layer qc_stream_desc_ack() is the entrypoint for streamdesc layer to handle a new acknowledgement of previously emitted STREAM data. Previously, it was only able to deal with in-order ACK offset. The caller was responsible to buffer out-of-order ACKs. Change this by dealing with the latter case directly in qc_stream_desc_ack(). This notably simplify ACK handling in quic_rx module.	2024-10-01 16:22:20 +02:00
Amaury Denoyelle	62558a9285	MINOR: quic: move buffered ACK to streambuf QUIC streamdesc layer is used to manage QUIC MUX stream txbuf data storage until acknowledgment. Currently, it only supports in-order acknowledgment at the stream level. This requires to be able to buffer out-of-order ACKs until they can be handled. Previously, these ACKs were stored in a tree to the streamdesc instance. Move this indexed storage at the streambuf instance. This commit is purely an architecture change. However, it will allow to extend ACK management in future patches, such as the ability to merge overlapping out-of-order ACKs.	2024-10-01 16:19:42 +02:00
Amaury Denoyelle	943e48dadd	MINOR: quic: store streambuf in a streamdesc tree qc_stream_desc layer is used by QUIC MUX to store emitted STREAM data until their acknowledgement. Each stream with Tx capability can allocate its own qc_stream_desc. In turn, each stream desc can have one or multiple data buffers. This is useful when a MUX stream releases a buffer and allocate a new one, to preserve bandwith without waiting to receive all acknowledgement of the previous buffer. Each buffer is encapsulated in a qc_stream_buf structure. Previously, it was stored as a list into qc_stream_desc. Change this storage to use a tree instead. Each buffer is indexed by their offset. This commit does not introduce functional changes. However, this rearchitecture will be necessary for future commit to extend ACK management which require fetching individual buffer instance, not just the first or last element of a streamdesc, by their offset.	2024-10-01 16:19:41 +02:00
Amaury Denoyelle	f4a83fbb14	MINOR: quic: do not remove qc_stream_desc automatically on ACK handling qc_stream_desc_ack() is used to handle ACK received for STREAM frame. It removes acknowledged data from their underlying buffer. If all data were removed after ACK handling, qc_stream_desc instance would automatically be freed at the end of qc_stream_desc_ack(). However, this renders the function complicated to use. Simplify this by removing this automatic removal. Now, caller is responsible to check after ACK handling if qc_stream_desc instance can be removed. This is easily done using qc_stream_desc_done() helper.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	db68f8ed86	MINOR: quic: refactor STREAM room notification qc_stream_desc is an intermediary layer between QUIC MUX and quic_conn. It is a facility which permits to store data to emit and keep them for retransmission until acknowledgment. This layer is responsible to notify QUIC MUX each time a buffer is freed. This is necessary as MUX buffer allocation is limited by the underlying congestion window size. Refactor this to use a mechanism similar to send notification. A new callback notify_room can now be registered to qc_stream_desc instance. This is set by QUIC MUX to qmux_ctrl_room(). On MUX QUIC free, special care is now taken to reset notify_room callback to NULL. Thanks to this refactoring, further adjustment have been made to refine the architecture. One of them is the removal of qc_stream_desc QC_SD_FL_OOB_BUF, which is now converted to a MUX layer flag QC_SF_TXBUF_OOB.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	d7f4e5abf0	MEDIUM: quic: strengthen MUX send notification Previous commit implement a refactor of MUX send notification from quic_conn layer. With this new architecture, a proper callback is defined for each qc_stream_desc instance. This architecture change allows to simplify notification from quic_conn layer. First, ensure the MUX callback to properly ignore retransmission of an already emitted frame. Luckily, this can be handled easily by comparing offsets and FIN status. Also, each QCS instance can now be unregistered from send notification just prior qc_stream_desc releasing. This ensures a QCS is never manipulated from quic_conn after its emission ending. Both these changes render the send notification more robust. As a nice effect, flag QUIC_FL_CONN_TX_MUX_CONTEXT can be removed as it is now unneeded.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	6ad99af0a9	MINOR: quic: refactor MUX send notification For STREAM emission, MUX QUIC generates one or several frames and emit them via qc_send_mux(). Lower layer may use them as-is, or split them to lower chunk to fit in a QUIC packet. It is then responsible to notify the MUX to report the amount of data sent. Previously, this was done via a direct call from quic_conn to MUX using qcc_streams_sent_done(). Modify this to have a better isolation accross layers. Define a send callback handled by the qc_stream_desc instance. This allows the MUX to register each QCS instance individually to the renamved qmux_ctrl_send() which replaces qcc_streams_sent_done(). At quic_conn layer, qc_stream_desc_send() can be used now. This is a wrapper to qc_stream_desc layer to invoke the send callback if registered. This mechanism of qc_stream_desc callback should be extended later to implement other notifications accross the QUIC stack.	2024-10-01 16:19:25 +02:00
Christopher Faulet	273d322b6f	MINOR: stream/stats: Expose the total number of streams ever created in stats A shared counter is added in the thread context to track the total number of streams created on the thread. This number is then reported in stats. It will be a useful information to diagnose some bugs.	2024-09-30 16:55:53 +02:00
Christopher Faulet	18ee22ff76	MINOR: stream/stats: Expose the current number of streams in stats A shared counter is added in the thread context to track the current number of streams. This number is then reported in stats. It will be a useful information to diagnose some bugs.	2024-09-30 16:55:53 +02:00
Christopher Faulet	6a94b7419e	MINOR: stream: Support dynamic changes of the number of connection retries Thanks to the previous patch, it is now possible to add an action to dynamically change the maxumum number of connection retires for a stream. "set-retries" action may now be used to do so, from a "tcp-request content" or a "http-request" rule. This action accepts an expression or an integer between 0 and 100. The integer value is checked during the configuration parsing and leads to an error if it is not in the expected range. However, for the expression, the value is retrieve at runtime. So, invalid value are just ignored. Too high value is forbidden to avoid any trouble. 100 retries seems already be an amazingly hight value. In addition, the option is only available on backend or listen sections. Because the max retries is limited to 100 at most, it can be stored as a unsigned short. This save some space in the stream structure.	2024-09-30 16:55:53 +02:00
Christopher Faulet	91e785edc9	MINOR: stream: Rely on a per-stream max connection retries value Instead of directly relying on the backend parameter to limit the number of connection retries, we now use a per-stream value. This value is by default inherited from the backend value when it is set. So for now, there is no change except the stream value is used instead of the backend value. But thanks to this change, it will be possible to dynamically change this value.	2024-09-30 16:55:53 +02:00
Christopher Faulet	0d91de2be4	MINOR: action: Export release_expr_int_action() release function This function was only used by TCP actions and was private to tcp_act.c file. However, it make sense to make it public to be used by any action relying on an int-or-expression argument.	2024-09-30 16:55:53 +02:00
Willy Tarreau	7caf073faa	MINOR: tools: do not attempt to use backtrace() on linux without glibc The function is provided by glibc. Nothing prevents us from using our own outside of glibc there (tested on aarch64 with musl). We still do not enable it by default as we don't yet know if all archs work well, but it's sufficient to pass USE_BACKTRACE=1 when building with musl to verify it's OK.	2024-09-29 09:52:23 +02:00
Willy Tarreau	1c4776dbc3	BUILD: tools: only include execinfo.h for the real backtrace() function No need to include this possibly non-existing file when using our own backtrace() implementation, it's only needed for the libc-provided one. Because of this it's currently not possible to build musl with backtrace enabled.	2024-09-29 09:52:23 +02:00
Willy Tarreau	a4d04c649a	BUG/MINOR: server: make sure the HMAINT state is part of MAINT In 1.8 when adding "set server fqdn" with commit `b418c1228c` ("MINOR: server: cli: Add server FQDNs to server-state file and stats socket."), the HMAINT flag was not made part of the MAINT ones, so technically speaking when changing the FQDN, the server is not completely considered as in maintenance mode. In its defense, the code location around that was completely messy, with the aggregator flag being hidden between other values and purposely but discretely ignoring one of the flags, so the comments were updated to make the intent clearer (particularly regarding CMAINT which looked like it was also forgotten while it was on purpose). This can be backported anywhere.	2024-09-27 18:40:15 +02:00
Willy Tarreau	b8e3b0a18d	BUG/MEDIUM: stream: make stream_shutdown() async-safe The solution found in commit `b500e84e24` ("BUG/MINOR: server: shut down streams under thread isolation") to deal with inter-thread stream shutdown doesn't work fine because there exists code paths involving a server lock which can then deadlock on thread_isolate(). A better solution then consists in deferring the shutdown to the stream itself and just wake it up for that. The only thing is that TASK_WOKEN_OTHER is a bit too generic and we need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED), so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the task's state to convey these info. The caller only needs to wake the task up with these flags set, and the stream handler will then finish the job locally using stream_shutdown_self(). This needs to be carefully backported to all branches affected by the dequeuing issue and containing any of the `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or `b11495652e` ("BUG/MEDIUM: queue: implement a flag to check for the dequeuing").	2024-09-27 12:15:41 +02:00
Willy Tarreau	b5281283bb	MINOR: task: define two new one-shot events for use with WOKEN_OTHER or MSG TASK_WOKEN_MSG only says "someone sent you a message" but doesn't convey any info about the message. TASK_WOKEN_OTHER says "you're woken for another reason" but doesn't tell which one. Most often they're used as-is by the task handlers to report very specific situations. For some important control notifications, having the ability to modulate the message a little bit is useful, so let's define two user event types UEVT1 and UEVT2 to be used in conjunction with TASK_WOKEN_MSG or _OTHER so that the application can know that a specific condition was explicitly requested. It will be used this way: task_wakeup(s->task, TASK_WOKEN_MSG \| TASK_F_UEVT1); or: task_wakeup(s->task, TASK_WOKEN_OTHER \| TASK_F_UEVT2); Since events are cumulative, keep in mind not to consider a 3rd value as the combination of EVT1+EVT2; these really mean that the two events appeared (though in unspecified order).	2024-09-27 11:56:10 +02:00
Aurelien DARRAGON	4189eb7aca	MINOR: log: add log_orig_proxy() helper function Function may be used on proxy where log-steps are used to check if a given log origin should be handled or not.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	c043d5d372	MINOR: log: introduce "log-steps" proxy keyword For now it is only available for proxies with frontend capability because log-steps are only evaluated under sess_log() or strm_log() which essentially focus on the frontend side when it comes to log settings so it's better to keep it this way for better consistency, at least for now. For now the setting does nothing (it is not considered during runtime), it will be implemented and documented in upcoming commits.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	9341792baf	MINOR: proxy: add log_steps struct member add proxy->conf.log_steps eb32 root tree which will be used to store the log origin identifiers that should result in haproxy emitting a log as configured by the user using upcoming "log-steps" proxy keyword. It was chosen to use eb32 tree instead of simple bitfield because despite the slight overhead it is more future-proof given that we already implemented the prerequisites for seamless custom log origins registration that will also be usable from "log-steps" proxy keyword.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	b882402a29	MINOR: log: support extra log origins for '%OG' alias Following previous commits, let's improve log_orig_to_str() so that extra log origins (registered through log_orig_register()) can be translated to string from origin ID. For that, it is required to add eb_32 tree node to log_origin struct in order to enable quick integer lookup during runtime. Slow name lookup using the list is acceptable for config parsing, but it is not the case during runtime when log_orig_to_str() is expected to be used. Also, to prevent duplicated info, get rid of ->id field and use ->tree.key instead	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	f8bb9d5c57	MINOR: log: explicitly handle extra log origins as error when relevant Thanks to previous commit, we can know check for log_orig optional flags in functions taking struct log_orig as parameter. Let's take this opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a few places to handle the log message differently because if the flag is set then the caller expects the log to be handled as an error explicitly. e.g.: in _process_send_log_override(), if the flag is set, use the error log format instead of the dedicated one.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	3c15ee05e9	MINOR: log: introduce log_orig flags Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically contains the log origin ids. Add 'struct log_orig' which wraps 'enum log_orig' with optional flags (no flags defined for now). Add log_orig() helper func that takes id and flags as parameter and returns log_orig struct initialized with input arguments. Update functions taking log origin as parameter so they explicitly take log orig id or log orig wrapper as argument depending on the level of context expected by the function.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	818475c5cc	MINOR: log: introduce extra log profile steps add a way to register additional log origins using log_origin_register() that may be used as log profile steps from log profile sections. For now this does nothing as no extra origins are registered and extra log origins are not yet considered for runtime logging paths. When specifying an extra logging step for on <step> under log-profile section, the logging step is stored within a binary tree for efficient lookup during runtime. No performance impact should be expected if extra log origins are not being used, and slight performance impact if extra log origins are used. Don't forget to update the documentation when new log origins are added (both %OG log alias and on <step> log-profile keyword are concerned.	2024-09-26 16:53:07 +02:00
Oliver Dala	a889413f5e	BUG/MEDIUM: cli: Deadlock when setting frontend maxconn The proxy lock state isn't passed down to relax_listener through dequeue_proxy_listeners, which causes a deadlock in relax_listener when it tries to get that lock. Backporting: Older versions didn't have relax_listener and directly called resume_listener in dequeue_proxy_listeners. lpx should just be passed directly to resume_listener then. The bug was introduced in commit `001328873c` [cf: This patch should fix the issue #2726. It must be backported as far as 2.4]	2024-09-25 17:12:11 +02:00
Christopher Faulet	96edacc546	DEV: flags/applet: decode appctx flags Decode APPCTX flags via appctx_show_flags() function.	2024-09-24 18:26:36 +02:00
Willy Tarreau	ccd1ecba1d	MEDIUM: cfgparse: drop duplicate named defaults sections after use It has never been permitted to explicitly reference named defaults sections for which there are duplicate names. This means that when a duplicate defaults section is found, there's no point in keeping it since it will never be used for lookups, so it can be dropped. However, some such defaults sections might have some rules in them that are implicitly referenced by proxies placed after them. In this case they cannot be removed. What is done here is that upon each new named section creation, if another one is found with the same name, its config location is stored into the new proxy's {prev_file,prev_line} pair, and the old section is either destroyed if its refcount is null, or just unindexed. The dup check when creating a new proxy now consists in checking the prev_line instead of performing a dup lookup on the defaults section. This will guarantee that we can't find duplicate defaults sections in their tree anymore, while still keeping track of what's allocated and releasing everything upon exit. Beyond the consistency gain, there are nice savings for large configs involving many defaults sections: a test with 300k sections saved about 1.9 GB of RAM, and started 25% faster likely thanks to spending less time allocating memory.	2024-09-20 16:35:32 +02:00
Willy Tarreau	c8b813771d	MINOR: proxy: add a list of orphaned defaults sections We'll soon delete unreferenced and duplicated named defaults sections from the list of proxies. The problem with this is that this list (in fact a name-based tree) is used to release all of them at the end. Let's add a list of orphaned defaults sections, typically those containing "http-check send" statements or various other rules, and that are implicitly inherited by a proxy hence have a non-zero refcount while also having a name. These now makes it possible to remove them from the name index while still keeping their memory around for the lifetime of the process, and cleaning it at the end.	2024-09-20 15:59:04 +02:00
Willy Tarreau	b325453c36	MINOR: proxy: use the global file names for conf->file Proxy file names are assigned a bit everywhere (resolvers, peers, cli, logs, proxy). All these elements were enumerated and now use copy_file_name(). The only ha_free() call was turned to drop_file_name(). As a bonus side effect, a 300k backend config saved 14 MB of RAM.	2024-09-19 15:38:19 +02:00
Willy Tarreau	9ab21a3c2d	CLEANUP: stick-table: make the file location point to a global file name The file name used to point to the calling function's stack for stick tables, which was OK during parsing but remained dangling afterwards. At least it was already marked const so as not to accidentally free it. Let's make it point to a file_name_node now.	2024-09-19 15:38:19 +02:00
Willy Tarreau	d6c060c5ae	MINOR: tools: add minimal file name management In proxies, stick-tables, servers, etc... at plenty of places we store a file name and a line number. Some file names are the result of strdup() (e.g. in proxies), others not (e.g. stick-tables) and leave dangling pointers at the end of parsing. The risk of double-free is not null either. In order to stop this, let's first add a simple tool that allows to register short strings inside a global list, these strings happening to be server names. The strings are either duplicated and stored upon failure to find them, or just added to this storage. Since file names are not expected to disappear before the end of the process, for now we don't even implement refcounting, and we free them all at the end. There's already a drop_file_name() function to reset the pointer like ha_free() used to do, and even if not strictly needed it's a good habit to get used to doing it. The strings are returned as const so that they're stored as-is in structs, and that nasty free() calls are easily caught. The pointer points to the char[] storage inside the node itself. This way later if we want to implement refcounting, it will be trivial to just look up a string and change its associated node's refcount. If needed, comparisons can also be made on pointers. For now they're not used yet and are released on deinit().	2024-09-19 15:36:58 +02:00
Willy Tarreau	8df44eea6d	BUILD: cebtree: silence a bogus gcc warning on impossible code paths gcc-12 and above report a wrong warning about a negative length being passed to memcmp() on an impossible code path when built at -O0. The pattern is the same at a few places, basically: int foo(int op, const void a, const void b, size_t size, size_t arg) { if (op == 1) // arg is a strict multiple of size return memcmp(a, b, arg - size); return 0; } ... int bar() { return foo(0, a, b, sizeof(something), 0); } It might be possible to invent dummy values for the "len" argument above in the real code, but that significantly complexifies it and as usual can easily result in introducing undesired bugs. Here we take a different approach consisting in shutting the -Wstringop-overread warning on gcc>=12 at -O0 since that's the only condition that triggers it. The issue was reported to and confirmed by the gcc team here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114622 No backport needed, but this should be upstreamed into cebtree after checking that all involved macros are available.	2024-09-18 17:42:52 +02:00
Willy Tarreau	f793845f4a	MEDIUM: clock: collect the monotonic time in clock_local_update_date() Now we collect this clock in clock_local_update_date(), the closest from the poller, which is also used when busy-polling, and the values is set into the thread's curr_mono_time which did not exist before. Later, clock_leaving_poll() just sets the prev_mono_time value from the curr_ one instead of retrieving the time at this specific point. It also means that the monotonic time will now also cover the time needed to update the global time, which should be negligible. Note that we don't collect the CPU time in the clock_local_update_date() function even though it's tempting, because when doing busy-polling, it would be collected on each round while being useless. Doing so will make sure that the local time always knows the monotonic time when it is available.	2024-09-17 09:08:10 +02:00
Christopher Faulet	5fc12b0afd	BUG/MEDIUM: sc_strm/applet: Wake applet after a successfull synchronous send On a synchronous send from the stream to an applet, if some data were sent, we must take care to wake the applet up. It is important because if everything was sent at this stage, there is no other chance to wake the applet up, mainly because SE_FL_WAIT_DATA flag is set on the applet's sedesc in sc_update_tx() at the end of process_stream(). This flag prevent any wakeup of the applet for a send event. It is not necessary for a mux because the mux stream is called when a syncrhonous send from the stream is performed. So it is reponsible to wake the mux connection if necessary. This patch must be backport to 3.0.	2024-09-16 22:55:40 +02:00
Willy Tarreau	5d350d1e50	OPTIM: vars: use multiple name heads in the vars struct Given that the original list-based version was using a list head as the root of the variables, while the tree is using a single pointer, it made sense to reuse that space to place multiple roots, indexed on the lower bits of the name hash. Two roots slightly increase the performance level, but the best gain is obtained with 4 roots. The performance is now always above that of the list, even with small counts, and with 100 vars, it's 21% higher than before, or 67% higher than with the list. We keep the same lock (it could have made sense to use one lock per head), because most of the variables in large configs are attached to a stream or a session, hence are not shared between threads. Thus there's no point in sharding the pointer.	2024-09-15 23:51:51 +02:00
Willy Tarreau	47ec7c681e	OPTIM: vars: use a cebtree instead of a list for variable names Configs involving many variables can start to eat a lot of CPU in name lookups. The reason is that the names themselves are dynamic in that they are relative to dynamic objects (sessions, streams, etc), so there's no fixed index for example. The current implementation relies on a standard linked list, and in order to speed up lookups and avoid comparing strings, only a 64-bit hash of the variable's name is stored and compared everywhere. But with just 100 variables and 1000 accesses in a config, it's clearly visible that variable name lookup can reach 56% CPU with a config generated this way: for i in {0..100}; do printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i; for j in {1..10}; do [ $i -lt $j ] \|\| printf ",add(txn.var%04d)" $((i-j)); done; echo; done The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf profile showing: Samples: 170K of event 'cycles', Event count (approx.): 142378815419 Overhead Shared Object Symbol 56.39% haproxy [.] var_to_smp 6.65% haproxy [.] var_set.part.0 5.76% haproxy [.] sample_process_cnv 3.23% haproxy [.] sample_conv_var2smp 2.88% haproxy [.] sample_conv_arith_add 2.33% haproxy [.] __pool_alloc 2.19% haproxy [.] action_store 2.13% haproxy [.] vars_get_by_desc 1.87% haproxy [.] smp_dup [above, var_to_smp() calls var_get() under the read lock]. By switching to a binary tree, the cost is significantly lower, the performance reaches 117k RPS (+37%) with this profile: Samples: 170K of event 'cycles', Event count (approx.): 142323631229 Overhead Shared Object Symbol 40.22% haproxy [.] cebu64_lookup 7.12% haproxy [.] sample_process_cnv 6.15% haproxy [.] var_to_smp 4.75% haproxy [.] cebu64_insert 3.79% haproxy [.] sample_conv_var2smp 3.40% haproxy [.] cebu64_delete 3.10% haproxy [.] sample_conv_arith_add 2.36% haproxy [.] action_store 2.32% haproxy [.] __pool_alloc 2.08% haproxy [.] vars_get_by_desc 1.96% haproxy [.] smp_dup 1.75% haproxy [.] var_set.part.0 1.74% haproxy [.] cebu64_first 1.07% [kernel] [k] aq_hw_read_reg 1.03% haproxy [.] pool_put_to_cache 1.00% haproxy [.] sample_process The performance lowers a bit earlier than with the list however. What can be seen is that the performance maintains a plateau till 25 vars, starts degrading a little bit for the tree while it remains stable till 28 vars for the list. Then both cross at 42 vars and the list continues to degrade doing a hyperbole while the tree resists better. The biggest loss is at around 32 variables where the list stays 10% higher. Regardless, given the extremely narrow band where the list is better, it looks relevant to switch to this in order to preserve the almost linear performance of large setups. For example at 1000 variables and 10k lookups, the tree is 18 times faster than the list. In addition this reduces the size of the struct vars by 8 bytes since there's a single pointer, though it could make sense to re-invest them into a secondary head for example.	2024-09-15 23:49:01 +02:00
Willy Tarreau	a0205f9de4	IMPORT: import cebtree (compact elastic binary trees) This is an import of the compact elastic binary trees at commit a9cd84a ("OPTIM: descent: better prefetch less and for writes when deleting") These will be used to replace certain lists (and possibly certain tree nodes as well). They're as fast (or even faster) than ebtrees for lookups, as fast for insertion and slower for deletion, and a node only uses 2 pointers (like a list). The only changes were cebtree.h where common/tools.h was replaced with ebtree.h which we already have and already provides the needed functions and macros, and the addition of a wrapper cebtree-prv.h in src/ to redirect to import/cebtree-prv.h.	2024-09-15 23:44:59 +02:00
Willy Tarreau	6e92988e20	MINOR: vars: remove the emptiness tests in callers before pruning All callers of vars_prune_* currently check the list for emptiness. Let's leave that to vars_prune() itself, it will ease some changes in the code. Thanks to the previous inlining of the vars_prune() function, there's no performance loss, and even a very tiny 0.1% gain.	2024-09-15 23:44:16 +02:00
Willy Tarreau	2c1a9c3a43	OPTIM: vars: inline vars_prune() to avoid many calls Many configs don't have variables and call it for no reason, and even configs with variables don't necessarily have some in all scopes.	2024-09-15 23:42:09 +02:00
Willy Tarreau	b11495652e	BUG/MEDIUM: queue: implement a flag to check for the dequeuing As unveiled in GH issue #2711, commit `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()") does have some side effects in that it can occasionally cause an endless loop. As Christopher analysed it, the problem is that process_srv_queue(), which uses a trylock in order to leave only one thread in charge of the dequeueing process, can lose the lock race against pendconn_add(). If this happens on the last served request, then there's no more thread to deal with the dequeuing, and assign_server_and_queue() will loop forever on a condition that was initially exepected to be extremely rare (and still is, except that now it can become sticky). Previously what was happening is that such queued requests would just time out and since that was very rare, nobody would notice. The root of the problem really is that trylock. It was added so that only one thread dequeues at a time but it doesn't offer only that guarantee since it also prevents a thread from dequeuing if another one is in the process of queuing. We need a different criterion. What we're doing now is to set a flag "dequeuing" in the server, which indicates that one thread is currently in the process of dequeuing requests. This one is atomically tested, and only if no thread is in this process, then the thread grabs the queue's lock and dequeues. This way it will be serialized with pendconn_add() and no request addition will be missed. It is not certain whether the original race covered by the fix above can still happen with this change, so better keep that fix for now. Thanks to @Yenya (Jan Kasprzak) for the precise and complete report allowing to spot the problem. This patch should be backported wherever the patch above was backported.	2024-09-13 08:35:47 +02:00
Aurelien DARRAGON	68cfb222b5	BUG/MEDIUM: pattern: prevent UAF on reused pattern expr Since `c5959fd` ("MEDIUM: pattern: merge same pattern"), UAF (leading to crash) can be experienced if the same pattern file (and match method) is used in two default sections and the first one is not referenced later in the config. In this case, the first default section will be cleaned up. However, due to an unhandled case in the above optimization, the original expr which the second default section relies on is mistakenly freed. This issue was discovered while trying to reproduce GH #2708. The issue was particularly tricky to reproduce given the config and sequence required to make the UAF happen. Hopefully, Github user @asmnek not only provided useful informations, but since he was able to consistently trigger the crash in his environment he was able to nail down the crash to the use of pattern file involved with 2 named default sections. Big thanks to him. To fix the issue, let's push the logic from `c5959fd` a bit further. Instead of relying on "do_free" variable to know if the expression should be freed or not (which proved to be insufficient in our case), let's switch to a simple refcounting logic. This way, no matter who owns the expression, the last one attempting to free it will be responsible for freeing it. Refcount is implemented using a 32bit value which fills a previous 4 bytes structure gap: int mflags; /* 80 4 / / XXX 4 bytes hole, try to pack / long unsigned int lock; / 88 8 */ (output from pahole) Even though it was not reproduced in 2.6 or below by @asmnek (the bug was revealed thanks to another bugfix), this issue theorically affects all stable versions (up to `c5959fd`), thus it should be backported to all stable versions.	2024-09-09 16:07:05 +02:00
Aaron Kuehler	50322dff81	MEDIUM: server: add init-state Allow the user to set the "initial state" of a server. Context: Servers are always set in an UP status by default. In some cases, further checks are required to determine if the server is ready to receive client traffic. This introduces the "init-state {up\|down}" configuration parameter to the server. - when set to 'fully-up', the server is considered immediately available and can turn to the DOWN sate when ALL health checks fail. - when set to 'up' (the default), the server is considered immediately available and will initiate a health check that can turn it to the DOWN state immediately if it fails. - when set to 'down', the server initially is considered unavailable and will initiate a health check that can turn it to the UP state immediately if it succeeds. - when set to 'fully-down', the server is initially considered unavailable and can turn to the UP state when ALL health checks succeed. The server's init-state is considered when the HAProxy instance is (re)started, a new server is detected (for example via service discovery / DNS resolution), a server exits maintenance, etc. Link: https://github.com/haproxy/haproxy/issues/51	2024-09-05 11:13:10 +02:00
Ilya Shipitsin	1f6e5f7a61	CLEANUP: assorted typo fixes in the code and comments This is 43rd iteration of typo fixes	2024-09-03 17:49:21 +02:00
Christopher Faulet	a7f6b0ac03	MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates Add a factor parameter to stick-tables, called "brates-factor", that is applied to in/out bytes rates to work around the 32-bits limit of the frequency counters. Thanks to this factor, it is possible to have bytes rates beyond the 4GB. Instead of counting each bytes, we count blocks of bytes. Among other things, it will be useful for the bwlim filter, to be able to configure shared limit exceeding the 4GB/s. For now, this parameter must be in the range ]0-1024].	2024-09-02 15:50:25 +02:00
Aperence	20efb856e1	MEDIUM: protocol: add MPTCP per address support Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension that enables a TCP connection to use different paths. Multipath TCP has been used for several use cases. On smartphones, MPTCP enables seamless handovers between cellular and Wi-Fi networks while preserving established connections. This use-case is what pushed Apple to use MPTCP since 2013 in multiple applications [2]. On dual-stack hosts, Multipath TCP enables the TCP connection to automatically use the best performing path, either IPv4 or IPv6. If one path fails, MPTCP automatically uses the other path. To benefit from MPTCP, both the client and the server have to support it. Multipath TCP is a backward-compatible TCP extension that is enabled by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...). Multipath TCP is included in the Linux kernel since version 5.6 [3]. To use it on Linux, an application must explicitly enable it when creating the socket. No need to change anything else in the application. This attached patch adds MPTCP per address support, to be used with: mptcp{,4,6}@<address>[:port1[-port2]] MPTCP v4 and v6 protocols have been added: they are mainly a copy of the TCP ones, with small differences: names, proto, and receivers lists. These protocols are stored in __protocol_by_family, as an alternative to TCP, similar to what has been done with QUIC. By doing that, the size of __protocol_by_family has not been increased, and it behaves like TCP. MPTCP is both supported for the frontend and backend sides. Also added an example of configuration using mptcp along with a backend allowing to experiment with it. Note that this is a re-implementation of Bj�rn's work from 3 years ago [4], when haproxy's internals were probably less ready to deal with this, causing his work to be left pending for a while. Currently, the TCP_MAXSEG socket option doesn't seem to be supported with MPTCP [5]. This results in a warning when trying to set the MSS of sockets in proto_tcp:tcp_bind_listener. This can be resolved by adding two new variables: sock_inet(6)_mptcp_maxseg_default that will hold the default value of the TCP_MAXSEG option. Note that for the moment, this will always be -1 as the option isn't supported. However, in the future, when the support for this option will be added, it should contain the correct value for the MSS, allowing to correctly set the TCP_MAXSEG option. Link: https://www.rfc-editor.org/rfc/rfc8684.html [1] Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2] Link: https://www.mptcp.dev [3] Link: https://github.com/haproxy/haproxy/issues/1028 [4] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5] Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be> Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>	2024-08-30 18:53:49 +02:00
Aperence	38618822e1	MINOR: server: add a alt_proto field for server Add a new field alt_proto to the server structures that specify if an alternate protocol should be used for this server. This field can be transparently passed to protocol_lookup to get an appropriate protocol structure. This change allows thus to create servers with different protocols, and not only TCP anymore.	2024-08-30 18:53:49 +02:00
Aperence	a7b04e383a	MINOR: tools: extend str2sa_range to add an alt parameter Add a new parameter "alt" that will store wether this configuration use an alternate protocol. This alt pointer will contain a value that can be transparently passed to protocol_lookup to obtain an appropriate protocol structure. This change is needed to allow for example the servers to know if it need to use an alternate protocol or not.	2024-08-30 18:53:49 +02:00
Frederic Lecaille	f627b9272b	BUG/MEDIUM: quic: always validate sender address on 0-RTT It has been reported by Wedl Michael, a student at the University of Applied Sciences St. Poelten, a potential vulnerability into haproxy as described below. An attacker could have obtained a TLS session ticket after having established a connection to an haproxy QUIC listener, using its real IP address. The attacker has not even to send a application level request (HTTP3). Then the attacker could open a 0-RTT session with a spoofed IP address trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests. To mitigate this vulnerability, one decided to use a token which can be provided to the client each time it successfully managed to connect to haproxy. These tokens may be reused for future connections to validate the address/path of the remote peer as this is done with the Retry token which is used for the current connection, not the next one. Such tokens are transported by NEW_TOKEN frames which was not used at this time by haproxy. So, each time a client connect to an haproxy QUIC listener with 0-RTT enabled, it is provided with such a token which can be reused for the next 0-RTT session. If no such a token is presented by the client, haproxy checks if the session is a 0-RTT one, so with early-data presented by the client. Contrary to the Retry token, the decision to refuse the connection is made only when the TLS stack has been provided with enough early-data from the Initial ClientHello TLS message and when these data have been accepted. Hopefully, this event arrives fast enough to allow haproxy to kill the connection if some early-data have been accepted without token presented by the client. quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN frame with this newly implemented token to be transported inside. quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre() and modified to be reused and derive the secret for the new token implementation. quic_token_validate() has been implemented to validate both the Retry and the new token implemented by this patch. When this is a non-retry token which could not be validated, the datagram received is marked as requiring a Retry packet to be sent, and no connection is created. When the Initial packet does not embed any non-retry token and if 0-RTT is enabled the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon as the TLS stack detects that some early-data have been provided and accepted by the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted() new function. The secret TLS handshake is interrupted as soon as possible returnin 0 from ha_quic_add_handshake_data(). The connection is also marked as requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb()) knows how to behave: kill the connection after having sent a Retry packet. About TLS stack compatibility, this patch is supported by aws-lc. It is disabled for wolfssl which does not support 0-RTT at this time thanks to HAVE_SSL_0RTT_QUIC. This patch depends on these commits: MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. MINOR: quic: Implement qc_ssl_eary_data_accepted(). MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder MINOR: quic: Token for future connections implementation. MINOR: quic: Implement quic_tls_derive_token_secret(). MINOR: tools: Implement ipaddrcpy(). Must be backported as far as 2.6.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	609b124561	MINOR: quic: Implement qc_ssl_eary_data_accepted(). This function is a wrapper around SSL_get_early_data_status() for OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived stacks like AWS-LC. It returns true for a TLS server if it has accepted the early data received from a client. Also implement quic_ssl_early_data_status_str() which is dedicated to be used for debugging purposes (traces). This function converts the enum returned by the two function mentionned above to a human readable string.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	e926378375	MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN as size as defined by the token for future connections (quic_token.c). Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()). Also add comments to denote that the NEW_TOKEN parser function is used only by clients and that its builder is used only by servers.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	f5b09dc452	MINOR: quic: Token for future connections implementation. There exist two sorts of token used by QUIC. They are both used to validate the peer address (path validation). Retry are used for the current connection the client want to open. This patch implement the other sort of tokens which after having been received from a connection, may be provided for the next connection from the same IP address to validate it (or validate the network path between the client and the server). The token generation is implemented by quic_generate_token(), and the token validation by quic_token_chek(). The same method is used as for Retry tokens to build such tokens to be reused for future connections. The format is very simple: one byte for the format identifier to distinguish these new tokens for the Retry token, followed by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes are added to this token and a salt to derive the AEAD secret used to cipher the token. In addition to this salt, this is the client IP address which is used also as AAD to derive the AEAD secret. So, the length of the token is fixed: 37 bytes.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	74caa0eece	MINOR: quic: Implement quic_tls_derive_token_secret(). This is function is similar to quic_tls_derive_retry_token_secret(). Its aim is to derive the secret used to cipher the token to be used for future connections. This patch renames quic_tls_derive_retry_token_secret() to a more and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret(). Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret() and quic_tls_derive_token_secret() new function which calls quic_do_tls_derive_token_secret().	2024-08-30 17:04:09 +02:00
Frederic Lecaille	fb7a092203	MINOR: tools: Implement ipaddrcpy(). Implement ipaddrcpy() new function to copy only the IP address from a sockaddr_storage struct object into a buffer.	2024-08-30 17:04:09 +02:00
Nicolas CARPi	a33407b499	CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE There was a typo in the macro name, where LENGTH was incorrectly written. This didn't cause any issue because the typo appeared in all occurrences in the codebase.	2024-08-30 14:58:59 +02:00
Christopher Faulet	62c9d51ca4	BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry Support for 429 was recently added to L7 retries (`0d142e075` "MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status"). But the l7_status_match() function was not properly updated. The switch statement must match the 429 status to be able to perform a L7 retry. This patch must be backported if the commit above is backported. It is related to #2687.	2024-08-30 12:13:32 +02:00
Christopher Faulet	0d142e0756	MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status The "429" status can now be specified on retry-on directives. PR_RE_* flags were updated to remains sorted. This patch should fix the issue #2687. It is quite simple so it may safely be backported to 3.0 if necessary.	2024-08-28 10:05:34 +02:00
William Lallemand	e8fecef0ff	MEDIUM: ssl: capture the signature_algorithms extension from Client Hello Activate the capture of the TLS signature_algorithms extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:17:40 +02:00
William Lallemand	ce7fb6628e	MEDIUM: ssl: capture the supported_versions extension from Client Hello Activate the capture of the TLS supported_versions extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:12:42 +02:00
Valentine Krasnobaeva	7b78e1571b	MINOR: mworker: restore initial env before wait mode This patch is the follow-up of `1811d2a6ba` (MINOR: tools: add helpers to backup/clean/restore env). In order to avoid unexpected behaviour in master-worker mode during the process reload with a new configuration, when the old one has contained '*env' keywords, let's backup its initial environment before calling parse_cfg() and let's clean and restore it in the context of master process, just before it enters in a wait polling loop. This will garantee that new workers will have a new updated environment and not the previous one inherited from the master, which does not read the configuration, when it's in a wait-mode.	2024-08-23 17:06:59 +02:00
Valentine Krasnobaeva	1811d2a6ba	MINOR: tools: add helpers to backup/clean/restore env 'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could modify the process runtime environment. In case of master-worker mode this creates a problem, as the configuration is read only once before the forking a worker and then the master process does the reexec without reading any config files, just to free the memory. So, during the reload a new worker process will be created, but it will inherited the previous unchanged environment from the master in wait mode, thus it won't benefit the changes in configuration, related to '*env' keywords. This may cause unexpected behavior or some parser errors in master-worker mode. So, let's add a helper to backup all process env variables just before it will read its configuration. And let's also add helpers to clean up the current runtime environment and to restore it to its initial state (as it was before parsing the config).	2024-08-23 17:06:33 +02:00
Willy Tarreau	2a799b64b0	MINOR: protocol: add the real address family to the protocol For custom families, there's sometimes an underlying real address and it would be nice to be able to directly use the real family in calls to bind() and connect() without having to add explicit checks for exceptions everywhere. Let's add a .real_family field to struct proto_fam for this. For now it's always equal to the family except for non-transferable ones such as rhttp where it's equal to the custom one (anything else could fit).	2024-08-21 17:37:46 +02:00
Willy Tarreau	ba4a416c66	MINOR: protocol: add a family lookup At plenty of places we have access to an address family which may include some custom addresses but we cannot simply convert them to the real families without performing some random protocol lookups. Let's simply add a proto_fam table like we have for the protocols. The protocols could even be indexed there, but for now it's not worth it.	2024-08-21 16:46:15 +02:00
Willy Tarreau	732913f848	MINOR: protocol: properly assign the sock_domain and sock_family When we finally split sock_domain from sock_family in 2.3, something was not cleanly finished. The family is what should be stored in the address while the domain is what is supposed to be passed to socket(). But for the custom addresses, we did the opposite, just because the protocol_lookup() function was acting on the domain, not the family (both of which are equal for non-custom addresses). This is an API bug but there's no point backporting it since it does not have visible effects. It was visible in the code since a few places were using PF_UNIX while others were comparing the domain against AF_MAX instead of comparing the family. This patch clarifies this in the comments on top of proto_fam, addresses the indexing issue and properly reconfigures the two custom families.	2024-08-21 16:46:15 +02:00
Willy Tarreau	67bf1d6c9e	MINOR: quic: support a tolerance for spurious losses Tests performed between a 1 Gbps connected server and a 100 mbps client, distant by 95ms showed that: - we need 1.1 MB in flight to fill the link - rare but inevitable losses are sufficient to make cubic's window collapse fast and long to recover - a 100 MB object takes 69s to download - tolerance for 1 loss between two ACKs suffices to shrink the download time to 20-22s - 2 losses go to 17-20s - 4 losses reach 14-17s At 100 concurrent connections that fill the server's link: - 0 loss tolerance shows 2-3% losses - 1 loss tolerance shows 3-5% losses - 2 loss tolerance shows 10-13% losses - 4 loss tolerance shows 23-29% losses As such while there can be a significant gain sometimes in setting this tolerance above zero, it can also significantly waste bandwidth by sending far more than can be received. While it's probably not a solution to real world problems, it repeatedly proved to be a very effective troubleshooting tool helping to figure different root causes of low transfer speeds. In spirit it is comparable to the no-cc congestion algorithm, i.e. it must not be used except for experimentation.	2024-08-21 08:34:30 +02:00
Willy Tarreau	fab0e99aa1	MINOR: quic: store the lost packets counter in the quic_cc_event element Upon loss detection, qc_release_lost_pkts() notifies congestion controllers about the event and its final time. However it does not pass the number of lost packets, that can provide useful hints for some controllers. Let's just pass this option.	2024-08-21 08:02:44 +02:00
Amaury Denoyelle	0d6112b40b	MINOR: mux-quic: retry after small buf alloc failure Previous commit switch to small buffers for HTTP/3 HEADERS emission. This ensures that several parallel streams can allocate their own buffer without hitting the connection buffer limit based now on the congestion window size. However, this prevents the transmission of responses with uncommonly large headers. Indeed, if all headers cannot be encoded in a single buffer, an error is reported which cause the whole connection closure. Adjust this by implementing a realloc API exposed by QUIC MUX. This allows application layer to switch from a small to a default buffer and restart its processing. This guarantees that again headers not longer than bufsize can be properly transferred.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	885e4c5cf8	MINOR: quic: support sbuf allocation in quic_stream This patch extends qc_stream_desc API to be able to allocate small buffers. QUIC MUX API is similarly updated as ultimatly each application protocol is responsible to choose between a default or a smaller buffer. Internally, the type of allocated buffer is remembered via qc_stream_buf instance. This is mandatory to ensure that the buffer is released in the correct pool, in particular as small and standard buffers can be configured with the same size. This commit is purely an API change. For the moment, small buffers are not used. This will changed in a dedicated patch.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	d0d8e57d47	MINOR: quic: define sbuf pool Define a new buffer pool reserved to allocate smaller memory area. For the moment, its usage will be restricted to QUIC, as such it is declared in quic_stream module. Add a new config option "tune.bufsize.small" to specify the size of the allocated objects. A special check ensures that it is not greater than the default bufsize to avoid unexpected effects.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	1de5f718cf	MINOR: quic/config: adapt settings to new conn buffer limit QUIC MUX buffer allocation limit is now directly based on the underlying congestion window size. previous static limit based on conn-tx-buffers is now unused. As such, this commit adds a warning to users to prevent that it is now obsolete. Secondly, update max-window-size setting. It is now the main entrypoint to limit both the maximum congestion window size and the number of QUIC MUX allocated buffer on emission. Remove its special value '0' which was used to automatically adjust it on now unused conn-tx-buffers.	2024-08-20 17:59:35 +02:00
Amaury Denoyelle	aeb8c1ddc3	MAJOR: mux-quic: allocate Tx buffers based on congestion window Each QUIC MUX may allocate buffers for MUX stream emission. These buffers are then shared with quic_conn to handle ACK reception and retransmission. A limit on the number of concurrent buffers used per connection has been defined statically and can be updated via a configuration option. This commit replaces the limit to instead use the current underlying congestion window size. The purpose of this change is to remove the artificial static buffer count limit, which may be difficult to choose. Indeed, if a connection performs with minimal loss rate, the buffer count would limit severely its throughput. It could be increase to fix this, but it also impacts others connections, even with less optimal performance, causing too many extra data buffering on the MUX layer. By using the dynamic congestion window size, haproxy ensures that MUX buffering corresponds roughly to the network conditions. Using QCC <buf_in_flight>, a new buffer can be allocated if it is less than the current window size. If not, QCS emission is interrupted and haproxy stream layer will subscribe until a new buffer is ready. One of the criticals parts is to ensure that MUX layer previously blocked on buffer allocation is properly woken up when sending can be retried. This occurs on two occasions : * after an already used Tx buffer is cleared on ACK reception. This case is already handled by qcc_notify_buf() via quic_stream layer. * on congestion window increase. A new qcc_notify_buf() invokation is added into qc_notify_send(). Finally, remove <avail_bufs> QCC field which is now unused. This commit is labelled MAJOR as it may have unexpected effect and could cause significant behavior change. For example, in previous implementation QUIC MUX would be able to buffer more data even if the congestion window is small. With this patch, data cannot be transferred from the stream layer which may cause more streams to be shut down on client timeout. Another effect may be more CPU consumption as the connection limit would be hit more often, causing more streams to be interrupted and woken up in cycle.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	000976af58	MINOR: mux-quic: define buf_in_flight Define a new QCC counter named <buf_in_flight>. Its purpose is to account the current sum of all allocated stream buffer size used on emission. For this moment, this counter is updated and buffer allocation and deallocation. It will be used to replace <avail_bufs> once congestion window is used as limit for buffer allocation in a future commit.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	4c4bf26f44	MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark streams which are not subject to the connection limit on allocated MUX stream buffer. The purpose is to simplify handling of QUIC MUX streams which do not transfer data and as such are not driven by haproxy layer, for example HTTP/3 control stream. These streams interacts synchronously with QUIC MUX and cannot retry emission in case of temporary failure. This commit will be useful once connection buffer allocation limit is reimplemented to directly rely on the congestion window size. This will probably cause the buffer limit to be reached more frequently, maybe even on QUIC MUX initialization. As such, it will be possible to mark control streams and prevent them to be subject to the buffer limit. QUIC MUX expose a new function qcs_send_metadata(). It can be used by an application protocol to specify which streams are used for control exchanges. For the moment, no such stream use this mechanism.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	f4d1bd0b76	MINOR: mux-quic: account stream txbuf in QCC A limit per connection is put on the number of buffers allocated by QUIC MUX for emission accross all its streams. This ensures memory consumption remains under control. This limit is simply explained as a count of buffers which can be concurrently allocated for each connection. As such, quic_conn structure was used to account currently allocated buffers. However, a quic_conn nevers allocates new stream buffers. This is only done at QUIC MUX layer. As such, this commit moves buffer accounting inside QCC structure. This simplifies the API, most notably qc_stream_buf_alloc() usage. Note that this commit inverts the accounting. Previously, it was initially set to 0 and increment for each allocated buffer. Now, it is set to the maximum value and decrement for each buf usage. This is considered as clearer to use.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	c24c8667b2	MINOR: quic: define max-window-size config setting Define a new global keyword tune.quic.frontend.max-window-size. This allows to set globally the maximum congestion window size for each QUIC frontend connections. The default value is 0. It is a special value which automatically derive the size from the configured QUIC connection buffer limit. This is similar to the previous "quic-cc-algo" behavior, which can be used to override the maximum window size per bind line.	2024-08-20 17:02:29 +02:00
Valentine Krasnobaeva	8b1dfa9def	MINOR: cfgparse: limit file size loaded via /dev/stdin load_cfg_in_mem() can continuously reallocate memory in order to load an extremely large input from /dev/stdin, until it fails with ENOMEM, which means that process has consumed all available RAM. In case of containers and virtualized environments it's not very good. So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will limit the size of input supplied via /dev/stdin.	2024-08-20 14:28:34 +02:00
Nathan Wehrman	fd48b28315	MINOR: Implements new log format of option tcplog clf Some systems require log formats in the CLF format and that meant that I could not send my logs for proxies in mode tcp to those servers. This implements a format that uses log variables that are compatble with TCP mode frontends and replaces traditional HTTP values in the CLF format to make them stand out. Instead of logging method and URI like this "GET /example HTTP/1.1" it will log "TCP " and for a response code I used "000" so it would be easy to separate from legitimate HTTP traffic. Now your log servers that require a CLF format can see the timings for TCP traffic as well as HTTP.	2024-08-20 07:46:34 +02:00
Aurelien DARRAGON	f8299bc5ea	MINOR: log: "drop" support for log-profile steps It is now possible to use "drop" keyword for "on" lines under a log-profile section to specify that no log at all should be emitted for the specified step (setting an empty format was not sufficient to do so because only the log payload would be empty, not the log header, thus the log would still be emitted). It may be useful to selectively disable logging at specific steps for a given log target (since the log profile may be set on log directives): log-profile myprof on request format "blabla" sd "custom sd" on response drop New testcase was added to reg-tests/log/log_profiles.vtc	2024-08-19 18:53:01 +02:00
William Lallemand	b2a8e8731d	MINOR: channel: implement ci_insert() function ci_insert() is a function which allows to insert a string <str> of size <len> at <pos> of the input buffer. This is the equivalent of ci_insert_line2() but without inserting '\r\n'	2024-08-08 17:29:37 +02:00
Valentine Krasnobaeva	c6cfa7cb4a	MINOR: startup: rename readcfgfile in parse_cfg As readcfgfile no longer opens configuration files and reads them with fgets, but performs only the parsing of provided data, let's rename it to parse_cfg by analogy with read_cfg in haproxy.c.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	5b52df4c4d	MEDIUM: startup: load and parse configs from memory Let's call load_cfg_in_ram() helper for each configuration file to load it's content in some area in memory. Adapt readcfgfile() parser function respectively. In order to limit changes in its scope we give as an argument a cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with file metadata and content. Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard fgets from libc implementations. SPOE filter parses its own configuration file, pointed by 'config' keyword in the configuration already loaded in memory. So, let's allocate and fill for this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles list. This structure and the memory with content of SPOE filter configuration are freed immediately in parse_spoe_flt(), when readcfgfile() returns. HAProxy OpenTracing filter also uses its own configuration file. So, let's follow the same logic as we do for SPOE filter.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	007f7f2f02	MINOR: tools: add fgets_from_mem Add fgets_from_mem() helper to read lines from configuration files, stored now as memory chunks. In order to limit changes in the first-level parser code (readcfgfile()), it is better to reimplement the standard fgets, i.e. to have a fgets, which can read the serialized data line by line from some memory area, instead of file stream, and can keep the same behaviour as libc implementations fgets.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	5b9ed6e4be	MINOR: cfgparse: add load_cfg_in_mem Add load_cfg_in_mem() helper, which allows to store the content of a given file in memory.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	bafb0ce272	MINOR: startup: adapt list_append_word to use cfgfile list_append_word() helper was used before only to chain configuration file names in a list. As now we start to use cfgfile structure which represents entire file in memory and its metadata, let's adapt this helper to use this structure and let's rename it to list_append_cfgfile(). Adapt functions, which process configuration files and directories to use cfgfile structure and list_append_cfgfile() instead of wordlist.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	39f2a19620	REORG: tools: move list_append_word to cfgparse Let's move list_append_word to cfgparse.c as it is used only to fill cfg_cfgfiles list with configuration file names.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	70b842e847	MINOR: cfgparse: add struct cfgfile to represent config in memory This and following commits serve to prepare loading configuration files in memory, before parsing them, as we may need to parse some parts of configuration in different moments of the startup sequence. This is a case of the new master-worker initialization process. Here we need to read at first only the global and the program sections and only after some steps (forking worker, etc) the rest of the configuration. Add a new structure cfgfile to keep configuration files metadata and content, loaded somewhere in a memory. Instances of filled cfgfile structures could be chained in a list, as the order in which they were loaded is important.	2024-08-07 18:41:41 +02:00
Willy Tarreau	10c8baca44	MINOR: trace: add a per-source helper to pre-fill the context Now sources which want to do it can provide a helper that can pre-fill some fields in the context based on their knowledge (e.g. mux streams).	2024-08-07 16:02:59 +02:00
Willy Tarreau	7d55a70f5a	MINOR: trace: move the known trace context into a dedicated struct We now have a trace_ctx to hold the sess, conn, qc, stream and so on. This will allow us to pass it across layers so that other helpers can help fill them. Ideally it should be passed as an argument to __trace_enabled() by __trace() so that it can be passed back to the trace callback. But it seems that trace callbacks are smart enough to figure all their info when they need them.	2024-08-07 16:02:59 +02:00
Willy Tarreau	d465610ec3	MEDIUM: trace: implement a "follow" mechanism With "follow" from one source to another, it becomes possible for a source to automatically follow another source's tracked pointer. The best example is the session: - the "session" source is enabled and has a "lockon session" -> its lockon_ptr is equal to the session when valid - other sources (h1,h2,h3 etc) are configured for "follow session" and will then automatically check if session's lockon_ptr matches its own session, in which case tracing will be enabled for that trace (no state change). It's not necessary to start/pause/stop traces when using this, only "follow" followed by a source with lockon enabled is needed. Some combinations might work better than others. At the moment the session is almost never known from the backend, but this may improve. The meta-source "all" is supported for the follower so that all sources will follow the tracked one.	2024-08-07 16:02:59 +02:00
Amaury Denoyelle	9f829ea3f3	MINOR: mux-quic: measure QCS lifetime and its blocking state Reuse newly defined tot_time structure to measure various values related to a QCS lifetime. First, a timer is used to comptabilize the total QCS lifetime. Then, two other timers are used to account the total time during which Tx from stream layer to MUX is blocked, either on lack of buffer or due to flow-control. These three timers are reported in qmux_dump_qcs_info(). Thus, they are available in traces and for QUIC MUX debug string sample.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	a6e2523ca1	MINOR: time: define tot_time structure Define a new utility type tot_time. Its purpose is to be able to account elapsed time accross multiple periods. Functions are defined to easily start and stop measures, and return the current value.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	663416b4ef	MINOR: quic: dump quic_conn debug string for logs Define a new xprt_ops callback named dump_info. This can be used to extend MUX debug string with infos from the lower layer. Implement dump_info for QUIC stack. For now, only minimal info are reported : bytes in flight and size of the sending window. This should allow to detect if the congestion controller is fine. These info are reported via QUIC MUX debug string sample.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	eb4dfa3b36	MINOR: mux-quic: define dump functions for QCC and QCS Extract trace code to dump QCC and QCS instances into dedicated functions named qmux_dump_qc{c,s}_info(). This will allow to easily print QCC/QCS infos outside of traces.	2024-08-07 15:40:52 +02:00
Willy Tarreau	921e04bf87	MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str These are passed to the underlying mux to retrieve debug information at the mux level (stream/connection) as a string that's meant to be added to logs. The API is quite complex just because we can't pass any info to the bottom function. So we construct a union and pass the argument as an int, and expect the callee to fill that with its buffer in return. Most likely the mux->ctl and ->sctl API should be reworked before the release to simplify this. The functions take an optional argument that is a bit mask of the layers to dump: muxs=1 muxc=2 xprt=4 conn=8 sock=16 The default (0) logs everything available.	2024-08-07 14:07:41 +02:00
Amaury Denoyelle	e177cf341c	BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM STREAM frames have dedicated handling on retransmission. A special check is done to remove data already acked in case of duplicated frames, thus only unacked data are retransmitted. This handling is faulty in case of an empty STREAM frame with FIN set. On retransmission, this frame does not cover any unacked range as it is empty and is thus discarded. This may cause the transfer to freeze with the client waiting indefinitely for the FIN notification. To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC when FIN has been transmitted. If set, it prevents qc_stream_desc to be freed until FIN is acknowledged. On retransmission side, qc_stream_frm_is_acked() has been updated. It now reports false if FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN set. This must be backported up to 2.6. However, this modifies heavily critical section for ACK handling and retransmission. As such, it must be backported only after a period of observation. This issue can be reproduced by using the following socat command as server to add delay between the response and connection closure : $ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;' On the client side, ngtcp2 can be used to simulate packet drop. Without this patch, connection will be interrupted on QUIC idle timeout or haproxy client timeout with ERR_DRAINING on ngtcp2 : $ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o" Alternatively to ngtcp2 random loss, an extra haproxy patch can also be used to force skipping the emission of the empty STREAM frame : diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h index efbdfe687..1ff899acd 100644 --- a/include/haproxy/quic_tx-t.h +++ b/include/haproxy/quic_tx-t.h @@ -26,6 +26,8 @@ extern struct pool_head pool_head_quic_cc_buf; / Flag a sent packet as being probing with old data / #define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5) +#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6) + / Structure to store enough information about TX QUIC packets. / struct quic_tx_packet { / List entry point. / diff --git a/src/quic_tx.c b/src/quic_tx.c index 2f199ac3c..2702fc9b9 100644 --- a/src/quic_tx.c +++ b/src/quic_tx.c @@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer buf, struct ssl_sock_ctx ctx) tmpbuf.size = tmpbuf.data = dglen; TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc); - if (!skip_sendto) { + if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) { int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso); if (ret < 0) { if (gso && ret == -EIO) { @@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer buf, struct ssl_sock_ctx ctx) qc->cntrs.sent_bytes_gso += ret; } } + first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO; b_del(buf, dglen + QUIC_DGRAM_HEADLEN); qc->bytes.tx += tmpbuf.data; @@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char pos, const unsigned char *end, continue; } + switch (cf->type) { + case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F: + if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) { + TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc); + pkt->flags \|= QUIC_FL_TX_PACKET_SKIP_SENDTO; + } + break; + default: + break; + } + quic_tx_packet_refinc(pkt); cf->pkt = pkt; }	2024-08-07 11:03:32 +02:00
Amaury Denoyelle	714009b7bc	MINOR: quic: implement function to check if STREAM is fully acked When a STREAM frame is retransmitted, a check is performed to remove range of data already acked from it. This is useful when STREAM frames are duplicated and splitted to cover different data ranges. The newly retransmitted frame contains only unacked data. This process is performed similarly in qc_dup_pkt_frms() and qc_build_frms(). Refactor the code into a new function named qc_stream_frm_is_acked(). It returns true if frame data are already fully acked and retransmission can be avoided. If only a partial range of data is acknowledged, frame content is updated to only cover the unacked data. This patch does not have any functional change. However, it simplifies retransmission for STREAM frames. Also, it will be reused to fix retransmission for empty STREAM frames with FIN set from the following patch : BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM As such, it must be backported prior to it.	2024-08-07 10:57:10 +02:00
Amaury Denoyelle	bb9ac256a1	MINOR: quic: convert qc_stream_desc release field to flags qc_stream_desc had a field <release> used as a boolean. Convert it with a new <flags> field and QC_SD_FL_RELEASE value as equivalent. The purpose of this patch is to be able to extend qc_stream_desc by adding newer flags values. This patch is required for the following patch BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM As such, it must be backported prior to it.	2024-08-06 18:00:17 +02:00
Amaury Denoyelle	7b89aa5b19	BUG/MINOR: h1: do not forward h2c upgrade header token haproxy supports tunnel establishment through HTTP Upgrade mechanism. Since the following commit, extended CONNECT is also supported for HTTP/2 both on frontend and backend side. commit `9bf957335e` MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to request an upgrade to HTTP/2. In haproxy, this is not supported so it silently ignores this. However, Connection and Upgrade headers are forwarded as-is on the backend side. If using HTTP/1 on the backend side and the server supports this upgrade mechanism, haproxy won't be able to parse the HTTP response. If using HTTP/2, mux backend tries to incorrectly convert the request to an Extended CONNECT with h2c protocol, which may also prevent the response to be transmitted. To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade header. On converting the header list to HTX, the upgrade header is skipped if any of this token is present and the H1_MF_CONN_UPG flag is removed. This issue can easily be reproduced using curl --http2 argument to connect to an HTTP/1 frontend. This must be backported up to 2.4 after a period of observation.	2024-08-01 18:23:32 +02:00
Amaury Denoyelle	4b0bda42f7	MINOR: flags/mux-quic: decode qcc and qcs flags Decode QUIC MUX connection and stream elements via qcc_show_flags() and qcs_show_flags(). Flags definition have been moved outside of USE_QUIC to ease compilation of flags binary.	2024-07-31 17:59:35 +02:00
Frederic Lecaille	1733dff42a	MINOR: tcp_sample: Move TCP low level sample fetch function to control layer Add ->get_info() new control layer callback definition to protocol struct to retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by an integer into a long long int. Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer function (src/proto_tcp.c) and define it as the ->get_info() callback for TCPv4 and TCPv6. Note that get_tcp_info() is called for several TCP sample fetches. This patch is useful to support some of these sample fetches for QUIC and to keep the code simple and easy to maintain.	2024-07-31 10:29:42 +02:00
William Lallemand	f76e8e50f4	BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC Replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC in the code source, so we won't need to set USE_OPENSSL_AWSLC in the Makefile on the long term.	2024-07-30 18:53:08 +02:00
William Lallemand	56eefd6827	BUG/MEDIUM: ssl: reactivate 0-RTT for AWS-LC Then reactivate HAVE_SSL_0RTT and HAVE_SSL_0RTT_QUIC for AWS-LC, which were wrongly deactivated in `f5353f2c` ("MINOR: ssl: add HAVE_SSL_0RTT constant"). Must be backported to 3.0.	2024-07-30 18:53:08 +02:00
Willy Tarreau	1a8f3a368f	MINOR: queue: add a function to check for TOCTOU after queueing There's a rare TOCTOU case that happens from time to time with maxconn 1 and multiple threads. Between the moment we see the queue full and the moment we queue a request, it's possible that the last request on the server or proxy ended and that no other one is left to offer it its place. Given that all this code path is performance-critical and we cannot afford to increase the lock duration, better recheck for the condition after queueing. For this we need to be able to check for the condition and cleanly dequeue a request. That's what this patch provides via the new function pendconn_must_try_again(). It will catch more requests than absolutely needed though it will catch them all. It may find that around 1/1000 of requests are at risk, though testing shows that in practice, it's around 1 per million that really gets stuck (other ones benefit from timing and finishing late requests). Maybe in the future some conditions might be refined but it's harmless. What happens to such requests is that they're dequeued and their pendconn freed, so that the caller can decide to try to LB or queue them again. For now the function is not used, it's just added separately for easier tracking.	2024-07-29 09:27:01 +02:00
Frederic Lecaille	76ff8afa2d	MINOR: quic: Add information to "show quic" for CUBIC cc. Add ->state_cli() new callback to quic_cc_algo struct to define a function called by the "show quic (cc\|full)" commands to dump some information about the congestion algorithm internal state currently in use by the QUIC connections. Implement this callback for CUBIC algorithm to dump its internal variables: - K: (the time to reach the cubic curve inflexion point), - last_w_max: the last maximum window value reached before intering the last recovery period. This is also the window value at the inflexion point of the cubic curve, - wdiff: the difference between the current window value and last_w_max. So negative before the inflexion point, and positive after.	2024-07-26 16:42:44 +02:00

... 5 6 7 8 9 ...

8446 Commits