haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-11 01:26:58 +02:00

Author	SHA1	Message	Date
Christopher Faulet	8bca3cc8c7	MEDIUM: checks: Stop scheduling healthchecks during stopping stage When the process is stopping, the health-checks are suspended. However the task is still periodically woken up for nothing. If there is a huge number of health-checks and if they are woken up in same time, it may lead to a noticeable CPU consumption for no reason. To avoid this extra CPU cost, we stop to schedule the health-check tasks when the proxy is disabled or stopped. This patch should partially solve the issue #2145.	2023-05-17 14:57:10 +02:00
Willy Tarreau	c7b9308f20	BUG/MINOR: clock: automatically adjust the internal clock with the boot time This is a better and more general solution to the problem described in this commit: BUG/MINOR: checks: postpone the startup of health checks by the boot time Now we're updating the now_offset that is used to compute now_ms at the few points where we update the ready date during boot. This ensures that now_ms while being stable during all the boot process will be correct and will start with the boot value right after the boot is finished. As such the patch above is rolled back (we don't want to count the boot time twice). This must not be backported because it relies on the more flexible clock architecture in 2.8.	2023-05-17 09:33:54 +02:00
Willy Tarreau	8e978a094d	BUG/MINOR: checks: postpone the startup of health checks by the boot time When health checks are started at boot, now_ms could be off by the boot time. In general it's not even noticeable, but with very large configs taking up to one or even a few seconds to start, this can result in a part of the servers' checks being scheduled slightly in the past. As such all of them will start groupped, partially defeating the purpose of the spread-checks setting. For example, this can cause a burst of connections for the network, or an excess of CPU usage during SSL handshakes, possibly even causing some timeouts to expire early. Here in order to compensate for this, we simply add the known boot time to the computed delay when scheduling the startup of checks. That's very simple and particularly efficient. For example, a config with 5k servers in 800 backends checked every 5 seconds, that was taking 3.8 seconds to start used to show this distribution of health checks previously despite the spread-checks 50: 3690 08:59:25 417 08:59:26 213 08:59:27 71 08:59:28 428 08:59:29 860 08:59:30 918 08:59:31 938 08:59:32 1124 08:59:33 904 08:59:34 647 08:59:35 890 08:59:36 973 08:59:37 856 08:59:38 893 08:59:39 154 08:59:40 Now with the fix it shows this: 470 08:59:59 929 09:00:00 896 09:00:01 937 09:00:02 854 09:00:03 827 09:00:04 906 09:00:05 863 09:00:06 913 09:00:07 873 09:00:08 162 09:00:09 This should be backported to all supported versions. It depends on this commit: MINOR: clock: measure the total boot time For 2.8 where the internal clock is now totally independent on the human one, an more generic fix will consist in simply updating now_ms to reflect the startup time.	2023-05-17 09:33:54 +02:00
Christopher Faulet	cb76030356	CLEANUP: check; Remove some useless assignments to NULL In process_chk_conn(), some assignments to NULL are useless and are reported by Coverity as unused value. while it is harmless, these assignments can be removed. This patch should fix the coverity report #2158.	2023-05-17 09:28:23 +02:00
Willy Tarreau	b93758cec9	MINOR: checks: make sure spread-checks is used also at boot time This makes use of spread-checks also for the startup of the check tasks. This provides a smoother load on startup for uneven configurations which tend to enable only some servers. Below is the connection distribution per second of the SSL checks of a config with 5k servers spread over 800 backends, with a check inter of 5 seconds: - default: 682 08:00:50 826 08:00:51 773 08:00:52 1016 08:00:53 885 08:00:54 889 08:00:55 825 08:00:56 773 08:00:57 1016 08:00:58 884 08:00:59 888 08:01:00 491 08:01:01 - with spread-checks 50: 437 08:01:19 866 08:01:20 777 08:01:21 1023 08:01:22 1118 08:01:23 923 08:01:24 641 08:01:25 859 08:01:26 962 08:01:27 860 08:01:28 929 08:01:29 909 08:01:30 866 08:01:31 849 08:01:32 114 08:01:33 - with spread-checks 50 + this patch: 680 08:01:55 922 08:01:56 962 08:01:57 899 08:01:58 819 08:01:59 843 08:02:00 916 08:02:01 896 08:02:02 886 08:02:03 846 08:02:04 903 08:02:05 894 08:02:06 178 08:02:07 The load is much smoother from the start, this can help initial health checks succeed when many target the same overloaded server for example. This could be backported as it should make border-line configs more reliable across reloads.	2023-05-17 08:10:40 +02:00
Aurelien DARRAGON	dcbc2d2cac	MINOR: checks/event_hdl: SERVER_CHECK event Adding a new event type: SERVER_CHECK. This event is published when a server's check state ought to be reported. (check status change or check result) SERVER_CHECK event is provided as a server event with additional data carrying relevant check's context such as check's result and health.	2023-05-05 16:28:32 +02:00
Willy Tarreau	69530f59ae	MEDIUM: clock: replace timeval "now" with integer "now_ns" This puts an end to the occasional confusion between the "now" date that is internal, monotonic and not synchronized with the system's date, and "date" which is the system's date and not necessarily monotonic. Variable "now" was removed and replaced with a 64-bit integer "now_ns" which is a counter of nanoseconds. It wraps every 585 years, so if all goes well (i.e. if humanity does not need haproxy anymore in 500 years), it will just never wrap. This implies that now_ns is never nul and that the zero value can reliably be used as "not set yet" for a timestamp if needed. This will also simplify date checks where it becomes possible again to do "date1<date2". All occurrences of "tv_to_ns(&now)" were simply replaced by "now_ns". Due to the intricacies between now, global_now and now_offset, all 3 had to be turned to nanoseconds at once. It's not a problem since all of them were solely used in 3 functions in clock.c, but they make the patch look bigger than it really is. The clock_update_local_date() and clock_update_global_date() functions are now much simpler as there's no need anymore to perform conversions nor to round the timeval up or down. The wrapping continues to happen by presetting the internal offset in the short future so that the 32-bit now_ms continues to wrap 20 seconds after boot. The start_time used to calculate uptime can still be turned to nanoseconds now. One interrogation concerns global_now_ms which is used only for the freq counters. It's unclear whether there's more value in using two variables that need to be synchronized sequentially like today or to just use global_now_ns divided by 1 million. Both approaches will work equally well on modern systems, the difference might come from smaller ones. Better not change anyhting for now. One benefit of the new approach is that we now have an internal date with a resolution of the nanosecond and the precision of the microsecond, which can be useful to extend some measurements given that timestamps also have this resolution.	2023-04-28 16:08:08 +02:00
Willy Tarreau	eed5da1037	MINOR: clock: do not use now.tv_sec anymore Instead we're using ns_to_sec(tv_to_ns(&now)) which allows the tv_sec part to disappear. At this point, "now" is only used as a timeval in clock.c where it is updated.	2023-04-28 16:08:08 +02:00
Willy Tarreau	e8e4712771	MINOR: checks: use a nanosecond counters instead of timeval for checks->start Now we store the checks start date as a nanosecond timestamps instead of a timeval, this will simplify the operations with "now" in the near future.	2023-04-28 16:08:08 +02:00
Willy Tarreau	76d343d3d3	MINOR: time: replace calls to tv_ms_elapsed() with a linear subtract Instead of operating on {sec, usec} now we convert both operands to ns then subtract them and convert to ms. This is a first step towards dropping timeval from these timestamps. Interestingly, tv_ms_elapsed() and tv_ms_remain() are no longer used at all and could be removed.	2023-04-28 16:08:08 +02:00
Aurelien DARRAGON	1746b56e68	MINOR: server: change srv_op_st_chg_cause storage type This one is greatly inspired by "MINOR: server: change adm_st_chg_cause storage type". While looking at current srv_op_st_chg_cause usage, it was clear that the struct needed some cleanup since some leftovers from asynchronous server state change updates were left behind and resulted in some useless code duplication, and making the whole thing harder to maintain. Two observations were made: - by tracking down srv_set_{running, stopped, stopping} usage, we can see that the <reason> argument is always a fixed statically allocated string. - check-related state change context (duration, status, code...) is not used anymore since srv_append_status() directly extracts the values from the server->check. This is pure legacy from when the state changes were applied asynchronously. To prevent code duplication, useless string copies and make the reason/cause more exportable, we store it as an enum now, and we provide srv_op_st_chg_cause() function to fetch the related description string. HEALTH and AGENT causes (check related) are now explicitly identified to make consumers like srv_append_op_chg_cause() able to fetch checks info from the server itself if they need to.	2023-04-21 14:36:45 +02:00
Aurelien DARRAGON	f3b48a808e	MINOR: server: srv_append_status refacto srv_append_status() has become a swiss-knife function over time. It is used from server code and also from checks code, with various inputs and distincts code paths, making it very hard to guess the actual behavior of the function (resulting string output). To simplify the logic behind it, we're dividing it in multiple contextual functions that take simple inputs and do explicit things, making them more predictable and easier to maintain.	2023-04-21 14:36:45 +02:00
Aurelien DARRAGON	32483ecaac	MINOR: server: correctly free servers on deinit() srv_drop() function is reponsible for freeing the server when the refcount reaches 0. There is one exception: when global.mode has the MODE_STOPPING flag set, srv_drop() will ignore the refcount and free the server on first invocation. This logic has been implemented with `13f2e2ce` ("BUG/MINOR: server: do not use refcount in free_server in stopping mode") and back then doing so was not a problem since dynamic server API was just implemented and srv_take() and srv_drop() were not widely used. Now that dynamic server API is starting to get more popular we cannot afford to keep the current logic: some modules or lua scripts may hold references to existing server and also do their cleanup in deinit phases In this kind of situation, it would be easy to trigger double-frees since every call to srv_drop() on a specific server will try to free it. To fix this, we take a different approach and try to fix the issue at the source: we now properly drop server references involved with checks/agent_checks in deinit_srv_check() and deinit_srv_agent_check(). While this could theorically be backported up to 2.6, it is not very relevant for now since srv_drop() usage in older versions is very limited and we're only starting to face the issue in mid 2.8 developments. (ie: lua core updates)	2023-04-05 08:58:16 +02:00
Aurelien DARRAGON	81b7c9518c	MINOR: check: use atomic for s->consecutive_errors Properly use atomic operations when dealing with s->consecutive_errors as we're using it out of server's lock. Race is negligible, no backport needed.	2022-12-07 17:04:08 +01:00
Aurelien DARRAGON	7d541a91ec	BUG/MINOR: checks: restore legacy on-error fastinter behavior With previous commit, `9e080bf` ("BUG/MINOR: checks: make sure fastinter is used even on forced transitions"), on-error mark-down\|sudden-death\|fail-check are now working as expected. However, on-error fastinter remains broken because srv_getinter(), used in the above commit to check the expiration date, won't return fastinter interval if server health is maxed out (which is the case with on-error fastinter mode). To fix this, we introduce a check flag named CHK_ST_FASTINTER. This flag is set when on-error is triggered. This way we can force srv_getinter() to return fastinter interval whenever the flag is set. The flag is automatically cleared as soon as the new check task expiry is recalculated in process_chk_conn(). This restores original behavior prior to `d114f4a` ("MEDIUM: checks: spread the checks load over random threads"). It must be backported to 2.7 along with the aforementioned commits.	2022-12-07 17:03:55 +01:00
Willy Tarreau	9e080bf375	BUG/MINOR: checks: make sure fastinter is used even on forced transitions Aur�lien also found that while previous commit `a56798ea4` ("BUG/MEDIUM: checks: do not reschedule a possibly running task on state change") addressed one specific case where the check's task had to be woken up quickly, but it's not always sufficient as the check will not be considered as expired regarding the fastinter yet. Let's make sure we do consider this specific case to update the timer based on the new state if the new value is shorter. This particularly means that even if the timer is not expired yet during a wakeup when nothing is in progress, we need to check if applying the currently effective interval right now to the current date would expire earlier than what is programmed, then the timer needs to be updated. I.e. make sure we never miss fastinter during a state transition before the end of the current period. The approach is not pretty, but it forces to repass via the existing block dedicated to updating the timer if the current one is expired and the updated one would appear earlier. This must be backported to 2.7 along with the commit above.	2022-12-06 18:48:22 +01:00
Willy Tarreau	a56798ea4d	BUG/MEDIUM: checks: do not reschedule a possibly running task on state change Aur�lien found an issue introduced in 2.7-dev8 with commit `d114f4a68` ("MEDIUM: checks: spread the checks load over random threads"), but which in fact has deeper roots. When a server's state is changed via __health_adjust(), if a fastinter setting is set, the task gets rescheduled to run at the new date. The way it's done is not thread safe, as nothing prevents another thread where the task is already running from also updating the expire field in parallel. But since such events are quite rare, this statistically never happens. However, with the commit above, the tasks are no longer required to go to the shared wait queue and are no longer marked as shared between multiple threads. It's just that any thread may run them at a time without implying that all of them are allowed to modify them. And this change is sufficient to trigger the BUG_ON() condition in the scheduler that detects the inconsistency between a task queued in one thread and being manipulated in parallel by another one: FATAL: bug condition "task->tid != tid" matched at include/haproxy/task.h:670 call trace(13): \| 0x55f61cf520c9 [c6 04 25 01 00 00 00 00]: main-0x2ee7 \| 0x55f61d0646e8 [8b 45 08 a8 40 0f 85 65]: back_handle_st_cer+0x78/0x4d7 \| 0x55f61cff3e72 [41 0f b6 4f 01 e9 c8 df]: process_stream+0x2252/0x364f \| 0x55f61d0d2fab [48 89 c3 48 85 db 74 75]: run_tasks_from_lists+0x34b/0x8c4 \| 0x55f61d0d38ad [29 44 24 18 8b 54 24 18]: process_runnable_tasks+0x37d/0x6c6 \| 0x55f61d0a22fa [83 3d 0b 63 1e 00 01 0f]: run_poll_loop+0x13a/0x536 \| 0x55f61d0a28c9 [48 8b 1d f0 46 19 00 48]: main+0x14d919 \| 0x55f61cf56dfe [31 c0 e8 eb 93 1b 00 31]: main+0x1e4e/0x2d5d At first glance it looked like it could be addressed in the scheduler only, but in fact the problem clearly is at the application level, since some shared fields are manipulated without protection. At minima, the task's expiry ought to be touched only under the server's lock. While it's arguable that the scheduler could make such updates easier, changing it alone will not be sufficient here. Looking at the sequencing closer, it becomes obvious that we do not need this task_schedule() at all: a simple task_wakeup() is sufficient for the callee to update its timers. Indeed, the process_chk_con() function already deals with spurious wakeups, and already uses srv_getinter() to calculate the next wakeup date based on the current state. So here, instead of having to queue the task from __health_adjust() to anticipate a new check, we can simply wake the task up and let it decide when it needs to run next. This is much cleaner as the expiry calculation remains performed at a single place, from the task itself, as it should be, and it fixes the problem above. This should be backported to 2.7, but not to older versions where the risks of breakage are higher than the chance to fix something that ever happened.	2022-12-06 14:14:41 +01:00
Willy Tarreau	36a73439f9	BUILD: check: use __fallthrough in __health_adjust() This avoids one build warning when preprocessing happens before compiling with gcc >= 7.	2022-11-14 11:14:02 +01:00
Willy Tarreau	d114f4a68f	MEDIUM: checks: spread the checks load over random threads The CPU usage pattern was found to be high (5%) on a machine with 48 threads and only 100 servers checked every second That was supposed to be only 100 connections per second, which should be very cheap. It was figured that due to the check tasks unbinding from any thread when going back to sleep, they're queued into the shared queue. Not only this requires to manipulate the global queue lock, but in addition it means that all threads have to check the global queue before going to sleep (hence take a lock again) to figure how long to sleep, and that they would all sleep only for the shortest amount of time to the next check, one would pick it and all other ones would go down to sleep waiting for the next check. That's perfectly visible in time-to-first-byte measurements. A quick test consisting in retrieving the stats page in CSV over a 48-thread process checking 200 servers every 2 seconds shows the following tail: percentile ttfb(ms) 99.98 2.43 99.985 5.72 99.99 32.96 99.995 82.176 99.996 82.944 99.9965 83.328 99.997 83.84 99.9975 84.288 99.998 85.12 99.9985 86.592 99.999 88 99.9995 89.728 99.9999 100.352 One solution could consist in forcefully binding checks to threads at boot time, but that's annoying, will cause trouble for dynamic servers and may cause some skew in the load depending on some server patterns. Instead here we take a different approach. A check remains bound to its thread for as long as possible, but upon every wakeup, the thread's load is compared with another random thread's load. If it's found that that other thread's load is less than half of the current one's, the task is bounced to that thread. In order to prevent that new thread from doing the same, we set a flag "CHK_ST_SLEEPING" that indicates that it just woke up and we're bouncing the task only on this condition. Tests have shown that the initial load was very unfair before, with a few checks threads having a load of 15-20 and the vast majority having zero. With this modification, after two "inter" delays, the load is either zero or one everywhere when checks start. The same test shows a CPU usage that significantly drops, between 0.5 and 1%. The same latency tail measurement is much better, roughly 10 times smaller: percentile ttfb(ms) 99.98 1.647 99.985 1.773 99.99 4.912 99.995 8.76 99.996 8.88 99.9965 8.944 99.997 9.016 99.9975 9.104 99.998 9.224 99.9985 9.416 99.999 9.8 99.9995 10.04 99.9999 10.432 In fact one difference here is that many threads work while in the past they were waking up and going down to sleep after having perturbated the shared lock. Thus it is anticipated that this will scale way smoother than before. Under strace it's clearly visible that all threads are sleeping for the time it takes to relaunch a check, there's no more thundering herd wakeups. However it is also possible that in some rare cases such as very short check intervals smaller than a scheduler's timeslice (such as 4ms), some users might have benefited from the work being concentrated on less threads and would instead observe a small increase of apparent CPU usage due to more total threads waking up even if that's for less work each and less total work. That's visible with 200 servers at 4ms where show activity shows that a few threads were overloaded and others doing nothing. It's not a problem, though as in practice checks are not supposed to eat much CPU and to wake up fast enough to represent a significant load anyway, and the main issue they could have been causing (aside the global lock) is an increase last-percentile latency.	2022-10-12 21:49:30 +02:00
Willy Tarreau	a840b4a39b	MINOR: checks: use the lighter PRNG for spread checks There's no point using ha_random32() which is heavy and uses shared variables to calculate a random timer when we have statistical_prng() which does the same and was made exactly for this.	2022-10-12 21:49:30 +02:00
Christopher Faulet	871dd82117	BUG/MINOR: tcpcheck: Disable QUICKACK only if data should be sent after connect It is only a real problem for agent-checks when there is no agent string to send. The condition to disable TCP_QUICKACK was only based on the action type following the connect one. But it is not always accurate. indeed, for agent-checks, there is always a SEND action. But if there is no "agent-send" string defined, nothing is sent. In this case, this adds 200ms of latency with no reason. To fix the bug, a flag is now used on the CONNECT action to instruct there are data that should be sent after the connect. For health-checks, this flag is set if the action following the connect is a SEND action. For agent-checks, it is set if an "agent-send" string is defined. This patch should fix the issue #1836. It must be backported as far as 2.2.	2022-08-24 11:59:04 +02:00
Ilya Shipitsin	3b64a28e15	CLEANUP: assorted typo fixes in the code and comments This is 31st iteration of typo fixes	2022-08-06 17:12:51 +02:00
Willy Tarreau	eed3911a54	MINOR: task: replace task_set_affinity() with task_set_thread() The latter passes a thread ID instead of a mask, making the code simpler.	2022-07-01 19:15:14 +02:00
Willy Tarreau	3ccb14d60d	MINOR: thread: get rid of MAX_THREADS_MASK This macro was used both for binding and for lookups. When binding tasks or FDs, using all_threads_mask instead is better as it will later be per group. For lookups, ~0UL always does the job. Thus in practice the macro was already almost not used anymore since the rest of the code could run fine with a constant of all ones there.	2022-06-14 11:18:40 +02:00
Christopher Faulet	e4b4019280	CLEANUP: check: Remove useless tests on check's stream-connector Since the conn-stream refactoring, from the time the health-check is in progress, its stream-connector is always defined. So, some tests on it are useless and can be removed. This patch should fix the issue #1739.	2022-06-13 08:04:10 +02:00
Christopher Faulet	4f1825c5db	BUG/MINOR: checks: Properly handle email alerts in trace messages There is no server for email alerts. So the trace messages must be adapted to handle this case. Information related to the server are now skipped for email alerts and "[EMAIL]" prefix is used. This patch must be backported as far as 2.4.	2022-06-08 15:28:38 +02:00
Willy Tarreau	bde14ad499	CLEANUP: check: rename all occurrences of stconn "cs" to "sc" The check struct had a "cs" field renamed to "sc", which also required a tiny update to a few functions using it to distinguish a check from a stream (log.c, payload.c, ssl_sample.c, tcp_sample.c, tcpcheck.c, connection.c). Function arguments and local variables called "cs" were renamed to "sc". The presence of one "cs=" in the debugging traces was also turned to "sc=" for consistency.	2022-05-27 19:33:35 +02:00
Willy Tarreau	19c65a9ded	CLEANUP: stconn: rename remaining management functions from cs_* to sc_* This is the end of the renaming for the generic SC management functions and macros: cs_applet_process() -> sc_applet_process() cs_attach_applet() -> sc_attach_applet() cs_attach_mux() -> sc_attach_mux() cs_attach_strm() -> sc_attach_strm() cs_detach_app() -> sc_detach_app() cs_detach_endp() -> sc_detach_endp() cs_notify() -> sc_notify() cs_reset_endp() -> sc_reset_endp() cs_state_in() -> sc_state_in() cs_update() -> sc_update() cs_update_rx() -> sc_update_rx() cs_update_tx() -> sc_update_tx() IS_HTX_CS() -> IS_HTX_SC()	2022-05-27 19:33:35 +02:00
Willy Tarreau	a0b58b537d	CLEANUP: stconn: rename cs_{new,create,free,destroy}_* to sc_* This renames the following functions: cs_new_from_endp() -> sc_new_from_endp() cs_new_from_strm() -> sc_new_from_strm() cs_new_from_check() -> sc_new_from_check() cs_applet_create() -> sc_applet_create() cs_destroy() -> sc_destroy() cs_free() -> sc_free()	2022-05-27 19:33:35 +02:00
Willy Tarreau	462b989d4c	CLEANUP: stconn: rename cs_conn_() to sc_conn_() The following functions which act on a connection-based stream connector were renamed to sc_conn_* (~60 places): cs_conn_drain_and_shut cs_conn_process cs_conn_read0 cs_conn_ready cs_conn_recv cs_conn_send cs_conn_shut cs_conn_shutr cs_conn_shutw	2022-05-27 19:33:34 +02:00
Willy Tarreau	fd9417ba3f	CLEANUP: stconn: rename cs_conn() to sc_conn() It's mostly used from upper layers. Both the checked and unchecked functions were updated, or ~150 entries.	2022-05-27 19:33:34 +02:00
Willy Tarreau	ea27f48c5a	CLEANUP: stconn: rename cs_{check,strm,strm_task} to sc_strm_* These functions return the app-layer associated with an stconn, which is a check, a stream or a stream's task. They're used a lot to access channels, flags and for waking up tasks. Let's just name them appropriately for the stream connector.	2022-05-27 19:33:34 +02:00
Willy Tarreau	2f2318df87	MEDIUM: stconn: merge the app_ops and the data_cb fields For historical reasons (stream-interface and connections), we used to require two independent fields for the application level callbacks and the transport-level functions. Over time the distinction faded away so much that the low-level functions became specific to the application and conversely. For example, applets may only work with streams on top since they rely on the channels, and the stream-level functions differ between applets and connections. Right now the application level only contains a wake() callback and the low-level ones contain the functions that act at the lower level to perform the shutr/shutw and at the upper level to notify about readability and writability. Let's just merge them together into a single set and get rid of this confusing distinction. Note that the check ops do not define any app-level function since these are only called by streams.	2022-05-27 19:33:34 +02:00
Willy Tarreau	f3ae34b67d	MINOR: check: export wake_srv_chk() We'll need it to centralize the stream connectors definitions.	2022-05-27 19:33:34 +02:00
Willy Tarreau	cb04166525	CLEANUP: stconn: tree-wide rename stream connector flags CS_FL_* to SC_FL_* This follows the natural naming. There are roughly 100 changes, all totally trivial.	2022-05-27 19:33:34 +02:00
Willy Tarreau	4596fe20d9	CLEANUP: conn_stream: tree-wide rename to stconn (stream connector) This renames the "struct conn_stream" to "struct stconn" and updates the descriptions in all comments (and the rare help descriptions) to "stream connector" or "connector". This touches a lot of files but the change is minimal. The local variables were not even renamed, so there's still a lot of "cs" everywhere.	2022-05-27 19:33:34 +02:00
Willy Tarreau	b605c4213f	CLEANUP: conn_stream: rename the stream endpoint flags CS_EP_* to SE_FL_* Let's now use the new flag names for the stream endpoint.	2022-05-27 19:33:34 +02:00
Willy Tarreau	0cfcc40812	CLEANUP: conn_stream: apply cs_endp_flags.cocci tree-wide This changes all main uses of cs->endp->flags to the sc_ep_*() equivalent by applying coccinelle script cs_endp_flags.cocci. Note: 143 locations were touched, manually reviewed and found to be OK, except a single one that was adjusted in cs_reset_endp() where the flags are read and filtered to be used as-is and not as a boolean, hence was replaced with sc_ep_get() & $FLAGS. The script was applied with all includes: spatch --in-place --recursive-includes -I include --sp-file $script $files	2022-05-27 19:33:34 +02:00
Christopher Faulet	c95eaefbfd	MEDIUM: check: Use the CS to handle subscriptions for read/write events Instead of using the health-check to subscribe to read/write events, we now rely on the conn-stream. Indeed, on the server side, the conn-stream's endpoint is a multiplexer. Thus it seems appropriate to handle subscriptions for read/write events the same way than for the streams. Of course, the I/O callback function is not the same. We use srv_chk_io_cb() instead of cs_conn_io_cb().	2022-05-19 10:12:38 +02:00
Christopher Faulet	361417f9b4	REORG: check: Rename and export I/O callback function event_srv_chk_io() function is renamed srv_chk_io_cb() to be consistant with the I/O callback function of connections. In addition, this function is exported. It will be required to use the conn-stream's subscriptions.	2022-05-19 10:12:38 +02:00
Christopher Faulet	08c8f8e20d	MEDIUM: check: No longer shutdown the connection in .wake callback function The connection is already closed by the health-check itself. Thus there is now reason to duplicate this part in the .wake callback function. It is enough to wake the health-check and wait.	2022-05-19 10:12:38 +02:00
Christopher Faulet	6d781f612a	BUG/MINOR: check: Reinit the buffer wait list at the end of a check The buffer wait list is used to deal with buffer allocation failure. But at the end of health-check, it must be reinitialized. There is no reason to reason to get a buffer between two health-check runs. And in fact, the associated flags, CHK_ST_IN_ALLOC and CHK_ST_OUT_ALLOC, are already cleared at the end of a health-check. This patch must be backported as far as 2.2. On the 2.2, MT_LIST_ADDED and MT_LIST_DEL must be used instead of LIST_INLIST and LIST_DEL_INIT.	2022-05-19 10:12:38 +02:00
Christopher Faulet	a6c4a48341	BUG/MEDIUM: conn-stream: Don't erase endpoint flags on reset Only CS_EP_ERROR flag is now removed from the endpoint when a reset is performed. When a new the endpoint is allocated, flags are preserved. It is the caller responsibility to remove other flags, depending on its need. Concretly, during a connection retry or a L7 retry, we must preserve flags. In tcpcheck and the CLI, we reset flags. This patch is 2.6-specific. No backport needed.	2022-04-29 14:12:42 +02:00
Willy Tarreau	7e2e4f8401	CLEANUP: tree-wide: remove 25 occurrences of unneeded fcntl.h There were plenty of leftovers from old code that were never removed and that are not needed at all since these files do not use any definition depending on fcntl.h, let's drop them.	2022-04-26 10:59:48 +02:00
Willy Tarreau	acef5e27b0	MINOR: tree-wide: always consider EWOULDBLOCK in addition to EAGAIN Some older systems may routinely return EWOULDBLOCK for some syscalls while we tend to check only for EAGAIN nowadays. Modern systems define EWOULDBLOCK as EAGAIN so that solves it, but on a few older ones (AIX, VMS etc) both are different, and for portability we'd need to test for both or we never know if we risk to confuse some status codes with plain errors. There were few entries, the most annoying ones are the switch/case because they require to only add the entry when it differs, but the other ones are really trivial.	2022-04-25 20:32:15 +02:00
Christopher Faulet	eb50c01fef	MINOR: conn-stream: Make cs_detach_* private and use cs_destroy() from outside A conn-stream is never detached from an endpoint or an application alone, except on a reset. Thus, to avoid any error, these functions are now private. And cs_destroy() function is added to destroy a conn-stream. This function is called when a stream is released, on the front and back conn-streams, and when a health-check is finished.	2022-04-22 14:32:30 +02:00
Christopher Faulet	ff022a2b8c	CLEANUP: conn-stream: Rename cs_conn_close() and cs_conn_drain_and_close() These functions don't close the connection but only perform shutdown for reads and writes at the mux level. It is a bit ambiguous. Thus, cs_conn_close() is renamed cs_conn_shut() and cs_conn_drain_and_close() is renamed cs_conn_drain_and_shut(). These both functions rely on cs_conn_shutw() and cs_conn_shutr().	2022-04-22 14:14:27 +02:00
Christopher Faulet	177a0e60ee	MEDIUM: check: Use a new conn-stream for each health-check run It is a partial revert of `54e85cbfc` ("MAJOR: check: Use a persistent conn-stream for health-checks"). But with the CS refactoring, the result is cleaner now. A CS is allocated when a new health-check run is started. The same CS is then used throughout the run. If there are several connections, the endpoint is just reset. At the end of the run, the CS is released. It means, in the tcp-check part, the CS is always defined.	2022-04-13 15:10:16 +02:00
Christopher Faulet	6b0a0fb2f9	CLEANUP: tree-wide: Remove any ref to stream-interfaces Stream-interfaces are gone. Corresponding files can be safely be removed. In addition, comments are updated accordingly.	2022-04-13 15:10:16 +02:00
Christopher Faulet	69ef6c9ef4	MINOR: conn-stream: Rename CS functions dedicated to connections Some conn-stream functions are only used when there is a connection. Thus, they was renamed with "cs_conn_" prefix. In addition, we expect to have a connection, so a BUG_ON is added to be sure the functions are never called in another context.	2022-04-13 15:10:15 +02:00

1 2 3

150 Commits