haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-08-16 03:56:56 +02:00

Author	SHA1	Message	Date
Willy Tarreau	11ef0837af	MINOR: pollers: add a new flag to indicate pollers reporting ERR & HUP In practice it's all pollers except select(). It turns out that we're keeping some legacy code only for select and enforcing it on all pollers, let's offer the pollers the ability to declare that they do not need that.	2019-12-27 14:04:33 +01:00
Willy Tarreau	6b3089856f	MEDIUM: fd: do not use the FD_POLL_* flags in the pollers anymore As mentioned in previous commit, these flags do not map well to modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead. This first patch only performs a 1-to-1 mapping making sure that the previously reported flags are still reported identically while using the closest possible semantics in the pollers. It's worth noting that kqueue will now support improvements such as returning distinctions between shut and errors on each direction, though this is not exploited for now.	2019-09-06 19:09:56 +02:00
Willy Tarreau	5bee3e2f47	MEDIUM: fd: remove the FD_EV_POLLED status bit Since commit `7ac0e35f2` in 1.9-dev1 ("MAJOR: fd: compute the new fd polling state out of the fd lock") we've started to update the FD POLLED bit a bit more aggressively. Lately with the removal of the FD cache, this bit is always equal to the ACTIVE bit. There's no point continuing to watch it and update it anymore, all it does is create confusion and complicate the code. One interesting side effect is that it now becomes visible that all fd_*_{send,recv}() operations systematically call updt_fd_polling(), except fd_cant_recv()/fd_cant_send() which never saw it change.	2019-09-05 09:31:18 +02:00
Olivier Houchard	53055055c5	MEDIUM: pollers: Remember the state for read and write for each threads. In the poller code, instead of just remembering if we're currently polling a fd or not, remember if we're polling it for writing and/or for reading, that way, we can avoid to modify the polling if it's already polled as needed.	2019-07-31 14:54:41 +02:00
Olivier Houchard	305d5ab469	MAJOR: fd: Get rid of the fd cache. Now that the architecture was changed so that attempts to receive/send data always come from the upper layers, instead of them only trying to do so when the lower layer let them know they could try, we can finally get rid of the fd cache. We don't really need it anymore, and removing it gives us a small performance boost.	2019-07-31 14:12:55 +02:00
Willy Tarreau	2ae84e445d	MEDIUM: poller: separate the wait time from the wake events We have been abusing the do_poll()'s timeout for a while, making it zero whenever there is some known activity. The problem this poses is that it complicates activity diagnostic by incrementing the poll_exp field for each known activity. It also requires extra computations that could be avoided. This change passes a "wake" argument to say that the poller must not sleep. This simplifies the operations and allows one to differenciate expirations from activity.	2019-05-28 17:25:21 +02:00
Olivier Houchard	cb6c9274ae	MEDIUM: pollers: Use the new _HA_ATOMIC_* macros. Use the new _HA_ATOMIC_* macros and add barriers where needed.	2019-03-11 17:02:38 +01:00
Willy Tarreau	beb859abce	MINOR: polling: add an option to support busy polling In some situations, especially when dealing with low latency on processors supporting a variable frequency or when running inside virtual machines, each time the process waits for an I/O using the poller, the processor goes back to sleep or is offered to another VM for a long time, and it causes excessively high latencies. A solution to this provided by this patch is to enable busy polling using a global option. When busy polling is enabled, the pollers never sleep and loop over themselves waiting for an I/O event to happen or for a timeout to occur. On multi-processor machines it can significantly overheat the processor but it usually results in much lower latencies. A typical test consisting in injecting traffic over a single connection at a time over the loopback shows a bump from 4640 to 8540 connections per second on forwarded connections, indicating a latency reduction of 98 microseconds for each connection, and a bump from 12500 to 21250 for locally terminated connections (redirects), indicating a reduction of 33 microseconds. It is only usable with epoll and kqueue because select() and poll()'s API is not convenient for such usages, and the level of performance they are used in doesn't benefit from this anyway. The option, which obviously remains disabled by default, can be turned on using "busy-polling" in the global section, and turned off later using "no busy-polling". Its status is reported in "show info" to help troubleshooting suspicious CPU spikes.	2018-11-22 19:47:30 +01:00
Willy Tarreau	48f8bc1368	MINOR: poller: move the call of tv_update_date() back to the pollers The reason behind this will be to be able to compute a timeout when busy polling.	2018-11-22 18:57:37 +01:00
Willy Tarreau	609aad9e73	REORG: time/activity: move activity measurements to activity.{c,h} At the moment the situation with activity measurement is quite tricky because the struct activity is defined in global.h and declared in haproxy.c, with operations made in time.h and relying on freq_ctr which are defined in freq_ctr.h which itself includes time.h. It's barely possible to touch any of these files without breaking all the circular dependency. Let's move all this stuff to activity.{c,h} and be done with it. The measurement of active and stolen time is now done in a dedicated function called just after tv_before_poll() instead of mixing the two, which used to be a lazy (but convenient) decision. No code was changed, stuff was just moved around.	2018-11-22 11:48:41 +01:00
Willy Tarreau	7e9c4ae4de	MINOR: poller: move time and date computation out of the pollers By placing this code into time.h (tv_entering_poll() and tv_leaving_poll()) we can remove the logic from the pollers and prepare for extending this to offer more accurate time measurements.	2018-10-17 19:59:43 +02:00
Willy Tarreau	f37ba94768	MINOR: fd: centralize poll timeout computation in compute_poll_timeout() The 4 pollers all contain the same code used to compute the poll timeout. This is pointless, let's centralize this into fd.h. This also gets rid of the useless SCHEDULER_RESOLUTION macro which used to work arond a very old linux 2.2 bug causing select() to wake up slightly before the timeout.	2018-10-17 19:59:43 +02:00
Willy Tarreau	60b639ccbe	MEDIUM: hathreads: implement a more flexible rendez-vous point The current synchronization point enforces certain restrictions which are hard to workaround in certain areas of the code. The fact that the critical code can only be called from the sync point itself is a problem for some callback-driven parts. The "show fd" command for example is fragile regarding this. Also it is expensive in terms of CPU usage because it wakes every other thread just to be sure all of them join to the rendez-vous point. It's a problem because the sleeping threads would not need to be woken up just to know they're doing nothing. Here we implement a different approach. We keep track of harmless threads, which are defined as those either doing nothing, or doing harmless things. The rendez-vous is used "for others" as a way for a thread to isolate itself. A thread then requests to be alone using thread_isolate() when approaching the dangerous area, and then waits until all other threads are either doing the same or are doing something harmless (typically polling). The function only returns once the thread is guaranteed to be alone, and the critical section is terminated using thread_release().	2018-08-02 17:51:45 +02:00
Olivier Houchard	cb92f5cae4	MINOR: pollers: move polled_mask outside of struct fdtab. The polled_mask is only used in the pollers, and removing it from the struct fdtab makes it fit in one 64B cacheline again, on a 64bits machine, so make it a separate array.	2018-05-06 06:27:34 +02:00
Olivier Houchard	6b96f7289c	BUG/MEDIUM: pollers: Use a global list for fd shared between threads. With the old model, any fd shared by multiple threads, such as listeners or dns sockets, would only be updated on one threads, so that could lead to missed event, or spurious wakeups. To avoid this, add a global list for fd that are shared, using the same implementation as the fd cache, and only remove entries from this list when every thread as updated its poller. [wt: this will need to be backported to 1.8 but differently so this patch must not be backported as-is]	2018-05-06 06:27:09 +02:00
Olivier Houchard	8ef1a6b0d8	BUG/MINOR: fd: Don't clear the update_mask in fd_insert. Clearing the update_mask bit in fd_insert may lead to duplicate insertion of fd in fd_updt, that could lead to a write past the end of the array. Instead, make sure the update_mask bit is cleared by the pollers no matter what. This should be backported to 1.8. [wt: warning: 1.8 doesn't have the lockless fdcache changes and will require some careful changes in the pollers]	2018-04-03 19:38:15 +02:00
Willy Tarreau	62a627ac19	MEDIUM: poller: use atomic ops to update the fdtab mask We don't need to lock the fdtab[].lock anymore since we only have one modification left (update update_mask). Let's use an atomic AND instead.	2018-02-05 16:02:22 +01:00
Willy Tarreau	038e54cb3c	MINOR: epoll: get rid of the now useless fd_compute_new_polled_status() Do not call it anymore and avoid updating the fdstate. We're not very far from removing the fd lock it seems.	2018-02-05 16:02:22 +01:00
Willy Tarreau	4979592907	BUG/MINOR: epoll/threads: only call epoll_ctl(DEL) on polled FDs Commit `d9e7e36` ("BUG/MEDIUM: epoll/threads: use one epoll_fd per thread") addressed an issue with the polling and required that cloned FDs are removed from all polling threads on close. But in fact it does it for all bound threads, some of which may not necessarily poll the FD. This is harmless, but it may also make it harder later to deal with FD migration between threads. Better use polled_mask which only reports threads still aware of the FD instead of thread_mask. This fix should be backported to 1.8.	2018-01-31 09:49:29 +01:00
Willy Tarreau	745c60eac6	CLEANUP: fd: remove the unused "new" field This field has been unused since 1.6, it's only updated and never tested. Let's remove it.	2018-01-29 16:02:59 +01:00
Willy Tarreau	ce036bc2da	MINOR: polling: make epoll and kqueue not depend on maxfd anymore Maxfd is really only useful to poll() and select(), yet epoll and kqueue reference it almost by mistake : - cloning of the initial FDs (maxsock should be used here) - max polled events, it's maxpollevents which should be used here. Let's fix these places.	2018-01-29 15:18:54 +01:00
Christopher Faulet	3e805ed08e	BUILD: epoll/threads: Add test on MAX_THREADS to avoid warnings when complied without threads When HAProxy is complied without threads, gcc throws following warnings: src/ev_epoll.c:222:3: warning: array subscript is outside array bounds [-Warray-bounds] ... src/ev_epoll.c:199:11: warning: array subscript is outside array bounds [-Warray-bounds] ... Of course, this is not a bug. In such case, tid is always equal to 0. But to avoid the noise, a check on MAX_THREADS in "if (tid)" lines makes gcc happy. This patch should be backported in 1.8 with the commit `d9e7e36c` ("BUG/MEDIUM: epoll/threads: use one epoll_fd per thread").	2018-01-25 17:52:57 +01:00
Willy Tarreau	d9e7e36c6e	BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on any thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.	2018-01-23 15:48:08 +01:00
Willy Tarreau	ebc78d78a2	BUG/MEDIUM: fd: maintain a per-thread update mask Since the fd update tables are per-thread, we need to have a bit per thread to indicate whether an update exists, otherwise this can lead to lost update events every time multiple threads want to update the same FD. In practice for now, it only happens at start time when listeners are enabled and ask for polling after facing their first EAGAIN. But since the pollers are still shared, a lost event is still recovered by a neighbor thread. This will not reliably work anymore with per-thread pollers, where it has been observed a few times on startup that a single-threaded listener would not always accept incoming connections upon startup. It's worth noting that during this code review it appeared that the "new" flag in the fdtab isn't used anymore. This fix should be backported to 1.8.	2018-01-23 15:41:19 +01:00
Willy Tarreau	d80cb4ee13	MINOR: global: add some global activity counters to help debugging A number of counters have been added at special places helping better understanding certain bug reports. These counters are maintained per thread and are shown using "show activity" on the CLI. The "clear counters" commands also reset these counters. The output is sent as a single write(), which currently produces up to about 7 kB of data for 64 threads. If more counters are added, it may be necessary to write into multiple buffers, or to reset the counters. To backport to 1.8 to help collect more detailed bug reports.	2018-01-23 15:38:33 +01:00
Christopher Faulet	2a944ee16b	BUILD: threads: Rename SPIN/RWLOCK macros using HA_ prefix This remove any name conflicts, especially on Solaris.	2017-11-07 11:10:24 +01:00
Willy Tarreau	f65610a83d	CLEANUP: threads: rename process_mask to thread_mask It was a leftover from the last cleaning session; this mask applies to threads and calling it process_mask is a bit confusing. It's the same in fd, task and applets.	2017-10-31 16:06:06 +01:00
Christopher Faulet	cd7879adc2	BUG/MEDIUM: threads: Run the poll loop on the main thread too There was a flaw in the way the threads was created. the main one was just used to create all the others and just wait to exit. Now, it is used to run a poll loop. So we only create nbthread-1 threads. This also fixes a bug about the compression filter when there is only 1 thread (nbthread == 1 or no threads support). The bug was in the way thread-local resources was initialized. per-thread init/deinit callbacks were never called for the main process. So, with nthread set to 1, some buffers remained uninitialized.	2017-10-31 13:58:33 +01:00
Christopher Faulet	63e2ce61a8	MINOR: threads/polling: pollers now handle FDs depending on the process mask	2017-10-31 13:58:30 +01:00
Christopher Faulet	d4604adeaa	MAJOR: threads/fd: Make fd stuffs thread-safe Many changes have been made to do so. First, the fd_updt array, where all pending FDs for polling are stored, is now a thread-local array. Then 3 locks have been added to protect, respectively, the fdtab array, the fd_cache array and poll information. In addition, a lock for each entry in the fdtab array has been added to protect all accesses to a specific FD or its information. For pollers, according to the poller, the way to manage the concurrency is different. There is a poller loop on each thread. So the set of monitored FDs may need to be protected. epoll and kqueue are thread-safe per-se, so there few things to do to protect these pollers. This is not possible with select and poll, so there is no sharing between the threads. The poller on each thread is independant from others. Finally, per-thread init/deinit functions are used for each pollers and for FD part for manage thread-local ressources. Now, you must be carefull when a FD is created during the HAProxy startup. All update on the FD state must be made in the threads context and never before their creation. This is mandatory because fd_updt array is thread-local and initialized only for threads. Because there is no pollers for the main one, this array remains uninitialized in this context. For this reason, listeners are now enabled in run_thread_poll_loop function, just like the worker pipe.	2017-10-31 13:58:30 +01:00
Christopher Faulet	ab62f51959	MINOR: polling: Use fd_update_events to update events seen for a fd Now, the same function is used by all pollers to update events seen for a fd. This will ease the threads support integration.	2017-09-05 15:45:11 +02:00
Willy Tarreau	9fab7bedfb	BUG/MEDIUM: epoll: ensure we always consider HUP and ERR Since commit `5be2f35` ("MAJOR: polling: centralize calls to I/O callbacks") that came into 1.6-dev1, each poller deals with its own events and decides to signal ability to receive or send on a file descriptor based on the active events on the file descriptor. The commit above was incorrectly done for the epoll code. Instead of checking the active events on the fd, it checks for the new events. In general these ones are the same for POLL_IN and POLL_OUT since they are always cleared prior to being computed, but it is possible that POLL_HUP and POLL_ERR were initially reported and are not reported again (especially for HUP). This could happen for example if POLL_HUP and POLL_IN were received together, the pending data exactly correspond to a full buffer which is read at once, preventing the POLL_HUP from being dealt with in the same call, and on the next call only POLL_OUT is reported (eg: to emit some response or peers protocol ACKs). In this case fd_may_recv() will not be enabled anymore and the close event will be missed. It seems quite hard to trigger this case, though it might explain some of the rare missed close events that were detected in the past on the peers. This fix needs to be backported to 1.6 and 1.7.	2017-09-05 15:32:56 +02:00
Willy Tarreau	5a767693b5	MINOR: fd: add a new flag HAP_POLL_F_RDHUP to struct poller We'll need to differenciate between pollers which can report hangup at the same time as read (POLL_RDHUP) from the other ones, because only these ones may benefit from the fd_done_recv() optimization. Epoll has had support for EPOLLRDHUP since Linux 2.6.17 and has always been used this way in haproxy, so now we only set the flag once we've observed it once in a response. It means that some initial requests may try to perform a second recv() call, but after the first closed connection it will be enough to know that the second call is not needed anymore. Later we may extend these flags to designate event-triggered pollers.	2017-03-21 16:30:35 +01:00
Willy Tarreau	10146c9c51	CLEANUP: poll: move the conditions for waiting out of the poll functions The poll() functions have become a bit dirty because they now check the size of the signal queue, the FD cache and the number of tasks. It's not their job, this must be moved to the caller. In the end it simplifies the code because the expiration date is now set to now_ms if we must not wait, and this achieves in exactly the same result and is cleaner. The change looks large due to the change of indent for blocks which were inside an "if" block.	2015-04-13 20:47:51 +02:00
Godbach	d39ae7ddc9	CLEANUP: epoll: epoll_events should be allocated according to global.tune.maxpollevents Willy: commit `f2e8ee2b` introduced an optimization in the old speculative epoll code, which implemented its own event cache. It was needed to store that many events (it was bound to maxsock/4 btw). Now the event cache lives on its own and we don't need this anymore. And since events are allocated on the kernel side, we only need to allocate the events we want to return. As a result, absmaxevents will be not used anymore. Just remove the definition and the comment of it, replace it with global.tune.maxpollevents. It is also an optimization of memory usage for large amounts of sockets. Signed-off-by: Godbach <nylzhaowei@gmail.com>	2014-12-17 17:04:53 +01:00
Willy Tarreau	5be2f35231	MAJOR: polling: centralize calls to I/O callbacks In order for HTTP/2 not to eat too much memory, we'll have to support on-the-fly buffer allocation, since most streams will have an empty request buffer at some point. Supporting allocation on the fly means being able to sleep inside I/O callbacks if a buffer is not available. Till now, the I/O callbacks were called from two locations : - when processing the cached events - when processing the polled events from the poller This change cleans up the design a bit further than what was started in 1.5. It now ensures that we never call any iocb from the poller itself and that instead, events learned by the poller are put into the cache. The benefit is important in terms of stability : we don't have to care anymore about the risk that new events are added into the poller while processing its events, and we're certain that updates are processed at a single location. To achieve this, we now modify all the fd_* functions so that instead of creating updates, they add/remove the fd to/from the cache depending on its state, and only create an update when the polling status reaches a state where it will have to change. Since the pollers make use of these functions to notify readiness (using fd_may_recv/fd_may_send), the cache is always up to date with the poller. Creating updates only when the polling status needs to change saves a significant amount of work for the pollers : a benchmark showed that on a typical TCP proxy test, the amount of updates per connection dropped from 11 to 1 on average. This also means that the update list is smaller and has more chances of not thrashing too many CPU cache lines. The first observed benefit is a net 2% performance gain on the connection rate. A second benefit is that when a connection is accepted, it's only when we're processing the cache, and the recv event is automatically added into the cache after the current one, resulting in this event to be processed immediately during the same loop. Previously we used to have a second run over the updates to detect if new events were added to catch them before waking up tasks. The next gain will be offered by the next steps on this subject consisting in implementing an I/O queue containing all cached events ordered by priority just like the run queue, and to be able to leave some events pending there as long as needed. That will allow us not to perform some FD processing if it's not the proper time for this (typically keep waiting for a buffer to be allocated if none is available for an recv()). And by only processing a small bunch of them, we'll allow priorities to take place even at the I/O level. As a result of this change, functions fd_alloc_or_release_cache_entry() and fd_process_polled_events() have disappeared, and the code dedicated to checking for new fd events after the callback during the poll() loop was removed as well. Despite the patch looking large, it's mostly a change of what function is falled upon fd_*() and almost nothing was added.	2014-11-21 20:37:32 +01:00
Conrad Hoffmann	041751c13a	BUG/MEDIUM: polling: fix possible CPU hogging of worker processes after receiving SIGUSR1. When run in daemon mode (i.e. with at least one forked process) and using the epoll poller, sending USR1 (graceful shutdown) to the worker processes can cause some workers to start running at 100% CPU. Precondition is having an established HTTP keep-alive connection when the signal is received. The cloned (during fork) listening sockets do not get closed in the parent process, thus they do not get removed from the epoll set automatically (see man 7 epoll). This can lead to the process receiving epoll events that it doesn't feel responsible for, resulting in an endless loop around epoll_wait() delivering these events. The solution is to explicitly remove these file descriptors from the epoll set. To not degrade performance, care was taken to only do this when neccessary, i.e. when the file descriptor was cloned during fork. Signed-off-by: Conrad Hoffmann <conrad@soundcloud.com> [wt: a backport to 1.4 could be studied though chances to catch the bug are low]	2014-05-20 14:57:36 +02:00
Willy Tarreau	25002d206b	MINOR: polling: create function fd_compute_new_polled_status() This function is used to compute the new polling state based on the previous state. All pollers have to do this in their update loop, so better centralize the logic for it.	2014-01-26 00:42:32 +01:00
Willy Tarreau	e852545594	MEDIUM: polling: centralize polled events processing Currently, each poll loop handles the polled events the same way, resulting in a lot of duplicated, complex code. Additionally, epoll was the only one to handle newly created FDs immediately. So instead, let's move that code to fd.c in a new function dedicated to this task : fd_process_polled_events(). All pollers now use this function.	2014-01-26 00:42:32 +01:00
Willy Tarreau	f817e9f473	MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2	2014-01-26 00:42:30 +01:00
Willy Tarreau	899d95757e	REORG: polling: rename the cache allocation functions - alloc_spec_entry() becomes fd_alloc_cache_entry() - release_spec_entry() becomes fd_release_cache_entry()	2014-01-26 00:42:29 +01:00
Willy Tarreau	16f649c82c	REORG: polling: rename "fd_spec" to "fd_cache" So fd_spec was renamed "fd_cache" as it's becoming an event cache, and fd_nbspec becomes fd_cache_num.	2014-01-26 00:42:29 +01:00
Willy Tarreau	15a4dec87e	REORG: polling: rename "spec_e" to "state" and "spec_p" to "cache" We're completely changing the way FDs will be polled. There will be no more speculative I/O since we'll know the exact FD state, so these will only be cached events. First, let's fix a few field names which become confusing. "spec_e" was used to store a speculative I/O event state. Now we'll store the whole R/W states for the FD there. "spec_p" was used to store a speculative I/O cache position. Now let's clearly call it "cache".	2014-01-26 00:42:29 +01:00
Willy Tarreau	69a41fa8a3	CLEANUP: polling: rename "spec_e" to "state" We're completely changing the way FDs will be polled. First, let's fix a few field names which become confusing. "spec_e" was used to store a speculative I/O event state. Now we'll store the whole R/W states for the FD there.	2014-01-26 00:42:28 +01:00
Willy Tarreau	3ef5af3dcc	BUG: Revert "OPTIM/MEDIUM: epoll: fuse active events into polled ones during polling changes" This reverts commit `2f877304ef`. This commit is OK for clear text traffic but causes trouble with SSL when buffers are smaller than SSL buffers. Since the issue it addresses will be gone once the polling redesign is complete, there's no reason for trying to workaround temporary inefficiencies. Better remove it.	2013-12-20 16:03:41 +01:00
Willy Tarreau	2f877304ef	OPTIM/MEDIUM: epoll: fuse active events into polled ones during polling changes When trying to speculatively send data to a server being connected to, we see the following pattern : connect() = EINPROGRESS send() = EAGAIN epoll_ctl(add, W) epoll_wait() = EPOLLOUT send() = success > epoll_ctl(del, W) > recv() = EAGAIN > epoll_ctl(add, R) recv() = success epoll_ctl(del, R) The reason for the failed recv() call is that the reading was marked as speculative while we already have a polled I/O there. So we already know when removing send write poll that the read is pending. Thus, let's improve this by merging speculative I/O into polled I/O when polled state changes. The result is now the following as expected : connect() = EINPROGRESS send() = EAGAIN epoll_ctl(add, W) epoll_wait() = EPOLLOUT send() = success epoll_ctl(mod, R) recv() = success epoll_ctl(del, R) This is specific to epoll(), it doesn't make much sense at the moment to do so for other pollers, because the cost of updating them is very small. The average performance gain on small requests is of 1.6% in TCP mode, which is easily explained with the syscall stats below for 10000 forwarded connections : Before : % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 91.02 0.024608 0 60000 1 epoll_wait 2.19 0.000593 0 20000 shutdown 1.52 0.000412 0 10000 10000 connect 1.36 0.000367 0 29998 9998 sendto 1.09 0.000294 0 49993 epoll_ctl 0.93 0.000252 0 50004 20002 recvfrom 0.79 0.000214 0 20005 close 0.62 0.000167 0 20001 10001 accept4 0.25 0.000067 0 20002 setsockopt 0.13 0.000035 0 10001 socket 0.10 0.000028 0 10001 fcntl After: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 87.59 0.024269 0 50012 1 epoll_wait 3.19 0.000884 0 20000 shutdown 2.33 0.000646 0 29996 9996 sendto 2.02 0.000560 0 10005 10003 connect 1.40 0.000387 0 40013 10013 recvfrom 1.35 0.000374 0 40000 epoll_ctl 0.64 0.000178 0 20001 10001 accept4 0.55 0.000152 0 20005 close 0.45 0.000124 0 20002 setsockopt 0.31 0.000086 0 10001 fcntl 0.17 0.000047 0 10001 socket Overall : -16.6% epoll_wait -20% recvfrom -20% epoll_ctl On HTTP, the gain is even better : % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 80.43 0.015386 0 60006 1 epoll_wait 4.61 0.000882 0 30000 10000 sendto 3.74 0.000715 0 20001 10001 accept4 3.35 0.000640 0 10000 10000 connect 2.66 0.000508 0 20005 close 1.34 0.000257 0 30002 10002 recvfrom 1.27 0.000242 0 30005 epoll_ctl 1.20 0.000230 0 10000 shutdown 0.62 0.000119 0 20003 setsockopt 0.40 0.000077 0 10001 socket 0.39 0.000074 0 10001 fcntl willy@wtap:haproxy$ head -15 apres.txt % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 83.47 0.020301 0 50008 1 epoll_wait 4.26 0.001036 0 20005 close 3.30 0.000803 0 30000 10000 sendto 2.55 0.000621 0 20001 10001 accept4 1.76 0.000428 0 10000 10000 connect 1.20 0.000292 0 10000 shutdown 1.14 0.000278 0 20001 1 recvfrom 0.86 0.000210 0 20003 epoll_ctl 0.71 0.000173 0 20003 setsockopt 0.49 0.000120 0 10001 socket 0.25 0.000060 0 10001 fcntl Overall : -16.6% epoll_wait -33% recvfrom -33% epoll_ctl	2013-11-15 23:15:10 +01:00
Willy Tarreau	cf181c9d40	BUG/MINOR: epoll: use a fix maxevents argument in epoll_wait() epoll_wait() takes a number of returned events, not the number of fds to consider. We must not pass it the number of the smallest fd, as it leads to value zero being used, which is invalid in epoll_wait(). The effect may sometimes be observed with peers sections trying to connect and causing 2-seconds CPU loops upon a soft reload because epoll_wait() immediately returns -1 EINVAL instead of waiting for the timeout to happen. This fix should be backported to 1.4 too (into ev_epoll and ev_sepoll).	2013-01-18 15:31:03 +01:00
Willy Tarreau	1c07b0755d	OPTIM: epoll: make use of EPOLLRDHUP epoll may report pending shutdowns using EPOLLRDHUP. Since this flag is missing from a number of libcs despite being available since kernel 2.6.17, let's define it ourselves. Doing so saves one syscall by allow us to avoid the read()==0 when the server closes with the respose.	2013-01-07 16:39:47 +01:00
Willy Tarreau	39ebef82aa	BUG/MINOR: poll: the I/O handler was called twice for polled I/Os When a polled I/O event is detected, the event is added to the updates list and the I/O handler is called. Upon return, if the event handler did not experience an EAGAIN, the event remains in the updates list so that it will be processed later. But if the event was already in the spec list, its state is updated and it will be called again immediately upon exit, by fd_process_spec_events(), so this creates unfairness between speculative events and polled events. So don't call the I/O handler upon I/O detection when the FD already is in the spec list. The fd events are still updated so that the spec list is up to date with the possible I/O change.	2012-12-14 00:17:03 +01:00
Willy Tarreau	fb5470d144	OPTIM: epoll: current fd does not count as a new one The epoll loop checks for newly appeared FDs in order to process them early if they're accepted sockets. Since the introduction of the fd_ev_set() calls before the iocb(), the current FD is always in the update list, and we don't want to check it again, so we must assign the old_updt index just before calling the I/O handler.	2012-12-14 00:13:23 +01:00
Willy Tarreau	6320c3cb46	OPTIM: epoll: use a temp variable for intermediary flag computations Playing with fdtab[fd].ev makes gcc constantly reload the pointers because it does not know they don't alias. Use a temporary variable instead. This saves a few operations in the fast path.	2012-12-13 23:52:58 +01:00
Willy Tarreau	db9cb0b9b7	CLEANUP: poll: remove a useless double-check on fdtab[fd].owner This check is already performed a few lines above in the same loop, remove it from the condition.	2012-12-13 23:41:12 +01:00
Willy Tarreau	462c7206bc	CLEANUP: polling: gcc doesn't always optimize constants away In ev_poll and ev_epoll, we have a bit-to-bit mapping between the POLL_ constants and the FD_POLL_ constants. A comment said that gcc was able to detect this and to automatically apply a mask. Things have possibly changed since the output assembly doesn't always reflect this. So let's perform an explicit assignment when bits are equal.	2012-12-13 22:30:17 +01:00
Willy Tarreau	26d7cfce32	BUG/MAJOR: polling: do not set speculative events on ERR nor HUP Errors and Hangups are sticky events, which means that once they're detected, we never clear them, allowing them to be handled later if needed. Till now when an error was reported, it used to register a speculative I/O event for both recv and send. Since the connection had not requested such events, it was not able to detect a change and did not clear them, so the events were called in loops until a timeout caused their owner task to die. So this patch does two things : - stop registering spec events when no I/O activity was requested, so that we don't end up with non-disablable polling state ; - keep the sticky polling flags (ERR and HUP) when leaving the connection handler so that an error notification doesn't magically become a normal recv() or send() report once the event is converted to a spec event. It is normally not needed to make the connection handler emit an error when it detects POLL_ERR because either a registered data handler will have done it, or the event will be disabled by the wake() callback.	2012-12-07 00:09:43 +01:00
Willy Tarreau	70c6fd82c3	MAJOR: polling: remove unused callbacks from the poller struct Since no poller uses poller->{set,clr,wai,is_set,rem} anymore, let's remove them and remove the associated pointer tests in proto/fd.h.	2012-11-11 21:02:34 +01:00
Willy Tarreau	e9f49e78fe	MAJOR: polling: replace epoll with sepoll and remove sepoll Now that all pollers make use of speculative I/O, there is no point having two epoll implementations, so replace epoll with the sepoll code and remove sepoll which has just become the standard epoll method.	2012-11-11 20:53:30 +01:00
Willy Tarreau	f8cfa447c6	BUG/MINOR: epoll: correctly disable FD polling in fd_rem() When calling fd_rem(), the polling was not correctly disabled because the ->prev state was set to zero instead of the previous value. fd_rem() is very rarely used, only just before closing a socket. The effect is that upon an error reported at the connection level, if the task assigned to the connection was too slow to be woken up because of too many other tasks in the run queue, the FD was still not disabled and caused the connection handler to be called again with the same event until the task was finally executed to close the fd. This issue only affects the epoll poller, not the sepoll variant nor any of the other ones. It was already present in 1.4 and even 1.3 with the same almost unnoticeable effects. The bug can in fact only be discovered during development where it emphasizes other bugs. It should be backported anyway.	2012-10-04 22:26:09 +02:00
Willy Tarreau	babd05a6c6	MEDIUM: fd: add fd_poll_{recv,send} for use when explicit polling is required The old EV_FD_SET() macro was confusing, as it would enable receipt but there was no way to indicate that EAGAIN was received, hence the recently added FD_WAIT_* flags. They're not enough as we're still facing a conflict between EV_FD_* and FD_WAIT_*. So let's offer I/O functions what they need to explicitly request polling.	2012-09-02 21:53:11 +02:00
Willy Tarreau	3788e4c874	MEDIUM: fd: remove the EV_FD_COND_* primitives These primitives were initially introduced so that callers were able to conditionally set/disable polling on a file descriptor and check in return what the state was. It's been long since we last had an "if" on this, and all pollers' functions were the same for cond_* and their systematic counter parts, except that this required a check and a specific return value that are not always necessary. So let's simplify the FD API by removing this now unused distinction and by making all specific functions return void.	2012-09-02 21:53:10 +02:00
Willy Tarreau	076be25ab8	CLEANUP: remove the now unused fdtab direct I/O callbacks They were all left to NULL since last commit so we can safely remove them all now and remove the temporary dual polling logic in pollers.	2012-09-02 21:51:29 +02:00
Willy Tarreau	9845e75d23	MEDIUM: polling: prepare to call the iocb() function when defined. We will need this to centralize I/O callbacks. Nobody sets it right now so the code should have no impact.	2012-09-02 21:51:27 +02:00
Willy Tarreau	db3b32610f	REORG/MEDIUM: fd: remove FD_STCLOSE from struct fdtab In an attempt to get rid of fdtab[].state, and to move the relevant parts to the connection struct, we remove the FD_STCLOSE state which can easily be deduced from the <owner> pointer as there is a 1:1 match.	2012-09-02 21:51:25 +02:00
Willy Tarreau	491c498d97	BUG/MINOR: polling: some events were not set in various pollers fdtab[].ev was only set in ev_sepoll. Unfortunately, some I/O handling functions now rely on this, so depending on the polling mechanism, some useless operations might have been performed, such as performing a useless recv() when a HUP was reported. This is a very old issue, the flags were only added to the fdtab and not propagated into any poller. Then they were used in ev_sepoll which needed them for the cache. It is unsure whether a backport to 1.4 is appropriate or not.	2012-07-31 07:55:31 +02:00
Willy Tarreau	45a1251515	[MEDIUM] poll: add a measurement of idle vs work time We now measure the work and idle times in order to report the idle time in the stats. It's expected that we'll be able to use it at other places later.	2011-09-10 18:01:41 +02:00
Willy Tarreau	43d8fb2d3a	[REORG] build: move syscall redefinition to specific places Some older libc don't define splice() and and don't define _syscall() either, which causes build errors if splicing is enabled. To solve this, we now split the syscall redefinition into two layers : - one file per syscall (epoll, splice) - one common file to declare the _syscall() macros The code is cleaner because files using the syscalls just have to include their respective file. It's not adviced to merge multiple syscall families into a same file if all are not intended to be used simultaneously, because defining unused static functions causes warnings to be emitted during build. As a result, the new USE_MY_SPLICE parameter was added in order to be able to define the splice() syscall separately.	2011-08-23 00:11:25 +02:00
Willy Tarreau	d79e79b436	[BUG] O(1) pollers should check their FD before closing it epoll, sepoll and kqueue pollers should check that their fd is not closed before attempting to close it, otherwise we can end up with multiple closes of fd #0 upon exit, which is harmless but dirty.	2009-05-10 10:18:54 +02:00
Willy Tarreau	332740dab2	[MEDIUM] pollers: don't wait if a signal is pending If an asynchronous signal is received outside of the poller, we don't want the poller to wait for a timeout to occur before processing it, so we set its timeout to zero, just like we do with pending tasks in the run queue.	2009-05-10 09:57:21 +02:00
Willy Tarreau	a534fea478	[CLEANUP] remove 65 useless NULL checks before free C specification clearly states that free(NULL) is a no-op. So remove useless checks before calling free.	2008-08-03 20:48:50 +02:00
Willy Tarreau	ec6c5df018	[CLEANUP] remove many #include <types/xxx> from C files It should be stated as a rule that a C file should never include types/xxx.h when proto/xxx.h exists, as it gives less exposure to declaration conflicts (one of which was caught and fixed here) and it complicates the file headers for nothing. Only types/global.h, types/capture.h and types/polling.h have been found to be valid includes from C files.	2008-07-16 10:30:42 +02:00
Willy Tarreau	0c303eec87	[MAJOR] convert all expiration timers from timeval to ticks This is the first attempt at moving all internal parts from using struct timeval to integer ticks. Those provides simpler and faster code due to simplified operations, and this change also saved about 64 bytes per session. A new header file has been added : include/common/ticks.h. It is possible that some functions should finally not be inlined because they're used quite a lot (eg: tick_first, tick_add_ifset and tick_is_expired). More measurements are required in order to decide whether this is interesting or not. Some function and variable names are still subject to change for a better overall logics.	2008-07-07 00:09:58 +02:00
Willy Tarreau	b0b37bcd65	[MEDIUM] further improve monotonic clock by check forward jumps The first implementation of the monotonic clock did not verify forward jumps. The consequence is that a fast changing time may expire a lot of tasks. While it does seem minor, in fact it is problematic because most machines which boot with a wrong date are in the past and suddenly see their time jump by several years in the future. The solution is to check if we spent more apparent time in a poller than allowed (with a margin applied). The margin is currently set to 1000 ms. It should be large enough for any poll() to complete. Tests with randomly jumping clock show that the result is quite accurate (error less than 1 second at every change of more than one second).	2008-06-23 14:00:57 +02:00
Willy Tarreau	b7f694f20e	[MEDIUM] implement a monotonic internal clock If the system date is set backwards while haproxy is running, some scheduled events are delayed by the amount of time the clock went backwards. This is particularly problematic on systems where the date is set at boot, because it seldom happens that health-checks do not get sent for a few hours. Before switching to use clock_gettime() on systems which provide it, we can at least ensure that the clock is not going backwards and maintain two clocks : the "date" which represents what the user wants to see (mostly for logs), and an internal date stored in "now", used for scheduled events.	2008-06-22 17:18:02 +02:00
Willy Tarreau	3a6281199a	[BUG] event pollers must not wait if a task exists in the run queue Under some circumstances, a task may already lie in the run queue (eg: inter-task wakeup). It is disastrous to wait for an event in this case because some processing gets delayed.	2008-06-20 15:05:56 +02:00
Willy Tarreau	70bcfb77a7	[OPTIM] GCC4's builtin_expect() is suboptimal GCC4 is stupid (unbelievable news!). When some code uses __builtin_expect(x != 0, 1), it really performs the check of x != 0 then tests that the result is not zero! This is a double check when only one was expected. Some performance drops of 10% in the HTTP parser code have been observed due to this bug. GCC 3.4 is fine though. A solution consists in expecting that the tested value is 1. In this case, it emits the correct code, but it's still not optimal it seems. Finally the best solution is to ignore likely() and to pray for the compiler to emit correct code. However, we still have to fix unlikely() to remove the test there too, and to fix all code which passed pointers overthere to pass integers instead.	2008-02-14 23:14:33 +01:00
Willy Tarreau	1db37710dc	[MEDIUM] limit the number of events returned by poll By default, epoll/kqueue used to return as many events as possible. This could sometimes cause huge latencies (latencies of up to 400 ms have been observed with many thousands of fds at once). Limiting the number of events returned also reduces the latency by avoiding too many blind processing. The value is set to 200 by default and can be changed in the global section using the tune.maxpollevents parameter.	2007-06-03 17:16:49 +02:00
Willy Tarreau	fb8983f21b	[BUG] the epoll FD must not be shared between processes Recreate the epoll file descriptor after a fork(). It will ensure that all processes will not share their epoll_fd. Some side effects were encountered because of this, such as epoll_wait() returning an FD which was previously deleted, in multi-process mode.	2007-06-03 16:40:44 +02:00
Willy Tarreau	bdefc513a0	[BUG] fix null timeouts in poll-based pollers Introduction of timeval timers broke poll-based pollers, because the call to tv_ms_remain may return 0 while the event is not elapsed yet. Now we carefully check for those cases and round the result up by 1 ms.	2007-05-14 02:02:04 +02:00
Willy Tarreau	d825eef9c5	[MAJOR] replaced all timeouts with struct timeval The timeout functions were difficult to manipulate because they were rounding results to the millisecond. Thus, it was difficult to compare and to check what expired and what did not. Also, the comparison functions were heavy with multiplies and divides by 1000. Now, all timeouts are stored in timevals, reducing the number of operations for updates and leading to cleaner and more efficient code.	2007-05-12 22:35:00 +02:00
Willy Tarreau	ef1d1f859b	[MAJOR] auto-registering of pollers at load time Gcc provides __attribute__((constructor)) which is very convenient to execute functions at startup right before main(). All the pollers have been converted to have their register() function declared like this, so that it is not necessary anymore to call them from a centralized file.	2007-04-16 00:25:25 +02:00
Willy Tarreau	b40d42006c	[BUILD] declare epoll_* as static when using our own functions We will have to share this code among several implementations.	2007-04-15 23:57:41 +02:00
Willy Tarreau	58094f2fd9	[MAJOR] ev_epoll: do not rely on fd_sets anymore The new epoll-based poller uses a list of changes in order to process only the fds which have changed.	2007-04-10 01:43:43 +02:00
Willy Tarreau	2ff7622c0c	[MAJOR] delay registering of listener sockets at startup Some pollers such as kqueue lose their FD across fork(), meaning that the registered file descriptors are lost too. Now when the proxies are started by start_proxies(), the file descriptors are not registered yet, leaving enough time for the fork() to take place and to get a new pollfd. It will be the first call to maintain_proxies that will register them.	2007-04-09 19:29:56 +02:00
Willy Tarreau	63455a9be5	[MINOR] use 'is_set' instead of 'isset' in struct poller 'isset' was defined as a macro in /usr/include/sys/param.h, and it breaks build on at least OpenBSD.	2007-04-09 15:34:49 +02:00
Willy Tarreau	69801b8e77	[MINOR] removed proto/polling.h which was not used anymore	2007-04-09 15:28:51 +02:00
Willy Tarreau	e54e9176a3	[MINOR] ev_* : moved the poll function closer to fd_*	2007-04-09 09:23:31 +02:00
Willy Tarreau	97129b5408	[MINOR] changed fd_set/fd_clr functions to return ints The fd_* functions now return ints so that they can be factored when appropriate.	2007-04-09 00:54:46 +02:00
Willy Tarreau	28d86862bc	[MEDIUM] pollers: store the events in arrays Instead of managing StaticReadEvent/StaticWriteEvent, use evts[dir]	2007-04-08 17:42:27 +02:00
Willy Tarreau	4f60f16dd3	[MAJOR] modularize the polling mechanisms select, poll and epoll now have their dedicated functions and have been split into distinct files. Several FD manipulation primitives have been provided with each poller. The rest of the code needs to be cleaned to remove traces of StaticReadEvent/StaticWriteEvent. A trick involving a macro has temporarily been used right now. Some work needs to be done to factorize tests and sets everywhere.	2007-04-08 16:39:58 +02:00

1 2 3

138 Commits