haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2026-01-23 02:51:12 +01:00

Author	SHA1	Message	Date
Willy Tarreau	8715dec6f9	MEDIUM: pools: remove the locked pools implementation Now that the modified lockless variant does not need a DWCAS anymore, there's no reason to keep the much slower locked version, so let's just get rid of it.	2021-06-10 17:46:50 +02:00
Willy Tarreau	2a4523f6f4	BUG/MAJOR: pools: fix possible race with free() in the lockless variant In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed report showing reproducible crashes under musl. Musl is one of the libs coming with a simple allocator for which we prefer to keep the shared cache. On x86 we have a DWCAS so the lockless implementation is enabled for such libraries. And this implementation has had a small race since day one: the allocator will need to read the first object's <next> pointer to place it into the free list's head. If another thread picks the same element and immediately releases it, while both the local and the shared pools are too crowded, it will be freed to the OS. If the libc's allocator immediately releases it, the memory area is unmapped and we can have a crash while trying to read that pointer. However there is no problem as long as the item remains mapped in memory because whatever value found there will not be placed into the head since the counter will have changed. The probability for this to happen is extremely low, but as analyzed by Fabiano, it increases with the buffer size. On 16 threads it's relatively easy to reproduce with 2MB buffers above 200k req/s, where it should happen within the first 20 seconds of traffic usually. This is a structural issue for which there are two non-trivial solutions: - place a read lock in the alloc call and a barrier made of lock/unlock in the free() call to force to serialize operations; this will have a big performance impact since free() is already one of the contention points; - change the allocator to use a self-locked head, similar to what is done in the MT_LISTS. This requires two memory writes to the head instead of a single one, thus the overhead is exactly one memory write during alloc and one during free; This patch implements the second option. A new POOL_DUMMY pointer was defined for the locked pointer value, allowing to both read and lock it with a single xchg call. The code was carefully optimized so that the locked period remains the shortest possible and that bus writes are avoided as much as possible whenever the lock is held. Tests show that while a bit slower than the original lockless implementation on large buffers (2MB), it's 2.6 times faster than both the no-cache and the locked implementation on such large buffers, and remains as fast or faster than the all implementations when buffers are 48k or higher. Tests were also run on arm64 with similar results. Note that this code is not used on modern libcs featuring a fast allocator. A nice benefit of this change is that since it removes a dependency on the DWCAS, it will be possible to remove the locked implementation and replace it with this one, that is then usable on all systems, thus significantly increasing their performance with large buffers. Given that lockless pools were introduced in 1.9 (not supported anymore), this patch will have to be backported as far as 2.0. The code changed several times in this area and is subject to many ifdefs which will complicate the backport. What is important is to remove all the DWCAS code from the shared cache alloc/free lockless code and replace it with this one. The pool_flush() code is basically the same code as the allocator, retrieving the whole list at once. If in doubt regarding what barriers to use in older versions, it's safe to use the generic ones. This patch depends on the following previous commits: - MINOR: pools: do not maintain the lock during pool_flush() - MINOR: pools: call malloc_trim() under thread isolation - MEDIUM: pools: use a single pool_gc() function for locked and lockless The last one also removes one occurrence of an unneeded DWCAS in the code that was incompatible with this fix. The removal of the now unused seq field will happen in a future patch. Many thanks to Fabiano for his detailed report, and to Olivier for his help on this issue.	2021-06-10 17:46:50 +02:00
Willy Tarreau	9a7aa3b4a1	BUG/MINOR: pools: make DEBUG_UAF always write to the to-be-freed location Since the code was reorganized, DEBUG_UAF was still tested in the locked pool code despite pools being disabled when DEBUG_UAF is used. Let's move the test to pool_put_to_os() which is the one that is always called in this condition. The impact is only a possible misleading analysis during a troubleshooting session due to a missing double-frees or free of const area test that is normally already dealt with by the underlying code anyway. In practice it's unlikely anyone will ever notice. This should only be backported to 2.4.	2021-06-10 17:46:50 +02:00
Willy Tarreau	2b71810cb3	CLEANUP: lists/tree-wide: rename some list operations to avoid some confusion The current "ADD" vs "ADDQ" is confusing because when thinking in terms of appending at the end of a list, "ADD" naturally comes to mind, but here it does the opposite, it inserts. Several times already it's been incorrectly used where ADDQ was expected, the latest of which was a fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain why we queue the stream at the head of the server list"). Let's use more explicit (but slightly longer) names now: LIST_ADD -> LIST_INSERT LIST_ADDQ -> LIST_APPEND LIST_ADDED -> LIST_INLIST LIST_DEL -> LIST_DELETE The same is true for MT_LISTs, including their "TRY" variant. LIST_DEL_INIT keeps its short name to encourage to use it instead of the lazier LIST_DELETE which is often less safe. The change is large (~674 non-comment entries) but is mechanical enough to remain safe. No permutation was performed, so any out-of-tree code can easily map older names to new ones. The list doc was updated.	2021-04-21 09:20:17 +02:00
Willy Tarreau	942b89f7dc	BUILD: pools: fix build with DEBUG_FAIL_ALLOC Amaury noticed that I managed to break the build of DEBUG_FAIL_ALLOC for the second time with 207c09509 ("MINOR: pools: move the fault injector to __pool_alloc()"). The joy of endlessly reworking patch sets... No backport is needed, that was in the just merged cleanup series.	2021-04-19 18:36:48 +02:00
Willy Tarreau	096b6cf581	CLEANUP: pools: declare dummy pool functions to remove some ifdefs By having a pair of dummy pool_get_from_cache() and pool_put_to_cache() we can remove some ugly ifdefs, so let's do this. We've already done it for the shared cache.	2021-04-19 15:24:33 +02:00
Willy Tarreau	b2a853d5f0	CLEANUP: pools: uninline pool_put_to_cache() This function has become too big (251 bytes) and is now hurting performance a lot, with up to 4% request rate being lost over the last pool changes. Let's move it to pool.c as a regular function. Other attempts were made to cut it in half but it's still inefficient. Doing this results in saving ~90kB of object code, and even 112kB since the pool changes, with code that is even slightly faster! Conversely, pool_get_from_cache(), which remains half of this size, is still faster inlined, likely in part due to the immediate use of the returned pointer afterwards.	2021-04-19 15:24:33 +02:00
Willy Tarreau	43d4ed548f	CLEANUP: pools: merge pool_{get_from,put_to}_local_caches with generic ones Since pool_get_from_cache() and pool_put_to_cache() were now only wrappers to the local cache versions which do all the job, let's merge them together so that there is no more local-cache specific function.	2021-04-19 15:24:33 +02:00
Willy Tarreau	d56db11447	CLEANUP: pools: make the local cache allocator fall back to the shared cache Now when pool_get_from_local_cache() fails, it automatically falls back to pool_get_from_shared_cache(), which used to always be done in pool_get_from_cache(). Thus now the API is simpler as we always allocate and free from/to the local caches.	2021-04-19 15:24:33 +02:00
Willy Tarreau	fa19d20ac4	MEDIUM: pools: make pool_put_to_cache() always call pool_put_to_local_cache() Till now it used to call it only if there were not too many objects into the local cache otherwise would send the latest one directly into the shared cache. Now it always sends to the local cache and it's up to the local cache to free its oldest objects. From a cache freshness perspective it's better this way since we always evict cold objects instead of hot ones. From an API perspective it's better because it will help make the shared cache invisible to the public API.	2021-04-19 15:24:33 +02:00
Willy Tarreau	147e1fa385	MINOR: pools: create unified pool_{get_from,put_to}_cache() These two functions are now responsible for allocating directly from the cache and releasing to the cache. Now the pool_alloc() function simply does this: if cache enabled return pool_alloc_from_cache() if no NULL return pool_alloc_nocache() otherwise and the pool_free() function does this: if cache enabled pool_put_to_cache() else pool_free_nocache() For now this only introduces these two functions without changing anything else, but the goal is to soon allow to make them implementation-specific.	2021-04-19 15:24:33 +02:00
Willy Tarreau	b8498e961a	MEDIUM: pools: make CONFIG_HAP_POOLS control both local and shared pools Continuing the unification of local and shared pools, now the usage of pools is governed by CONFIG_HAP_POOLS without which allocations and releases are performed directly from the OS using pool_alloc_nocache() and pool_free_nocache().	2021-04-19 15:24:33 +02:00
Willy Tarreau	45e4e28161	MINOR: pools: factor the release code into pool_put_to_os() There are two levels of freeing to the OS: - code that wants to keep the pool's usage counters updated uses pool_free_area() and handles the counters itself. That's what pool_put_to_shared_cache() does in the no-global-pools case. - code that does not want to update the counters because they were already updated only calls pool_free_area(). Let's extract these calls to establish the symmetry with pool_get_from_os() and pool_alloc_nocache(), resulting in pool_put_to_os() (which only updates the allocated counter) and pool_free_nocache() (which also updates the used counter). This will later allow to simplify the generic code.	2021-04-19 15:24:33 +02:00
Willy Tarreau	acf0c54491	MINOR: pools: move pool_free_area() out of the lock in the locked version Calling pool_free_area() inside a lock in pool_put_to_shared_cache() is a very bad idea. Fortunately this only happens on the lowest end platforms which almost never use threads or in very small counts. This change consists in zeroing the pointer once already released to the cache in the first test so that the second stage knows if it needs to pass it to the OS or not. This has slightly reduced the length of the	2021-04-19 15:24:33 +02:00
Willy Tarreau	2b5579f6da	MINOR: pools: always use atomic ops to maintain counters A part of the code cannot be factored out because it still uses non-atomic inc/dec for pool->used and pool->allocated as these are located under the pool's lock. While it can make sense in terms of bus cycles, it does not make sense in terms of code normalization. Further, some operations were still performed under a lock that could be totally removed via the use of atomic ops. There is still one occurrence in pool_put_to_shared_cache() in the locked code where pool_free_area() is called under the lock, which must absolutely be fixed.	2021-04-19 15:24:33 +02:00
Willy Tarreau	13843641e5	MINOR: pools: split the OS-based allocator in two Now there's one part dealing with the allocation itself and keeping counters up to date, and another one on top of it to return such an allocated pointer to the user and update the use count and stats. This is in anticipation for being able to group cache-related parts. The release code is still done at once.	2021-04-19 15:24:33 +02:00
Willy Tarreau	207c095098	MINOR: pools: move the fault injector to __pool_alloc() Till now it was limited to objects allocated from the OS which means it had little use as soon as pools were enabled. Let's move it upper in the layers so that any code can benefit from fault injection. In addition this allows to pass a new flag POOL_F_NO_FAIL to disable it if some callers prefer a no-failure approach.	2021-04-19 15:24:33 +02:00
Willy Tarreau	635cced32f	CLEANUP: pools: rename __pool_free() to pool_put_to_shared_cache() Now the multi-level cache becomes more visible: pool_get_from_local_cache() pool_put_to_local_cache() pool_get_from_shared_cache() pool_put_to_shared_cache()	2021-04-19 15:24:33 +02:00
Willy Tarreau	8c77ee5ae5	CLEANUP: pools: rename pool__{from,to}_cache() to _local_cache() The functions were rightfully called from/to_cache when the thread-local cache was considered as the only cache, but this is getting terribly confusing. Let's call them from/to local_cache to make it clear that it is not related with the shared cache. As a side note, since pool_evict_from_cache() used not to work for a particular pool but for all of them at once, it was renamed to pool_evict_from_local_caches() (plural form).	2021-04-19 15:24:33 +02:00
Willy Tarreau	2f03dcde91	CLEANUP: pools: rename __pool_get_first() to pool_get_from_shared_cache() This is exactly what it is, the entry is retrieved from the shared cache when it is defined. The implementation that is enabled with CONFIG_HAP_NO_GLOBAL_POOLS continues to return NULL.	2021-04-19 15:24:33 +02:00
Willy Tarreau	2543211830	CLEANUP: pools: move the lock to the only __pool_get_first() that needs it Now that __pool_alloc() only surrounds __pool_get_first() with the lock, let's move it to the only variant that requires it and remove the ugly ifdefs from the function. This is safe because nobody else calls this function.	2021-04-19 15:24:33 +02:00
Willy Tarreau	8ee9df57db	MINOR: pools: call pool_alloc_nocache() out of the pool's lock In __pool_alloc(), historically we used to use factor out the pool's lock between __pool_get_first() and __pool_refill_alloc(), resulting in real malloc() or mmap() calls being performed under the pool lock (for platforms using the locked shared pools). As this is not needed anymore, let's move the call out of the lock, it may improve allocation patterns on some platforms. This also makes __pool_alloc() cleaner as we see a first attempt to allocate from the local cache, then a second from the shared cache then a reall allocation.	2021-04-19 15:24:33 +02:00
Willy Tarreau	8fe726f118	CLEANUP: pools: re-merge pool_refill_alloc() and __pool_refill_alloc() They were strictly equivalent, let's remerge them and rename them to pool_alloc_nocache() as it's the call which performs a real allocation which does not check nor update the cache. The only difference in the past was the former taking the lock and not the second but now the lock is not needed anymore at this stage since the pool's list is not touched. In addition, given that the "avail" argument is no longer used by the function nor by its callers, let's drop it.	2021-04-19 15:24:33 +02:00
Willy Tarreau	64383b8181	MINOR: pools: make the basic pool_refill_alloc()/pool_free() update needed_avg This is a first step towards unifying all the fallback code. Right now these two functions are the only ones which do not update the needed_avg rate counter since there's currently no shared pool kept when using them. But their code is similar to what could be used everywhere except for this one, so let's make them capable of maintaining usage statistics. As a side effect the needed field in "show pools" will now be populated.	2021-04-19 15:24:33 +02:00
Willy Tarreau	2d6f628d34	MINOR: pools: rename CONFIG_HAP_LOCAL_POOLS to CONFIG_HAP_POOLS We're going to make the local pool always present unless pools are completely disabled. This means that pools are always enabled by default, regardless of the use of threads. Let's drop this notion of "local" pools and make it just "pool". The equivalent debug option becomes DEBUG_NO_POOLS instead of DEBUG_NO_LOCAL_POOLS. For now this changes nothing except the option and dropping the dependency on USE_THREAD.	2021-04-19 15:24:33 +02:00
Willy Tarreau	d5140e7c6f	MINOR: pool: remove the size field from pool_cache_head Everywhere we have access to the pool so we don't need to cache a copy of the pool's size into the pool_cache_head. Let's remove it.	2021-04-19 15:24:33 +02:00
Willy Tarreau	9f3129e583	MEDIUM: pools: move the cache into the pool header Initially per-thread pool caches were stored into a fixed-size array. But this was a bit ugly because the last allocated pools were not able to benefit from the cache at all. As a work around to preserve performance, a size of 64 cacheable pools was set by default (there are 51 pools at the moment, excluding any addon and debugging code), so all in-tree pools were covered, at the expense of higher memory usage. In addition an index had to be calculated for each pool, and was used to acces the pool cache head into that array. The pool index was not even stored into the pools so it was required to determine it to access the cache when the pool was already known. This patch changes this by moving the pool cache head into the pool head itself. This way it is certain that each pool will have its own cache. This removes the need for index calculation. The pool cache head is 32 bytes long so it was aligned to 64B to avoid false sharing between threads. The extra cost is not huge (~2kB more per pool than before), and we'll make better use of that space soon. The pool cache head contains the size, which should probably be removed since it's already in the pool's head.	2021-04-19 15:24:33 +02:00
Willy Tarreau	fff96b441f	CLEANUP: pools: remove unused arguments to pool_evict_from_cache() In commit fb117e6a8 ("MEDIUM: memory: don't let pool_put_to_cache() free the objects itself") pool_evict_from_cache() was introduced with no argument, yet the only call place passes it the pool, the pointer and the index number! Let's remove these as they even let the reader think that the function does something specific to the current pool while it's not the case.	2021-04-19 15:24:33 +02:00
Willy Tarreau	ff88270ef9	MINOR: pool: move pool declarations to read_mostly All pool heads are accessed via a pointer and should not be shared with highly written variables. Move them to the read_mostly section.	2021-04-10 19:27:41 +02:00
Willy Tarreau	4781b1521a	CLEANUP: atomic/tree-wide: replace single increments/decrements with inc/dec This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1) or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.	2021-04-07 18:18:37 +02:00
Willy Tarreau	18759079b6	MINOR: pools: add pool_zalloc() to return a zeroed area It's like pool_alloc() but the output is zeroed before being returned and is never poisonned.	2021-03-22 22:05:05 +01:00
Willy Tarreau	de749a9333	MINOR: pools: make the pool allocator support a few flags The pool_alloc_dirty() function was renamed to __pool_alloc() and now takes a set of flags indicating whether poisonning is permitted or not and whether zeroing the area is needed or not. The pool_alloc() function is now just a wrapper calling __pool_alloc(pool, 0).	2021-03-22 20:54:15 +01:00
Willy Tarreau	a213b683f7	CLEANUP: pools: remove the unused pool_get_first() function This one used to maintain a shortcut in the pools allocation path that was only justified by b_alloc_fast() which was not used! Let's get rid of it as well so that the allocator becomes a bit more straight forward.	2021-03-22 16:28:08 +01:00
Willy Tarreau	0bae075928	MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS We've reached a point where the global pools represent a significant bottleneck with threads. On a 64-core machine, the performance was divided by 8 between 32 and 64 H2 connections only because there were not enough entries in the local caches to avoid picking from the global pools, and the contention on the list there was very high. It becomes obvious that we need to have an array of lists, but that will require more changes. In parallel, standard memory allocators have improved, with tcmalloc and jemalloc finding their ways through mainstream systems, and glibc having upgraded to a thread-aware ptmalloc variant, keeping this level of contention here isn't justified anymore when we have both the local per-thread pool caches and a fast process-wide allocator. For these reasons, this patch introduces a new compile time setting CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are enabled with thread local pool caches, and we know we have a fast thread-aware memory allocator (currently set for glibc>=2.26). In this case we entirely bypass the global pool and directly use the standard memory allocator when missing objects from the local pools. It is also possible to force it at compile time when a good allocator is used with another setup. It is still possible to re-enable the global pools using CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the operating system's default allocator, or when building with a recent libc but a different allocator which provides other benefits but does not scale well with threads.	2021-03-05 08:30:08 +01:00
Willy Tarreau	20dc3cd4a6	MINOR: pools: move the LRU cache heads to thread_info The LRU cache head was an array of list, which causes false sharing between 4 to 8 threads in the same cache line. Let's move it to the thread_info structure instead. There's no need to do the same for the pool_cache[] array since it's already quite large (32 pointers each). By doing this the request rate increased by 1% on a 16-thread machine.	2020-06-29 10:36:37 +02:00
Willy Tarreau	d0ef439699	REORG: include: move common/memory.h to haproxy/pool.h Now the file is ready to be stored into its final destination. A few minor reorderings were performed to keep the file properly organized, making the various sections more visible (cache & lockless). In addition and to stay consistent, memory.c was renamed to pool.c.	2020-06-11 10:18:57 +02:00

36 Commits