haproxy

mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-10-27 22:51:02 +01:00

Author	SHA1	Message	Date
Willy Tarreau	fef4cfbd21	IMPORT: ebtree: only use __builtin_prefetch() when supported It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no mention of it in 3.0's doc. Let's replace it with eb_prefetch() which maps to __builtin_prefetch() on supported compilers and falls back to the usual do{}while(0) on other ones. It was tested to properly build with tcc as well as gcc-2.95. This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.	2025-09-17 14:30:32 +02:00
Willy Tarreau	6c54bf7295	IMPORT: eb32/eb64: place an unlikely() on the leaf test In the loop we can help the compiler build slightly more efficient code by placing an unlikely() around the leaf test. This shows a consistent 0.5% performance gain both on eb32 and eb64. This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.	2025-09-17 14:30:32 +02:00
Willy Tarreau	6af17d491f	IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs The current code calculates the next troot based on a calculation. This was efficient when the algorithm was developed many years ago on K6 and K7 CPUs running at low frequencies with few registers and limited branch prediction units but nowadays with ultra-deep pipelines and high latency memory that's no longer efficient, because the CPU needs to have completed multiple operations before knowing which address to start fetching from. It's sad because we only have two branches each time but the CPU cannot know it. In addition, the calculation is performed late in the loop, which does not help the address generation unit to start prefetching next data. Instead we should help the CPU by preloading data early from the node and calculing troot as soon as possible. The CPU will be able to postpone that processing until the dependencies are available and it really needs to dereference it. In addition we must absolutely avoid serializing instructions such as "(a >> b) & 1" because there's no way for the compiler to parallelize that code nor for the CPU to pre- process some early data. What this patch does is relatively simple: - we try to prefetch the next two branches as soon as the node is known, which will help dereference the selected node in the next iteration; it was shown that it only works with the next changes though, otherwise it can reduce the performance instead. In practice the prefetching will start a bit later once the node is really in the cache, but since there's no dependency between these instructions and any other one, we let the CPU optimize as it wants. - we preload all important data from the node (next two branches, key and node.bit) very early even if not immediately needed. This is cheap, it doesn't cause any pipeline stall and speeds up later operations. - we pre-calculate 1<<bit that we assign into a register, so as to avoid serializing instructions when deciding which branch to take. - we assign the troot based on a ternary operation (or if/else) so that the CPU knows upfront the two possible next addresses without waiting for the end of a calculation and can prefetch their contents every time the branch prediction unit guesses right. Just doing this provides significant gains at various tree sizes on random keys (in million lookups per second): eb32 1k: 29.07 -> 33.17 +14.1% 10k: 14.27 -> 15.74 +10.3% 100k: 6.64 -> 8.00 +20.5% eb64 1k: 27.51 -> 34.40 +25.0% 10k: 13.54 -> 16.17 +19.4% 100k: 7.53 -> 8.38 +11.3% The performance is now much closer to the sequential keys. This was done for all variants ({32,64}{,i,le,ge}). Another point, the equality test in the loop improves the performance when looking up random keys (since we don't need to reach the leaf), but is counter-productive for sequential keys, which can gain ~17% without that test. However sequential keys are normally not used with exact lookups, but rather with lookup_ge() that spans a time frame, and which does not have that test for this precise reason, so in the end both use cases are served optimally. It's interesting to note that everything here is solely based on data dependencies, and that trying to perform less operations upfront always ends up with lower performance (typically the original one). This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.	2025-09-17 14:30:31 +02:00
Willy Tarreau	8d2b777fe3	REORG: ebtree: move the include files from ebtree to include/import/ This is where other imported components are located. All files which used to directly include ebtree were touched to update their include path so that "import/" is now prefixed before the ebtree-related files. The ebtree.h file was slightly adjusted to read compiler.h from the common/ subdirectory (this is the only change). A build issue was encountered when eb32sctree.h is loaded before eb32tree.h because only the former checks for the latter before defining type u32. This was addressed by adding the reverse ifdef in eb32tree.h. No further cleanup was done yet in order to keep changes minimal.	2020-06-11 09:31:11 +02:00
Willy Tarreau	ff0e8a44a4	REORG: ebtree: move the C files from ebtree/ to src/ As part of the include files cleanup, we're going to kill the ebtree directory. For this we need to host its C files in a different location and src/ is the right one.	2020-06-11 09:31:11 +02:00
Willy Tarreau	3e79479348	[CLEANUP] ebtree: remove old unused files	2009-10-26 21:15:10 +01:00
Willy Tarreau	5804434a0f	[MINOR] update ebtree to version 4.1 Ebtree version 4.1 brings lookup by ranges. This will be useful for the scheduler.	2009-03-21 10:23:36 +01:00
Willy Tarreau	e6d2e4dbdf	[MINOR] merge ebtree version 3.0 Version 3.0 of ebtree has been merged in but is not used yet.	2007-11-28 14:20:44 +01:00

8 Commits