mirror of https://git.haproxy.org/git/haproxy.git/ synced 2025-12-23 10:31:20 +01:00

Go to file

Willy Tarreau 6af17d491f IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs

The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.

Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.

What this patch does is relatively simple:

  - we try to prefetch the next two branches as soon as the
    node is known, which will help dereference the selected node in
    the next iteration; it was shown that it only works with the next
    changes though, otherwise it can reduce the performance instead.
    In practice the prefetching will start a bit later once the node
    is really in the cache, but since there's no dependency between
    these instructions and any other one, we let the CPU optimize as
    it wants.

  - we preload all important data from the node (next two branches,
    key and node.bit) very early even if not immediately needed.
    This is cheap, it doesn't cause any pipeline stall and speeds
    up later operations.

  - we pre-calculate 1<<bit that we assign into a register, so as
    to avoid serializing instructions when deciding which branch to
    take.

  - we assign the troot based on a ternary operation (or if/else) so
    that the CPU knows upfront the two possible next addresses without
    waiting for the end of a calculation and can prefetch their contents
    every time the branch prediction unit guesses right.

Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):

  eb32   1k:  29.07 -> 33.17  +14.1%
        10k:  14.27 -> 15.74  +10.3%
       100k:   6.64 ->  8.00  +20.5%
  eb64   1k:  27.51 -> 34.40  +25.0%
        10k:  13.54 -> 16.17  +19.4%
       100k:   7.53 ->  8.38  +11.3%

The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).

Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.

It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).

This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.

2025-09-17 14:30:31 +02:00

.github

CI: github: add an OpenSSL + ECH job

2025-09-16 15:05:44 +02:00

addons

MINOR: applet: Add a flag to know an applet is using HTX buffers

2025-08-25 11:11:05 +02:00

admin

BUG/MINOR: halog: Add OOM checks for calloc() in filter_count_srv_status() and filter_count_url()

2025-09-02 07:29:54 +02:00

dev

DEV: gdb: add a memprofile decoder to the debug tools

2025-07-16 15:33:33 +02:00

doc

DOC: internals: document the shm-stats-file format/mapping

2025-09-17 11:32:58 +02:00

examples

MINOR: mailers: warn if mailers are configured but not actually used

2025-06-27 16:41:18 +02:00

include

IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs

2025-09-17 14:30:31 +02:00

reg-tests

REGTESTS: ssl: Fix the script about automatic SNI selection

2025-09-08 15:55:56 +02:00

scripts

CI: scripts: mkdir BUILDSSL_TMPDIR

2025-09-16 15:35:35 +02:00

src

IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs

2025-09-17 14:30:31 +02:00

tests

TESTS: quic: add unit-tests for QUIC TX part

2025-09-08 14:49:03 +02:00

.cirrus.yml

CI: cirrus-ci: bump FreeBSD image to 14-2

2025-02-12 13:18:55 +01:00

.gitattributes

MINOR: Configure the cpp userdiff driver for *.[ch] in .gitattributes

2021-02-22 18:17:57 +01:00

.gitignore

MINOR: tevt/dev: Add term_events tool

2025-01-31 10:41:50 +01:00

.mailmap

DOC: update Tim's address in .mailmap

2021-09-16 09:14:14 +02:00

.travis.yml

MEDIUM: mworker: remove USE_SYSTEMD requirement for -Ws

2024-11-20 12:07:38 +01:00

BRANCHES

DOC: fix some spelling issues over multiple files

2021-01-08 14:53:47 +01:00

BSDmakefile

BUILD: makefile: commit the tiny FreeBSD makefile stub

2023-05-24 17:17:36 +02:00

CHANGELOG

[RELEASE] Released version 3.3-dev8

2025-09-05 09:54:34 +02:00

CONTRIBUTING

CLEANUP: assorted typo fixes in the code and comments

2025-04-02 11:12:20 +02:00

INSTALL

BUILD: makefile: bump the default minimum linux version to 4.17

2025-09-05 09:44:56 +02:00

LICENSE

LICENSE: add licence exception for OpenSSL

2012-09-07 13:52:26 +02:00

MAINTAINERS

MAJOR: spoe: Let the SPOE back into the game

2024-05-22 09:04:38 +02:00

Makefile

IMPORT: cebtree: import version 0.5.0 to support duplicates

2025-09-16 09:23:46 +02:00

README.md

DOC: change the link to the FreeBSD CI in README.md

2024-06-03 15:21:29 +02:00

SUBVERS

BUILD: use format tags in VERDATE and SUBVERS files

2013-12-10 11:22:49 +01:00

VERDATE

[RELEASE] Released version 3.3-dev8

2025-09-05 09:54:34 +02:00

VERSION

[RELEASE] Released version 3.3-dev8

2025-09-05 09:54:34 +02:00

README.md

HAProxy

HAProxy is a free, very fast and reliable reverse-proxy offering high availability, load balancing, and proxying for TCP and HTTP-based applications.

Installation

The INSTALL file describes how to build HAProxy. A list of packages is also available on the wiki.

Getting help

The discourse and the mailing-list are available for questions or configuration assistance. You can also use the slack or IRC channel. Please don't use the issue tracker for these.

The issue tracker is only for bug reports or feature requests.

Documentation

The HAProxy documentation has been split into a number of different files for ease of use. It is available in text format as well as HTML. The wiki is also meant to replace the old architecture guide.

Please refer to the following files depending on what you're looking for:

INSTALL for instructions on how to build and install HAProxy
BRANCHES to understand the project's life cycle and what version to use
LICENSE for the project's license
CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory:

doc/intro.txt for a quick introduction on HAProxy
doc/configuration.txt for the configuration's reference manual
doc/lua.txt for the Lua's reference manual
doc/SPOE.txt for how to use the SPOE engine
doc/network-namespaces.txt for how to use network namespaces under Linux
doc/management.txt for the management guide
doc/regression-testing.txt for how to use the regression testing suite
doc/peers.txt for the peers protocol reference
doc/coding-style.txt for how to adopt HAProxy's coding style
doc/internals for developer-specific documentation (not all up to date)

License

HAProxy is licensed under GPL 2 or any later version, the headers under LGPL 2.1. See the LICENSE file for a more detailed explanation.

Languages

C 98%

Shell 0.9%

Makefile 0.5%

Lua 0.2%

Python 0.2%