From 'man gcc': passing 0 as the argument to "__builtin_ctz" or
"__builtin_clz" invokes undefined behavior. This triggers UBsan
in HAProxy.
[wt: tested in treebench and verified not to cause any performance
regression with opstime-u32 nor stress-u32]
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 8c29daf9fa6e34de8c7684bb7713e93dcfe09029.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit cf3b93736cb550038325e1d99861358d65f70e9a.
While the previous optimizations couldn't be preserved due to the
possibility of out-of-bounds accesses, at least the prefetch is useful.
A test on treebench shows that for 64k short strings, the lookup time
falls from 276 to 199ns per lookup (28% savings), and the insert falls
from 311 to 296ns (4.9% savings), which are pretty respectable, so
let's do this.
This is ebtree commit b44ea5d07dc1594d62c3a902783ed1fb133f568d.
It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no
mention of it in 3.0's doc. Let's replace it with eb_prefetch() which
maps to __builtin_prefetch() on supported compilers and falls back to
the usual do{}while(0) on other ones. It was tested to properly build
with tcc as well as gcc-2.95.
This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.
Similar to previous patches, let's improve the insert() descent loop to
avoid discovering mandatory data too late. The change here is even
simpler than previous ones, a prefetch was installed and troot is
calculated before last instruction in a speculative way. This was enough
to gain +50% insertion rate on random data.
This is ebtree commit e893f8cc4d44b10f406b9d1d78bd4a9bd9183ccf.
This is the same principles as for the latest improvements made on
integer trees. Applying the same recipes made the ebmb_lookup()
function jump from 10.07 to 12.25 million lookups per second on a
10k random values tree (+21.6%).
It's likely that the ebmb_lookup_longest() code could also benefit
from this, though this was neither explored nor tested.
This is ebtree commit a159731fd6b91648a2fef3b953feeb830438c924.
In the loop we can help the compiler build slightly more efficient code
by placing an unlikely() around the leaf test. This shows a consistent
0.5% performance gain both on eb32 and eb64.
This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.
This one was previously used to preload from the node and keep a copy
in a register on i386 machines with few registers. With the new more
optimal code it's totally useless, so let's get rid of it. By the way
the 64 bit code didn't use that at all already.
This is ebtree commit 1e219a74cfa09e785baf3637b6d55993d88b47ef.
Instead of shifting the XOR value right and comparing it to 1, which
roughly requires 2 sequential instructions, better test if the XOR has
any bit above the current bit, which means any bit set among those
strictly higher, or in other words that XOR & (-bit << 1) is non-zero.
This is one less instruction in the fast path and gives another nice
performance gain on random keys (in million lookups/s):
eb32 1k: 33.17 -> 37.30 +12.5%
10k: 15.74 -> 17.08 +8.51%
100k: 8.00 -> 9.00 +12.5%
eb64 1k: 34.40 -> 38.10 +10.8%
10k: 16.17 -> 17.10 +5.75%
100k: 8.38 -> 8.87 +5.85%
This is ebtree commit c942a2771758eed4f4584fe23cf2914573817a6b.
The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.
Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.
What this patch does is relatively simple:
- we try to prefetch the next two branches as soon as the
node is known, which will help dereference the selected node in
the next iteration; it was shown that it only works with the next
changes though, otherwise it can reduce the performance instead.
In practice the prefetching will start a bit later once the node
is really in the cache, but since there's no dependency between
these instructions and any other one, we let the CPU optimize as
it wants.
- we preload all important data from the node (next two branches,
key and node.bit) very early even if not immediately needed.
This is cheap, it doesn't cause any pipeline stall and speeds
up later operations.
- we pre-calculate 1<<bit that we assign into a register, so as
to avoid serializing instructions when deciding which branch to
take.
- we assign the troot based on a ternary operation (or if/else) so
that the CPU knows upfront the two possible next addresses without
waiting for the end of a calculation and can prefetch their contents
every time the branch prediction unit guesses right.
Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):
eb32 1k: 29.07 -> 33.17 +14.1%
10k: 14.27 -> 15.74 +10.3%
100k: 6.64 -> 8.00 +20.5%
eb64 1k: 27.51 -> 34.40 +25.0%
10k: 13.54 -> 16.17 +19.4%
100k: 7.53 -> 8.38 +11.3%
The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).
Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.
It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).
This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.
Since commit 21fd162 ("[MEDIUM] make ebpttree rely solely on eb32/eb64
trees") it was no longer used and no longer builds. The commit message
mentions that the file is no longer needed, probably that a rebase failed
and left the file there.
This is ebtree commit fcfaf8df90e322992f6ba3212c8ad439d3640cb7.
Add some documentation about shm stats file structure to help writing
tools that can parse the file to use the shared stats counters.
This file was written for shm stats file version 1.0 specifically,
it may need to be updated when the shm stats file structure changes
in the future.
Let's explicitly mention that fe_counters_shared_tg and
be_counters_shared_tg structs are embedded in shm_stats_file_object
struct so any change in those structs will result in shm stats file
incompatibility between processes, thus extra precaution must be
taken when making changes to them.
Note that the provisionning made in shm_stats_file_object struct could
be used to add members to {fe,be}_counters_shared_tg without changing
shm_stats_file_object struct size if needed in order to preserve
shm stats file version.
When logsteps proxy storage was migrated from eb nodes to bitmasks in
6a92b14 ("MEDIUM: log/proxy: store log-steps selection using a bitmask,
not an eb tree"), some unused eb node related code was left over in
px_parse_log_steps()
Not only this code is unused, it also resulted in wasted memory since
an eb node was allocated for nothing.
This should fix GH #3121
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.
While this bug is 3.3 only, it turns out that 3.2 is also affected by
the incorrect loop construct in pat_ref_set_from_node(), where it's
possible to run an infinite loop since commit 010c34b8c7 ("MEDIUM:
pattern: consider gen_id in pat_ref_set_from_node()") due to the
"continue" statement being placed before the ebmb_next_dup() call.
As such the relevant part of this fix (pat_ref_set_from_elt) will
need to be backported to 3.2.
This reverts commit 359a829ccb8693e0b29808acc0fa7975735c0353.
The fix is neither sufficient nor correct (it triggers ASAN). Better
redo it cleanly rather than accumulate invalid fixes.
The upcoming ECH feature need a patched OpenSSL with the "feature/ech"
branch.
This daily job launches an openssl build, as well as haproxy build with
reg-tests.
Add support for git releases downloaded from github in openssl builds:
- GIT_TYPE variable allow you to chose between "branch" or "commit"
- OPENSSL_VERSION variable supports a "git-" prefix
- "git-${commit_id}" is stored in .openssl_version instead of the branch
name for version comparison.
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.
No backport is needed, this is only 3.3.
The variables trees use the immediate cebtree API, better use the
item one which is more expressive and safer. The "node" field was
renamed to "name_node" to avoid any ambiguity.
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.
With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.
Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.
This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):
struct before after delta
conn_hash_nodea 56 32 -24
srv_per_thread 96 72 -24
The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.
Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.
The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.
At the moment, compared to before this series, we're seeing these
sizes:
type\size 7d554ca62 current delta
listener 752 704 -48 (-6.4%)
server 4032 3840 -192 (-4.8%)
proxy 3184 3056 -128 (-4%)
stktable 3392 3328 -64 (-1.9%)
Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.
The struct server still has a lot of holes and padding that make it
quite big. By moving a few fields aronud between areas which do not
interact (e.g. boot vs aligned areas), it's quite easy to plug some
of them and/or to arrange larger ones which could be reused later with
a bit more effort. Here we've reduced holes by 40 bytes, allowing the
struct to shrink by one more cache line (64 bytes). The new size is
3840 bytes.
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
In srv_parse_id(), there's no point doing all the low-level work with
the tree functions to check for the existence of an ID, we already have
server_find_by_id() which does exactly this, so let's use it.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated listener_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.
Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.
Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.
It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.
The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.
It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.
Affected struct sizes in bytes:
Before After Diff
server 4032 4032 0*
proxy 3184 3160 -24
listener 752 728 -24
*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).
Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):
$ time ./haproxy -c -f acl-ip.cfg
Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.
Before:
RAM (VSZ/RSS): 4443912 3912444
real 0m19.211s
user 0m18.138s
sys 0m1.068s
Overhead Command Shared Object Symbol
44.79% haproxy haproxy [.] ebst_insert
25.07% haproxy haproxy [.] ebmb_insert_prefix
3.44% haproxy libc-2.33.so [.] __libc_calloc
2.71% haproxy libc-2.33.so [.] _int_malloc
2.33% haproxy haproxy [.] free_pattern_tree
1.78% haproxy libc-2.33.so [.] inet_pton4
1.62% haproxy libc-2.33.so [.] _IO_fgets
1.58% haproxy libc-2.33.so [.] _int_free
1.56% haproxy haproxy [.] pat_ref_push
1.35% haproxy libc-2.33.so [.] malloc_consolidate
1.16% haproxy libc-2.33.so [.] __strlen_avx2
0.79% haproxy haproxy [.] pat_idx_tree_ip
0.76% haproxy haproxy [.] pat_ref_read_from_file
0.60% haproxy libc-2.33.so [.] __strrchr_avx2
0.55% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.54% haproxy libc-2.33.so [.] __memchr_avx2
0.46% haproxy haproxy [.] pat_ref_append
After:
RAM (VSZ/RSS): 4166108 3634768
real 0m18.114s
user 0m17.113s
sys 0m0.996s
Overhead Command Shared Object Symbol
38.99% haproxy haproxy [.] cebs_insert
27.09% haproxy haproxy [.] ebmb_insert_prefix
3.63% haproxy libc-2.33.so [.] __libc_calloc
3.18% haproxy libc-2.33.so [.] _int_malloc
2.69% haproxy haproxy [.] free_pattern_tree
1.99% haproxy libc-2.33.so [.] inet_pton4
1.74% haproxy libc-2.33.so [.] _IO_fgets
1.73% haproxy libc-2.33.so [.] _int_free
1.57% haproxy haproxy [.] pat_ref_push
1.48% haproxy libc-2.33.so [.] malloc_consolidate
1.22% haproxy libc-2.33.so [.] __strlen_avx2
1.05% haproxy libc-2.33.so [.] __strcmp_avx2
0.80% haproxy haproxy [.] pat_idx_tree_ip
0.74% haproxy libc-2.33.so [.] __memchr_avx2
0.69% haproxy libc-2.33.so [.] __strrchr_avx2
0.69% haproxy libc-2.33.so [.] _IO_getline_info
0.62% haproxy haproxy [.] pat_ref_read_from_file
0.56% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.56% haproxy libc-2.33.so [.] cfree@GLIBC_2.2.5
0.46% haproxy haproxy [.] pat_ref_append
If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).
On large strings such as user agents (1 million here), it's now slightly
slower (+9%):
Before:
real 0m2.475s
user 0m2.316s
sys 0m0.155s
After:
real 0m2.696s
user 0m2.544s
sys 0m0.147s
But such patterns are much less common than short ones, and the memory
savings do still count.
Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
- replace a few ceb_node** with ceb_root** since pointers are now
tagged ;
- replace cebu*.h with ceb*.h since both are now merged in the same
include file. This way we can drop the unused cebu*.h files from
cebtree that are provided only for compatibility.
- rename immediate storage functions to cebXX_imm_XXX() as per the API
change in 0.5 that makes immediate explicit rather than implicit.
This only affects vars and tools.c:copy_file_name().
The tests continue to work.
When running "make range", it would be convenient to support running
reg tests or anything else such as "size", "pahole" or even benchmarks.
Such commands are usually specific to the developer's environment, so
let's just pass a generic variable TEST_CMD that is executed as-is if
not empty.
This way it becomes possible to run "make range RANGE=... TEST_CMD=...".
RFC1034 states the following:
By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit. This means that you are free to create a node with
label "A" or a node with label "a", but not both as brothers; you could
refer to either using "a" or "A".
In practice, most DNS resolvers normalize domain labels (i.e., convert
them to lowercase) before performing searches or comparisons to ensure
this requirement is met.
While HAProxy normalizes the domain name in the request, it currently
does not do so for the response. Commit 75cc653 ("MEDIUM: resolvers:
replace bogus resolv_hostname_cmp() with memcmp()") intentionally
removed the `tolower()` conversion from `resolv_hostname_cmp()` for
safety and performance reasons.
This commit re-introduces the necessary normalization for FQDNs received
in the response. The change is made in `resolv_read_name()`, where labels
are processed as an unsigned char string, allowing `tolower()` to be
applied safely. Since a typical FQDN has only 3-4 labels, replacing
`memcpy()` with an explicit copy that also applies `tolower()` should
not introduce a significant performance degradation.
This patch addresses the rare edge case, as most resolvers perform this
normalization themselves.
This fixes the GitHub issue #3102. This fix may be backported in all stable
versions since 2.5 included 2.5.