The per-cpu capacity of a cluster was taken into account since 3.2 with
commit 6c88e27cf4 ("MEDIUM: cpu-topo: change "performance" to consider
per-core capacity").
In cpu_policy_performance() and cpu_policy_efficiency(), we're trying
to figure which cores have more capacity than others by comparing their
cluster's average capacity. However, contrary to what the comment says,
we're not averaging per core but per cpu, which makes a difference for
CPUs mixing SMT with non-SMT cores on the same SoC, such as intel's 14th
gen CPUs. Indeed, on a machine where cpufreq is not enabled, all CPUs
can be reported with a capacity of 1024, resulting in a big cluster of
16*1024, and 4 small clusters of 4*1024 each, giving an average of 1024
per CPU, making it impossible to distinguish one from the other. In this
situation, both "cpu-policy performance" and "cpu-policy efficiency"
enable all cores.
But this is wrong, what needs to be taken into account in the divide is
the number of *cores*, not *cpus*, that allows to distinguish big from
little clusters. This was not noticeable on the ARM machines the commit
above aimed at fixing because there, the number of CPUs equals the number
of cores. And on an x86 machine with cpu_freq enabled, the frequencies
continue to help spotting which ones are big/little.
By using nb_cores instead of nb_cpus in the comparison and in the avg_capa
compare function, it properly works again on x86 without affecting other
machines with 1 CPU per core.
This can be backported to 3.2.
When using per-group affinity, add an optional new directive. It accepts
the values of "auto", where when multiple thread groups are created, the
available CPUs are split equally across the groups, and is the new
default, and "loose", where all groups are bound to all available CPUs,
this is the old default.
Rename "visited_tsid" and "visited_ccx" to "touse_tsid" and
"touse_ccx". They are not there to remember which tsid/ccx we
alreaday visited, contrarily to visited_ccx_set and
visited_cl_set, they are there to know which tsid/ccx we should
use, so make that clear.
We want to reset visited_ccx, as introduced by commit
8aef5bec1ef57eac449298823843d6cc08545745, each time we run the loop,
otherwise the chances of its content being correct are very low, and
will likely end up being bound to the wrong threads.
This was reported in github issue #3224.
src/cpu_topo.c:1325:15: warning: logical not is only applied to the left hand side of this bitwise operator [-Wlogical-not-parentheses]
1325 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^ ~
src/cpu_topo.c:1325:15: note: add parentheses after the '!' to evaluate the bitwise operator first
1325 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
src/cpu_topo.c:1325:15: note: add parentheses around left hand side expression to silence this warning
1325 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
src/cpu_topo.c:1533:15: warning: logical not is only applied to the left hand side of this bitwise operator [-Wlogical-not-parentheses]
1533 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^ ~
src/cpu_topo.c:1533:15: note: add parentheses after the '!' to evaluate the bitwise operator first
1533 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
src/cpu_topo.c:1533:15: note: add parentheses around left hand side expression to silence this warning
1533 | } else if (!cpu_policy_conf.flags & CPU_POLICY_ONE_THREAD_PER_CORE)
| ^
| ( )
No backport needed.
Add a new cpu-affinity keyword, "per-thread".
If used, each thread will be bound to only one hardware thread of the
thread group.
If used in conjonction with the "threads-per-core 1" cpu_policy, then
each thread will be bound on a different core.
Add a new global keyword, max-threads-per-group. It sets the maximum number of
threads a thread group can contain. Unless the number of thread groups
is fixed with "thread-groups", haproxy will just create more thread
groups as needed.
The default and maximum value is 64.
Add a new global option, "cpu-affinity", which controls how threads are
bound.
It currently accepts three values, "per-core", which will bind one thread to
each hardware thread of a given core, and "per-group" which will use all
the available hardware threads of the thread group, and "auto", the
default, which will use "per-group", unless "threads-per-core 1" has
been specified in cpu_policy, in which case it will use per-core.
Add a new, optional key-word to "cpu-policy", "threads-per-core".
It takes one argument, "1" or "auto". If "1" is used, then only one
thread per core will be created, no matter how many hardware thread each
core has. If "auto" is used, then one thread will be created per
hardware thread, as is the case by default.
for example: cpu-policy performance threads-per-core 1
Turn the cpu policy configuration into a struct. Right now it just
contains an int, that represents the policy used, but will get more
information soon.
cpu_dump_topology() prints details about each enabled CPU and a summary with
clusters info and thread-cpu bindings. The latter is often usefull for
debugging and we want to add it in the 'show dev' output.
So, let's split cpu_dump_topology() in two parts: cpu_topo_debug() to print the
details about each enabled CPU; and cpu_topo_dump_summary() to print only the
summary.
In the next commit we will modify cpu_topo_dump_summary() to write into local
trash buffer and it could be easily called from debug_parse_cli_show_dev().
As mentioned during the NUMA series development, the goal is to use
all available cores in the most efficient way by default, which
normally corresponds to "cpu-policy performance". The previous default
choice of "cpu-policy first-usable-node" was only meant to stay 100%
identical to before cpu-policy.
So let's switch the default cpu-policy to "performance" right now.
The doc was updated to reflect this.
Most of the time, machines made of multiple CPU types use the same L3
for them, and grouping CPUs by frequencies to form groups doesn't bring
any value and on the opposite can impair the incoming connection balancing.
This choice of grouping by cluster was made in order to constitute a good
choice on homogenous machines as well, so better rely on the per-CCX
grouping than the per-cluster one in this case. This will create less
clusters on machines where it counts without affecting other ones.
It doesn't seem necessary to change anything for the "resource" policy
since it selects a single cluster.
This is similar to the previous change to the "performance" policy but
it applies to the "efficiency" one. Here we're changing the sorting
method to sort CPU clusters by average per-CPU capacity, and we evict
clusters whose per-CPU capacity is above 125% of the previous one.
Per-core capacity allows to detect discrepancies between CPU cores,
and to continue to focus on efficient ones as a priority.
Running the "performance" policy on highly heterogenous systems yields
bad choices when there are sufficiently more small than big cores,
and/or when there are multiple cluster types, because on such setups,
the higher the frequency, the lower the number of cores, despite small
differences in frequencies. In such cases, we quickly end up with
"performance" only choosing the small or the medium cores, which is
contrary to the original intent, which was to select performance cores.
This is what happens on boards like the Orion O6 for example where only
the 4 medium cores and 2 big cores are choosen, evicting the 2 biggest
cores and the 4 smallest ones.
Here we're changing the sorting method to sort CPU clusters by average
per-CPU capacity, and we evict clusters whose per-CPU capacity falls
below 80% of the previous one. Per-core capacity allows to detect
discrepancies between CPU cores, and to continue to focus on high
performance ones as a priority.
The current per-capacity sorting function acts on a whole cluster, but
in some setups having many small cores and few big ones, it becomes
easy to observe an inversion of metrics where the many small cores show
a globally higher total capacity than the few big ones. This does not
necessarily fit all use cases. Let's add new a function to sort clusters
by their per-cpu average capacity to cover more use cases.
This cpu-policy will only consider CCX and not clusters. This makes
a difference on machines with heterogenous CPUs that generally share
the same L3 cache, where it's not desirable to create multiple groups
based on the CPU types, but instead create one with the different CPU
types. The variants "group-by-2/3/4-ccx" have also been added.
Let's also add some text explaining the difference between cluster
and CCX.
Some (rare) boards have their clusters in an erratic order. This is
the case for the Radxa Orion O6 where one of the big cores appears as
CPU0 due to booting from it, then followed by the small cores, then the
medium cores, then the remaining big cores. This results in clusters
appearing this order: 0,2,1,0.
The core in cpu_policy_group_by_cluster() expected ordered clusters,
and performs ordered comparisons to decide whether a CPU's cluster has
already been taken care of. On the board above this doesn't work, only
clusters 0 and 2 appear and 1 is skipped.
Let's replace the cluster number comparison with a cpuset to record
which clusters have been taken care of. Now the groups properly appear
like this:
Tgrp/Thr Tid CPU set
1/1-2 1-2 2: 0,11
2/1-4 3-6 4: 1-4
3/1-6 7-12 6: 5-10
No backport is needed, this is purely 3.2.
This adds "group-by-{2,3,4}-clusters", which, as its name implies,
create one thread group per X clusters. This can be useful when CPUs
are split into too small clusters, as well as when the total number
of assigned cores is not even between the clusters, to try to spread
the load between less different ones.
When emitting the CPU topology info with -dc, also emit a list of
thread-to-CPU mapping. The group/thread and thread ID are emitted
with the list of their CPUs on each line. The count of CPUs is shown
to ease comparisons, and as much as possible, we try to pack identical
lines within a group by showing thread ranges.
Coverity has reported that cpu2 seems sometimes unused in
cpu_fixup_topology():
*** CID 1593776: Code maintainability issues (UNUSED_VALUE)
/src/cpu_topo.c: 690 in cpu_fixup_topology()
684 continue;
685
686 if (ha_cpu_topo[cpu].cl_gid != curr_id) {
687 if (curr_id >= 0 && cl_cpu <= 2)
688 small_cl++;
689 cl_cpu = 0;
>>> CID 1593776: Code maintainability issues (UNUSED_VALUE)
>>> Assigning value from "cpu" to "cpu2" here, but that stored value is overwritten before it can be used.
690 cpu2 = cpu;
691 curr_id = ha_cpu_topo[cpu].cl_gid;
692 }
693 cl_cpu++;
694 }
695
That's it. 'cpu2' automatic/stack variable is used only in for() loop scopes to
save cpus ID in which we are interested in. In the loop pointed by coverity
this variable is not used for further processing within the loop's scope.
Then it is always reinitialized to 0 in the another following loops.
This fixes GitHUb issue #2895.
This cpu policy keeps the smallest CPU cluster. This can
be used to limit the resource usage to the strict minimum
that still delivers decent performance, for example to
try to further reduce power consumption or minimize the
number of cores needed on some rented systems for a
sidecar setup, in order to scale the system down more
easily. Note that if a single cluster is present, it
will still be fully used.
When started on a 64-core EPYC gen3, it uses only one CCX
with 8 cores and 16 threads, all in the same group.
This cpu policy tries to evict performant core clusters and only
focuses on efficiency-oriented ones. On an intel i9-14900k, we can
get 525k rps using 8 performance cores, versus 405k when using all
24 efficiency cores. In some cases the power savings might be more
desirable (e.g. scalability tests on a developer's laptop), or the
performance cores might be better suited for another component
(application or security component).
This cpu policy tries to evict efficient core clusters and only
focuses on performance-oriented ones. On an intel i9-14900k, we can
get 525k rps using only 8 cores this way, versus 594k when using all
24 cores. The gains from using all these codes are not significant
enough to waste them on this. Also these cores can be much slower
at doing SSL handshakes so it can make sense to evict them. Better
keep the efficiency cores for network interrupts for example.
Also, on a developer's machine it can be convenient to keep all these
cores for the local tasks and extra tools (load generators etc).
When a cluster is too large to fit into a single group, let's split it
into two equal groups, which will still be allowed to use all the CPUs
of the cluster. This allows haproxy to start all the threads with a
minimum number of groups (e.g. 2x40 for 80 cores).
This policy forms thread groups from the CPU clusters, and bind all the
threads in them to all the CPUs of the cluster. This is recommended on
system with bad inter-CCX latencies. It was shown to simply triple the
performance with queuing on a 64-core EPYC without having to manually
assign the cores with cpu-map.
This now turns the cpu-policy to "first-usable-node" by default, so that
we preserve the current default behavior consisting in binding to the
first node if nothing was forced. If a second node is found,
global.nbthread is set and the previous code will be skipped.
This is a reimplemlentation of the current default policy. It binds to
the first node having usable CPUs if found, and drops CPUs from the
second and next nodes.
We'll need to let the user decide what's best for their workload, and in
order to do this we'll have to provide tunable options. For that, we're
introducing struct ha_cpu_policy which contains a name, a description
and a function pointer. The purpose will be to use that function pointer
to choose the best CPUs to use and now to set the number of threads and
thread-groups, that will be called during the thread setup phase. The
only supported policy for now is "none" which doesn't set/touch anything
(i.e. all available CPUs are used).
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
cluster number(s). It can be used to bind to only some clusters, such
as CCX or different energy efficiency cores. For this reason, here we
use the cluster's local ID (local to the node).
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
core number(s). It can be used to bind to only some clusters as well
as to evict efficient cores whose number is known.
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
thread number(s). It can be used to reserve even threads for HW IRQs
and odd threads for haproxy for example, or to evict efficient cores
that do only have thread #0.
On some Arm systems (typically A76/N1) where CPUs can be associated in
pairs, clusters are reported while they have no incidence on I/O etc.
Yet it's possible to have tens of clusters of 2 CPUs each, which is
counter productive since it does not even allow to start enough threads.
Let's detect this situation as soon as there are at least 4 clusters
having each 2 CPUs or less, which is already very suspcious. In this
case, all these clusters will be reset as meaningless. In the worst
case if needed they'll be re-assigned based on L2/L3.
The goal here is to keep an array of the known CPU clusters, because
we'll use that often to decide of the performance of a cluster and
its relevance compared to other ones. We'll store the number of CPUs
in it, the total capacity etc. For the capacity, we count one unit
per core, and 1/3 of it per extra SMT thread, since this is roughly
what has been measured on modern CPUs.
In order to ease debugging, they're also dumped with -dc.
The purpose here is to detect heterogenous clusters which are not
properly reported, based on the exposed information about the cores
capacity. The algorithm here consists in sorting CPUs by capacity
within a cluster, and considering as equal all those which have 5%
or less difference in capacity with the previous one. This allows
large clusters of more than 5% total between extremities, while
keeping apart those where the limit is more pronounced. This is
quite common in embedded environments with big.little systems, as
well as on some laptops.
Due to the way core numbers are assigned and the presence of SMT on
some of them, some holes may remain in the array. Let's renumber them
to plug holes once they're known, following pkg/node/die/llc etc, so
that they're local to a (pkg,node) set. Now an i7-14700 shows cores
0 to 19, not 0 to 27.
On some machines, L3 is not always reported (e.g. on some lx2 or some
armada8040). But some also don't have L3 (core 2 quad). However, no L3
when there are more than 2 L2 is quite unheard of, and while we don't
really care about firing 2 thread groups for 2 L2, we'd rather avoid
doing this if there are 8! In this case we'll declare an L3 instance
to fix the situation. This allows small machines to continue to start
with two groups while not derivating on large ones.
It's important that we don't leave unassigned IDs in the topology,
because the selection mechanism is based on index-based masks, so an
unassigned ID will never be kept. This is particularly visible on
systems where we cannot access the CPU topology, the package id, node id
and even thread id are set to -1, and all CPUs are evicted due to -1 not
being set in the "only-cpu" sets.
Here in new function "cpu_fixup_topology()", we assign them with the
smallest unassigned value. This function will be used to assign IDs
where missing in general.
Due to the previous commit we can end up with cores not assigned
any cluster ID. For this, at the end we sort the CPUs by topology
and assign cluster IDs to remaining CPUs based on pkg/node/llc.
For example an 14900 now shows 5 clusters, one for the 8 p-cores,
and 4 of 4 e-cores each.
The local cluster numbers are per (node,pkg) ID so that any rule could
easily be applied on them, but we also keep the global numbers that
will help with thread group assignment.
We still need to force to assign distinct cluster IDs to cores
running on a different L3. For example the EPYC 74F3 is reported
as having 8 different L3s (which is true) and only one cluster.
Here we introduce a new function "cpu_compose_clusters()" that is called
from the main init code just after cpu_detect_topology() so that it's
not OS-dependent. It deals with this renumbering of all clusters in
topology order, taking care of considering any distinct LLC as being
on a distinct cluster.
Some platforms (several armv7, intel 14900 etc) report one distinct
cluster per core. This is problematic as it cannot let clusters be
used to distinguish real groups of cores, and cannot be used to build
thread groups.
Let's just compare the cluster cpus to the siblings, and ignore it if
they exactly match. We must also take care of not falling back to
core_cpus_list, which can enumerate cores that already have their
cluster assigned (e.g. intel 14900 has 4 4-Ecore clusters in addition
to the 8 Pcores).
This will be used to detect and fix incorrect setups which report
the same cluster ID for multiple L3 instances.
The arrangement of functions in this file is becoming a real problem.
Maybe we should move all this to cpu_topo for example, and better
distinguish OS-specific and generic code.