53 Commits

Author SHA1 Message Date
Valentine Krasnobaeva
fb7bef781d MINOR: defaults: update MASTER_MAXCONN description
This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT
support, as it became redundant with MODE_MWORKER due to moving master-worker
fork in init().
2024-10-16 22:02:39 +02:00
Amaury Denoyelle
d0d8e57d47 MINOR: quic: define sbuf pool
Define a new buffer pool reserved to allocate smaller memory area. For
the moment, its usage will be restricted to QUIC, as such it is declared
in quic_stream module.

Add a new config option "tune.bufsize.small" to specify the size of the
allocated objects. A special check ensures that it is not greater than
the default bufsize to avoid unexpected effects.
2024-08-20 18:12:27 +02:00
Valentine Krasnobaeva
8b1dfa9def MINOR: cfgparse: limit file size loaded via /dev/stdin
load_cfg_in_mem() can continuously reallocate memory in order to load an
extremely large input from /dev/stdin, until it fails with ENOMEM, which means
that process has consumed all available RAM. In case of containers and
virtualized environments it's not very good.

So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will
limit the size of input supplied via /dev/stdin.
2024-08-20 14:28:34 +02:00
Valentine Krasnobaeva
41275a6918 MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD
Let's provide a default value for fd_hard_limit, if it's not set in the
configuration. With this patch we could set some specific default via
compile-time variable DEFAULT_MAXFD as well. Hope, this will be helpfull for
haproxy package maintainers.

    make -j 8 TARGET=linux-glibc DEBUG=-DDEFAULT_MAXFD=50000

If haproxy is comipled without DEFAULT_MAXFD defined, the default will be set
to 1048576.

This is done to avoid killing the process by its watchdog, while it started
without any limitations in its configuration or in the command line and the
hard RLIMIT_NOFILE is extremely huge (~1000000000). We use in this case
compute_ideal_maxconn() to calculate maxconn and maxsock, maxsock defines the
size of internal fdtab, which becames very-very large as well. When
the process starts to simply loop over this fdtab (0(n)), this takes a lot of
time, so watchdog does it job.

To avoid this, maxconn now is always reduced to some reasonable value either
by explicit global.fd-hard-limit from configuration, or by its default. The
default may be changed at build-time and overwritten then by
global.fd-hard-limit at runtime. Explicit global.fd-hard-limit from the
configuration has always precedence over DEFAULT_MAXFD, if set.

Must be backported in all stable versions until v2.6.0, including v2.6.0.
2024-07-04 07:52:42 +02:00
Willy Tarreau
290659ffd3 MINOR: activity: make the memory profiling hash size configurable at build time
The MEMPROF_HASH_BITS variable was set to 10 without a possibility to
change it (beyond patching the code). After seeing a few reports already
with "other" being listed and a list with close to 1024 entries, it looks
like it's about time to either increase the hash size, or at least make
it configurable for special cases. As a reminder, in order to remain
fast, the algorithm searches no more than 16 places after the hash, so
when a table is almost full, searches are long and new places are rare.

The present patch just makes it possible to redefine it by passing
"-DMEMPROF_HASH_BITS=11" or "-DMEMPROF_HASH_BITS=12" in CFLAGS, and
moves the definition to defaults.h to make it easier to find. Such
values should be way sufficient for the vast majority of use cases.
Maybe in the future we'd change the default. At least this version
should be backported to ease rebuilds, say, till 2.8 or so.
2024-06-27 18:01:27 +02:00
Willy Tarreau
0ce51dc93b MEDIUM: dynbuf: implement emergency buffers
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.

These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.

At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
2024-05-10 17:18:13 +02:00
Willy Tarreau
772f9a5874 BUILD: pools: make DEBUG_MEMORY_POOLS=1 the default option
This option has been set by default for a very long time and also
complicates the manipulation of the DEBUG variable. Let's make it
the official default and permit to unset it by setting it to zero.
The other pool-related DEBUG options were adjusted to also explicitly
check for the zero value for consistency.
2024-04-11 17:25:45 +02:00
Willy Tarreau
b70981532a BUILD: debug: make DEBUG_STRICT=1 the default
We continue to carry it in the makefile, which adds to the difficulty
of passing new options. Let's make DEBUG_STRICT=1 the default so that
one has to explicitly pass DEBUG_STRICT=0 to disable it. This allows us
to remove the option from the default DEBUG variable in the makefile.
2024-04-11 17:25:45 +02:00
Willy Tarreau
1a088da7c2 MAJOR: stktable: split the keys across multiple shards to reduce contention
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.

The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):

  function dump_stk()
    local dmp = core.proxies['tbl'].stktable:dump({});
    local count = 0
    for _, __ in pairs(dmp) do
        count = count + 1
    end
    core.Info('Total entries: ' .. count)
  end

  core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);

  ##
  global
    tune.lua.log.stderr on
    lua-load-per-thread lua-cnttbl.lua

  listen front
    bind :8001
    http-request lua.dump_stk if { path_beg /stk }
    http-request track-sc1 rand(),upper,hex table tbl
    http-request redirect location /

  backend tbl
    stick-table size 100k type string len 12 store http_req_cnt

  ##
  $ h2load -c 16 -n 10000 0:8001/
  $ curl 0:8001/stk

  ## A count close to 100k appears on haproxy's stderr
  ## On the CLI, "show table tbl" | wc will show the same.

Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.

The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.

The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().

The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.

At this point, this allowed to completely unlock the performance on a
80-thread setup:

 before: 5.4 Gbps, 150k RPS, 80 cores
  52.71%  haproxy    [.] stktable_lookup_key
  36.90%  haproxy    [.] stktable_get_entry.part.0
   0.86%  haproxy    [.] ebmb_lookup
   0.18%  haproxy    [.] process_stream
   0.12%  haproxy    [.] process_table_expire
   0.11%  haproxy    [.] fwrr_get_next_server
   0.10%  haproxy    [.] eb32_insert
   0.10%  haproxy    [.] run_tasks_from_lists

 after: 36 Gbps, 980k RPS, 80 cores
  44.92%  haproxy    [.] stktable_get_entry
   5.47%  haproxy    [.] ebmb_lookup
   2.50%  haproxy    [.] fwrr_get_next_server
   0.97%  haproxy    [.] eb32_insert
   0.92%  haproxy    [.] process_stream
   0.52%  haproxy    [.] run_tasks_from_lists
   0.45%  haproxy    [.] conn_backend_get
   0.44%  haproxy    [.] __pool_alloc
   0.35%  haproxy    [.] process_table_expire
   0.35%  haproxy    [.] connect_server
   0.35%  haproxy    [.] h1_headers_to_hdr_list
   0.34%  haproxy    [.] eb_delete
   0.31%  haproxy    [.] srv_add_to_idle_list
   0.30%  haproxy    [.] h1_snd_buf

WIP: uint64_t -> long

WIP: ulong -> uint

code is much smaller
2024-04-03 17:34:47 +02:00
Willy Tarreau
6c1b29d06f MINOR: ring: make the number of queues configurable
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.

Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.

This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.

It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
2024-03-25 17:34:19 +00:00
Willy Tarreau
0a0041d195 BUILD: tree-wide: fix a few missing includes in a few files
Some include files, mostly types definitions, are missing a few includes
to define the types they're using, causing include ordering dependencies
between files, which are most often not seen due to the alphabetical
order of includes. Let's just fix them.

These were spotted by building pre-compiled headers for all these files
to .h.gch.
2024-03-05 11:50:34 +01:00
Aurelien DARRAGON
cb3ec978fd MINOR: event_hdl: add global tunables
The local variable "event_hdl_async_max_notif_at_once" which was
introduced with the event_hdl API was left as is but with a TODO note
telling that we should make it a global tunable.

Well, we're doing this now. To prepare for upcoming tunables related to
event_hdl API, we add a dedicated struct named event_hdl_tune which is
globally exposed through the event_hdl header file so that it may be used
from everywhere. The struct is automatically initialized in
event_hdl_init() according to defaults.h.

"event_hdl_async_max_notif_at_once" now becomes
"event_hdl_tune.max_events_at_once" with it's dedicated
configuation keyword: "tune.events.max-events-at-once".

We're also taking this opportunity to raise the default value from 10
to 100 since it's seems quite reasonnable given existing async event_hdl
users.

The documentation was updated accordingly.
2023-11-29 08:59:27 +01:00
Remi Tricot-Le Breton
48f81ec09d MAJOR: cache: Delay cache entry delete in reserve_hot function
A reference counter on the cache_entry was added in a previous commit.
Its value is atomically increased and decreased via the retain_entry and
release_entry functions.

This is needed because of the latest cache and shared_context
modifications that introduced two separate locks instead of the
preexisting single shctx_lock one.
With the new logic, we have two main blocks competing for the two locks:
- the one in the http_action_req_cache_use that performs a lookup in the
  cache tree (locked by the cache lock) and then tries to remove the
  corresponding blocks from the shared_context's 'avail' list until the
  response is sent to the client by the cache applet,
- the shctx_row_reserve_hot that traverses the 'avail' list and gives
  them back to the caller, while removing previous row heads from the
  cache tree
Those two blocks require the two locks but one of them would take the
cache lock first, and the other one the shctx_lock first, which would
end in a deadlock without the current patch.

The way this conflict is resolved in this patch is by ensuring that at
least one of those uses works without taking the two locks at the same
time.
The solution found was to keep taking the two locks in the cache_use
case. We first lock the cache to lookup for an entry and we then take
the shctx lock as well to detach the corresponding blocks from the
'avail' list. The subtlety is that between the cache lookup and the
actual locking of the shctx, another thread might have called the
reserve_hot function in which we only take the shctx lock.
In this function we traverse the 'avail' list to remove blocks that are
then given to the caller. If one of those blocks corresponds to a
previous row head, we call the 'free_blocks' callback that used to
delete the cache entry from the tree.
We now avoid deleting directly the cache entries in reserve_hot and we
rather set the cache entries 'complete' param to 0 so that no other
thread tries to work with this entry. This way, when we release the
shctx lock in reserve_hot, the first thread that had performed the cache
lookup and had found an entry that we just gave to another thread will
see that the 'complete' field is 0 and it won't try to work with this
response.

The actual removal of entries from the cache tree will now be performed
in the new 'reserve_finish' callback called at the end of the
shctx_row_reserve_hot function. It will iterate on all the row head that
were inserted in a dedicated list in the 'free_block' callback and
perform the actual delete.
2023-11-16 19:35:10 +01:00
Willy Tarreau
06885aaea7 MINOR: pools: introduce the use of multiple buckets
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).

Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.

A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.

For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
2023-08-12 19:04:34 +02:00
Willy Tarreau
59c347c15e BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
LONGBITS was defined long ago with old compilers that didn't provide the
word size. It's still present as being referenced in various places in the
code, but we must not use it to define other macros that may be evaluated
at pre-processing time since it contains sizeof() and casts that are not
compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP
to __WORDSIZE so that we can condition blocks of code on it if needed.

LONGBITS should really be removed by now, given that we don't support
compilers not providing __WORDSIZE anymore (gcc < 4.2).
2023-08-12 19:04:34 +02:00
Patrick Hemmer
7fccccccea MINOR: acl: add acl() sample fetch
This provides a sample fetch which returns the evaluation result
of the conjunction of named ACLs.
2023-08-01 10:49:06 +02:00
William Lallemand
2078d4b1f7 BUG/MINOR: mworker: use MASTER_MAXCONN as default maxconn value
In environments where SYSTEM_MAXCONN is defined when compiling, the
master will use this value instead of the original minimal value which
was set to 100. When this happens, the master process could allocate
RAM excessively since it does not need to have an high maxconn. (For
example if SYSTEM_MAXCONN was set to 100000 or more)

This patch fixes the issue by using the new define MASTER_MAXCONN which
define a default maxconn of 100 for the master process.

Must be backported as far as 2.5.
2023-03-09 14:28:44 +01:00
Willy Tarreau
28360dc53f MEDIUM: clock: force internal time to wrap early after boot
GH issue #2034 clearly indicates yet another case of time roll-over
that went badly. Issues that happen only once every 50 days are hard
to detect and debug, and are usually reported more or less synchronized
from multiple sources. This patch finally does what had long been planned
but never done yet, which is to force the time to wrap early after boot
so that any such remaining issue can be spotted quicker. The margin delay
here is 20s (it may be changed by setting BOOT_TIME_WRAP_SEC to another
value). This value seems sufficient to permit failed health checks to
succeed and traffic to come in and possibly start to update some time
stamps (accept dates in logs, freq counters, stick-tables expiration
dates etc).

It could theoretically be helpful to have this in 2.7, but as can be
seen with the two patches below, we've already had incorrect use cases
of the internal monotonic time when the wall-clock one was needed, so
we could expect to detect other ones in the future. Note that this will
*not* induce bugs, it will only make them happen much faster (i.e. no
need to wait for 50 days before seeing them). If it were to eventually
be backported, these two previous patches must also be backported:

    BUG/MINOR: clock: use distinct wall-clock and monotonic start dates
    BUG/MEDIUM: cache: use the correct time reference when comparing dates
2023-02-08 11:10:33 +01:00
Willy Tarreau
b8b243ac6a MINOR: trace: add the long awaited TRACE_PRINTF()
TRACE_PRINTF() can be used to produce arbitrary trace contents at any
trace level. It uses the exact same arguments as other TRACE_* macros,
but here they are mandatory since they are followed by the format-string,
though they may be filled with zeroes. The reason for the arguments is to
match tracking or filtering and not pollute other non-inspected objects.

It will probably be used inside loops, in which case there are two points
to be careful about:
  - output atomicity is only per-message, so competing threads may see
    their messages interleaved. As such, it is recommended that the
    caller places a recognizable unique context at the beginning of the
    message such as a connection pointer.
  - iterating over arrays or lists for all requests could be very
    expensive. In order to avoid this it is best to condition the call
    via TRACE_ENABLED() with the same arguments, which will return the
    same decision.
  - messages longer than TRACE_MAX_MSG-1 (1023 by default) will be
    truncated.

For example, in order to dump the list of HTTP headers between hpack
and h2:

  if (outlen > 0 &&
      TRACE_ENABLED(TRACE_LEVEL_DEVELOPER,
      H2_EV_RX_FRAME|H2_EV_RX_HDR, h2c->conn, 0, 0, 0)) {
          int i;
          for (i = 0; list[i].n.len; i++)
              TRACE_PRINTF(TRACE_LEVEL_DEVELOPER, H2_EV_RX_FRAME|H2_EV_RX_HDR,
                           h2c->conn, 0, 0, 0, "h2c=%p hdr[%d]=%s:%s", h2c, i,
                           list[i].n.ptr, list[i].v.ptr);
  }

In addition, a lower-level TRACE_PRINTF_LOC() macro is provided, that takes
two extra arguments, the caller's location and the caller's function name.
This will allow to emit composite traces from central functions on the
behalf of another one.
2023-01-26 15:49:43 +01:00
Willy Tarreau
4da51bd190 CLEANUP: pools: get rid of CONFIG_HAP_POOLS
This one was set in defaults.h only when neither DEBUG_NO_POOLS nor
DEBUG_UAF were set. This was not the most convenient location to look
for it, and it was only used in pool.c to decide on the initial value
of POOL_DBG_NO_CACHE.

Let's just use DEBUG_NO_POOLS || DEBUG_UAF directly on this flag and
get rid of the intermediary condition. This also has the benefit of
removing a double inversion, which is always nice for understanding.
2022-12-08 17:45:08 +01:00
William Lallemand
eba6a54cd4 MINOR: logs: startup-logs can use a shm for logging the reload
When compiled with USE_SHM_OPEN=1 the startup-logs are now able to use
an shm which is used to keep the logs when switching to mworker wait
mode. This allows to keep the failed reload logs.

When allocating the startup-logs at first start of the process, haproxy
will do a shm_open with a unique path using the PID of the process, the
file is unlink immediatly so we don't let unwelcomed files be. The fd
resulting from this shm is stored in the HAPROXY_STARTUPLOGS_FD
environment variable so it can be mmap again when switching to wait
mode.

When forking children, the process is copying the mmap to a a mallocated
ring so we never share the same memory section between the master and
the workers. When switching to wait mode, the shm is not used anymore as
it is also copied to a mallocated structure.

This allow to use the "show startup-logs" command over the master CLI,
to get the logs of the latest startup or reload. This way the logs of
the latest failed reload are also kept.

This is only activated on the linux-glibc target for now.
2022-10-13 16:50:22 +02:00
Willy Tarreau
693688e734 MINOR: config: automatically preset MAX_THREADS based on MAX_TGROUPS
MAX_THREADS was not changed when setting MAX_TGROUPS, which still limits
some possibilities. Let's preset it to 4 * LONGBITS when MAX_TGROUPS is
larger than 1, or LONGBITS when it's set to 1. This means that the new
default value is 256 threads.

The rationale behind this is that the main use of thread groups is
mostly to address NUMA issues and that we don't necessarily need large
thread counts when using many groups, and 256 threads is already plenty
even on quite large systems.

For now it's important not to go too far because some internal structs
are arrays of MAX_THREADS entries, for example accept_queue_ring, which
is around 8kB per thread. Such structures will need to become dynamic
before defaulting to large thread counts (at 4096 threads max the
accept queues would require 32 MB RAM alone).
2022-08-06 16:51:20 +02:00
Willy Tarreau
856d56d2d2 MINOR: config: change default MAX_TGROUPS to 16
This will allows nbtgroups > 1 to be declared in the config without
recompiling. The theoretical limit is 64, though we'd rather not push
it too far for now as some structures might be enlarged to be indexed
per group. Let's start with 16 groups max, allowing to experiment with
dual-socket machines suffering from up to 8 loosely coupled L3 caches.
It's a good start and doesn't engage us too far.
2022-07-15 21:51:48 +02:00
Willy Tarreau
3ccb14d60d MINOR: thread: get rid of MAX_THREADS_MASK
This macro was used both for binding and for lookups. When binding tasks
or FDs, using all_threads_mask instead is better as it will later be per
group. For lookups, ~0UL always does the job. Thus in practice the macro
was already almost not used anymore since the rest of the code could run
fine with a constant of all ones there.
2022-06-14 11:18:40 +02:00
Willy Tarreau
197715ae21 CLEANUP: compression: move the default setting of maxzlibmem to defaults
__comp_fetch_init() only presets the maxzlibmem, and only when both
USE_ZLIB and DEFAULT_MAXZLIBMEM are set. The intent is to preset a
default value to protect the system against excessive memory usage
when no setting is set by the user.

Nowadays the entry in the global struct is always there so there's no
point anymore in passing via a constructor to possibly set this value.
Let's go the cleaner way by always presetting DEFAULT_MAXZLIBMEM to 0
in defaults.h unless these conditions are met, and always assigning it
instead of pre-setting the entry to zero. This is more straightforward
and removes some ifdefs and the last constructor. In addition, now the
setting has a chance of being found.
2022-04-25 19:42:43 +02:00
Remi Tricot-Le Breton
1d6338ea96 MEDIUM: ssl: Disable DHE ciphers by default
DHE ciphers do not present a security risk if the key is big enough but
they are slow and mostly obsoleted by ECDHE. This patch removes any
default DH parameters. This will effectively disable all DHE ciphers
unless a global ssl-dh-param-file is defined, or
tune.ssl.default-dh-param is set, or a frontend has DH parameters
included in its PEM certificate. In this latter case, only the frontends
that have DH parameters will have DHE ciphers enabled.
Adding explicitely a DHE ciphers in a "bind" line will not be enough to
actually enable DHE. We would still need to know which DH parameters to
use so one of the three conditions described above must be met.

This request was described in GitHub issue #1604.
2022-04-20 17:30:55 +02:00
Willy Tarreau
b61fccdc3f CLEANUP: init: remove the ifdef on HAPROXY_MEMMAX
It's ugly, let's move it to defaults.h with all other ones and preset
it to zero if not defined.
2022-02-23 17:11:33 +01:00
Remi Tricot-Le Breton
55d7e782ee MINOR: ssl: Set default dh size to 2048
Starting from OpenSSLv3, we won't rely on the
SSL_CTX_set_tmp_dh_callback mechanism so we will need to know the DH
size we want to use during init. In order for the default DH param size
to be used when no RSA or DSA private key can be found for a given bind
line, we will need to know the default size we want to use (which was
not possible the way the code was built, since the global default dh
size was set too late.
2022-02-14 10:07:14 +01:00
Willy Tarreau
39fd546d4b MINOR: pools: enable pools with DEBUG_FAIL_ALLOC as well
During 2.4-dev, fault injection was enabled for cached pools with commit
207c09509 ("MINOR: pools: move the fault injector to __pool_alloc()"),
except that the condition for CONFIG_HAP_POOLS still depended on
DEBUG_FAIL_ALLOC not being set, which limits the usability to cases
where the define is set by hand. Let's remove it from the equation as
this is not a constraint anymore. While a bit old, there's no need to
backport this as it's only used during development.
2022-01-12 17:31:01 +01:00
Willy Tarreau
f5e94b2f47 OPTIM: pools: reduce local pool cache size to 512kB
Now that we support batched allocations/releases, it appears that we can
reach the same performance on H2 with shared pools and 256kB thread-local
cache as without shared pools, a fast allocator and 1MB thread-local cache.
With 512kB we're up to 10% faster on highly multiplexed H2 than without the
shared cache. This was tested on a 16-core ARM machine. Thus it's time to
slightly reduce the per-thread memory cost, which may also improve the
performance on machines with smaller L2 caches. It essentially reverts
commit f587003fe ("MINOR: pools: double the local pool cache size to 1 MB").
2022-01-02 19:52:15 +01:00
Willy Tarreau
43937e920f MEDIUM: pools: start to batch eviction from local caches
Since previous patch we can forcefully evict multiple objects from the
local cache, even when evicting basd on the LRU entries. Let's define
a compile-time configurable setting to batch releasing of objects. For
now we set this value to 8 items per round.

This is marked medium because eviction from the LRU will slightly change
in order to group the last items that are freed within a single cache
instead of accurately scanning only the oldest ones exactly in their
order of appearance. But this is required in order to evolve towards
batched removals.
2022-01-02 19:35:26 +01:00
Willy Tarreau
f9662848f2 MINOR: threads: introduce a minimalistic notion of thread-group
This creates a struct tgroup_info which knows the thread ID of the first
thread in a group, and the number of threads in it. For now there's only
one thread group supported in the configuration, but it may be forced to
other values for development purposes by defining MAX_TGROUPS, and it's
enabled even when threads are disabled and will need to remain accessible
during boot to keep a simple enough internal API.

For the purpose of easing the configurations which do not specify a thread
group, we're starting group numbering at 1 so that thread group 0 can be
"undefined" (i.e. for "bind" lines or when binding tasks).

The goal will be to later move there some global items that must be
made per-group.
2021-10-08 17:22:26 +02:00
Willy Tarreau
561958c17c CLEANUP: time: move a few configurable defines to defaults.h
TV_ETERNITY, TV_ETERNITY_MS and MAX_DELAY_MS may be configured and
ought to be in defaults.h so that they can be inherited from everywhere
without including time.h and could also be redefined if neede
(particularly for MAX_DELAY_MS).
2021-10-07 01:41:14 +02:00
Willy Tarreau
8de6dc9926 REORG: pools: move default settings to defaults.h
There's no reason CONFIG_HAP_POOLS and its opposite are located into
pools-t.h, it forces those that depend on them to inlcude the file.
Other similar options are normally dealt with in defaults.h, which is
part of the default API, so let's do that.
2021-09-28 19:31:16 +02:00
Tim Duesterhus
d5fc8fcb86 CLEANUP: Add haproxy/xxhash.h to avoid modifying import/xxhash.h
This solves setting XXH_INLINE_ALL in a cleaner way, because the imported
header is not modified, easing future updates.

see 6f7cc11e6dd0f01b437fba893da2edd2362660a2
2021-09-11 19:58:45 +02:00
Willy Tarreau
33056436c7 BUILD/MINOR: defaults: eliminate warning on MAXHOSTNAMELEN with -Wundef
As reported in GH issue #1369, there is a single case of #if with a
possibly undefined value in defaults.h which is on MAXHOSTNAMELEN. Let's
turn it to a #ifdef.
2021-08-28 12:05:32 +02:00
Willy Tarreau
dc70c18ddc BUG/MEDIUM: cfgcond: limit recursion level in the condition expression parser
Oss-fuzz reports in issue 36328 that we can recurse too far by passing
extremely deep expressions to the ".if" parser. I thought we were still
limited to the 1024 chars per line, that would be highly sufficient, but
we don't have any limit now :-/

Let's just pass a maximum recursion counter to the recursive parsers.
It's decremented for each call and the expression fails if it reaches
zero. On the most complex paths it can add 3 levels per parenthesis,
so with a limit of 1024, that's roughly 343 nested sub-expressions that
are supported in the worst case. That's more than sufficient, for just
a few kB of RAM.

No backport is needed.
2021-07-20 18:03:08 +02:00
Willy Tarreau
06987f4238 CLEANUP: global: remove unused definition of MAX_PROCS
This one was forced to 1 and the only reference was a test to verify it
was comprised between 1 and LONGBITS.
2021-06-15 16:52:42 +02:00
Willy Tarreau
b63dbb7b2e MAJOR: config: remove parsing of the global "nbproc" directive
This one was deprecated in 2.3 and marked for removal in 2.5. It suffers
too many limitations compared to threads, and prevents some improvements
from being engaged. Instead of a bypassable startup error, there is now
a hard error.

The parsing code was removed, and very few obvious cases were as well.
The code is deeply rooted at certain places (e.g. "for" loops iterating
from 0 to nbproc) so it will not be that trivial to remove everywhere.
The "bind" and "bind-process" parsers will have to be adjusted, though
maybe not completely changed if we later want to support thread groups
for large NUMA machines. Some stats socket restrictions were removed,
and the doc was updated according to what was done. A few places in the
doc still refer to nbproc and will have to be revisited. The master-worker
code also refers to the process number to distinguish between master and
workers and will have to be carefully adjusted. The MAX_PROCS macro was
reset to 1, this will at least reduce the size of some remaining arrays.

Two regtests were dependieng on this directive, one with an explicit
"nbproc 1" and another one testing the master's CLI using nbproc 4.
Both were adapted.
2021-06-11 17:02:13 +02:00
Amaury Denoyelle
b56a7c89a8 MEDIUM: cfgparse: detect numa and set affinity if needed
On process startup, the CPU topology of the machine is inspected. If a
multi-socket CPU machine is detected, automatically define the process
affinity on the first node with active cpus. This is done to prevent an
impact on the overall performance of the process in case the topology of
the machine is unknown to the user.

This step is not executed in the following condition :
- a non-null nbthread statement is present
- a restrictive 'cpu-map' statement is present
- the process affinity is already restricted, for example via a taskset
  call

For the record, benchmarks were executed on a machine with 2 CPUs
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. In both clear and ssl
scenario, the performance were sub-optimal without the automatic
rebinding on a single node.
2021-04-23 16:06:49 +02:00
Willy Tarreau
f14c7570d6 CLEANUP: cli: rename MAX_STATS_ARGS to MAX_CLI_ARGS
This is the number of args accepted on a command received on the CLI,
is has long been totally independent of stats and should not carry
this misleading "stats" name anymore.
2021-03-13 10:59:23 +01:00
Willy Tarreau
060a761248 OPTIM: task: automatically adjust the default runqueue-depth to the threads
The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.
2021-03-10 11:15:34 +01:00
Willy Tarreau
f587003fe9 MINOR: pools: double the local pool cache size to 1 MB
The reason is that H2 can already require 32 16kB buffers for the mux
output at once, which will deplete the local cache. Thus it makes sense
to go further to leave some time to other connection to release theirs.
In addition, the L2 cache on modern CPUs is already 1 MB, so this change
is welcome in any case.
2021-03-05 08:30:08 +01:00
Willy Tarreau
66161326fd MINOR: listener: refine the default MAX_ACCEPT from 64 to 4
The maximum number of connections accepted at once by a thread for a single
listener used to default to 64 divided by the number of processes but the
tasklet-based model is much more scalable and benefits from smaller values.
Experimentation has shown that 4 gives the highest accept rate for all
thread values, and that 3 and 5 come very close, as shown below (HTTP/1
connections forwarded per second at multi-accept 4 and 64):

 ac\thr|    1     2    4     8     16
 ------+------------------------------
      4|   80k  106k  168k  270k  336k
     64|   63k   89k  145k  230k  274k

Some tests were also conducted on SSL and absolutely no change was observed.

The value was placed into a define because it used to be spread all over the
code.

It might be useful at some point to backport this to 2.3 and 2.2 to help
those who observed some performance regressions from 1.6.
2021-02-19 16:02:04 +01:00
Willy Tarreau
4327d0ac00 MINOR: tasks: refine the default run queue depth
Since a lot of internal callbacks were turned to tasklets, the runqueue
depth had not been readjusted from the default 200 which was initially
used to favor batched processing. But nowadays it appears too large
already based on the following tests conducted on a 8c16t machine with
a simple config involving "balance leastconn" and one server. The setup
always involved the two threads of a same CPU core except for 1 thread,
and the client was running over 1000 concurrent H1 connections. The
number of requests per second is reported for each (runqueue-depth,
nbthread) couple:

 rq\thr|    1     2     4     8    16
 ------+------------------------------
     32|  120k  159k  276k  477k  698k
     40|  122k  160k  276k  478k  722k
     48|  121k  159k  274k  482k  720k
     64|  121k  160k  274k  469k  710k
    200|  114k  150k  247k  415k  613k  <-- default

It's possible to save up to about 18% performance by lowering the
default value to 40. One possible explanation to this is that checking
I/Os more frequently allows to flush buffers faster and to smooth the
I/O wait time over multiple operations instead of alternating phases
of processing, waiting for locks and waiting for new I/Os.

The total round trip time also fell from 1.62ms to 1.40ms on average,
among which at least 0.5ms is attributed to the testing tools since
this is the minimum attainable on the loopback.

After some observation it would be nice to backport this to 2.3 and
2.2 which observe similar improvements, since some users have already
observed some perf regressions between 1.6 and 2.2.
2021-02-19 16:01:55 +01:00
Dragan Dosen
6f7cc11e6d MEDIUM: xxhash: use the XXH_INLINE_ALL macro to inline all functions
This way we make all xxhash functions inline, with implementations being
directly included within xxhash.h.

Makefile is updated as well, since we don't need to compile and link
xxhash.o anymore.

Inlining should improve performance on small data inputs.
2020-12-23 06:39:21 +01:00
Willy Tarreau
ca8b069aa7 REORG: include: move MAX_THREADS to defaults.h
That's already where MAX_PROCS is set, and we already handle the case of
the default value so there is no reason for placing it in thread.h given
that most call places don't need the rest of the threads definitions. The
include was removed from global-t.h and activity.c.
2020-06-11 10:18:59 +02:00
Willy Tarreau
aeed4a85d6 REORG: include: move log.h to haproxy/log{,-t}.h
The current state of the logging is a real mess. The main problem is
that almost all files include log.h just in order to have access to
the alert/warning functions like ha_alert() etc, and don't care about
logs. But log.h also deals with real logging as well as log-format and
depends on stream.h and various other things. As such it forces a few
heavy files like stream.h to be loaded early and to hide missing
dependencies depending where it's loaded. Among the missing ones is
syslog.h which was often automatically included resulting in no less
than 3 users missing it.

Among 76 users, only 5 could be removed, and probably 70 don't need the
full set of dependencies.

A good approach would consist in splitting that file in 3 parts:
  - one for error output ("errors" ?).
  - one for log_format processing
  - and one for actual logging.
2020-06-11 10:18:58 +02:00
Willy Tarreau
ac13aeaa89 REORG: include: move auth.h to haproxy/auth{,-t}.h
The STATS_DEFAULT_REALM and STATS_DEFAULT_URI were moved to defaults.h.
It was required to include types/pattern.h and types/sample.h since they
are mentioned in function prototypes.

It would be wise to merge this with uri_auth.h later.
2020-06-11 10:18:57 +02:00
Willy Tarreau
0f6ffd652e REORG: include: move fd.h to haproxy/fd{,-t}.h
A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h

Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
2020-06-11 10:18:57 +02:00