QUIC connections are pushed manually into a dedicated listener queue
when they are ready to be accepted. This happens after handshake
finalization or on 0-RTT packet reception. Listener is then woken up to
dequeue them with listener_accept().
This patch comptabilizes the number of connections currently stored in
the accept queue. If reaching a certain limit, INITIAL packets are
dropped on reception to prevent further QUIC connections allocation.
This should help to preserve system resources.
This limit is automatically derived from the listener backlog. Half of
its value is reserved for handshakes and the other half for accept
queues. By default, backlog is equal to maxconn which guarantee that
there can't be no more than maxconn connections in handshake or waiting
to be accepted.
Implement a limit per listener for concurrent number of QUIC
connections. When reached, INITIAL packets for new connections are
automatically dropped until the number of handshakes is reduced.
The limit value is automatically based on listener backlog, which itself
defaults to maxconn.
This feature is important to ensure CPU and memory resources are not
consume if too many handshakes attempt are started in parallel.
Special care is taken if a connection is released before handshake
completion. In this case, counter must be decremented. This forces to
ensure that member <qc.state> is set early in qc_new_conn() before any
quic_conn_release() invocation.
As a result of copy paste error in 1b8e68e ("MEDIUM: stick-table: Stop
handling stick-tables as proxies."), postparsing stktable_init() failures
were reported as such for named peer tables:
"Proxy 'table_name': failed to initialize stick table."
Now they are correctly reported like this:
"Parsing [file:line]: failed to initialize 'table_name' stick-table."
This should be backported to every stable versions.
Simplify stick and store sticktable proxy rules postparsing by adding
a sticking rule entry resolve (postparsing) function.
This will ease code maintenance.
Increment actconn and check maxconn limit when a quic_conn is
instantiated. This is necessary because prior to this patch, quic_conn
instances where not counted. Global actconn was only incremented after
the handshake has been completed and the connection structure is
allocated.
The increment is done using increment_actconn() on INITIAL packet
parsing if a new connection is about to be created. If the limit is
reached, the allocation is cancelled and the INITIAL packet is dropped.
The decrement is done under quic_conn_release(). This means that
quic_cc_conn instances are not taken into account. This seems safe
enough because quic_cc_conn are only used for minimal usage.
The counterpart of this change is that maxconn must not be checked a
second time when listener_accept() is done over a QUIC connection. For
this, a new bind_conf flag BC_O_XPRT_MAXCONN is set for listeners when
maxconn is already counted by the lower layer. For the moment, it is
positionned only for QUIC listeners.
Without this patch, haproxy process could suffer from heavy memory/CPU
load if the number of concurrent handshake is high.
This patch is not considered a bug fix per-se. However, it has a major
benefit to protect against too many QUIC handshakes. As such, it should
be backported up to 2.6. For this, it relies on the following patch :
"MINOR: frontend: implement a dedicated actconn increment function"
This function allows to safely map proxy mode to corresponding proto_mode
This will allow for easier code maintenance and prevent mixups between
proxy mode and proto mode.
Reverse HTTP listeners are very specific and share only a very limited
subset of keywords with other listeners. As such, it is probable
meaningless to mix standard and reverse addresses on the same bind line.
This patch emits a fatal error during configuration parsing if this is
the case.
hash lb algorithm can be configured with the "log-balance hash <cnv_list>"
directive. With this algorithm, the user specifies a converter list with
<cnv_list>.
The produced log message will be passed as-is to the provided converter
list, and the resulting hash will be used to select the log server that
will receive the log message.
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)
log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:
| log backend@mybackend local0
A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
When 'log' directive was implemented, the internal representation was
named 'struct logsrv', because the 'log' directive would directly point
to the log target, which used to be a (UDP) log server exclusively at
that time, hence the name.
But things have become more complex, since today 'log' directive can point
to ring targets (implicit, or named) for example.
Indeed, a 'log' directive does no longer reference the "final" server to
which the log will be sent, but instead it describes which log API and
parameters to use for transporting the log messages to the proper log
destination.
So now the term 'logsrv' is rather confusing and prevents us from
introducing a new level of abstraction because they would be mixed
with logsrv.
So in order to better designate this 'log' directive, and make it more
generic, we chose the word 'logger' which now replaces logsrv everywhere
it was used in the code (including related comments).
This is internal rewording, so no functional change should be expected
on user-side.
These ones were still in cfgparse.c but they're not specific to the
config at all and may actually be used even when parsing cpu list
entries in /sys. Better move them where they can be reused.
Do not force affinity on the process, instead let's just apply it to
cpu-map, it will automatically be used later in the init process. We
can do this because we know that cpu-map was not set when we're using
this detection code.
This is much saner, as we don't need to manipulate the process' affinity
at this point in time, and just update the info that the user omitted to
set by themselves, which guarantees a better long-term consistency with
the documented feature.
It's easier to use this function now to natively support variable
fields in the file's path. This also removes read_file_from_trash()
that was only used here and was static.
The function generate_random_cluster_secret() which initializes the cluster secret
when not supplied by configuration is buggy. There 1/256 that the cluster secret
string is empty.
To fix this, one stores the cluster as a reduced size first 128 bits of its own
SHA1 (160 bits) digest, if defined by configuration. If this is not the case, it
is initialized with a 128 bits random value. Furthermore, thus the cluster secret
is always initialized.
As the cluster secret is always initialized, there are several tests which
are for now on useless. This patch removes such tests (if(global.cluster_secret))
in the QUIC code part and at parsing time: no need to check that a cluster
secret was initialized with "quic-force-retry" option.
Must be backported as far as 2.6.
Implement parsing for "rev@" addresses on bind line. On config parsing,
server name is stored on the bind_conf.
Several new callbacks are defined on reverse_connect protocol to
complete parsing. listen callback is used to retrieve the server
instance from the bind_conf server name. If found, the server instance
is stored on the receiver. Checks are implemented to ensure HTTP/2
protocol only is used by the server.
Implement reverse-connect server. This server type cannot instantiate
its own connection on transfer. Instead, it can only reuse connection
from its idle pool. These connections will be populated using the future
'tcp-request session attach-srv' rule.
A reverse-connect has no address. Instead, it uses a new custom server
notation with '@' character prefix. For the moment, only '@reverse' is
defined. An extra check is implemented to ensure server is used in a
HTTP proxy.
Add a check for limited-quic in check_config_validity() when compiled
with USE_QUIC_OPENSSL_COMPAT so that we prevent a config from starting
accidentally with limited QUIC support. If a QUIC listener is found
when using the compatibility mode and limited-quic is not set, an error
message is reported explaining that the SSL library is not compatible
and proposing the user to enable limited-quic if that's what they want,
and the startup fails.
This partially reverts commit 7c730803d ("MINOR: quic: Warning for
OpenSSL wrapper QUIC bindings without "limited-quic"") since a warning
was not sufficient.
Surprisingly, commit 00e00fb42 ("REORG: cfgparse: extract curproxy as a
global variable") caused a build breakage on the CI but not on two
developers' machines. It looks like it's dependent on the linker version
used. What happens is that flt_spoe.c already has a curproxy struct which
already is a copy of the one passed by the parser because it also needed
it to be exported, so they now conflict. Let's just drop this unused copy.
parse_cpu_set() stopped returning the undocumented -1 which was a
leftover from an earlier attempt, changed from ulong to int since
it only returns a success/failure and no more a mask. Thus it must
not return -1 and its callers must only test for != 0, as is
documented.
As documented, the NUMA auto-detection is not supposed to be used when
the CPU affinity was set either by taskset (already checked) or by a
cpu-map directive. However this check was missing, so that configs
having cpu-map entries would still first bind to a single node. In
practice it has no impact on correct configs since bindings will be
replaced. However for those where the cpu-map directive are not
exhaustive it will have the impact of binding those threads to one node,
which disagrees with the doc (and makes future evolutions significantly
more complicated).
This could be backported to 2.4 where numa-cpu-mapping was added, though
if nobody encountered this by then maybe we should only focus on recent
versions that are more NUMA-friendly (e.g. 2.8 only). This patch depends
on this previous commit that brings the function we rely on:
MINOR: cpuset: add cpu_map_configured() to know if a cpu-map was found
When a stick-table is defined within a peers section, the name is
prefixed with the peers section name. However when checking for
duplicate table names, the check was using the table name without
the prefix, and would thus never match.
Must be backported as far as 2.6.
There are several misuses in peers sections that are not detected during the
configuration parsing and that could lead to undefined behaviors or crashes.
First, only one listener is expected for a peers section. If several bind
lines or local peer definitions are used, an error is triggered. However, if
multiple addresses are set on the same bind line, there is no error while
only the last listener is properly configured. On the 2.8, there is no crash
but side effects are hardly predictable. On older version, HAProxy crashes
if an unconfigured listener is used.
Then, there is no check on remote peers name. It is unexpected to have same
name for several remote peers. There is now a test, performed during the
post-parsing, to verify all remote peer names are unique.
Finally, server parsing options for the peers sections are changed to be
sure a port is always defined, and not a port range or a port offset.
This patch fixes the issue #2066. It could be backported to all stable
versions.
Some huge configs take a significant amount of time to start and this
can cause some trouble (e.g. health checks getting delayed and grouped,
process not responding to the CLI etc). For example, some configs might
start fast in certain environments and slowly in other ones just due to
the use of a wrong DNS server that delays all libc's resolutions. Let's
first start by measuring it by keeping a copy of the most recently known
ready date, once before calling check_config_validity() and then refine
it when leaving this function. A last call is finally performed just
before deciding to split between master and worker processes, and it covers
the whole boot. It's trivial to collect and even allows to get rid of a
call to clock_update_date() in function check_config_validity() that was
used in hope to better schedule future events.
The function that cpu-map uses to parse CPU sets, parse_cpu_set(), was
etended in 2.4 with commit a80823543 ("MINOR: cfgparse: support the
comma separator on parse_cpu_set") to support commas between ranges.
But since it was quite late in the development cycle, by then it was
decided not to add a last-minute surprise and not to magically support
commas in cpu-map, hence the "comma_allowed" argument.
Since then we know that it was not the best choice, because the comma
is silently ignored in the cpu-map syntax, causing all sorts of
surprises in field with threads running on a single node for example.
In addition it's quite common to copy-paste a taskset line and put it
directly into the haproxy configuration.
This commit relaxes this rule an finally allows cpu-map to support
commas between ranges. It simply consists in removing the comma_allowed
argument in the parse_cpu_set() function. The doc was updated to
reflect this.
This puts an end to the occasional confusion between the "now" date
that is internal, monotonic and not synchronized with the system's
date, and "date" which is the system's date and not necessarily
monotonic. Variable "now" was removed and replaced with a 64-bit
integer "now_ns" which is a counter of nanoseconds. It wraps every
585 years, so if all goes well (i.e. if humanity does not need
haproxy anymore in 500 years), it will just never wrap. This implies
that now_ns is never nul and that the zero value can reliably be used
as "not set yet" for a timestamp if needed. This will also simplify
date checks where it becomes possible again to do "date1<date2".
All occurrences of "tv_to_ns(&now)" were simply replaced by "now_ns".
Due to the intricacies between now, global_now and now_offset, all 3
had to be turned to nanoseconds at once. It's not a problem since all
of them were solely used in 3 functions in clock.c, but they make the
patch look bigger than it really is.
The clock_update_local_date() and clock_update_global_date() functions
are now much simpler as there's no need anymore to perform conversions
nor to round the timeval up or down.
The wrapping continues to happen by presetting the internal offset in
the short future so that the 32-bit now_ms continues to wrap 20 seconds
after boot.
The start_time used to calculate uptime can still be turned to
nanoseconds now. One interrogation concerns global_now_ms which is used
only for the freq counters. It's unclear whether there's more value in
using two variables that need to be synchronized sequentially like today
or to just use global_now_ns divided by 1 million. Both approaches will
work equally well on modern systems, the difference might come from
smaller ones. Better not change anyhting for now.
One benefit of the new approach is that we now have an internal date
with a resolution of the nanosecond and the precision of the microsecond,
which can be useful to extend some measurements given that timestamps
also have this resolution.
Instead we're using ns_to_sec(tv_to_ns(&now)) which allows the tv_sec
part to disappear. At this point, "now" is only used as a timeval in
clock.c where it is updated.
The listeners in peers sections were still not handing the thread
groups fine. Shards were silently ignored and if a listener was bound
to more than one group, it would simply fail. Now we can call the
dedicated function to resolve all this and possibly create the missing
extra listeners.
bind_complete_thread_setup() was adjusted to use the proxy_type_str()
instead of writing "proxy" at the only place where this word was still
hard-coded so that we continue to speak about peers sections when
relevant.
What used to be only two lines to apply a mask in a loop in
check_config_validity() grew into a 130-line block that performs deeply
listener-specific operations that do not have their place there anymore.
In addition it's worth noting that the peers code still doesn't support
shards nor being bound to more than one group, which is a second reason
for moving that code to its own function. Nothing was changed except
recreating the missing variables from the bind_conf itself (the fe only).
In 2.6-dev1, NUMA topology detection was enabled on FreeBSD with commit
f5d48f8b3 ("MEDIUM: cfgparse: numa detect topology on FreeBSD."). But
it suffers from a minor bug which is that it forgets to check for the
number of domains and always emits a confusing warning indicating that
multiple sockets were found while it's not the case.
This can be backported to 2.6.
Now it's possible for a bind line to span multiple thread groups. When
this happens, the first one will become the reference and will be entirely
set up, and the subsequent ones will be duplicated from this reference,
so that they can be registered in distinct groups. The reference is
always setup and started first so it is always available when the other
ones are started.
The doc was updated to reflect this new possibility with its limitations
and impacts, and the differences with the "shards" option.
Commit 5003ac7fe ("MEDIUM: config: set useful ALPN defaults for HTTPS
and QUIC") revealed a build dependency bug: if QUIC is not enabled,
cfgparse doesn't have any dependency on the SSL stack, so the various
ifdefs that try to check special conditions such as rejecting an H2
config with too small a bufsize, are silently ignored. This was
detected because the default ALPN string was not set and caused the
alpn regtest to fail without QUIC support. Adding openssl-compat to
the list of includes seems to be sufficient to have what we need.
It's unclear when this dependency was broken, it seems that even 2.2
didn't have an explicit dependency on anything SSL-related, though it
could have been inherited through other files (as happens with QUIC
here). It would be safe to backport it to all stable branches. The
impact is very low anyway.
This commit makes sure that if three is no "alpn", "npn" nor "no-alpn"
setting on a "bind" line which corresponds to an HTTPS or QUIC frontend,
we automatically turn on "h2,http/1.1" as an ALPN default for an HTTP
listener, and "h3" for a QUIC listener. This simplifies the configuration
for end users since they won't have to explicitly configure the ALPN
string to enable H2, considering that at the time of writing, HTTP/1.1
represents less than 7% of the traffic on large infrastructures. The
doc and regtests were updated. For more info, refer to the following
thread:
https://www.mail-archive.com/haproxy@formilux.org/msg43410.html
It's possible to replace a previously set ALPN but not to disable ALPN
if it was previously set. The new "no-alpn" setting allows to disable
a previously set ALPN setting by preparing an empty one that will be
replaced and freed when the config is validated.
Setting "shards by-group" will create one shard per thread group. This
can often be a reasonable tradeoff between a single one that can be
suboptimal on CPUs with many cores, and too many that will eat a lot
of file descriptors. It was shown to provide good results on a 224
thread machine, with a distribution that was even smoother than the
system's since here it can take into account the number of connections
per thread in the group. Depending on how popular it becomes, it could
even become the default setting in a future version.
Instead of artificially setting the shards count to MAX_THREAD when
"by-thread" is used, let's reserve special values for symbolic names
so that we can add more in the future. For now we use value -1 for
"by-thread", which requires to turn the type to signed int but it was
already used as such everywhere anyway.
We're only checking for 0, 1, or >1 groups enabled there, and we'll soon
need to be more precise and know quickly which groups are non-empty.
Let's just replace the count with a mask of enabled groups. This will
allow to quickly spot the presence of any such group in a set.
thread_set_first_group() and thread_set_first_tmask() were modified
and renamed to instead return the number and mask of the nth group.
Passing zero continues to return the first one, but it will be more
convenient to use this way when building shards.
The ssl_bind_kw structure is exclusively used for crt-list keyword, it
must be named otherwise to remove the confusion.
The structure was renamed ssl_crtlist_kws.
In check_config_validity() function, we performed some consistency checks to
adjust minconn/maxconn attributes for each declared server.
We move this logic into a dedicated function named srv_minmax_conn_apply()
to be able to perform those checks later in the process life when needed
(ie: dynamic servers)
Aurélien reported that the BUG_ON(!new_ts.nbgrp) added in 2.8-dev3 by
commit 50440457e ("MEDIUM: config: restrict shards, not bind_conf to one
group each") can trigger on some invalid configs where the thread_set on
the "bind" line couldn't be resolved. The reason is that we still enter
the parsing loop (as it was done previously) and we possibly have no
group to work on (which was the purpose of this assertion). There we
need to bypass all this block on such a condition.
No backport is needed.
Now that we're using thread_sets there's no need to restrict an entire
bind_conf to 1 group, the real concern being the FD, we can move that
restriction to the shard only. This means that as long as we have enough
shards and that they're properly aligned on group boundaries (i.e. shards
are an integer divider of the number of threads), we can support "bind"
lines spanning more than one group.
The check is still performed for shards to span more than one group,
and an error is emitted when this happens. But at least now it becomes
possible to have this:
global
nbthread 256
frontend foo
bind :1111 shards 4
bind :2222 shards by-thread
Let's now retrieve the first thread group and its mask from the
thread_set so that we don't need these fields in the bind_conf anymore.
For now we're still limited to the first group (like before) but that
allows to get rid of these fields and to make sure that there's nothing
"special" being done there anymore.
Instead of reading and storing a single group and a single mask for a
"thread" directive on a bind line, we now store the complete range in
a thread set that's stored in the bind_conf. The bind_parse_thread()
function now just calls parse_thread_set() to complete the current set,
which starts empty, and thread_resolve_group_mask() was updated to
support retrieving thread group numbers or absolute thread numbers
directly from the pre-filled thread_set, and continue to feed bind_tgroup
and bind_thread. The CLI parsers which were pre-initialized to set the
bind_tgroup to 1 cannot do it anymore as it would prevent one from
restricting the thread set. Instead check_config_validity() now detects
the CLI frontend and passes the info down to thread_resolve_group_mask()
that will automatically use only the group 1's threads for these
listeners. The same is done for the peers listeners for now.
At this step it's already possible to start with all previous valid
configs as well as extended ones supporting comma-delimited thread
sets. In addition the parser already accepts large ranges spanning
multiple groups, but since the underlying listeners infrastructure
is not read, for now we're maintaining a specific check against this
at the higher level of the config validity check.
The patch is a bit large because thread resolution is performed in
multiple steps, so we need to adjust all of them at once to preserve
functional and technical consistency.