These values are obviously wrong. There is an extra zero at the end for both
defines. By chance, it is harmless. But it is better to fix it.
This patch should be backported as far as 2.6.
qc_send() was systematically called by quic_conn IO handlers with all
instantiated quic_enc_level. Change this to only register quic_enc_level
for send if needed. Do not call at all qc_send() if no qel registered.
A new function qel_need_sending() is defined to detect if sending is
required. First, it checks if quic_enc_level has prepared frames or
probing is set. It can also returns true if ACK required either on
quic_enc_level itself or because of quic_conn ack timer fired. Finally,
a CONNECTION_CLOSE emission for quic_conn is also a valid case.
This should reduce the number of invocations of qc_send(). This could
improve slightly performance, as well as simplify traces debugging.
A series of previous patches have clean up sending function for
handshake case. Their new exposed API is now flexible enough to convert
app case to use the same functions.
As such, qc_send_hdshk_pkts() is renamed qc_send() and become the single
entry point for QUIC emission. It is used during application packets
emission in quic_conn_app_io_cb(), qc_send_mux(). Also the internal
function qc_prep_hpkts() is renamed qc_prep_pkts().
Remove the new unneeded qc_send_app_pkts() and qc_prep_app_pkts().
Also removed qc_send_app_probing(). It was a simple wrapper over other
application send functions. Now, default qc_send() can be reuse for such
cases with <old_data> argument set to true.
An adjustment was needed when converting qc_send_hdshk_pkts() to the
general qc_send() version. Previously, only a single packets
encoding/emission cycle was performed. This was enough as handshake
packets are always smaller than Tx buffer. However, it may be possible
to emit more application data. As such, a loop is necessary to perform
multiple encoding/emission cycles, as this was already the case in
qc_send_app_pkts().
No functional difference should happen with this commit. However, as
these are critcal functions with a lot of changes, this patch is
labelled as medium.
quic_conn_io_cb() manually implements emission by using lower level
functions qc_prep_pkts() and qc_send_ppkts(). Replace this by using the
higher level function qc_send_hdshk_pkts() which notably handle buffer
allocation and purging.
This allows to clean up send API by flagging qc_prep_pkts() and
qc_send_ppkts() as static. They are now used in a single location inside
qc_send_hdshk_pkts().
qc_send_hdshk_pkts() is a wrapper for qc_prep_hpkts() used on
retransmission. It was restricted to use two quic_enc_level pointers as
distinct arguments. Adapt it to directly use the same list of
quic_enc_level which is passed then to qc_prep_hpkts().
Now for retransmission quic_enc_level send list is built directly into
qc_dgrams_retransmit() which calls qc_send_hdshk_pkts().
Along this change, a new utility function qel_register_send() is
defined. It is an helper to build the quic_enc_level send list. It
enfores that each quic_enc_level instance is only registered in a single
list to prevent memory issues. It is both used in qc_dgrams_retransmit()
and quic_conn_io_cb().
Emission of packets during handshakes was implemented via an API which
uses two alternative ways to specify the list of frames.
The first one uses a NULL list of quic_enc_level as argument for
qc_prep_hpkts(). This was an implicit method to iterate on all qels
stored in quic_conn instance, with frames already inserted in their
corresponding quic_pktns.
The second method was used for retransmission. It uses a custom local
quic_enc_level list specified by the caller as input to qc_prep_hpkts().
Frames were accessible through <retransmit> list pointers of each
quic_enc_level used in an implicit mechanism.
This commit clarifies the API by using a single common method. Now
quic_enc_level list must always be specified by the caller. As for
frames list, each qels must set its new field <send_frms> pointer to the
list of frames to send. Callers of qc_prep_hpkts() are responsible to
always clear qels send list. This prevent a single instance of
quic_enc_level to be inserted while being attached to another list.
This allows notably to clean up some unnecessary code. First,
<retransmit> list of quic_enc_level is removed as it is replaced by new
<send_frms>. Also, it's now possible to use proper list_for_each_entry()
inside qc_prep_hpkts() to loop over each qels. Internal functions for
quic_enc_level selection is now removed.
encode_{chunk,string}() is often found to be used this way:
ret = encode_{chunk,string}(start, stop...)
if (ret == NULL || *ret != '\0') {
//error
}
//success
Indeed, encode_{chunk,string} will always try to add terminating NULL byte
to the output string, unless no space is available for even 1 byte.
However, it means that for the caller to be able to spot an error, then it
must provide a buffer (here: start) which is already initialized.
But this is wrong: not only this is very tricky to use, but since those
functions don't return NULL on failure, then if the output buffer was not
properly initialized prior to calling the function, the caller will
perform invalid reads when checking for failure this way. Moreover, even
if the buffer is initialized, we cannot reliably tell if the function
actually failed this way because if the buffer was previously initialized
with NULL byte, then the caller might think that the call actually
succeeded (since the function didn't return NULL and didn't update the
buffer).
Also, sess_build_logline() relies lf_encode_{chunk,string}() functions
which are in fact wrappers for encode_{chunk,string}() functions and thus
exhibit the same error handling mechanism. It turns out that
sess_build_logline() makes unsafe use of those functions because it uses
the error-checking logic mentionned above while buffer (tmplog) is not
guaranteed to be initialized when entering the function. This may
ultimately cause malfunctions or invalid reads if the output buffer is
lacking space.
To fix the issue once and for all and prevent similar bugs from being
introduced, we make it so encode_{string, chunk} and escape_string()
(based on encode_string()) now explicitly return NULL on failure
(when the function failed to write at least the ending NULL byte)
lf_encode_{string,chunk}() helpers had to be patched as well due to code
duplication.
This should be backported to all stable versions.
[ada: for 2.4 and 2.6 the patch won't apply as-is, it might be helpful to
backport ae1e14d65 ("CLEANUP: tools: removing escape_chunk() function")
first, considering it's not very relevant to maintain a dead function]
Since the Linux capabilities support add-on (see the commit bd84387beb
("MEDIUM: capabilities: enable support for Linux capabilities")), we can also
check haproxy process effective and permitted capabilities sets, when it
starts and runs as non-root.
Like this, if needed network capabilities are presented only in the process
permitted set, we can get this information with capget and put them in the
process effective set via capset. To do this properly, let's introduce
prepare_caps_from_permitted_set().
First, it checks if binary effective set has CAP_NET_ADMIN or CAP_NET_RAW. If
there is a match, LSTCHK_NETADM is removed from global.last_checks list to
avoid warning, because in the initialization sequence some last configuration
checks are based on LSTCHK_NETADM flag and haproxy process euid may stay
unpriviledged.
If there are no CAP_NET_ADMIN and CAP_NET_RAW in the effective set, permitted
set will be checked and only capabilities given in 'setcap' keyword will be
promoted in the process effective set. LSTCHK_NETADM will be also removed in
this case by the same reason. In order to be transparent, we promote from
permitted set only capabilities given by user in 'setcap' keyword. So, if
caplist doesn't include CAP_NET_ADMIN or CAP_NET_RAW, LSTCHK_NETADM would not
be unset and warning about missing priviledges will be emitted at
initialization.
Need to call it before protocol_bind_all() to allow binding to priviledged
ports under non-root and 'setcap cap_net_bind_service' must be set in the
global section in this case.
This commit is similar with the two previous ones. Its purpose is to add
GUID support on listeners. Due to bind_conf and listeners configuration,
some specifities were required.
Its possible to define several listeners on a single bind line, for
example by specifying multiple addresses. As such, it's impossible to
support a "guid" keyword on a bind line. The problem is exacerbated by
the cloning of listeners when sharding is used.
To resolve this, a new keyword "guid-prefix" is defined for bind lines.
It allows to specify a string which will be used as a prefix for
automatically generated GUID for each listeners attached to a bind_conf.
Automatic GUID listeners generation is implemented via a new function
bind_generate_guid(). It is called on post-parsing, after
bind_complete_thread_setup(). For each listeners on a bind_conf, a new
GUID is generated with bind_conf prefix and the index of the listener
relative to other listeners in the bind_conf. This last value is stored
in a new bind_conf field named <guid_idx>. If a GUID cannot be inserted,
for example due to a non-unique value, an error is returned, startup is
interrupted with configuration rejected.
This commit is similar to previous one, except that it implements GUID
support for server instances. A guid_node field is inserted into server
structure. A new "guid" server keyword is defined.
Implement proxy identiciation through GUID. As such, a guid_node member
is inserted into proxy structure. A proxy keyword "guid" is defined to
allow user to fix its value.
GUID format is unspecified to allow users to choose the naming scheme.
Some restrictions however are added by this patch, mainly to ensure
coherence and memory usage.
The first restriction is on the length of GUID. No more than 127
characters can be used to prevent memory over consumption.
The second restriction is on the character set allowed in GUID. Utility
function invalid_char() is used for this : it allows alphanumeric
values and '-', '_', '.' and ':'.
Define a new module guid. Its purpose is to be able to attach a global
identifier for various objects such as proxies, servers and listeners.
A new type guid_node is defined. It will be stored in the objects which
can be referenced by such GUID. Several functions are implemented to
properly initialized, insert, remove and lookup GUID in a global tree.
Modification operations should only be conducted under thread isolation.
Currently, the way proxy-oriented logformat directives are handled is way
too complicated. Indeed, "log-format", "log-format-error", "log-format-sd"
and "unique-id-format" all rely on preparsing hints stored inside
proxy->conf member struct. Those preparsing hints include the original
string that should be compiled once the proxy parameters are known plus
the config file and line number where the string was found to generate
precise error messages in case of failure during the compiling process
that happens within check_config_validity().
Now that lf_expr API permits to compile a lf_expr struct that was
previously prepared (with original string and config hints), let's
leverage lf_expr_compile() from check_config_validity() and instead
of relying on individual proxy->conf hints for each logformat expression,
store string and config hints in the lf_expr struct directly and use
lf_expr helpers funcs to handle them when relevant (ie: original
logformat string freeing is now done at a central place inside
lf_expr_deinit(), which allows for some simplifications)
Doing so allows us to greatly simplify the preparsing logic for those 4
proxy directives, and to finally save some space in the proxy struct.
Also, since httpclient proxy has its "logformat" automatically compiled
in check_config_validity(), we now use the file hint from the logformat
expression struct to set an explicit name that will be reported in case
of error ("parsing [httpclient:0] : ...") and remove the extraneous check
in httpclient_precheck() (logformat was parsed twice previously..)
split parse_logformat_string() into two functions:
parse_logformat_string() sticks to the same behavior, but now becomes an
helper for lf_expr_compile() which uses explicit arguments so that it
becomes possible to use lf_expr_compile() without a proxy, but also
compile an expression which was previously prepared for compiling (set
string and config hints within the logformat expression to avoid manually
storing string and config context if the compiling step happens later).
lf_expr_dup() may be used to duplicate an expression before it is
compiled, lf_expr_xfer() now makes sure that the input logformat is
already compiled.
This is some prerequisite works for log-profiles implementation, no
functional change should be expected.
This patch tries to address a design flaw with how logformat expressions
are parsed from config. Indeed, some parse_logformat_string() calls are
performed during config parsing when the proxy mode is not yet known.
Here's a config example that illustrates the issue:
defaults
mode tcp
listen test
bind :8888
http-response set-header custom-hdr "%trl" # needs http
mode http
The above config should work, because the effective proxy mode is http,
yet haproxy fails with this error:
[ALERT] (99051) : config : parsing [repro.conf:6] : error detected in proxy 'test' while parsing 'http-response set-header' rule : format tag 'trl' is reserved for HTTP mode.
To fix the issue once and for all, let's implement smart postparsing for
logformat expressions encountered during config parsing:
- split parse_logformat_string() (and subfonctions) in order to create a
new lf_expr_postcheck() function that must be called to finish
preparing and checking the logformat expression once the proxy type is
known.
- save some config hints info during parse_logformat_string() to
generate more precise error messages during lf_expr_postcheck(), if
needed, we rely on curpx->conf.args.{file,line} hints for that because
parse_logformat_string() doesn't know about current file and line
number.
- lf_expr_postcheck() uses PR_FL_CHECKED proxy flag to know if the
function may try to make the proxy compatible with the expression, or
if it should simply fail as soon as an incompatibility is detected.
- if parse_logformat_string() is called from an unchecked proxy, then
schedule the expression for postparsing, else (ie: during runtime),
run the postcheck right away.
This change will also allow for some logformat expression error handling
simplifications in the future.
PR_FL_CHECKED is set on proxy once the proxy configuration was fully
checked (including postparsing checks).
This information may be useful to functions that need to know if some
config-related proxy properties are likely to change or not due to parsing
or postparsing/check logics. Also, during runtime, except for some rare cases
config-related proxy properties are not supposed to be changed.
log format expressions are broadly used within the code: once they are
parsed from input string, they are converted to a linked list of
logformat nodes.
We're starting to face some limitations because we're simply storing the
converted expression as a generic logformat_node list.
The first issue we're facing is that storing logformat expressions that
way doesn't allow us to add metadata alongside the list, which is part
of the prerequites for implementing log-profiles.
Another issue with storing logformat expressions as generic lists of
logformat_node elements is that it's starting to become really hard to
tell when we rely on logformat expressions or not in the code given that
there isn't always a comment near the list declaration or manipulation
to indicate that it's relying on logformat expressions under the hood,
so this adds some complexity for code maintenance.
This patch looks quite impressive due to changes in a lot of header and
source files (since logformat expressions are broadly used), but it does
a simple thing: it defines the lf_expr structure which itself holds a
generic list of logformat nodes, and then declares some helpers to
manipulate lf_expr elements and fixes the code so that we now exclusively
manipulate logformat_node lists as lf_expr elements outside of log.c.
For now, lf_expr struct only contains the list of logformat nodes (no
additional metadata), but now that we have dedicated type and helpers,
doing so in the future won't be problematic at all and won't require
extensive code changes.
This is a pretty simple patch despite requiring to make some visible
changes in the code:
When parsing a logformat string, log tags (ie: '%tag', AKA log tags) are
turned into logformat nodes with their type set to the type of the
corresponding logformat_tag element which was matched by name. Thus, when
"compiling" a logformat tag, we only keep a reference to the tag type
from the original logformat_tag.
For example, for "%B" log tag, we have the following logformat_tag
element:
{
.name = "B",
.type = LOG_FMT_BYTES,
.mode = PR_MODE_TCP,
.lw = LW_BYTES,
.config_callback = NULL
}
When parsing "%B" string, we search for a matching logformat tag
inside logformat_tags[] array using the provided name, once we find a
matching element, we craft a logformat node whose type will be
LOG_FMT_BYTES, but from the node itself, we no longer have access to
other informations that are set in the logformat_tag struct element.
Thus from a logformat_node resulting from a log tag, with current
implementation, we cannot easily get back to matching logformat_tag
struct element as it would require us to scan the whole logformat_tags
array at runtime using node->type to find the matching element.
Let's take a simpler path and consider all tag-specific LOG_FMT_*
subtypes as being part of the same logformat node type: LOG_FMT_TAG.
Thanks to that, we're now able to distinguish logformat nodes made
from logformat tag from other logformat nodes, and link them to
their corresponding logformat_tag element from logformat_tags[] array. All
it costs is a simple indirection and an extra pointer in logformat_node
struct.
While at it, all LOG_FMT_* types related to logformat tags were moved
inside log.c as they have no use outside of it since they are simply
lookup indexes for sess_build_logline() and could even be replaced by
function pointers some day...
rename logformat_type internal struct to logformat_tag to to make it less
confusing, then expose logformat_tag struct through header file so that it
can be referenced in other structs.
also rename logformat_keywords[] to logformat_tags[] for better
consistency.
What we use to call logformat variable in the code is referred as
log-format tag in the documentation. Having both 'var' and 'tag' labels
referring to the same thing is really confusing. Let's make the code
comply with the documentation by replacing all logformat var/variable/VAR
occurences with either tag or TAG.
No functional change should be expected, the only visible side-effect from
user point of view is that "variable" was replaced by "tag" in some error
messages.
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.
The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):
function dump_stk()
local dmp = core.proxies['tbl'].stktable:dump({});
local count = 0
for _, __ in pairs(dmp) do
count = count + 1
end
core.Info('Total entries: ' .. count)
end
core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);
##
global
tune.lua.log.stderr on
lua-load-per-thread lua-cnttbl.lua
listen front
bind :8001
http-request lua.dump_stk if { path_beg /stk }
http-request track-sc1 rand(),upper,hex table tbl
http-request redirect location /
backend tbl
stick-table size 100k type string len 12 store http_req_cnt
##
$ h2load -c 16 -n 10000 0:8001/
$ curl 0:8001/stk
## A count close to 100k appears on haproxy's stderr
## On the CLI, "show table tbl" | wc will show the same.
Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.
The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.
The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().
The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.
At this point, this allowed to completely unlock the performance on a
80-thread setup:
before: 5.4 Gbps, 150k RPS, 80 cores
52.71% haproxy [.] stktable_lookup_key
36.90% haproxy [.] stktable_get_entry.part.0
0.86% haproxy [.] ebmb_lookup
0.18% haproxy [.] process_stream
0.12% haproxy [.] process_table_expire
0.11% haproxy [.] fwrr_get_next_server
0.10% haproxy [.] eb32_insert
0.10% haproxy [.] run_tasks_from_lists
after: 36 Gbps, 980k RPS, 80 cores
44.92% haproxy [.] stktable_get_entry
5.47% haproxy [.] ebmb_lookup
2.50% haproxy [.] fwrr_get_next_server
0.97% haproxy [.] eb32_insert
0.92% haproxy [.] process_stream
0.52% haproxy [.] run_tasks_from_lists
0.45% haproxy [.] conn_backend_get
0.44% haproxy [.] __pool_alloc
0.35% haproxy [.] process_table_expire
0.35% haproxy [.] connect_server
0.35% haproxy [.] h1_headers_to_hdr_list
0.34% haproxy [.] eb_delete
0.31% haproxy [.] srv_add_to_idle_list
0.30% haproxy [.] h1_snd_buf
WIP: uint64_t -> long
WIP: ulong -> uint
code is much smaller
Right now we're taking the stick-tables update lock for reads just for
the sake of checking if the update index is past it or not. That's
costly because even taking the read lock is sufficient to provoke a
cache line write, while when under load or attack it's frequent that
the update has not yet been propagated and wouldn't require anything.
This commit brings a new field to the stksess, "seen", which is zeroed
when the entry is updated, and set to one as soon as at least one peer
starts to consult it. This way it will reflect that the entry must be
updated again so that this peer can see it. Otherwise no update will
be necessary. For now the flag is only set/reset but not exploited.
A great care is taken to avoid writes whenever possible.
Given the xz drama which allowed liblzma to be linked to openssh, lets remove
libsystemd to get rid of useless dependencies.
The sd_notify API seems to be stable and is now documented. This patch replaces
the sd_notify() and sd_notifyf() function by a reimplementation inspired by the
systemd documentation.
This should not change anything functionnally. The function will be built when
haproxy is built using USE_SYSTEMD=1.
References:
https://github.com/systemd/systemd/issues/32028https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes
Before:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffcfaf65000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x000074637fef4000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000074637fe4f000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000074637f400000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x000074637fe0d000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x000074637f92a000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x000074637f365000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074637f000000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074637f27a000)
libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x000074637fdff000)
libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x000074637eeb8000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000074637fdcd000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x000074637ee01000)
liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x000074637fda8000)
/lib64/ld-linux-x86-64.so.2 (0x000074637ff5d000)
libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x000074637f904000)
After:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffd51901000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f758d6c0000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f758d61b000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f758ca00000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x00007f758d5d9000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f758d365000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f758d5ba000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f758c600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f758c915000)
/lib64/ld-linux-x86-64.so.2 (0x00007f758d729000)
A backport to all stable versions could be considered at some point.
This is a simple algorithm to replace the classic slow start phase of the
congestion control algorithms. It should reduce the high packet loss during
this step.
Implemented only for Cubic.
Motivation: When services are discovered through DNS resolution, the order in
which DNS records get resolved and assigned to servers is arbitrary. Therefore,
even though two HAProxy instances using chash balancing might agree that a
particular request should go to server3, it is likely the case that they have
assigned different IPs and ports to the server in that slot.
This patch adds a server option, "hash-key <key>" which can be set to "id" (the
existing behaviour, default), "addr", or "addr-port". By deriving the keys for
the chash tree nodes from a server's address and port we ensure that independent
HAProxy instances will agree on routing decisions. If an address is not known
then the key is derived from the server's puid as it was previously.
When adjusting a server's weight, we now check whether the server's hash has
changed. If it has, we have to remove all its nodes first, since the node keys
will also have to change.
log load-balancing implementation was not seamlessly integrated within
lbprm API. The consequence is that it could become harder to maintain
over time since it added some specific cases just for the log backend.
Moreover, it resulted in some code duplication since balance algorithms
that are common to logs and regular (tcp, http) backends were specifically
rewritten for log backends.
Thanks to the previous commit, we now have all the prerequisites to make
log load-balancing fully leverage lbprm logic. Thus in this patch we make
__do_send_log_backend() use existing lbprm algorithms, and we no longer
require log-specific lbprm initialization in cfgparse.c and in
postcheck_log_backend().
As a bonus, for log backends this allows weighed algorithms to properly
support weights (ie: roundrobin, random and log-hash) since we now
leverage the same lb algorithms that we use for tcp/http backends
(doc was updated).
As previously mentioned in cd352c0db ("MINOR: log/balance: rename
"log-sticky" to "sticky""), let's define a sticky algorithm that may be
used from any protocol. Sticky algorithm sticks on the same server as
long as it remains available.
The documentation was updated accordingly.
The CLI applet is now using its own snd_buf callback function. Instead of
copying as most output data as possible, only one command is copied at a
time.
To do so, a new state CLI_ST_PARSEREQ is added for the CLI applet. In this
state, the CLI I/O handle knows a full command was copied into its input
buffer and it must parse this command to evaluate it.
This flag can be use by endpoints to know the data to send, via .snd_buf
callback function are the last ones. It is useful to know a shutdown is
pending but it cannot be delivered while sedning data are not consumed.
applet_putchk() and other similar functions are now testing the applet's
type to use the applet's outbuf instead of the channel's buffer. This will
ease applets convertion because most of them relies on these functions.
These functions are very similar to co_getline() and co_getdelim(). The
first one retrieves the longest part of the buffer that is composed
exclusively of characters not in the a delimiter set. The second one stops
on LF only and thus returns a line.
sc_sync_recv() and sc_sync_send() were added to use connection or applet
versions, depending on the endpoint type. For now these functions are not
used. But this will be used by process_stream() to replace the connection
version.
This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.
Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.
This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.
It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.
There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.
The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.
For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.
Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
We really want to let the readers and writers act on different areas, so
we want to have the tail and the head on separate cache lines, themselves
separate from the rest of the ring. Doing so improves the performance from
2.15 to 2.35M msg/s at 48 threads on a 24-core EPYC.
This increases the header space from 32 to 192 bytes when threads are
enabled. But since we already have the header size available in the file,
haring remains able to detect the aligned vs unaligned formats and call
dump_v2a() when aligned is detected.
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.
The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.
The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.
This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.
It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.
This will mostly be used during reallocation and boot-time duplicates,
the purpose is simply to save the caller from having to know the details
of the internal representation.
Many ring-based APIs need a tail and a head, with some extra assumption
that the user takes care of not filling the ring so that tail==head is
unambiguous. Vectors are particularly suited to this usage so here we
create 4 functions to create vectors representing free room or data
from a ring, as well as updating rings based on a pair of vectors that
represents either free space or data.
The buffers API defines both a storage layout and how to handle the
data. The storage is shared with the chunks API which only deals with
non-wrapping messages while buffers support wrapping both of the data
and of the free space. As such, most of the buffers code already makes
special cases of two parts in a buffer, the first one before wrapping
and the optional second one after the wrapping occurred.
The thing is, there are plenty of other places (e.g. rings) where the
code dealing with wrapping is desirable but with a different storage
layout. Let's export the existing buffer handling code related to
reading/writing wrapping data and make it work with arbitrary vector
pairs instead. This will handle wrapping and holes in messages if
desired, and it will be up to the caller to decide how its messages
are arranged and to pass the relevant ptr,len elements.
The code is limited to two vectors because this is sufficient to deal
with wrapping without making the code needlessly complex. I.e. this will
not reassemble an iovec. For vectors, since we already had the ist type,
there's no point inventing a new type, and it's even possible that over
time some callers will find benefits in using this unified API (i.e. no
NOP translation layer). It also allows to pass inputs as direct arguments
and outputs as pointers. Not only this is more efficient code-wise, but
it also avoids the accidental use of a wrong function. It was indeed
found that naming functions is even harder than with the buffer as the
notion of from/to is even fuzzier here.
The API will likely continue to evolve and some functions might get
renamed to more explicit ones over time to limit confusion. For now
the code provides anything needed to reset/create/fill/erase/read/peek
or measure vector pairs and to manipulate chars/blocks/varints to/from
there.
In order to support concurrent writers we'll need to lock areas in the
buffer. For this we'll use one special value of the single-byte readers
count. Let's reserve it now and use the macro instead of the hardcoded
255.
For some concurrently accessed buffers we can't rely on head/data etc,
but sometimes the access patterns guarantees that the buffer contents
are there. Let's implement a function to read contents from a fixed
offset, which never checks head nor data, only the area and its size.
It's the caller's job to get this offset.
This new function b_putblk_ofs() puts one full block of data of length
<len> from <blk> into the buffer, starting from absolute offset <offset>
after the buffer's area. As a convenience to avoid complex checks in
callers, the offset is allowed to exceed a valid one by no more than one
buffer size, and will automatically be wrapped. The caller is responsible
for ensuring that <len> doesn't exceed the known length of the available
room at this position, otherwise data may be overwritten. The buffer's
length is *not* updated, so generally the caller will have updated it
before calling this function. This is meant to be used on concurrently
accessed buffers, so that a writer can append data while a reader is
blocked by other means from reaching the current area The function
guarantees never to use ->head nor ->data.
This new function is made around the loop that scans a ring for new
messages and dispatches them to a message handler. It also takes
ring flags (WAIT, NEW, etc) and offset pointers that the caller will
use to initialize/reuse/update the current processing offset. The
caller is still responsible for presetting it to ~0 before the
first call if it wants the function to automatically adjust it (or set
it to the correct value). The function may also return the last_ofs
that was known before releasing the lock so that the caller knows
what to compare against and if it needs to restart processing or not.
The context remains a void* so that should not necessarily depend on
an appctx.
The current "show ring" code was ported to this and it continues to
work as expected.
A ring is used for the DNS code but slightly differently from the generic
one, which prevents some important changes from being made to the generic
code without breaking DNS. As the use cases differ, it's better to just
split them apart for now and have the DNS code use its own ring that we
rename dns_ring and let the generic code continue to live on its own.
The unused parts such as CLI registration were dropped, resizing and
allocation from a mapped area were dropped. dns_ring_detach_appctx() was
kept despite not being used, so as to stay consistent with the comments
that say it must be called, despite the DNS code explicitly mentioning
that it skips it for now (i.e. this may change in the future).
Hopefully after the generic rings are converted the DNS code can migrate
back to them, though this is really not necessary.
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. Contrary to its sibling applet_append_line(), instead of just
appending an LF at the end of the line, it prepends the message size
in decimal and a space before the message, as expected by syslog TCP
implementaions. This will be used to simplify the ring reader code.
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. This will be used to simplify the ring reader code.
Tests on various systems show that x86 prefers not to wait at all inside
read loops while aarch64 prefers to wait a little bit. Instead of having
to stuff ifdefs around __ha_cpu_relax() inside plenty of such loops
waiting for a condition to appear, better implement a new variant that
we call __ha_cpu_relax_for_read() which honors each architecture's
preferences and is the same as __ha_cpu_relax() for other ones.
Willy reported that since 3ac79b504 ("MEDIUM: server:
make server_set_inetaddr() updater serializable"), haproxy fails to
compile on some older compilers such as gcc-4.4 with this kind of error:
src/server.c: In function 'snr_resolution_cb':
src/server.c:4471: error: unknown field 'dns_resolver' specified in initializer
compilation terminated due to -Wfatal-errors.
make: *** [Makefile:1006: src/server.o] Error 1
This is due to referencing a member inside anonymous union from a compound
literal assignment. Apparently such use of anonymous union wasn't properly
supported back then on older compilers. To fix the issue, we give "u" name
to the parent union use this name to explicitly refer to the union where
relevant in the code (only a few changes fortunately).
The fix itself was verified to restore build compatibility with gcc 4.4
(and even 4.2).
As 3ac79b504 is used as a prerequisite for 64c9c8ef3 ("BUG/MINOR:
server/dns: use server_set_inetaddr() to unset srv addr from DNS"), please
consider backporting this patch too if 64c9c8ef3 happens to be backported
in 2.9.
This commit similar to the following one :
65ae241dcfe710e1cdd3ec4e7a9bde38d2e4c116
MEDIUM: server: close idle conn before server deletion
This patch implements a similar logic, this time to close private idle
connections stored in sessions. The principle is identical to the above
commit : conn_release() is used on idle connections after a takeover to
ensure thread safety.
An extra change was required to be able to execute takeover on such
connections. Their original thread ID was unknown, contrary to non
private connections which are stored in sharded lists. As such, a new
tid member has been added under sess_priv_conns chaining element.
Extend takeover API both for MUX and XPRT with a new boolean argument
<release>. Its purpose is to signal if the connection will be freed
immediately after the takeover, rendering new resources allocation
unnecessary.
For the moment, release argument is always false. However, it will be
set to true on delete server CLI handler to proactively close server
idle connections.
Several places reuse the same code to ensure a connection is properly
freed, either via its MUX or by calling the proper set of functions.
Factorize all of this in a new function conn_release().
This new function is now called via session_free() and
session_accept_fd(). It will also be reused on delete server to
proactively close idle connections.
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.
These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.
Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.
An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.
This patch can be backported up to 2.8. Wait a little bit before
backporting.
With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each SSL_CTX that references it increments its
refcount. The reference to the certificate_ocsp is kept in the SSL_CTX
linked to each ckch_inst, in an ex_data entry that gets freed when the
context is freed.
One of the downsides of this implementation is that if every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), and if all
the corresponding SSL_CTXs were destroyed (no ongoing connection using
them), the OCSP response would be destroyed even if the certificate
remains in the system (as an unused certificate).
In such a case, we would want the OCSP response not to be "usable",
since it is not used by any ckch_inst, but still remain in the OCSP
response tree so that if the certificate gets reused (via an "add ssl
crt-list" command for instance), its OCSP response is still known as
well.
But we would also like such an entry not to be updated automatically
anymore once no instance uses it. An easy way to do it could have been
to keep a reference to the certificate_ocsp structure in the ckch_store
as well, on top of all the ones in the ckch_instances, and to remove the
ocsp response from the update tree once the refcount falls to 1, but it
would not work because of the way the ocsp response tree keys are
calculated. They are decorrelated from the ckch_store and are the actual
OCSP_CERTIDs, which is a combination of the issuer's name hash and key
hash, and the certificate's serial number. So two copies of the same
certificate but with different names would still point to the same ocsp
response tree entry.
The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
actual reference counter corresponding to the number of "live" pointers
on the certificate_ocsp structure, incremented for every SSL_CTX using
it, and one for the ckch stores.
If the ckch_store reference counter falls to 0, the corresponding
certificate must have been removed via CLI calls ('set ssl cert' for
instance).
If the actual refcount falls to 0, then no live SSL_CTX uses the
response anymore. It could happen if all the corresponding crt-list
lines were removed and there are no live SSL sessions using the
certificate anymore.
If any of the two refcounts becomes 0, we will always remove the
response from the auto update tree, because there's no point in spending
time updating an OCSP response that no new SSL connection will be able
to use. But the certificate_ocsp object won't be removed from the tree
unless both refcounts are 0.
Must be backported up to 2.8. Wait a little bit before backporting.
A crash could occured if a session_add_conn() would temporarily failed
when called via h2_detach(). In this case, connection owner is reset to
NULL. However, if this wasn't the last connection stream, the connection
won't be destroyed. When h2_detach() is recalled for another stream and
this time session_add_conn() succeeds, a crash will occur due to
session_check_idle_conn() invocation with a NULL connection owner.
To fix this, ensure connection owner is always set after
session_add_conn() success.
This bug is considered as minor as the only failure reason for
session_add_conn() is a pool allocation issue.
This should be backported up to all stable releases.
Similarly to "expose-exprimental-directives" option, there is no a global
option to expose some deprecated directives. Idea is to have a way to silent
warnings about deprecated directives when there is no alternative solution.
Of course, deprecated directives covered by this option are not listed and
may change. It is only a best effort to let users upgrade smoothly.
A server can only be deleted if there is no elements which reference it.
This is taken care via srv_check_for_deletion(), most notably for active
and idle connections.
A special case occurs for connections directly managed by a session.
This is for so-called private connections, when using http-reuse never
or H2 + http-reuse safe for example. In this case. server does not
account these connections into its idle lists. This caused a bug as the
server is deleted despite the session still being able to access it.
To properly fix this, add a new referencing element into the server for
these session connections. A mt_list has been chosen for this. On
default http-reuse, private connections are typically not used so it
won't make any difference. If using H2 servers, or more generally when
dealing with private connections, insert/delete should typically occur
only once per session lifetime so impact on performance should be
minimal.
This should be backported up to 2.4. Note that srv_check_for_deletion()
was introduced in 3.0 dev tree. On backport, the extra condition in it
should be placed in cli_parse_delete_server() instead.
By default, backend connections are attached to a server instance. This
allows to implement connection reuse. However, in some particular cases,
connection cannot be shared accross several clients. These connections
are considered and private and are attached to the session instance
instead.
These private connections are also indexed by the target server to not
mix them. All of this is implemented via a dedicated structure
previously named struct sess_srv_list.
Rename it to better reflect its usage to struct sess_priv_conns. Also
rename its internal members and all of the associated functions.
This commit is only a renaming, thus no functional impact is expected.
While trying to reproduce another crash case involving lua filters
reported by @bgrooot on GH #2467, we found out that mixing filters loaded
from different contexts ('lua-load' vs 'lua-load-per-thread') for the same
stream isn't supported and may even cause the process to crash.
Historically, mixing lua-load and lua-load-per-threads for a stream wasn't
supported, but this changed thanks to 0913386 ("BUG/MEDIUM: hlua: streams
don't support mixing lua-load with lua-load-per-thread").
However, the above fix didn't consider lua filters's use-case properly:
unlike lua fetches, actions or even services, lua filters don't simply
use the stream hlua context as a "temporary" hlua running context to
process some hlua code. For fetches, actions.. hlua executions are
processed sequentially, so we simply reuse the hlua context from the
previous action/fetch to run the next one (this allows to bypass memory
allocations and initialization, thus it increases performance), unless
we need to run on a different hlua state-id, in which case we perform a
reset of the hlua context.
But this cannot work with filters: indeed, once registered, a filter will
last for the whole stream duration. It means that the filter will rely
on the stream hlua context from ->attach() to ->detach(). And here is the
catch, if for the same stream we register 2 lua filters from different
contexts ('lua-load' + 'lua-load-per-thread'), then we have an issue,
because the hlua stream will be re-created each time we switch between
runtime contexts, which means each time we switch between the filters (may
happen for each stream processing step), and since lua filters rely on the
stream hlua to carry context between filtering steps, this context will be
lost upon a switch. Given that lua filters code was not designed with that
in mind, it would confuse the code and cause unexpected behaviors ranging
from lua errors to crashing process.
So here we take another approach: instead of re-creating the stream hlua
context each time we switch between "global" and "per-thread" runtime
context, let's have both of them inside the stream directly as initially
suggested by Christopher back then when talked about the original issue.
For this we leverage hlua_stream_ctx_prepare() and hlua_stream_ctx_get()
helper functions which return the proper hlua context for a given stream
and state_id combination.
As for debugging infos reported after ha_panic(), we check for both hlua
runtime contexts to check if one of them was active when the panic occured
(only 1 runtime ctx per stream may be active at a given time).
This should be backported to all stable versions with 0913386
("BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread")
This commit depends on:
- "DEBUG: lua: precisely identify if stream is stuck inside lua or not"
[for versions < 2.9 the ha_thread_dump_one() part should be skipped]
- "MINOR: hlua: use accessors for stream hlua ctx"
For 2.4, the filters API didn't exist. However it may be a good idea to
backport it anyway because ->set_priv()/->get_priv() from tcp/http lua
applets may also be affected by this bug, plus it will ease code
maintenance. Of course, filters-related parts should be skipped in this
case.
When ha_panic() is called by the watchdog, we try to guess from
ha_task_dump() and ha_thread_dump_one() if the thread was stuck while
executing lua from the stream context. However we consider this is the
case by simply checking if the stream hlua context was set, but this is
not very precise because if the hlua context is set, then it simply means
that at least one lua instruction was executed at the stream level, not
that the stuck was currently executing lua when the panic occured.
This is especially true with filters, one could simply register a lua
filter that does nothing but this will still end up initializing the
stream hlua context for each stream. If the thread end up being stuck
during the stream handling, then debug dumping functions will report
that the stream was stuck while handling lua, which is not necessarilly
true, and could in fact confuse us even more.
So here we take another approach, we add the BUSY flag to hlua context:
this flag is set by hlua_ctx_resume() around lua_resume() call, this way
we can precisely tell if the thread was handling lua when it was
interrupted, and we rely on this flag in debug functions to check if the
thread was effectively stuck inside lua or not while processing the stream
No backport needed unless a commit depends on it.
The new "ssl-security-level" option allows one to change the OpenSSL
security level without having to change the openssl.cnf global file of
your distribution. This directives applies on every SSL_CTX context.
People sometimes change their security level directly in the ciphers
directive, however there are some cases when the security level change
is not applied in the right order (for example when applying a DH
param).
Before this patch, it was to possible to trick by using a specific
openssl.cnf file and start haproxy this way:
OPENSSL_CONF=./openssl.cnf ./haproxy -f bug-2468.cfg
Values for the security level can be found there:
https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_security_level.html
This was discussed in github issue #2468.
This commit removes qc_treat_rx_crypto_frms(). This function was used in
a single place inside qc_ssl_provide_all_quic_data(). Besides, its
naming was confusing as conceptually it is directly linked to quic_ssl
module instead of quic_rx.
Thus, body of qc_treat_rx_crypto_frms() is inlined directly inside
qc_ssl_provide_all_quic_data(). Also, qc_ssl_provide_quic_data() is now
only used inside quic_ssl to its scope is set to static. Overall, API
for CRYPTO frame handling is now cleaner.
Compilation on solaris fails because of usage of names reserved on that
platform, i.e. 'queue' and 's_addr'.
This patch redefines 'queue' as '_queue' and renames 's_addr' to
'srv_addr' which fixes compilation for now.
Future plan: rename 'queue' in code base so define can be removed again.
Backporting: 2.9, 2.8
The sink lock was made to prevent event producers from passing while
there were other threads trying to print a "dropped" message, in order
to guarantee the absence of reordering. It has a serious impact however,
which is that all threads need to take the read lock when producing a
regular trace even when there's no reader.
This patch takes a different approach. The drop counter is shifted left
by one so that the lowest bit is used to indicate that one thread is
already taking care of trying to dump the counter. Threads only read
this value normally, and will only try to change it if it's non-null,
in which case they'll first check if they are the first ones trying to
dump it, otherwise will simply count another drop and leave. This has
a large benefit. First, it will avoid the locking that causes stalls
as soon as a slow reader is present. Second, it avoids any write on the
fast path as long as there's no drop. And it remains very lightweight
since we just need to add +2 or subtract 2*dropped in operations, while
offering the guarantee that the sink_write() has succeeded before
unlocking the counter.
While a reader was previously limiting the traffic to 11k RPS under
4C/8T, now we reach 36k RPS vs 14k with no reader, so readers will no
longer slow the traffic down and will instead even speed it up due to
avoiding the contention down the chain in the ring. The locking cost
dropped from ~75% to ~60% now (it's in ring_write now).
Amaury reported that previous commit 08ac282375 ("MINOR: Add aes_gcm_enc
converter") broke the CI on OpenSSL 1.0.2 due to the define above not
existing there. Let's just map it to its older name when not existing.
For reference, these were renamed when switching to 1.1.0:
https://marc.info/?l=openssl-cvs&m=142244190907706&w=2
No backport is needed.
The previous patch fix the handling of in-order CRYPTO frames which
requires the usage of a new buffer for these data as their handling is
delayed to run under TASK_HEAVY.
In fact, as now all CRYPTO frames handling must be delayed, their
handling can be unify. This is the purpose of this commit, which removes
the just introduced new buffer. Now, all CRYPTO frames are buffered
inside the ncbuf. Unused elements such as crypto_frms member for
encryption level are also removed.
This commit is not a bugcfix but is a direct follow-up to the last one.
As such, it can probably be backported with it to 2.9 to reduce code
differences between these versions.
QUIC relies on SSL_do_hanshake() to be able to validate handshake. As
this function is computation heavy, it is since 2.9 called only under
TASK_HEAVY. This has been implemented by the following patch :
94d20be138
MEDIUM: quic: Heavy task mode during handshake
Instead of handling CRYPTO frames immediately during reception, this
patch delays the process to run under TASK_HEAVY tasklet. A frame copy
is stored in qel.rx.crypto_frms list. However, this frame still
reference the receive buffer. If the receive buffer is cleared before
the tasklet is rescheduled, it will point to garbage data, resulting in
haproxy decryption error. This happens if a fair amount of data is
received constantly to preempt the quic_conn tasklet execution.
This bug can be reproduced with a fair amount of clients. It is
exhibited by 'show quic full' which can report connections blocked on
handshake. Using the following commands result in h2load non able to
complete the last connections.
$ h2load --alpn-list h3 -t 8 -c 800 -m 10 -w 10 -n 8000 "https://127.0.0.1:20443/?s=10k"
Also, haproxy QUIC listener socket mode was active to trigger the issue.
This forces several connections to share the same reception buffer,
rendering the bug even more plausible to occur. It should be possible to
reproduce it with connection socket if increasing the clients amount.
To fix this bug, define a new buffer under quic_cstream. It is used
exclusively to copy CRYPTO data for in-order frame if ncbuf is empty.
This ensures data remains accessible even if receive buffer is cleared.
Note that this fix is only a temporary step. Indeed, a ncbuf is also
already used for out-of-order data. It should be possible to unify its
usage for both in and out-of-order data, rendering this new buffer
instance unnecessary. In this case, several unneeded elements will
become obsolete such as qel.rx.crypto_frms list. This will be done in a
future refactoring patch.
This must be backported up to 2.9.
In 2.7 with commit 35df34223b ("MINOR: buffers: split b_force_xfer() into
b_cpy() and b_force_xfer()"), b_ncat() was extracted from b_force_xfer()
but kept its source variable instead of constant, making it unusable for
calls from a const source. Let's just fix it.
Some include files, mostly types definitions, are missing a few includes
to define the types they're using, causing include ordering dependencies
between files, which are most often not seen due to the alphabetical
order of includes. Let's just fix them.
These were spotted by building pre-compiled headers for all these files
to .h.gch.
Extend "show quic" to be able to dump MUX related information. This is
done via the new function qcc_show_quic(). This replaces the old streams
dumping list which was incomplete.
These info are displayed on full output or by specifying "mux" field.
When a response was returned by HAProxy, a dedicated HTX flag was
set. Thanks to this flag, it was possible to add a "connection: close"
header to the response if the request was not fully received and to close
the connection. In the same way, when a redirect rule was applied,
keep-alive was forcefully disabled for unfinished requests.
All these mechanisms are now useless because the H1 mux is able to drain the
response. So HTX_FL_PROXY_RESP flag is removed and no special processing is
performed on HAProxy response when the request is unfinished.
unlike for H2 and H3, there is no mechanism in H1 to notify the client it
must stop to upload data when a response is replied before the end of the
request without closing the connection. There is no RST_STREAM frame
equivalent.
Thus, there is only two ways to deal with this situation: closing the
connection or draining the request. Until now, HAProxy didn't support
draining H1 messages. Closing the connection in this case has however a
major drawback. It leads to send a TCP reset, dropping this way all in-fly
data. There is no warranty the client has fully received the response.
Draining H1 messages was never implemented because in old versions it was a
bit tricky to implement. However, it is now far simplier to support this
feature because it is possible to have a H1 stream without any applicative
stream. It is the purpose of this patch. Now, when a shutdown is requested
and the stream is detached from the connection, if the request is unfinished
while the response was fully sent, the request in drained.
To do so, in this case the shutdown and the detach are delayed. From the
upper layer point of view, there is no changes. The endpoint is shut down
and detached as usual. But on H1 mux point of view, the H1 stream is still
alive and is being able to drain data. However the stream-endpoint
descriptor is orphan. Once the request is fully received (and drained), the
connection is shut down if it cannot be reused for a new transaction and the
H1 stream is destroyed.
Contrary to static servers, dynamic servers does not initialize their
settings from a default server instance. As such, _srv_parse_init() was
responsible to set a set of minimal values to have a correct behavior.
However, some settings were not properly initialized. This caused
dynamic servers to not behave as static ones without explicit
parameters.
Currently, the main issue detected is connection reuse which was
completely impossible. This is due to incorrect pool_purge_delay and
max_reuse settings incompatible with srv_add_to_idle_list().
To fix the connection reuse, but also more generally to ensure dynamic
servers are aligned with other server instances, define a new function
srv_settings_init(). This is used to set initial values for both default
servers and dynamic servers. For static servers, srv_settings_cpy() is
kept instead, using their default server as reference.
This patch could have unexpected effects on dynamic servers behavior as
it restored proper initial settings. Previously, they were set to 0 via
calloc() invocation from new_server().
This should be backported up to 2.6, after a brief period of
observation.
This patch reverts 2 fixes that were made in an attempt to fix the
ocsp-update feature used with the 'commit ssl cert' command.
The patches crash the worker when doing a soft-stop when the 'set ssl
ocsp-response' command was used, or during runtime if the ocsp-update
was used.
This was reported in issue #2462 and #2442.
The last patch reverted is the associated reg-test.
Revert "BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing"
This reverts commit 5e66bf26ec.
Revert "BUG/MEDIUM: ocsp: Separate refcount per instance and per store"
This reverts commit 04b77f84d1b52185fc64735d7d81137479d68b00.
Revert "REGTESTS: ssl: Add OCSP related tests"
This reverts commit acd1b85d3442fc58164bd0fb96e72f3d4b501d15.
The trailing NUL added at the end of istdup() by recent commit de0216758
("BUG/MINOR: ist: allocate nul byte on istdup") was placed outside of
the pointer validity test, rightfully showing null deref warnings. This
fix should be backported along with the fix above, to the same versions.
Due to the possibility of calling a control process after adding CRLs, the
ssl_commit_crlfile_cb variable was added. It is actually a pointer to the
callback function, which is called if defined after initial loading of CRL
data from disk and after committing CRL data via CLI command
'commit ssl crl-file ..'.
If the callback function returns an error, then the CLI commit operation
is terminated.
Also, one case was added to the CLI context used by "commit cafile" and
"commit crlfile": CACRL_ST_CRLCB in which the callback function is called.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
The issue was decribed in commit "BUG/MEDIUM: cli: Warn if pipelined commands
are delimited by a \n". In non-interactive mode, it was possible to use a
newline character as delimiter for pipelined commands. As a consequence, it was
possible to stop commands processing on the middle.
With the above commit, a warning is emitted to notify users. With this one,
we restore the expected behavior, as documented in the management guide.
Only the first line of commands is parsed. This commit will not be
backported to avoid breaking changes on stable versions.
This commit has of course some visible effects. All script using a newline
character as delimiter to pipeline commands in non-interactive mode will
stop working. Only the first command will be evaluated, all others will be
ignored. Pipelined commands MUST now be separated by a semi-colon.
But there is a more subtle and probably more annoying change. It is no
longer possible to pipeline commands with a payload ! A command with a
payload will always be the last one evaluated because it must be finished by
a newline (eventually preceeded by a custom pattern).
It is really annoying to introduce such breaking change. But, on the long
term, it is mandatory. The 2.8 will be the last LST version supporting the
old behavior (with some warning however). This will let 4 years to users to
adapt their scripts.
No backport needed.
This was broken since commit 0011c25144 ("BUG/MINOR: cli: avoid O(bufsize)
parsing cost on pipelined commands"). It is not really a bug fix but it is
labelled as is to make it more visible.
Before, a full line was first retrieved from the request buffer before
extracting the first command to eval it. Now, only one command is retrieved.
But we rely on the request buffer state to interrupt processing in
non-interactive mode. After a command processing, if output of the request
buffer is empty, we leave. Before the above commit, this was not a problem.
But since then, it is obviously a bad statement. First because some input
data may still be there. It is not true today, but it might change. Then,
there is no warranty to receive all commands in same time. For small list of
commands, it will be most of time the case, but it is a dangerous
assumption. For long list of commands, it is almost always false.
To be an issue, commands must be chunked exactly between two commands. But
in this case, remaining commands are skipped. A good way to reproduce the
issue is to wait a bit between two commands, for instance:
(printf "show info;"; sleep 2; printf "show stat\n") | socat ...
In fact, to properly fix the issue, we should exit on the first command
finished by a newline. Indeed, as stated in the documentation, in
non-interactive mode, a single line is processed. To pipeline commands,
commands must be separated by a semi-colon. Unfortunately, the above commit
introduced another change. It is possible to pipeline commands delimited by
a newline. It was pushed 2 years ago and backported to all stable versions.
Several scripts may rely on this behavior.
So, on stable version, the bug will not be fixed. However a warning will be
emitted to notify users their scripts don't respect the documentation and
they must adapt it. Mainly because the cli behavior on this point will be
changed in 3.0 to stick to the doc. This warning will only be emitted once
over the whole worker process life. Idea is to not flood the logs with the
same warning for every offending commands.
This commit should probably be backported to all stable versions. But with
some cautions because the CLI was often modified.
istdup() is documented as having the same behavior as strdup(). However,
it may cause confusion as it allocates a block of input length, without
an extra byte for \0 delimiter. This behavior is incoherent as in case
of an empty string however a single \0 is allocated.
This API inconsistency could cause a bug anywhere an IST is used as a
C-string after istdup() invocation. Currently, the only found issue is
with 'wait' CLI command using 'srv-unused'. This causes a buffer
overflow due to ist0() invocation after istdup() for be_name and
sv_name.
Backport should be done to all stable releases. Even if no bug has been
found outside of wait CLI implementation, it ensures the code is more
consistent on every releases.
Since 3d6350e10 ("MINOR: log: Remove log-error-via-logformat option"),
PR_O_ERR_LOGFMT flag is not used anymore, but it was left in the proxy-t.h
header file. Simply removing it and adding a comment to indicate that the
corresponding bit is now unused.
This patch is the direct followup of the previous one :
MINOR: quic: remove sendto() usage variant
This finalizes qc_snd_buf() simplification by removing send() syscall
usage for quic-conn owned socket. Syscall invocation is merged in a
single code location to the sendmsg() variant.
The only difference for owned socket is that destination address for
sendmsg() is set to NULL. This usage is documented in man 2 sendmsg as
valid for connected sockets. This allows maximum performance by avoiding
unnecessary lookups on kernel socket address tables.
As the previous patch, no functional change should happen here. However,
it will be simpler to extend qc_snd_buf() for GSO usage.
Add the ability to manually specify desired output type after a custom
field name for logformat nodes. Forcing the type can be useful to ensure
value is stored with the proper type representation. (i.e.: forcing
numerical to string to work around the limited resolution of JS number
types)
By default, type is set to SMP_T_SAME, which means the original type will
be preserved.
Currently supported types are: bool, str, sint
type_to_smp(type) does the reverse operation of smp_to_type[smp]: it takes
a type name as input string and tries to return the corresponding SMP_T_*
smp type or SMP_TYPES if not found.
Add the ability to specify custom name (will be used for representation
in verbose output types such as json) to logformat nodes.
For now, a custom name should be composed by characters [a-zA-Z0-9-_]*
The SE_FL_MAY_FASTFWD_CONS is added and it will be used by endpoints to
announce their support for the zero-copy forwarding on the consumer
side. The flag is not necessarily permanent. However, it will be used this
way for now.
To fix a bug, a flag to announce the capabitlity to support the zero-copy
forwarding on the consumer side will be added on the SE descriptor. So the
old flag SE_FL_MAY_FASTFWD is renamed to indicate it concerns the producer
side. It is now SE_FL_MAY_FASTFWD_PROD. And to prepare addition of the new
flag, the bitfield is a bit reordered.
To fix a bug, some SE flags must be added or renamed. To avoid mixing flags
set by the endpoint and flags set by the app, the second set of flags are
moved at the end of the bitfield, leaving the holes on the middle.
This revert of commit 0b93ff8c87 ("BUG/MEDIUM: stconn: Wake applets on
sending path if there is a pending shutdown") and 9e394d34e0 ("BUG/MINOR:
stconn: Don't report blocked sends during connection establishment") because
it was not the right fixes.
We must not wake an applet up when a shutdown is pending because it means
output some data are still blocked in the channel buffer. The applet does
not necessarily consume these data. In this case, the applet may be woken up
infinitly, except if it explicitly reports it wont consume datay yet.
This patch must be backported as far as 2.8. For older versions, as far as
2.2, it may be backported. If so, a previous fix must be pushed to prevent
an HTTP applet to be stuck. In http_ana.c, in http_end_request() and
http_end_reponse(), the call to channel_htx_truncate() on the request
channel in case of MSG_ERROR must be replace by a call to
channel_htx_erase().
When a READ or a WRITE activity is reported on a channel, the corresponding
date is updated. the last-read-activity date (lra) is updated and the
first-send-block date (fsb) is reset. The event is also reported at the
channel level by setting CF_READ_EVENT or CF_WRITE_EVENT flags. When one of
these flags is set, this prevent the update of the stream's task expiration
date from sc_notify(). It also prevents corresponding timeout to be reported
from process_stream().
But it is a problem during fast-forwarding stage if no expiration date was
set by the stream. Only process_stream() resets these flags. So a first READ
or WRITE event will prevent any stream's expiration date update till a new
call to process_stream(). But with no expiration date, this will only happen
on shutdown/abort event, blocking the stream for a while.
It is for instance possible to block the stats applet or the cli applet if a
client does not consume the response. The stream may be blocked, the client
timeout is not respected and the stream can only be closed on a client
abort.
So now, we update the stream's expiration date, regardless of reported
READ/WRITE events. It is not a big deal because lra and fsb date are
properly updated. It also means an old READ/WRITE event will no prevent the
stream to report a timeout and it is expected too.
This patch must be backported as far as 2.8. On older versions, timeouts and
stream's expiration date are not updated in the same way and this works as
expected.
In fact there is already flags on the SE to state a shutdown for reads or
writes was performed. But for applets, this notion does not exist. Both
flags are set in same time when the applet is released. But at the SC level,
there are functions to perform a shutdown (formely the shutw) and an abort
(formely the shutr). For applets, when a shutdown is performed on the SC, if
the applet is not immediately released, nothing is acknowledge at the SE
level.
With old way to implement applets, this was not an real issue until recently
because applets accessed to the channel/SC flags. It was thus possible to
catch the shutdowns. But the "wait" command on the CLI reveals the
flaw. Indeed, when this command is executed, nothing is read or sent. So, it
is not possible to detect the shutdowns. As a workaround, a dedicated test
on the SC flags was added at the end of the wait command I/O handler. But it
is pretty ugly.
With new way to implement applets, there is no longer access to the channel
or SC. So we must add a way to acknowledge shutdown into the SE.
This patch solves the both sides of the issue. The shutw notion is added for
applets. Its only purpose is to set SE_FL_SHWN flags. This flag is tested by
all applets, so, it solves the issue quite simply.
Note that it is described as a bug fix but there is no real issue, just a
design flaw. However, if the "wait" command is backported, this patch must
be backported too. Unfortinately it will require an adaptation because there
is no appctx flags on older versions.
This case does not exist yet with the H1 multiplexer, but applets may decide to
not produce data if there is not enough room in the destination buffer (the
applet's outbuf or the opposite SE buffer). It is true for the stats applets for
instance. However this case is not properly handled when the zero-copy
forwarding is in-use.
To fix the issue, the se_done_ff() function was modified to return the number of
bytes really forwarded and to subs for sends if nothing was forwarded while the
zero-copy forwarding was blocked by the producer. On the applet side, we take
care to block the zero-copy forwarding if the applet requests more room. At the
end, zero-copy forwarding is unblocked if something was forwarded.
This way, it is now possible for the stats applet to report a full buffer and
block the zero-copy forwarding, even if the buffer is not really full, by
requesting more room.
No backport needed.
An issue was introduced when zero-copy forwarding was added to the stats and
cache applets. There is no test to be sure the upper layer is ready to use
the zero-copy forwarding. So these applets refuse to deliver the response
into the applet's output buffer if the zero-copy forwarding is supported by
the opposite endpoint. It is especially an issue when a filter, like the
compression, is in-use on the response channel.
Because of this bug, the response is not delivered and the applet is woken
up in loop to produce data.
To fix the issue, an appctx flag was added, APPCTX_FL_FASTFWD, to know when
the zero-copy forwarding is in-use. We rely on this flag to not fill the
outbuf in the applet's I/O handler.
No backport needed.
A packet is considered as reordered when it is detected as lost because its packet
number is above the largest acknowledeged packet number by at least the
packet reordering threshold value.
Add ->nb_reordered_pkt new quic_loss struct member at the same location that
the number of lost packets to count such packets.
Should be backported to 2.6.
Let's say that the largest packet number acknowledged by the peer is #10, when inspecting
the non already acknowledged packets to detect if they are lost or not, this is the
case a least if the difference between this largest packet number and and their
packet numbers are bigger or equal to the packet reordering threshold as defined
by the RFC 9002. This latter must not be less than QUIC_LOSS_PACKET_THRESHOLD(3).
Which such a value, packets #7 and oldest are detected as lost if non acknowledged,
contrary to packet number #8 or #9.
So, the packet loss detection is very sensitive to such a network characteristic
where non acknowledged packets are distant from each others by their packet number
differences.
Do not use this static value anymore for the packet reordering threshold which is used
as a criteria to detect packet loss. In place, make it depend on the difference
between the number of the last transmitted packet and the number of the oldest
one among the packet which are still in flight before being inspected to be
deemed as lost.
Add new tune.quic.reorder-ratio setting to apply a ratio in percent to this
dynamic packet reorder threshold.
Should be backported to 2.6.
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.
These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.
Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.
An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.
This patch can be backported up to 2.8.
The "wait" command now supports a condition, "srv-unused", which waits
for the designated server to become totally unused, indicating that it
is removable. Upon each wakeup it calls srv_check_for_deletion() to
verify if conditions are met, if not if it's recoverable, or if it's
not recoverable, and proceeds according to this, never waiting for a
final decision longer than the configured delay.
The purpose is to make it possible to remove servers from the CLI after
waiting for their sessions to be terminated:
$ socat -t5 /path/to/socket - <<< "
disable server px/srv1
shutdown sessions server px/srv1
wait 2s srv-unused px/srv1
del server px/srv1"
Or even wait for connections to terminate themselves:
$ socat -t70 /path/to/socket - <<< "
disable server px/srv1
wait 1m srv-unused px/srv1
del server px/srv1"
Conditions will need to have context, arguments etc from the command line.
Since these will vary with time (otherwise we wouldn't wait), let's just
pass them as text (possibly pre-processed). We're starting with 4 strings
that are expected to be allocated by strdup() and are always sent to free()
upon release.
Since we'll support waiting for an action to succeed or permanently
fail, we need the ability to return an unrecoverable failure. Let's
add CLI_WAIT_ERR_FAIL for this. A static error message may be placed
into ctx->msg to report to the user why the failure is unrecoverable.
We'll need to be able to verify whether or not a server may be deleted.
For now, both the verification and the action are performed in the same
function, at once under thread isolation. The goal here is to extract
the verification code into a new function that will perform these checks,
return a status between success/recoverable/non-recoverable failure, and
will also return a message for the caller.
This allows to insert delays between commands, i.e. to collect a same
set of metrics at a fixed interval. E.g:
$ socat -t20 /path/to/socket <<< "show activity; wait 10s; show activity"
The goal will be to extend the feature to optionally support waiting on
certain conditions. For this reason the struct definitions and enums were
placed into cli-t.h.
This adds a new pair of stored types in the stick-tables:
- glitch_cnt
- glitch_rate
These keep count of the number of glitches reported on a front connection,
in order to decide how to act with a badly defective client or a potential
attacker. For now nothing updates these counters, but all the infrastructure
needed to configure, update and retrieve them was added, including the doc.
No regtest was added yet since they're not filled yet.
With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each ckch_inst that references it increments
its refcount. The reference to the certificate_ocsp is actually kept in
the SSL_CTX linked to each ckch_inst, in an ex_data entry that gets
freed when he context is freed.
One of the downside of this implementation is that is every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), the response
would be destroyed even if the certificate remains in the system (as an
unused certificate). In such a case, we would want the OCSP response not
to be "usable", since it is not used by any ckch_inst, but still remain
in the OCSP response tree so that if the certificate gets reused (via an
"add ssl crt-list" command for instance), its OCSP response is still
known as well. But we would also like such an entry not to be updated
automatically anymore once no instance uses it. An easy way to do it
could have been to keep a reference to the certificate_ocsp structure in
the ckch_store as well, on top of all the ones in the ckch_instances,
and to remove the ocsp response from the update tree once the refcount
falls to 1, but it would not work because of the way the ocsp response
tree keys are calculated. They are decorrelated from the ckch_store and
are the actual OCSP_CERTIDs, which is a combination of the issuer's name
hash and key hash, and the certificate's serial number. So two copies of
the same certificate but with different names would still point to the
same ocsp response tree entry.
The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
for the actual ckch instances and one for the ckch stores. If the
instance refcount becomes 0 then we remove the entry from the auto
update tree, and if the store reference becomes 0 we can then remove the
OCSP response from the tree. This would allow to chain some "del ssl
crt-list" and "add ssl crt-list" CLI commands without losing any
functionality.
Must be backported to 2.8.
The only useful information taken out of the ckch_store in order to copy
an OCSP certid into a buffer (later used as a key for entries in the
OCSP response tree) is the ocsp_certid field of the ckch_data structure.
We then don't need to pass a pointer to the full ckch_store to
ckch_store_build_certid or even any information related to the store
itself.
The ckch_store_build_certid is then converted into a helper function
that simply takes an OCSP_CERTID and converts it into a char buffer.
At the beginning of the 3.0-dev cycle, the zero-copy forwarding support was
added only for the cache applet with an option to disable it. This was a
hack, waiting for a better integration with applets. It is now possible to
implement the zero-copy forwarding for any applets. So the specific option
for the cache applet was renamed to be used for all applets. And this option
is now also checked for the stats applet.
Concretely, 'tune.cache.zero-copy-forwarding' was renamed to
'tune.applet.zero-copy-forwarding'.
Default .rcv_buf and .snd_buf functions that applets can use are now
specialized to manipulate raw buffers or HTX buffers.
Thus a TCP applet should use appctx_raw_rcv_buf() and appctx_raw_snd_buf()
while HTTP applet should use appctx_htx_rcv_buf() and appctx_htx_snd_buf().
Note that the appctx is now directly passed to these functions instead of
the SC.
Just like for the cache applet, it is now possible to send response to the
opposite side using the zero-copy forwarding. Internal functions were
slightly updated but there is nothing special to say. Except the requested
size during the nego stage is not exact.
It is now possible to use a flag during zero-copy forwarding negotiation to
specify the requested size is exact, it means the producer really expect to
receive at least this amount of data.
It can be used by consumer to prepare some processing at this stage, based
on the requested size. For instance, in the H1 mux, it is used to write the
next chunk size.
During zero-copy forwarding negotiation, a pseudo flag was already used to
notify the consummer if the producer is able to use kernel splicing or not. But
this was not extensible. So, now we use a true bitfield to be able to pass flags
during the negotiation. NEGO_FF_FL_* flags may be used now.
Of course, for now, there is only one flags, the kernel splicing support on
producer side (NEGO_FF_FL_MAY_SPLICE).
Thanks to this patch, it is possible to an applet to directly send data to
the opposite endpoint. To do so, it must implement <fastfwd> appctx callback
function and set SE_FL_MAY_FASTFWD flag.
Everything will be handled by appctx_fastfwd() function. The applet is only
responsible to transfer data. If it sets <to_forward> value, it is used to
limit the amount of data to forward.
This patch introduces the support for the callback function responsible to
produce data via the zero-copy forwarding mechanism. There is no
implementation for now. But <to_forward> field was added in the appctx
structure to let an applet inform how much data it want to forward. It is
not mandatory but it will be used during the zero-copy forwarding
negociation.
There is no shutdown for reads and send with applets. Both are performed
when the appctx is released. So instead of 2 flags, like for
muxes/connections, only one flag is used. But the idea is the same:
acknowledge the event at the applet level.
The appctx state was never really used as a state. It is only used to know
when an applet should be freed on the next wakeup. This can be converted to
a flag and the state can be removed. This is what this patch does.
Dedicated appctx flags to report EOI, EOS and errors (pending or terminal) were
added with the functions to set these flags. It is pretty similar to what it
done on most of muxes.
Till now, we've extended the appctx state to add some flags. However, the
field name is misleading. So a bitfield was added to handle real flags. And
helper functions to manipulate this bitfield were added.
A dedicated function to run applets was introduced, in addition to the old
one, to deal with applets that use their own buffers. The main differnce
here is that this handler does not use channels at all. It performs a
synchronous send before calling the applet and performs a synchronous
receive just after.
No applets are plugged on this handler for now.
There is no tasklet to handle I/O subscriptions for applets, but functions
to deal with receives and sends from the SC layer were added. it meanse a
function to retrieve data from an applet with this synchronous version and a
function to push data to an applet wit this synchronous version.
It is pretty similar to the functions used for muxes but there are some
differences. So for now, we keep them separated.
Zero-copy forwarding is not supported for now. In addition, there is no
subscription mechanism.
In this patch, we add default functions to copy data from a channel to the
<inbuf> buffer of an applet (appctx_rcv_buf) and another on to copy data
from <outbuf> buffer of an applet to a channel (appctx_snd_buf).
These functions are not used for now, but they will be used by applets to
define their <rcv_buf> and <snd_buf> callback functions. Of course, it will
be possible for a specific applet to implement its own functions but these
ones should be good enough for most of applets. HTX and RAW buffers are
supported.
For now, it is not usable, but this patch introduce the support of callback
functions, in the applet structure, to exchange data between channels and
applets. It is pretty similar to callback functions defined by muxes.
It is the first patch of a series aimed to align applets on connections.
Here, dedicated buffers are added for applets. For now, buffers are
initialized and helpers function to deal with allocation are added. In
addition, flags to report allocation failures or full buffers are also
introduced. <inbuf> will be used to push data to the applet from the stream
and <outbuf> will be used to push data from the applet to the stream.
IS_HXT_SC() macro is only usable if the stream-connector is attached to a
connection. It is a bit restrictive because this cannot work if the SC is
attached to an applet. So let's fix that be adding the support of applets
too.
wait_event structure was in connection header file because it is only used
by connections and muxes. But, this may change. For instance applets may be
good candidates to use it too. So, the structure is moved to the task header
file instead.
This commit adds support for an optional second argument to BUG_ON(),
WARN_ON(), CHECK_IF(), that can be a constant string. When such an
argument is given, it will be printed on a second line after the
existing first message that contains the condition.
This can be used to provide more human-readable explanations about
what happened, such as "too low on memory" or "memory corruption
detected" that may help a user resolve the incident by themselves.
The ABORT_NOW() macro is not much used since we have BUG_ON(), but
there are situations where it makes sense, typically if the program
must always die regardless od DEBUG_STRICT, or if the condition must
always be evaluated (e.g. decompress something and check it).
It's not convenient not to have any hint about what happened there. But
providing too much info also results in wiping some registers, making
the trace less exploitable, so a compromise must be found.
What this patch does is to provide the support for an optional argument
to ABORT_NOW(). When an argument is passed (a string), then a message
will be emitted with the file name, line number, the message and a
trailing LF, before the stack dump and the crash. It should be used
reasonably, for example in functions that have multiple calls that need
to be more easily distinguished.
As seen in previous commit 59acb27001 ("BUILD: quic: Variable name typo
inside a BUG_ON()."), it can sometimes happen that with DEBUG forced
without DEBUG_STRICT, BUG_ON() statements are ignored. Sadly, it means
that typos there are not even build-tested.
This patch makes these statements reference sizeof(cond) to make sure
the condition is parsed. This doesn't result in any code being emitted,
but makes sure the expression is correct so that an issue such as the one
above will fail to build (which was verified).
This may be backported as it can help spot failed backports as well.
Since d480b7b ("MINOR: debug: make ABORT_NOW() store the caller's line
number when using abort"), building with 'DEBUG_USE_ABORT' fails with:
|In file included from include/haproxy/api.h:35,
| from include/haproxy/activity.h:26,
| from src/ev_poll.c:20:
|include/haproxy/thread.h: In function ‘ha_set_thread’:
|include/haproxy/bug.h:107:47: error: expected ‘;’ before ‘_with_line’
| 107 | #define ABORT_NOW() do { DUMP_TRACE(); abort()_with_line(__LINE__); } while (0)
| | ^~~~~~~~~~
|include/haproxy/bug.h:129:25: note: in expansion of macro ‘ABORT_NOW’
| 129 | ABORT_NOW(); \
| | ^~~~~~~~~
|include/haproxy/bug.h:123:9: note: in expansion of macro ‘__BUG_ON’
| 123 | __BUG_ON(cond, file, line, crash, pfx, sfx)
| | ^~~~~~~~
|include/haproxy/bug.h:174:30: note: in expansion of macro ‘_BUG_ON’
| 174 | # define BUG_ON(cond) _BUG_ON (cond, __FILE__, __LINE__, 3, "FATAL: bug ", "")
| | ^~~~~~~
|include/haproxy/thread.h:201:17: note: in expansion of macro ‘BUG_ON’
| 201 | BUG_ON(!thr->ltid_bit);
| | ^~~~~~
|compilation terminated due to -Wfatal-errors.
|make: *** [Makefile:1006: src/ev_poll.o] Error 1
This is because of a leftover: abort()_with_line(__LINE__);
^^
Fixing it by removing the extra parentheses after 'abort' since the
abort() call is now performed under abort_with_line() helper function.
This was raised by Ilya in GH #2440.
No backport is needed, unless the above commit gets backported.
Placing DO_NOT_FOLD() before abort() only works in -O2 but not in -Os which
continues to place only 5 calls to abort() in h3.o for call places. The
approach taken here is to replace abort() with a new function that wraps
it and stores the line number in the stack. This slightly increases the
code size (+0.1%) but when unwinding a crash, the line number remains
present now. This is a very low cost, especially if we consider that
DEBUG_USE_ABORT is almost only used by code coverage tools and occasional
debugging sessions.
As indicated in previous commit, we don't want calls to ha_crash_now()
to be merged, since it will make gdb return a wrong line number. This
was found to happen with gcc 4.7 and 4.8 in h3.c where 26 calls end up
as only 5 to 18 "ud2" instructions depending on optimizations. By
calling DO_NOT_FOLD() just before provoking the trap, we can reliably
avoid this folding problem. Note that this does not address the case
where abort() is used instead (DEBUG_USE_ABORT).
Modern compilers sometimes perform function tail merging and identical
code folding, which consist in merging identical occurrences of same
code paths, generally final ones (e.g. before a return, a jump or an
unreachable statement). In the case of ABORT_NOW(), it can happen that
the compiler merges all of them into a single one in a function,
defeating the purpose of the check which initially was to figure where
the bug occurred.
Here we're creating a DO_NO_FOLD() macro which makes use of the line
number and passes it as an integer argument to an empty asm() statement.
The effect is a code position dependency which prevents the compiler
from merging the code till that point (though it may still merge the
following code). In practice it's efficient at stopping the compilers
from merging calls to ha_crash_now(), which was the initial purpose.
It may also be used to force certain optimization constructs since it
gives more control to the developer.
It is now possible to selectively retrieve extra counters from stats
modules. H1, H2, QUIC and H3 fill_stats() callback functions are updated to
return a specific counter.
The list of modules registered on the stats to expose extra counters is now
public. It is required to export these counters into the Prometheus
exporter.
set-bc-{mark,tos} actions are pretty similar to set-fc-{mark,tos} to set
mark/tos on packets sent from haproxy to server: set-bc-{mark,tos} actions
act on the whole backend/srv connection: from connect() to connection
teardown, thus they may only be used before the connection to the server
is instantiated, meaning that they are only relevant for request-oriented
rules such as tcp-request or http-request rules. For now their use is
limited to content request rules, because tos and mark informations are
stored directly within the stream, thus it is required that the stream
already exists.
stream flags are used in combination with dedicated stream struct members
variables to pass 'tos' and 'mark' informations so that they are correctly
considered during stream connection assignment logic (prior to connecting
to actually connecting to the server)
'tos' and 'mark' fd sockopts are taken into account in conn hash
parameters for connection reuse mechanism.
The documentation was updated accordingly.
In this patch we add the possibility to use sample expression as argument
for set-fc-{mark,tos} actions. To make it backward compatible with
previous behavior, during parsing we first try to parse the value as
as integer (decimal or hex notation), and then fallback to expr parsing
in case of failure.
The documentation was updated accordingly.
Some CPU time is needlessly wasted in conn_calculate_hash(), because all
params are first copied into a temporary buffer before computing the
hash on the whole buffer. Instead, let's leverage the XXH progressive
hash update functions to avoid expensive memcpys.
0x00000008 bit for CO_FL_* flags is no more unused since 8cc3fc73f1
("MINOR: connection: update rhttp flags usage"). Removing the comment
that says otherwise.
A major reorganization of QUIC MUX sending has been implemented. Now
data transfer occur over a single QCS buffer. This has improve
performance but at the cost of restrictions on snd_buf. Indeed, buffer
instances are now shared from stream callback snd_buf up to quic-conn
layer.
As such, snd_buf cannot manipulate freely already present data buffer.
In particular, realign has been completely removed by the previous
patches.
This commit reintroduces a partial realign support. This is only done if
the buffer contains only unsent data, via a new MUX function
qcc_realign_stream_txbuf() which is called during snd_buf.
This commit is a direct follow-up on the major rearchitecture of send
buffering. This patch implements the proper handling of connection pool
buffer temporary exhaustion.
The first step is to be able to differentiate a fatal allocation error
from a temporary pool exhaustion. This is done via a new output argument
on qcc_get_stream_txbuf(). For a fatal error, application protocol layer
will schedule the immediate connection closing. For a pool exhaustion,
QCC is flagged with QC_CF_CONN_FULL and stream sending process is
interrupted. QCS instance is also registered in a new list
<qcc.buf_wait_list>.
A new connection buffer can become available when all ACKs are received
for an older buffer. This process is taken in charge by quic-conn layer.
It uses qcc_notify_buf() function to clear QC_CF_CONN_FULL and to wake
up every streams registered on buf_wait_list to resume sending process.
This commit is a direct follow-up on the major rearchitecture of send
buffering. It allows application protocol to react if current QCS
sending buffer space is too small. In this case, the buffer can be
released to the quic-conn layer. This allows to allocate a new QCS
buffer and retry HTX parsing, unless connection buffer pool is already
depleted.
A new function qcc_release_stream_txbuf() serves as API for app protocol
to release the QCS sending buffer. This operation fails if there is
unsent data in it. In this case, MUX has to keep it to finalize transfer
of unsent data to quic-conn layer. QCS is thus flagged with
QC_SF_BLK_MROOM to interrupt snd_buf operation.
When all data are sent to the quic-conn layer, QC_SF_BLK_MROOM is
cleared via qcc_streams_sent_done() and stream layer is woken up to
restart snd_buf.
Note that a new function qcc_stream_can_send() has been defined. It
allows app proto to check if sending is currently blocked for the
current QCS. For now, it checks QC_SF_BLK_MROOM flag. However, it will
be extended to other conditions with the following patches.
The previous commit was a major rework for QUIC MUX sending process.
Following this, this patch cleans up a few elements that remains but can
be removed as they are duplicated.
Of notable changes, offset fields from QCS and QCC are removed. They are
both equivalent to flow control soft offsets.
A new function qcs_prep_bytes() is implemented. Its purpose is to return
the count of prepared data bytes not yet sent. It also replaces
qcs_need_sending().
Previously, QUIC MUX sending was implemented with data transfered along
two different buffer instances per stream.
The first QCS buffer was used for HTX blocks conversion into H3 (or
other application protocol) during snd_buf stream callback. QCS instance
is then registered for sending via qcc_io_cb().
For each sending QCS, data memcpy is performed from the first to a
secondary buffer. A STREAM frame is produced for each QCS based on the
content of their secondary buffer.
This model is useful for QUIC MUX which has a major difference with
other muxes : data must be preserved longer, even after sent to the
lower layer. Data references is shared with quic-conn layer which
implements retransmission and data deletion on ACK reception.
This double buffering stages was the first model implemented and remains
active until today. One of its major drawbacks is that it requires
memcpy invocation for every data transferred between the two buffers.
Another important drawback is that the first buffer was is allocated by
each QCS individually without restriction. On the other hand, secondary
buffers are accounted for the connection. A bottleneck can appear if
secondary buffer pool is exhausted, causing unnecessary haproxy
buffering.
The purpose of this commit is to completely break this model. The first
buffer instance is removed. Now, application protocols will directly
allocate buffer from qc_stream_desc layer. This removes completely the
memcpy invocation.
This commit has a lot of code modifications. The most obvious one is the
removal of <qcs.tx.buf> field. Now, qcc_get_stream_txbuf() returns a
buffer instance from qc_stream_desc layer. qcs_xfer_data() which was
responsible for the memcpy between the two buffers is also completely
removed. Offset fields of QCS and QCC are now incremented directly by
qcc_send_stream(). These values are used as boundary with flow control
real offset to delimit the STREAM frames built.
As this change has a big impact on the code, this commit is only the
first part to fully support single buffer emission. For the moment, some
limitations are reintroduced and will be fixed in the next patches :
* on snd_buf if QCS sent buffer in used has room but not enough for the
application protocol to store its content
* on snd_buf if QCS sent buffer is NULL and allocation cannot succeeds
due to connection pool exhaustion
One final important aspect is that extra care is necessary now in
snd_buf callback. The same buffer instance is referenced by both the
stream and quic-conn layer. As such, some operation such as realign
cannot be done anymore freely.
Both QCS and QCC have their owned sent offset field. These fields store
the newest offset sent to the quic-conn layer. It is similar to QCS/QCC
flow control real offset. This patch removes them and replaces them by
the latter for code clarification.
MINOR: mux-quic: remove unneeded qcc.tx.sent_offsets field
This commit as a similar purpose as previous, except that it removes QCC
<sent_offsets> field, now equivalent to connection flow control real
offset.
This commit is a direct follow-up on the previous one. This time, it
deals with connection level flow control. Process is similar to stream
level : soft offset is incremented during snd_buf and real offset during
STREAM frame emission.
On MAX_DATA reception, both stream layer and QMUX is woken up if
necessary. One extra feature for conn level is the introduction of a new
QCC list to reference QCS instances. It will store instances for which
snd_buf callback has been interrupted on QCC soft offset reached. Every
stream instances is woken up on MAX_DATA reception if soft_offset is
unblocked.
This patch is the first of two to reimplement flow control emission
limits check. The objective is to account flow control earlier during
snd_buf stream callback. This should smooth transfers and prevent over
buffering on haproxy side if flow control limit is reached.
The current patch deals with stream level flow control. It reuses the
newly defined flow control type. Soft offset is incremented after HTX to
data conversion. If limit is reached, snd_buf is interrupted and stream
layer will subscribe on QCS.
On qcc_io_cb(), generation of STREAM frames is restricted as previously
to ensure to never surpass peer limits. Finally, flow control real
offset is incremented on lower layer send notification. Thus, it will
serve as a base offset for built STREAM frames. If limit is reached,
STREAM frames generation is suspended.
Each time QCS data flow control limit is reached, soft and real offsets
are reconsidered.
Finally, special care is used when flow control limit is incremented via
MAX_STREAM_DATA reception. If soft value is unblocked, stream layer
snd_buf is woken up. If real value is unblocked, qcc_io_cb() is
rescheduled.
Create a new module dedicated to flow control handling. It will be used
to implement earlier flow control update on snd_buf stream callback.
For the moment, only Tx part is implemented (i.e. limit set by the peer
that haproxy must respect for sending). A type quic_fctl is defined to
count emitted data bytes. Two offsets are used : a real one and a soft
one. The difference is that soft offset can be incremented beyond limit
unless it is already in excess.
Soft offset will be used for HTX to H3 parsing. As size of generated H3
is unknown before parsing, it allows to surpass the limit one time. Real
offset will be used during STREAM frame generation : this time the limit
must not be exceeded to prevent protocol violation.
Add a new argument to qcc_send_stream() to specify the count of sent
bytes.
For the moment this argument is unused. This commit is in fact a step to
implement earlier flow control update during stream layer snd_buf.
Previous patches have reorganize define definitions for SSL 0RTT
support. However a typo was introduced. This caused haproxy to disable
0RTT support announcement and report of an erroneous warning for no
support on the SSL library side when using quictls/openssl compat layer.
This was detected by using ngtcp2-client. No 0RTT packet were emitted by
the client due to haproxy missing support advertisement.
The faulty commit is the following one :
commit 5c45199347
MEDIUM: ssl/quic: always compile the ssl_conf.early_data test
This must be backported wherever the above patch is.
Add the HAVE_SSL_0RTT constant which define if the SSL library supports
0RTT. Which is different from HA_OPENSSL_HAVE_0RTT_SUPPORT which was
used only in the context of QUIC
It is similar to the previous fix but for the chunk size parsing. But this
one is more annoying because a poorly coded application in front of haproxy
may ignore the last digit before the LF thinking it should be a CR. In this
case it may be out of sync with HAProxy and that could be exploited to
perform some sort or request smuggling attack.
While it seems unlikely, it is safer to forbid LF with CR at the end of a
chunk size.
This patch must be backported to 2.9 and probably to all stable versions
because there is no reason to still support LF without CR in this case.
When the message is chunked, all chunks must ends with a CRLF. However, on
old versions, to support bad client or server implementations, the LF only
was also accepted. Nowadays, it seems useless and can even be considered as
an issue. Just forbid LF only at the end of chunks, it seems reasonnable.
This patch must be backported to 2.9 and probably to all stable versions
because there is no reason to still support LF without CR in this case.
In several places in the source, there was the same block of code that was
used to deinitialize the log buffer. There were even two functions that
did this, but they were called only from the code that is in the same
source file (free_tcpcheck_fmt() in src/tcpcheck.c and free_logformat_list()
in src/proxy.c - they were both static functions).
The function free_logformat_list() was moved from the file src/proxy.c to
src/log.c, and a check of the list before freeing the memory was added to
that function.
Device Atlas' dummy lib will use a C++ file when built with cache
support, so for completeness we'll have to pretty-print it as well.
Let's define cmd_CXX.
QCS instances use qc_stream_desc for data buffering on emission. On
stream reset, its Tx channel is closed earlier than expected. This may
leave unsent data into qc_stream_desc.
Before this patch, these unsent data would remain after QCS freeing.
This prevents the buffer to be released as no ACK reception will remove
them. The buffer is only freed when the whole connection is closed. As
qc_stream_desc buffer is limited per connection, this reduces the buffer
pool for other streams of the same connection. In the worst case if
several streams are resetted, this may completely freeze the transfer of
the remaining connection streams.
This bug was reproduced by reducing the connection buffer pool to a
single buffer instance by using the following global statement :
tune.quic.frontend.conn-tx-buffers.limit 1.
Then a QUIC client is used which opens a stream for a large enough
object to ensure data are buffered. The client them emits a STOP_SENDING
before reading all data, which forces the corresponding QCS instance to
be resetted. The client then opens a new request but the transfer is
freezed due to this bug.
To fix this, adjust qc_stream_desc API. Add a new argument <final_size>
on qc_stream_desc_release() function. Its value is compared to the
currently buffered offset in latest qc_stream_desc buffer. If
<final_size> is inferior, it means unsent data are present in the
buffer. As such, qc_stream_desc_release() removes them to ensure the
buffer will finally be freed when all ACKs are received. It is also
possible that no data remains immediately, indicating that ACK were
already received. As such, buffer instance is immediately removed by
qc_stream_buf_free().
This must be backported up to 2.6. As this code section is known to
regression, a period of observation could be reserved before
distributing it on LTS releases.
This previous commit was not sufficient to completely fix the building issue
in relation with the TLS stack 0-RTT support. LibreSSL was the last TLS
stack to refuse to compile because of undefined a QUIC specific function
for 0-RTT: SSL_set_quic_early_data_enabled().
To get rid of such compilation issues, define HA_OPENSSL_HAVE_0RTT_SUPPORT
only when building against TLS stack with 0-RTT support.
No need to backport.
The initial purpose of CSV stats through CLI was to make it easely
parsable by scripts. But in some specific cases some error or warning
messages strings containing LF were dumped into cells of this CSV.
This made some parsing failure on several tools. In addition, if a
warning or message contains to successive LF, they will be dumped
directly but double LFs tag the end of the response on CLI and the
client may consider a truncated response.
This patch extends the 'csv_enc_append' and 'csv_enc' functions used
to format quoted string content according to RFC with an additionnal
parameter to convert multi-lines strings to one line: CRs are skipped,
and LFs are replaced with spaces. In addition and optionally, it is
also possible to remove resulting trailing spaces.
The call of this function to fill strings into stat's CSV output is
updated to force this conversion.
This patch should be backported on all supported branches (issue was
already present in v2.0)
The 'generate-certificates' option does not need its dedicated SSL_CTX
*, it only needs the default SSL_CTX.
Use the default SSL_CTX found in the sni_ctx to generate certificates.
It allows to remove all the specific default_ctx initialization, as
well as the default_ssl_conf and 'default_inst'.
This patch follows the previous one about default certificate selection
("MEDIUM: ssl: allow multiple fallback certificate to allow ECDSA/RSA
selection").
This patch generates '*" SNI filters for the first certificate of a
bind line, it will be used to match default certificates. Instead of
setting the default_ctx pointer in the bind line.
Since the filters are in the SNI tree, it allows to have multiple
default certificate and restore the ecdsa/rsa selection with a
multi-cert bundle.
This configuration:
# foobar.pem.ecdsa and foobar.pem.rsa
bind *:8443 ssl crt foobar.pem crt next.pem
will use "foobar.pem.ecdsa" and "foobar.pem.rsa" as default
certificates.
Note: there is still cleanup needed around default_ctx.
This was discussed in github issue #2392.
At the moment, http_err_cnt and http_fail_cnt are incremented on a
well-defined set of status codes, which are checked at various places.
Over time, there have been some complains about 404, 401 or 407
triggering errors, or 500 triggering failures in SOAP environments
for example. With a small bit field that fits in a cache line we
can match the presence of a status code from 100 to 599, so that
remains cheap.
This patch adds two such bit fields, one per code class, and the
accompanying functions to set/clear/test the codes. The arrays are
preset at boot time. For now they are not used and it's not possible
to adjust them.
This function is defined in the RX part (quic_rx.c) and declared in quic_rx.h
header. This is its correct place.
Remove the useless declaration of this function in quic_conn.h.
Should be backported in 2.9 where this double declaration was introduced when
moving quic_dgram_parse() from quic_conn.c to quic_rx.c.
It used to return ssize_t for -1 but in fact we're using this -1 as
the largest possible value and the result is generally cast to signed
to check if the end was reached, so better make it clearly return an
unsigned value here.
This is cbtree commit e1e58a2b2ced2560d4544abaefde595273089704.
This is ebtree commit d7531a7475f8ba8e592342ef1240df3330d0ab47.
There's no reason to return signed values there. And it turns out that
the compiler manages to improve the performance by ~2%.
This is cbtree commit ab3fd53b8d6bbe15c196dfb4f47d552c3441d602.
This is ebtree commit 0ebb1d7411d947de55fa5913d3ab17d089ea865c.
With flsnz() instead of flsnz_long() we're now getting a better
performance on both x86 and ARM. The difference is that previously
we were relying on a function that was forcing the use of register
%eax for the 8-bit version and that was preventing the compiler
from keeping the code optimized. The gain is roughly 5% on ARM and
1% on x86.
This is cbtree commit 19cf39b2514bea79fed94d85e421e293be097a0e.
This is ebtree commit a9aaf2d94e2c92fa37aa3152c2ad8220a9533ead.
The definitions were a bit of a mess and there wasn't even a fall back to
__builtin_clz() on compilers supporting it. Now we instead define a macro
for each implementation that is set on an arch-dependent case by case,
and add the fall back ones only when not defined. This also allows the
flsnz8() to automatically fall back to the 32-bit arch-specific version
if available. This shows a consistent 33% speedup on arm for strings.
This is cbtree commit c6075742e8d0a6924e7183d44bd93dec20ca8049.
This is ebtree commit f452d0f83eca72f6c3484ccb138d341ed6fd27ed.
Let's use these in order to avoid 32-64 bit casts on 64 bit platforms.
This is cbtree commit e4f4c10fcb5719b626a1ed4f8e4e94d175468c34.
This is ebtree commit cc10507385c784d9a9e74ea9595493317d3da99e.
The asm code shows multiple conversions. Gcc has always been terribly
bad at dealing with chars, which are constantly converted to ints for
every operation and zero-extended after each operation. But here in
addition there are conversions before and after the flsnz(). Let's
just mark the variables as long and use flsnz_long() to process them
without any conversion. This shortens the code and makes it slightly
faster.
Note that the fls operations could make use of __builtin_clz() on
gcc 4.6 and above, and it would be useful to implement native support
for ARM as well.
This is cbtree commit 1f0f83ba26f2279c8bba0080a2e09a803dddde47.
This is ebtree commit 9c38dcae22a84f0b0d9c5a56facce1ca2ad0aaef.
During the zero-copy forwarding, if the consumer side reports it is blocked,
it means it is blocked on send. At the stream-connector level, the event
must be reported to be sure to set/update the fsb date. Otherwise, write
timeouts cannot be properly reported. If this happens when no other timeout
is armed, this freezes the stream.
This patch must be backported to 2.9.
Such "#ifdef USE_QUIC" prepocessor statements are used by QUIC C header
to avoid inclusion of QUIC headers when the QUIC support is not enabled
(by USE_QUIC make variable). Furthermore, this allows inclusions of QUIC
header from C file without having to protect them with others "#ifdef USE_QUIC"
statements as follows:
#ifdef USE_QUIC
#include <a QUIC header>
#include <another one QUIC header>
#endif /* USE_QUIC */
So, here if this quic_ssl.h header was included by a C file, and compiled without
QUIC support, this will lead to build errrors as follows:
In file included from <a C file...>:
include/haproxy/quic_ssl.h:35:35: warning: ‘enum ssl_encryption_level_t’
declared inside parameter list will not be visible outside of this
definition or declaration
Should be backported to 2.9 to avoid such building issues to come.
Remove some QUIC definitions of members from server structure as the haproxy QUIC
stack does not support at all the server part (QUIC client) as this time.
Remove the statements in relation with their initializations.
This patch should be backported as far as 2.6 to save memory.
The new function hap_get_next_build_opt() will iterate over the list of
build options. This will be used for debugging, so that the build options
can be retrieved from the CLI.
A new flag RX_F_PASS_PKTINFO is now available, whose purpose is to mark
that the destination address is about to be retrieved on some listeners.
The address can be retrieved from the first received datagram, and
relies on the IP_PKTINFO, IP_RECVDSTADDR and IPV6_RECVPKTINFO support.
RSLV_UPD_CNAME and RSLV_UPD_NAME_ERROR flags have now become useless since
3cf7f987 ("MINOR: dns: proper domain name validation when receiving DNS
response") as they are never set, but we forgot to remove them.
RSLV_UPD_OBSOLETE_IP was introduced with commit a8c6db8d2 ("MINOR: dns:
Cache previous DNS answers.") but the commit didn't make any use of it,
and today the flag is still unused. Since we have no valid use for it,
better remove it to prevent confusions.
This bug impacts only the QUIC OpenSSL compatibility module (USE_QUIC_OPENSSL_COMPAT).
The TLS capture of information from client hello enabled by
tune.ssl.capture-buffer-size could not work with USE_QUIC_OPENSSL_COMPAT. This
is due to the fact the callback set for this feature was replaced by
quic_tls_compat_msg_callback(). In fact this called must be registered by
ssl_sock_register_msg_callback() as this done for the TLS client hello capture.
A call to this function appends the function passed as parameter to a list of
callbacks to be called when the TLS stack parse a TLS message.
quic_tls_compat_msg_callback() had to be modified to return if it is called
for a non-QUIC TLS session.
Must be backported to 2.8.
Previously, if snd_buf operation was conducted despite QCS already
locally closed, the input buffer was silently dropped. This situation
could happen if a RESET_STREAM was emitted butemission not reported to
the stream layer. Resetting silently the buffer ensure QUIC MUX remain
compliant with RFC 9000 which forbid emission after RESET_STREAM.
Since previous commit, it is now ensured that RESET_STREAM sending will
always be reported to stream-layer. Thus, there is no need anymore to
silently reset the buffer. A BUG_ON() statement is added to ensure this
assumption will remain valid.
The new code is deemed cleaner as it does not hide a missing error
notification on the stconn-layer. Previously, if an error was missing,
sending would continue unnecessarily with a false success status
reported for the stream.
Note that the BUG_ON() statement was also added into nego_ff callback.
This is necessary to ensure both sending path remains consistent.
This patch is labelled as MEDIUM as issues were already encountered in
snd_buf/nego_ff implementation and it's not easy to cover all occurences
during test. If the BUG_ON() is triggered without any apparent
stream-layer issue, this commit should be reverted.
Similarly to the previous commit, we get rid of unused peer member.
peer->addr was only used to save a copy of the sever's addr at parsing
time. But instead of relying on an intermediate variable, we can actually
use server's address directly when initiating the peer session.
As with other streams created from server's settings (tcp/http, log, ring),
we should rely on srv->svc_port for the port part of the address. This
shouldn't change anything for peers since the address is fully resolved
at parsing time and runtime changes are not supported, but this should
help to make the code future-proof.
peer->proto and peer->xprt struct members are now pure legacy: they are
only set during parsing but never used afterwards.
This is due to commit 02efedac ("MINOR: peers: now remove the remote
connection setup code") which made some cleanup in the past, but the
unused proto and xprt members were probably left unused by mistake.
Since we don't have valid uses for them, we remove them.
Also, peer_xprt() helper function was removed since it was related to
peer->xprt struct member.
Historically, we used the internal peer proxy as stream target, because
then we only cared about initiating a basic tcp connection with the
endpoint, and relying on parent proxy settings was enough.
But later, we introduced the possibility to connect to an SSL peer by
taking server's SSL parameters into acount. This was done in commit
1055e687 ("MINOR: peers: Make outgoing connection to SSL/TLS peers work.")
However, the above commit introduced an ambiguity:
peer_session_target() function was introduced, and the function will
either return the peers proxy's object or the current server's object
depending if ssl is configured or not.
While this works fine to ensure proper SSL handling while being
conservative with historical behavior, this cause other server transport
related settings to only work when ssl settings are provided, which is
quite debatable.
Indeed, while we're there, why not always using the server's object as
a stream target, to ensure all transport related options are properly
handled? Moreover, the peers documentation tells this:
... "support for all "server" parameters found in 5.2 paragraph that
are related to transport settings" ...
To remove the ambiguity and fully comply with the documentation, we make
peer_session_target() always return the server's object.
snr_update_srv_status() and srvrq_update_srv_status() will both set or
clear the server RMAINT state depending of the result of the current dns
resolution.
This used to work pretty well in the past, but now that addr:svc_port
changes are changed atomically through a dedicated task, the change is
performed asynchronously, so this can cause some flapping issues if the
server is put out of maintenance while the server's address is still
unassigned.
To prevent errors, the resolver's code is now only allowed to put the
server under maintenance but not to remove it from maintenance:
the decision to remove a server from maintenance is performed by the task
responsible for updating the server's addr: if the addr resolves again
thanks to a valid DNS resolution and the server was previously under
RMAINT, then it cleared from RMAINT state.
srvrq_update_srv_status() was renamed srvrq_set_srv_down(), since it is
only called to put the server in maintenance as a result of a failing
SRV entry.
snr_update_srv_status() was renamed srv_set_srv_down() and slightly
modified so that it only takes care of putting the server under
maintenance when needed.
The cli command "set server x/y addr" does not need to remove the RMAINT
flag anymore.
server_set_inetaddr() updater argument is a simple char * string
containing infos about the caller responsible for the update.
In this patch, we try to make this argument serializable, that is, make
it so that we can easily export it without having to keep the original
pointer passed by the caller or having to work with strings of variable
lengths.
This was a prerequisite for exposing more updater information through
SERVER_INETADDR event (upcoming patch).
Static strings were simply mapped to a fixed ID that can be converted back
to a string when needed using server_inetaddr_updater_by_to_str(). One
special case one made for the SERVER_INETADDR_UPDATER_DNS_RESOLVER updater
since in this case the updater hint has to be generated from the
corresponding resolver id / nameserver id combination. This was achieved
by saving the nameserver id within the updater struct. Knowing that the
resolver id can be guessed from the server struct directly, it was not
exposed through the updater struct.
This patch depends on:
- "MINOR: resolvers: add unique numeric id to nameservers"
No functional change should be expected.
When we want to avoid keeping pointers on a nameserver struct, it's not
always convenient to refer as a nameserver using it's text-based unique
identifier since it's not limited in length thus it cannot be serialized
and deserialized safely.
To address this limitation, we add a new ->puid member in dns_nameserver
struct which is a parent-unique numeric value that can be used to refer
to the dns nameserver within its parent resolver context.
To achieve this, we reused the resolver->nb_nameserver member that wasn't
used. Each time we add a new nameserver to a resolver: we set ns->puid to
the current number of nameservers within the resolver and we increment
this number right away.
Public helper function find_nameserver_by_resolvers_and_id() was added to
help retrieve nameserver pointer from (resolver X nameserver puid)
combination.
dns_dgram_init() function prototype was found in both resolvers and dns
header files, but it should belong to the dns header file, so the
duplicate entry was simply removed.
server_parse_addr_change_request() was completely replaced by the newer
srv_update_addr_port() function. Considering the function doesn't offer
useful features that srv_update_addr_port() couldn't do, we simply
remove the function.
Both functions are performing the similar tasks, except that the _port()
version is doing a bit more work.
In this patch, we add the server_set_inetaddr() function that works like
the srv_update_addr_port() but it takes parsed inputs instead of raw
strings as arguments.
Then, server_set_inetaddr() is used as underlying helper function for
both srv_update_addr() and srv_update_addr_port() to make them easier
to maintain.
Also, helper functions were added:
- server_set_inetaddr_warn() -> same as server_set_inetaddr() but report
a warning on updates.
- server_get_inetaddr() -> fills a struct server_inetaddr from srv
Since the feedback message generation part was slightly reworked, some
minor changes in the way addr:svc_port updates are reported in the logs
or cli messages should be expected (no loss of information though).
server addr:svc_port updates during runtime might set or clear the
SRV_F_MAPPORTS flag. Unfortunately, the flag update is still directly
performed by srv_update_addr_port() function while the addr:svc_port
update is being scheduled for atomic update. Given that existing readers
don't take server's lock to read addr:svc_port, they also check the
SRV_F_MAPPORTS flag right after without the lock.
So we could cause the readers to incorrectly interpret the svc_port from
the server struct because the mapport information is not published
atomically, resulting in inconsistencies between svc_port / mapport flag.
(MAPPORTS flag causes svc_port to be used differently by the reader)
To fix this, we publish the mapport information within the INETADDR server
event and we let the task responsible for updating server's addr and port
position or clear the flag depending on the mapport hint.
This patch depends on:
- MINOR: server/event_hdl: add server_inetaddr struct to facilitate event data usage
- MINOR: server/event_hdl: update _srv_event_hdl_prepare_inetaddr prototype
This should be backported in 2.9 with 683b2ae01 ("MINOR: server/event_hdl:
add SERVER_INETADDR event")
event_hdl_cb_data_server_inetaddr struct had some anonymous structs
defined in it, making it impossible to pass as a function argument and
harder to maintain since changes must be performed at multiple places
at once. So instead we define a storage struct named server_inetaddr
that helps to save addr:port server information in INET context.
4e5e2664 ("MINOR: proxy: add findserver_unique_id() and findserver_unique_name()")
added findserver_unique_id() and findserver_unique_name() functions that
were inspired from the historical findserver() function, so unfortunately
they don't perform well when used on large backend farms because they scan
the whole server list linearly.
I was about to provide a patch to optimize such functions when I stumbled
on Baptiste's work:
19a106d24 ("MINOR: server: server_find functions: id, name, best_match")
It turns out Baptiste already implemented helper functions to supersed
the unoptimized findserver() function (at least at runtime when servers
have been assigned their final IDs and inserted in the lookup trees): they
offer more matching options and rely on eb lookups so they are much more
suitable for fast queries. I don't know how I missed that, but they are a
perfect base for the server rid matching functions.
So in this patch, we essentially revert 4e5e2664 to provide the optimized
equivalent functions named server_find_by_id_unique() and
server_find_by_name_unique(), then we force existing findserver_unique_*()
callers to switch to the new functions.
This patch depends on:
- "OPTIM: server: eb lookup for server_find_by_name()"
This could be backported up to 2.8.
When a TCP frontend uses an HTTP backend, the stream is automatically
upgraded and it results in a similar behavior as if a switch-mode http
rule was evaluated since stream_set_http_mode() gets called in both
situations and minimal HTTP analyzers are set.
In the current implementation, some postparsing checks are generating
errors or warnings when the frontend is in TCP mode with some HTTP options
set and no upgrade is expected (no switch-rule http). But as you can guess,
unfortunately this leads in issues when such "HTTP" only options are used
in a frontend that has implicit switching rules (that is, when the
frontend uses an HTTP backend for example), because in this case the
PR_O_HTTP_UPG will not be set, so the postparsing checks will consider
that some options are not relevant and will raise some warnings.
Consider the following example:
backend back
mode http
server s1 git.haproxy.org:80
frontend front
mode tcp
bind localhost:8080
http-request set-var(txn.test) str(TRUE),debug(WORKING,stderr)
use_backend back
By starting an haproxy instance with the above example conf, we end up
having this warning:
[WARNING] (400280) : config : 'http-request' rules ignored for frontend 'front' as they require HTTP mode.
However, by making a request on the frontend, we notice that the request
rules are still executed, and that's because the stream is effectively
upgraded as a result of an implicit upgrade:
[debug] WORKING: type=str <TRUE>
So this confirms the previous description: since implicit and explicit
upgrades result in approximately the same behavior on the frontend side,
we should consider them both when doing postparsing checks.
This is what we try to address in the following commit: PR_O_HTTP_UPG
flag is now more generic in the sense that it refers to either implicit
(through default_backend or use_backend rules) or explicit (switch-mode
rules) upgrades. Indeed, everytime an HTTP or dynamic backend (where the
mode cannot be assumed during parsing) is encountered in default_backend
directive or use_backend rules, we explicitly position the upgrade flag
so that further checks that depend on the proxy being in HTTP context
don't report false warnings.
Some HTTP related stats functions need to know the parent proxy, mainly
to get a pointer on the related uri_auth set by the proxy or to check
scope settings.
The current design (probably historical as only the http context existed
by then) took the other approach: it propagates the uri pointer from the
http context deep down the calling stack up to the relevant functions.
For non-http contexts (cli), the pointer is set to NULL.
Doing so is not very pretty and not easy to maintain. Moreover, there were
still some places in the code were the uri pointer was learned directly
from the stream proxy because the argument was not available as argument
from those functions. This is error-prone, because if one day we decide to
change the source proxy in the parent function, we might still have some
functions down the stack that ignore the top most argument and still do
on their own, and we'll probably end up with inconsistencies.
So in this patch, we take a safer approach: the caller responsible for
creating the stats applet should set the http_px pointer so that any stats
function running under the applet that needs to know if it's running in
http context or needs to access parent proxy info may do so thanks to
the dedicated ctx->http_px pointer.
A regression was introduced by commit 2421c6fa7d ("BUG/MEDIUM: stconn: Block
zero-copy forwarding if EOS/ERROR on consumer side"). When zero-copy
forwarding is inuse and the consumer side is shut or in error, we declare it
as blocked and it is woken up. The idea is to handle this state at the
stream-connector level. However this definitly blocks receives on the
producer side. So if the mux is unable to close by itself, but instead wait
the peer to shut, this can lead to a wake up loop. And indeed, with the
passthrough multiplexer this may happen.
To fix the issue and prevent any loop, instead of blocking the zero-copy
forwarding, we now disable it. This way, the stream-connector on producer
side will fallback on classical receives and will be able to handle peer
shutdown properly. In addition, the wakeup of the consumer side was
removed. This will be handled, if necessary, by sc_notify().
This patch should fix the issue #2395. It must be backported to 2.9.
When the producer side (h1 for now) negociates with the consumer side to
perform a zero-copy forwarding, we now consider the consumer side as blocked
if it is closed and this was reported to the SE via a end-of-stream or a
(pending) error.
It is performed before calling ->nego_ff callback function, in se_nego_ff().
This way, all consumer are concerned automatically. The aim of this patch is
to fix an issue with the QUIC mux. Indeed, it is unexpected to send a frame
on an closed stream. This triggers a BUG_ON(). Other muxes are not affected
but it remains useless to try to send data if the stream is closed.
This patch should fix the issue #2372. It must be backported to 2.9.
qcc_app_ops is a set of callbacks used to unify application protocol
running over QUIC. This commit introduces some changes to clarify its
API :
* write simple comment to reflect each callback purpose
* rename decode_qcs to rcv_buf as this name is more common and is
similar to already existing snd_buf
* finalize is moved up as it is used during connection init stage
All these changes are ported to HTTP/3 layer. Also function comments
have been extended to highlight HTTP/3 special characteristics.
This function is similar to the previous one, but this time for QCS
sending buffer.
Previously, each application layer redefine their own version of
mux_get_buf() which was used to allocate <qcs.tx.buf>. Unify it under a
single function renamed qcc_get_stream_txbuf().
Replaces qcs_get_buf() function which naming does not reflect its
purpose. Add a new function qcc_get_stream_rxbuf() which allocate if
needed <qcs.rx.app_buf> and returns the buffer pointer. This function is
reserved for application protocol layer. This buffer is then accessed by
stconn layer.
For other qcs_get_buf() invocation which was used in effect for a local
buffer, replace these by a plain b_alloc().
"set severity-output" is one of these command that changes the appctx
state so the next commands are affected.
Unfortunately the master CLI works with pipelining and server close
mode, which means the connection between the master and the worker is
closed after each response, so for the next command this is a new appctx
state.
To fix the problem, 2 new flags are added ACCESS_MCLI_SEVERITY_STR and
ACCESS_MCLI_SEVERITY_NB which are used to prefix each command sent to
the worker with the right "set severity-output" command.
This patch fixes issue #2350.
It could be backported as far as 2.6.
Before this patch, it was not possible to use a list of patterns, map or a
list of acls, without an existing file. However, it could be handy to just
use an ID, with no file on the disk. It is pretty useful for everyone
managing dynamically these lists. It could also be handy to try to load a
list from a file if it exists without failing if not. This way, it could be
possible to make a cold start without any file (instead of empty file),
dynamically add and del patterns, dump the list to the file periodically to
reuse it on reload (via an external process).
In this patch, we uses some prefixes to be able to use virtual or optional
files.
The default case remains unchanged. regular files are used. A filename, with
no prefix, is used as reference, and it must exist on the disk. With the
prefix "file@", the same is performed. Internally this prefix is
skipped. Thus the same file, with ou without "file@" prefix, references the
same list of patterns.
To use a virtual map, "virt@" prefix must be used. No file is read, even if
the following name looks like a file. It is just an ID. The prefix is part
of ID and must always be used.
To use a optional file, ie a file that may or may not exist on a disk at
startup, "opt@" prefix must be used. If the file exists, its content is
loaded. But HAProxy doesn't complain if not. The prefix is not part of
ID. For a given file, optional files and regular files reference the same
list of patterns.
This patch should fix the issue #2202.
tune.cache.zero-copy-forwarding parameter can now be used to enable or
disable the zero-copy fast-forwarding for the cache applet only. It is
enabled ('on') by default. It can be disabled by setting the parameter to
'off'.
For now, CF_STREAMER and CF_STREAMER_FAST flags are set in sc_conn_recv()
function. The logic is moved in dedicated functions.
First, channel_check_idletimer() function is now responsible to check the
channel's last read date against the idle timer value to be sure the
producer is still streaming data. Otherwise, it removes STREAMER flags.
Then, channel_check_xfer() function is responsible to check amount of data
transferred avec a receive, to eventually update STREAMER flags.
In sc_conn_recv(), we now use these functions.
Zero-copy fast-forwading feature is a quite new and is a bit sensitive.
There is an option to disable it globally. However, all protocols have not
the same maturity. For instance, for the PT multiplexer, there is nothing
really new. The zero-copy fast-forwading is only another name for the kernel
splicing. However, for the QUIC/H3, it is pretty new, not really optimized
and it will evolved. And soon, the support will be added for the cache
applet.
In this context, it is usefull to be able to enable/disable zero-copy
fast-forwading per-protocol and applet. And when it is applicable, on sends
or receives separately. So, instead of having one flag to disable it
globally, there is now a dedicated bitfield, global.tune.no_zero_copy_fwd.
It is possible that a server's addr family is temporarily set to AF_UNSPEC
even if we're certain to be in INET context (ipv4, ipv6).
Indeed, as soon as IP address resolving is involved, srv->addr family will
be set to AF_UNSPEC when the resolution fails (could happen at anytime).
However, _srv_event_hdl_prepare_inetaddr() wrongly assumed that it would
only be called with AF_INET or AF_INET6 families. Because of that, the
function will handle AF_UNSPEC address as an IPV6 address: not only
we could risk reading from an unititialized area, but we would then
propagate false information when publishing the event.
In this patch we make sure to properly handle the AF_UNSPEC family in
both the "prev" and the "next" part for SERVER_INETADDR event and that
every members are explicitly initialized.
This bug was introduced by 6fde37e046 ("MINOR: server/event_hdl: add
SERVER_INETADDR event"), no backport needed.
preferred_address is a transport parameter specify by the server. It
specified both an IPv4 and IPv6 address. These addresses were defined as
plain array in <struct tp_preferred_address>.
Convert these adressees to use the common types in_addr/in6_addr. With
this change, dumping of preferred_address is extended. It now displays
the addresses using inet_ntop() and CID value.
retrieve_qc_conn_from_cid() requires listener as argument whereas it is
unused. This is an artifact from the old architecture where CID trees
where stored on listener instances instead of globally. Remove it to
better reflect this change.
Just like the ->ctl() callback function, used to send commands to mux
connections, the ->sctl() callback function can now be used to send commands
to mux streams. The first command, MUX_SCTL_SID, is a way to request the mux
stream ID.
It will be implemented later for each mux.
Instead of the generic MUX_, we now use MUX_CTL_ prefix for all mux_ctl_type
value. This will avoid any ambiguities with other enums, especially with a
new one that will be added to get information on mux streams.
It is now possible to retrieve the session terminate state, using
"txn.sess_term_state". The sample fetch returns the 2-character session
termation state.
Of course, the result of this sample fetch is volatile. It is subject to
change. It is also most of time useless because no termation state is set
except at the end. It should only be useful in http-after-response rule
sets. It may also be used to customize the logs using a log-format
directive.
This patch should fix the issue #2221.
The code returned by the "status" sample fetch is the one in the HTTP
response at the moment the sample is evaluated. It may be the status code in
the server response or the one of the HAProxy reply in case of error, deny,
redirect...
However, it could be handy to retrieve the status code returned by the
server, when a HTTP response was really received from it. It is the purpose
of the "server_status" sample fetch. The server status code itself is stored
in the HTTP txn.
In previous log backend implementation, we created a pseudo log target
for each declared log server, and we made the log target's address point
to the actual server address to save some time and prevent unecessary
copies.
But this was done without knowing that when FQDN is involved (more broadly
when dns/resolution is involved), the "port" part of server addr should
not be relied upon, and we should explicitly use ->svc_port for that
purpose.
With that in mind and thanks to the previous commit, some changes were
required: we allocate a dedicated addr within the log target when target
is in DGRAM mode. The addr is first initialized with known values and it
is then updated automatically by _srv_set_inetaddr() during runtime.
(the change is atomic so readers don't need to worry about it)
addr from server "log target" (INET/DGRAM mode) is made of the combination
of server's address (lacking the port part) and server's svc_port.
The local variable "event_hdl_async_max_notif_at_once" which was
introduced with the event_hdl API was left as is but with a TODO note
telling that we should make it a global tunable.
Well, we're doing this now. To prepare for upcoming tunables related to
event_hdl API, we add a dedicated struct named event_hdl_tune which is
globally exposed through the event_hdl header file so that it may be used
from everywhere. The struct is automatically initialized in
event_hdl_init() according to defaults.h.
"event_hdl_async_max_notif_at_once" now becomes
"event_hdl_tune.max_events_at_once" with it's dedicated
configuation keyword: "tune.events.max-events-at-once".
We're also taking this opportunity to raise the default value from 10
to 100 since it's seems quite reasonnable given existing async event_hdl
users.
The documentation was updated accordingly.
Implements the customized payload pattern for the master CLI.
The pattern is stored in the stream in char pcli_payload_pat[8].
The principle is basically the same as the CLI one, it looks for '<<'
then stores what's between '<<' and '\n', and look for it to exit the
payload mode.
The CLI payload syntax has some limitation, it can't handle payloads
with empty lines, which is a common problem when uploading a PEM file
over the CLI.
This patch implements a way to customize the ending pattern of the CLI,
so we can't look for other things than empty lines.
A char cli_payload_pat[8] is used in the appctx to store the customized
pattern. The pattern can't be more than 7 characters and can still empty
to match an empty line.
The cli_io_handler() identifies the pattern and stores it, and
cli_parse_request() identifies the end of the payload.
If the customized pattern between "<<" and "\n" is more than 7
characters, it is not considered as a pattern.
This patch only implements the parser for the 'stats socket', another
patch is needed for the 'master CLI'.
Move qc_notify_send() from quic_tx.c to quic_conn.c. Note that it was already
exported from both quic_conn.h and quic_tx.h. Modify this latter header
to fix the duplication.
Such a warning appeared after having added quic_retry.h which includes only
headers for types (quic_cid-t.h, clock-t.h...)
In file included from include/haproxy/quic_retry.h:12,
from src/quic_retry.c:5:
include/haproxy/quic_cid-t.h:26:26: error: field ‘seq_num’ has incomplete type
26 | struct eb64_node seq_num;
Add quic_retry.c new C file for the QUIC retry feature:
quic_saddr_cpy() moved from quic_tx.c,
quic_generate_retry_token_aad() moved from
quic_generate_retry_token() moved from
parse_retry_token() moved from
quic_retry_token_check() moved from
quic_retry_token_check() moved from
This function is in relation with the Initial packet number space which is
more linked to the QUIC TLS specifications. Let's move it to quic_tls.h
to be inlined.
Move quic_path struct from quic_conn-t.h to quic_cc-t.h and rename it to quic_cc_path.
Update the code consequently.
Also some inlined functions in relation with QUIC path to quic_cc.h
Move quic_pkt_type(), quic_saddr_cpy(), quic_write_uint32(), max_available_room(),
max_stream_data_size(), quic_packet_number_length(), quic_packet_number_encode()
and quic_compute_ack_delay_us() to quic_tx.c because only used in this file.
Also move quic_ack_delay_ms() and quic_read_uint32() to quic_tx.c because they
are used only in this file.
Move quic_rx_packet_refinc() and quic_rx_packet_refdec() to quic_rx.h header.
Move qc_el_rx_pkts(), qc_el_rx_pkts_del() and qc_list_qel_rx_pkts() to quic_tls.h
header.
Move quic_cstream struct definition from quic_conn-t.h to quic_tls-t.h.
Its pool is also moved from quic_conn module to quic_tls. Same thing for
quic_cstream_new() and quic_cstream_free().
Fix such building issues:
In file included from src/quic_tx.c:15:
include/haproxy/quic_tx.h:51:23: warning: ‘struct quic_rx_packet’
Do not know why the compiler warns about such missing header inclusions
just now. It should have complained a long time ago during the big QUIC
source code split.
Move quic_cid and quic_connnection_id from quic_conn-t.h to new quic_cid-t.h header.
Move defintions of quic_stateless_reset_token_init(), quic_derive_cid(),
new_quic_cid(), quic_get_cid_tid() and retrieve_qc_conn_from_cid() to quic_cid.c
new C file.
When zero-copy data fast-forwarding is inuse, if the opposite SC is blocked,
there is no reason to try to fast-forward more data. Worst, in some cases,
this can lead to a receive loop of the producer side while the consumer side
is blocked.
No backport needed.
Add an optional argument for "-dt". This argument is interpreted as a
list of several trace statement separated by comma. For each statement,
a specific trace name can be specifed, or none to act on all sources.
Using double-colon separator, it is possible to add specifications on
the wanted level and verbosity.
Add '-dt' haproxy process argument. This will automatically activate all
trace sources on stderr with the error level. This could be useful to
troubleshoot issues such as protocol violations.
In the pat_ref_elt struct, the pattern string is stored outside of the
node element, using a pointer to an strdup(). Not only this needlessly
wastes at least 16-24 bytes per entry (8 for the pointer, 8-16 for the
allocator), it also makes the tree descent less efficient since both
the node and the string have to be visited for each layer (hence at least
two cache lines). Let's use an ebmb storage and place the pattern right
at the end of the pat_ref_elt, making it a variable-sized element instead.
The set-map test below jumps from 173 to 182 kreq/s/core, and the memory
usage drops from 356 MB to 324 MB:
http-request set-map(/dev/null) %[rand(1000000)] 1
This is even more visible with large maps: after loading 16M IP addresses
into a map, the process uses this amount of memory:
- 3.15 GB with haproxy-2.8
- 4.21 GB with haproxy-2.9-dev11
- 3.68 GB with this patch
So that's a net saving of 32 bytes per entry here, which cuts in half the
extra cost of the tree, and loading a large map takes about 20% less time.
Task_drop_running() is used to remove the RUNNING bit and check if
while the task was running it got a new wakeup from itself. Thus
each time task_drop_running() marks itself as a caller, it in fact
removes the previous caller that woke up the task, such as below:
Tasks activity over 10.439 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lat_tot lat_avg
task_run_applet 57895273 6.396m 6.628us 2.733h 170.0us <- run_tasks_from_lists@src/task.c:658 task_drop_running
Better not mark this function as a caller and keep the original one:
Tasks activity over 13.834 sec till 0.000 sec ago:
function calls cpu_tot cpu_avg lat_tot lat_avg
task_run_applet 62424582 5.825m 5.599us 5.717h 329.7us <- sc_app_chk_rcv_applet@src/stconn.c:952 appctx_wakeup
The mworker mode never had a proper 'hard-stop' (-st) for the reload,
this is a mode which was commonly used with the daemon mode, but it was
never implemented in mworker mode.
This patch fixes the problem by implementing a "hard-reload" command
over the master CLI. It does the same as the "reload" command, but
instead of waiting for the connections to stop in the previous process,
it immediately quits the previous process after binding.
Take the px->server_rules freeing part out of free_proxy() and make it
a dedicated helper function so that it becomes possible to use it from
anywhere.
In this patch we fix the prototype for ipcmp() and ipcpy() functions so
that input pointers that are used exclusively for reads are used as const
pointers. This way, the compiler can safely assume that those variables
won't be altered by the function.
In this patch we add the support for a new SERVER event in the
event_hdl API.
SERVER_INETADDR is implemented as an advanced server event.
It is published each time the server's ip address or port is
about to change. (ie: from the cli, dns, lua...)
SERVER_INETADDR data is an event_hdl_cb_data_server_inetaddr struct
that provides additional info related to the server inet addr change,
but can be casted as a regular event_hdl_cb_data_server struct if
additional info is not needed.
When possible, we try send DATA frame without copying data. To do so, we
swap the input buffer with QCS tx buffer. It is only possible iff:
* There is only one HTX block of data at the beginning of the message
* Amount of data to send is equal to the size of the HTX data block
* The QCS tx buffer is empty
In this case, both buffers are swapped. The frame metadata are written at
the begining of the buffer, before data and where the HTX structure is
stored.
Add a new member <nb_rhttp_conns> in thread_ctx structure. Its purpose
is to count the current number of opened reverse HTTP connections
regarding from their listeners membership.
This patch will be useful to support multi-thread for active reverse
HTTP, in order to select the less loaded thread.
Note that despite access to <nb_rhttp_conns> are only done by the
current thread, atomic operations are used. This is because once
multi-thread support will be added, external threads will also retrieve
values from others.
Previous commit renames 'proto_reverse_connect' module to 'proto_rhttp'.
This commits follows this by replacing various custom prefix by 'rhttp_'
to make the code uniform.
Note that 'reverse_' prefix was kept in connection module. This is
because if a new reversable protocol not based on HTTP is implemented,
it may be necessary to reused the same connection function which are
protocol agnostic.
In Github issue #2128, @jvincze84 explained the complexity of using
external checks in some advanced setups due to the systematic purge of
environment variables, and expressed the desire to preserve the
existing environment. During the discussion an agreement was found
around having an option to "external-check" to do that and that
solution was tested and confirmed to work by user @nyxi.
This patch just cleans this up, implements the option as
"preserve-env" and documents it. The default behavior does not change,
the environment is still purged, unless "preserve-env" is passed. The
choice of not using "import-env" instead was made so that we could
later use it to name specific variables that have to be imported
instead of keeping the whole environment.
The patch is simple enough that it could be backported if needed (and
was in fact tested on 2.6 first).
Here the idea is to collect components' versions and build options. The
main component is haproxy, but the API is made so that any sub-system
can easily add a component there (for example the detailed version of a
device detection lib, or some info about a lib loaded from Lua).
The elements are stored as a pointer to an array of structs and its count
so that it's sufficient to issue this in gdb to list them all at once:
print *post_mortem.components@post_mortem.nb_components
For now we collect name, version, toolchain, toolchain options, build
options and path. Maybe more could be useful in the future.
When debugging a core, it's difficult to match a given gdb thread number
against an internal thread. Let's just store the pthread ID and the stack
pointer in each tinfo. This could help in the future by allowing to just
glance over them and pick the right one depending what info is found
first.
Add missing CO_FL_REVERSED and CO_FL_ACT_REVERSING flag definitions in
conn_show_flags(). These flags were introduced in this release with
reverse HTTP support.
No need to backport
On CONNECTION_CLOSE reception/emission, QUIC connections enter CLOSING
state. At this stage, only CONNECTION_CLOSE can be reemitted and all
other exchanges are stopped.
Previously, on haproxy process stopping, if all QUIC connections were in
CLOSING state, they were released before their closing timer expiration
to not block the process shutdown. However, since a recent commit, the
closing timer has been shorten to a more reasonable delay. It is now
consider viable to respect connections closing state even on process
shutdown. As such, stopping specific code in QUIC connections idle timer
task was removed.
A specific function quic_handle_stopping() was implemented to notify
QUIC connections on shutdown from main() function. It should have been
deleted along the removal in QUIC idle timer task. This patch just does
this.
In 2.3, we started to get a cleaner socket unbinding mechanism with
commit f58b8db47 ("MEDIUM: receivers: add an rx_unbind() method in
the protocols"). This mechanism rightfully refrains from unbinding
when sockets are expected to be transferrable to another worker via
"expose-fd listeners", but this is not compatible with ABNS sockets,
which do not support reuseport, unbinding nor being renamed: in short
they will always prevent a new process from binding.
It turns out that this is not much visible because by pure accident,
GTUNE_SOCKET_TRANSFER is only set in the code dealing with master mode
and deamons, so it's never set in foreground mode nor in tests even if
present on the stats socket. However with master mode, it is now always
set even when not present on the stats socket, and will always conflict.
The only reasonable approach seems to consist in marking these abns
sockets as non-suspendable so that the generic sock_unbind() code can
decide to just unbind them regardless of GTUNE_SOCKET_TRANSFER.
This should carefully be backported as far as 2.4.
add proxy_cfg_ensure_no_log() function (similar to
proxy_cfg_ensure_no_http()) to ensure at the end of proxy parsing that
no log exclusive options are found if the proxy is not in log mode.
"log-balance" directive was recently introduced to configure the
balancing algorithm to use when in a log backend. However, it is
confusing and it causes issues when used in default section.
In this patch, we take another approach: first we remove the
"log-balance" directive, and instead we rely on existing "balance"
directive to configure log load balancing in log backend.
Some algorithms such as roundrobin can be used as-is in a log backend,
and for log-only algorithms, they are implemented as "log-$name" inside
the "backend" directive.
The documentation was updated accordingly.
Make sure lbprm.algo can store 32bits by declaring it as uint32_t
Then, use all 32 available bits to offer 4 extra bits for the BE_LB_NEED
inputs. This will allow new required inputs to be easily added (up to 4
new ones, plus one that wasn't used yet if we keep them exclusive)
This required some cleanup: all ALGO bitfields were rewritten in the
32bits format and the high ones were shifted to make room for the
new BE_LB_NEED bits.
BE_LB_HASH_RND was introduced with 760e81d35 ("MINOR: backend: implement
random-based load balancing") but was never used since. Removing it
to regain an extra slot for future types.
A dummy connect() function previously had to be installed for the log
server so that a reverse-http address could be referenced on a "server"
line, but after the recent rework of the server line parsing, this is
no longer needed, and this is actually annoying as it makes one believe
there is a way to connect outside, which is not true. Let's now get rid
of this function.
The idle timer task may be used to trigger the client handshake timeout.
The hanshake timeout expiration date (qc->hs_expire) is initialized when the
connection is allocated. Obviously, this timeout is taken into an account only
during the handshake by qc_idle_timer_do_rearm() whose job is to rearm the idle timer.
The idle timer expiration date could be initialized only one time, then
never updated until the hanshake completes. But this only works if the
handshake timeout is smaller than the idle timer task timeout. If the handshake
timeout is set greater than the idle timeout, this latter may expire before the
handshake timeout.
This patch may have an impact on the L1/C1 interop tests (with heavy packet loss
or corruption). This is why I guess some implementations with a hanshake timeout
support set a big timeout during this test. This is at least the case for ngtcp2
which sets a 180s hanshake timeout! haproxy will certainly have to proceed the
same way if it wants to have a chance to pass this test as before this handshake
timeout.
Add a new timeout for the handshake, on the frontend side only. Such a hanshake
will be typically used for TLS hanshakes during client connections to TLS/TCP or
QUIC frontends.
Partial sends is an activity, not a full blocking. Thus a read activity must
be reported for non-independent stream. It is especially important for very
congested stream where full sends are uncommon.
This patch must be backported to 2.8.
This patch adds HXT-aware versions of the functions c_data(), ci_data() and
c_empty(). channel_data() function returns the amount of data in the
channel, channel_input_data() returns the amount of input data and
channel_empty() returns true if the channel's buffer is empty. These
functions handles HTX buffers.
In addition, channel_data_limit() function, still HTX-aware, can be used to
get the maximum absolute amount of data that can be copied in a buffer,
independently on data already present in the buffer.
The overhead induced by the HTX format was set to the HTX structure itself
and two HTX blocks. It was set this way to optimize zero-copy during
transfers. This value may (and will) be used at different places. Thus we
now use a macro, called HTX_BUF_OVERHEAD.
The first-send-blocked date was originally designed to save the date of the
first send of a series where some data remain blocked. It was relaxed
recently (3083fd90e "BUG/MEDIUM: stconn: Report a send activity everytime
data were sent") to save the date of the first full blocked send. However,
it is not accurrate.
When all data are sent, the fsb value must be reset to TICK_ETERNITY. When
nothing is sent and if it is not already set, it must be set. But when data
are partially sent, the value must be updated and not reset. Otherwise the
write timeout may be ignored because fsb date is never set.
So, changes brought by the patch above are reverted and
sc_ep_report_blocked_send() was changed to know if some data were sent or
not. This way we are able to update fsb value.
l
This patch must be backported to 2.8.
This global variable was used to avoid using locks on shared_contexts in
the unlikely case of nbthread==1. Since the locks do not do anything
when USE_THREAD is not defined, it will be more beneficial to simply
remove this variable and the systematic test on its value in the shared
context locking functions.
A reference counter on the cache_entry was added in a previous commit.
Its value is atomically increased and decreased via the retain_entry and
release_entry functions.
This is needed because of the latest cache and shared_context
modifications that introduced two separate locks instead of the
preexisting single shctx_lock one.
With the new logic, we have two main blocks competing for the two locks:
- the one in the http_action_req_cache_use that performs a lookup in the
cache tree (locked by the cache lock) and then tries to remove the
corresponding blocks from the shared_context's 'avail' list until the
response is sent to the client by the cache applet,
- the shctx_row_reserve_hot that traverses the 'avail' list and gives
them back to the caller, while removing previous row heads from the
cache tree
Those two blocks require the two locks but one of them would take the
cache lock first, and the other one the shctx_lock first, which would
end in a deadlock without the current patch.
The way this conflict is resolved in this patch is by ensuring that at
least one of those uses works without taking the two locks at the same
time.
The solution found was to keep taking the two locks in the cache_use
case. We first lock the cache to lookup for an entry and we then take
the shctx lock as well to detach the corresponding blocks from the
'avail' list. The subtlety is that between the cache lookup and the
actual locking of the shctx, another thread might have called the
reserve_hot function in which we only take the shctx lock.
In this function we traverse the 'avail' list to remove blocks that are
then given to the caller. If one of those blocks corresponds to a
previous row head, we call the 'free_blocks' callback that used to
delete the cache entry from the tree.
We now avoid deleting directly the cache entries in reserve_hot and we
rather set the cache entries 'complete' param to 0 so that no other
thread tries to work with this entry. This way, when we release the
shctx lock in reserve_hot, the first thread that had performed the cache
lookup and had found an entry that we just gave to another thread will
see that the 'complete' field is 0 and it won't try to work with this
response.
The actual removal of entries from the cache tree will now be performed
in the new 'reserve_finish' callback called at the end of the
shctx_row_reserve_hot function. It will iterate on all the row head that
were inserted in a dedicated list in the 'free_block' callback and
perform the actual delete.
This patch adds a reserve_finish callback that can be defined by the
subsystems that require a shared_context. It is called at the end of
shctx_row_reserve_hot after the shared_context lock is released.
Since a lock on the cache tree was added in the latest cache changes, we
do not need to use the shared_context's lock to lock more than pure
shared_context related data anymore. This already existing lock will now
only cover the 'avail' list from the shared_context. It can then be
changed to a rwlock instead of a spinlock because we might want to only
run through the avail list sometimes.
Apart form changing the type of the shctx lock, the main modification
introduced by this patch is to limit the amount of code covered by the
shctx lock. This lock does not need to cover any code strictly related
to the cache tree anymore.
The "hot" list stored in a shared_context was used to keep a reference
to shared blocks that were currently being used and were thus removed
from the available list (so that they don't get reused for another cache
response). This 'hot' list does not ever need to be shared across
threads since every one of them only works on their current row.
The main need behind this 'hot' list was to detach the corresponding
blocks from the 'avail' list and to have a known list root when calling
list_for_each_entry_from in shctx_row_data_append (for instance).
Since we actually never need to iterate over all members of the 'hot'
list, we can remove it and replace the inc_hot/dec_hot logic by a
detach/reattach one.
Every use of the cache tree was covered by the shctx lock even when no
operations were performed on the shared_context lists (avail and hot).
This patch adds a dedicated RW lock for the cache so that blocks of code
that work on the cache tree only can use this lock instead of the
superseding shctx one. This is useful for operations during which the
concerned blocks are already in the hot list.
When the two locks need to be taken at the same time, in
http_action_req_cache_use and in shctx_row_reserve_hot, the shctx one
must be taken first.
A new parameter needed to be added to the shared_context's free_block
callback prototype so that cache_free_block can take the cache lock and
release it afterwards.
Change the flags used for reversed connection :
* CO_FL_REVERSED is now put after reversal for passive connect. For
active connect, it is delayed when accept is completed after reversal.
* CO_FL_ACT_REVERSING replace the old CO_FL_REVERSED. It is put only for
active connect on reversal and removes once accept is done.
This allows to identify a connection as reversed during its whole
lifetime. This should be useful to extend reverse connect.
Make all the congestion support the maximum congestion control window
set by configuration. There is nothing special to explain. For each
each algo, each time the window is incremented it is also bounded.
Add a new ->max_cwnd member to bind_conf struct to store the maximum
congestion control window value for each QUIC binding.
Modify the "quic-cc-algo" keyword parsing to add an optional parameter
to its value: the maximum congestion window value between parentheses
as follows:
ex: quic-cc-algo cubic(10m)
This value must be bounded, greater than 10k and smaller than 1g.
This bug was introduced with 29b76ca ("BUG/MEDIUM: server/log: "mode log"
after server keyword causes crash ")
Indeed, we cannot safely rely on addr_proto being set when str2sa_range()
returns in parse_server() (even if SRV_PARSE_PARSE_ADDR is set), because
proto lookup might be bypassed when FQDN addresses are involved.
Unfortunately, the above patch wrongly assumed that proto would always
be set when SRV_PARSE_PARSE_ADDR was passed to parse_server() (so when
str2sa_range() was called), resulting in invalid postparsing checks being
performed, which could as well lead to crashes with log backends
("mode log" set) because some postparsing init was skipped as a result of
proto not being set and this wasn't expected later in the init code.
To fix this, we now make use of the previous patch to perform server's
address compatibility checks on hints that are always set when
str2sa_range() succesfully returns.
For log backend, we're also adding a complementary test to check if the
address family is of expected type, else we report an error, plus we're
moving the postinit logic in log api since _srv_check_proxy_mode() is
only meant to check proxy mode compatibility and we were abusing it.
This patch depends on:
- "MINOR: tools: make str2sa_range() directly return type hints"
No backport required unless 29b76ca gets backported.
str2sa_range() already allows the caller to provide <proto> in order to
get a pointer on the protocol matching with the string input thanks to
5fc9328a ("MINOR: tools: make str2sa_range() directly return the protocol")
However, as stated into the commit message, there is a trick:
"we can fail to return a protocol in case the caller
accepts an fqdn for use later. This is what servers do and in this
case it is valid to return no protocol"
In this case, we're unable to return protocol because the protocol lookup
depends on both the [proto type + xprt type] and the [family type] to be
known.
While family type might not be directly resolved when fqdn is involved
(because family type might be discovered using DNS queries), proto type
and xprt type are already known. As such, the caller might be interested
in knowing those address related hints even if the address family type is
not yet resolved and thus the matching protocol cannot be looked up.
Thus in this patch we add the optional net_addr_type (custom type)
argument to str2sa_range to enable the caller to check the protocol type
and transport type when the function succeeds.
It's common to see process_stream() being woken up by wake_expired_tasks
in the profiling output, without knowing which timeout was set to cause
this. By making it possible to record the call places of task_queue()
and task_schedule(), and by making wake_expired_tasks() explicitly not
replace it, we'll be able to know which task_queue() or task_schedule()
was triggered for a given wakeup.
For example below:
process_stream 51200 311.4ms 6.081us 34.59s 675.6us <- run_tasks_from_lists@src/task.c:659 task_queue
process_stream 19227 70.00ms 3.640us 9.813m 30.62ms <- sc_notify@src/stconn.c:1136 task_wakeup
process_stream 6414 102.3ms 15.95us 8.093m 75.70ms <- stream_new@src/stream.c:578 task_wakeup
It's visible that it's the run_tasks_from_lists() which in fact applies
on the task->expire returned by the ->process() function itself.
This is used for tracing and profiling. By permitting to have a NULL
caller, we allow a caller to explicitly pass zero to state that the
current caller must not be replaced. This will soon be used by
wake_expired_tasks() to avoid replacing a caller in the expire loop.
QUIC connections are pushed manually into a dedicated listener queue
when they are ready to be accepted. This happens after handshake
finalization or on 0-RTT packet reception. Listener is then woken up to
dequeue them with listener_accept().
This patch comptabilizes the number of connections currently stored in
the accept queue. If reaching a certain limit, INITIAL packets are
dropped on reception to prevent further QUIC connections allocation.
This should help to preserve system resources.
This limit is automatically derived from the listener backlog. Half of
its value is reserved for handshakes and the other half for accept
queues. By default, backlog is equal to maxconn which guarantee that
there can't be no more than maxconn connections in handshake or waiting
to be accepted.
Implement a limit per listener for concurrent number of QUIC
connections. When reached, INITIAL packets for new connections are
automatically dropped until the number of handshakes is reduced.
The limit value is automatically based on listener backlog, which itself
defaults to maxconn.
This feature is important to ensure CPU and memory resources are not
consume if too many handshakes attempt are started in parallel.
Special care is taken if a connection is released before handshake
completion. In this case, counter must be decremented. This forces to
ensure that member <qc.state> is set early in qc_new_conn() before any
quic_conn_release() invocation.
Accounting is implemented for half open connections which represent QUIC
connections waiting for handshake completion. When reaching a certain
limit, Retry mechanism is automatically activated prior to instantiate
new connections.
The issue with this behavior is that two notions are mixed : QUIC
connection handshake phase and Retry which is mechanism against
amplification attacks. As such, only peer address validation should be
taken into account to activate Retry protection.
This patch chooses to reduce the scope of half_open_conn. Now only
connection waiting to validate the peer address are now accounted for.
Most notably, connections instantiated with a validated Retry token
check are not accounted.
One impact of this patch is that it should prevent to activate Retry
mechanism too early, in particular in case if multiple handshakes are
too slow. Another limitation should be implemented to protect against
this scenario.
When a new QUIC connection is created, server considers peer address as
not yet validated. The server must limit its sending up to 3 times the
content already received. This is a defensive measure to avoid flooding
a remote host victim of address spoofing.
This patch adjust the condition to consider the peer address as
validated. Two conditions are now considered :
* successful handling of a received HANDSHAKE packet. This was already
done before although implemented in a different way.
* validation of a Retry token. This was not considered prior this patch
despite RFC recommandation.
This patch also adjusts how a connection is internally labelled as using
a validated peer address. Before, above conditions were checked via
quic_peer_validated_addr(). Now, a flag QUIC_FL_CONN_PEER_VALIDATED_ADDR
is set to labelled this. It already existed prior this patch but was
only used for quic_cc_conn. This should now be more explicit.
This bug could be reproduced with -dMfail option and detected by libasan.
During the TLS secrets allocations, when failed, quic_tls_ctx_secs_free()
is called. It resets the already initialized secrets. Some were detected
as initialized when not, or with a non initialized length, which leads
to big "memset(0)" detected by libsasan.
Ensure that all the secrets are really initialized with correct lengths.
No need to be backported.
This was no reason not to release as soon as possible the TLS/SSL QUIC connection
context from quic_conn_release() before allocating a "closing connection" connection
(quic_cc_conn struct).
This patch sets the handshake task in heavy task mode when receiving in disorder
CRYPTO data which results in in order bufferized CRYPTO data. This is done
thanks to a non-contiguous buffer and from qc_handle_crypto_frm() after having
potentially bufferized CRYPTO data in this buffer.
qc_treat_rx_crypto_frms() is no more called from qc_treat_rx_pkts() but instead
this is where the task is set in heavy task mode. Consequently,
this is the job of qc_ssl_provide_all_quic_data() to call directly
qc_treat_rx_crypto_frms() to provide the in order bufferized CRYPTO data to the
TLS stack. As this function releases the non-contiguous buffer for the CRYPTO
data, if possible, there is no need to do that from qc_treat_rx_crypto_frms()
anymore.
Add a new pool for the CRYPTO data frames received in order.
Add ->rx.crypto_frms list to each encryption level to store such frames
when they are received in order from qc_handle_crypto_frm().
Also set the handshake task (qc_conn_io_cb()) in heavy task mode from
this function after having received such frames. When this task
detects that it is set in heavy mode, it calls qc_ssl_provide_all_quic_data()
newly implemented function to provide the CRYPTO data to the TLS task.
Modify quic_conn_enc_level_uninit() to release these CRYPTO frames
when releasing the encryption level they are in relation with.
IOBUF_FL_EOI iobuf flag is now set by the producer to notify the consumer
that the end of input was reached. Thanks to this flag, we can remove the
ugly ack in h2_done_ff() to test the opposite SE flags.
Of course, for now, it works and it is good enough. But we must keep in mind
that EOI is always forwarded from the producer side to the consumer side in
this case. But if this change, a new CO_RFL_ flag will have to be added to
instruct the producer if it can forward EOI or not.
In the mux-to-mux data forwarding, we now try, as far as possible to send at
least a buffer. Of course, if the consumer side is congested or if nothing
more can be received, we leave. But the idea is to retry to fast-forward
data if less than a buffer was forwarded. It is only performed for buffer
fast-forwarding, not splicing.
The idea behind this patch is to optimise the forwarding, when a first
forward was performed to complete a buffer with some existing data. In this
case, the amount of data forwarded is artificially limited because we are
using a non-empty buffer. But without this limitation, it is highly probable
that a full buffer could have been sent. And indeed, with H2 client, a
significant improvement was observed during our test.
To do so, .done_fastfwd() callback function must be able to deal with
interim forwards. Especially for the H2 mux, to remove H2_SF_NOTIFIED flags
on the H2S on the last call only. Otherwise, the H2 stream can be blocked by
itself because it is in the send_list. IOBUF_FL_INTERIM_FF iobuf flag is
used to notify the consumer it is not the last call. This flag is then
removed on the last call.
In 2.6-dev1, the method used to decide how many pool entries could be
released at once was revisited to support releases in batches. This was
done with commits 91a8e28f9 ("MINOR: pool: add a function to estimate
how many may be released at once") and 361e31e3f ("MEDIUM: pool: compute
the number of evictable entries once per pool").
The first commit takes care of the possible inconsistency between the
moment the allocated count and the used count are read, but unfortunately
fixed it the wrong way, by adjusting "used" to match "alloc" whenever it
was lower (i.e. almost always). This results in a nasty case which is that
as soon as the allocated value becomes higher than the estimated count of
needed entries, we end up returning pool->minavail, which causes very
small batches to be released, starting from commit 1513c5479 ("MEDIUM:
pools: release cached objects in batches").
The problem was further amplified in 2.9-dev3 with commit 7bf829ace
("MAJOR: pools: move the shared pool's free_list over multiple buckets")
because it now becomes possible for a thread to allocate from one bucket
and release into a few other different ones, causing an accumulation of
entries in that bucket.
The fix is trivial, simply adjust the alloc counter if the used one is
higher, before performing operations.
This must be backported to 2.6.
QUIC connections are accounted inside global sslconns. As with QUIC
actconn, it suffered from a similar issue if an intermediary allocation
failed inside qc_new_conn().
Fix this similarly by moving increment operation inside qc_new_conn().
Increment and error path are now centralized and much easier to
validate.
The consequences are similar to the actconn fix : on memory allocation
global sslconns may wrap, this time blocking any future QUIC or SSL
connections on the process.
This must be backported up to 2.6.
Recent fixes have shown <lra> and <fsb> uses were not prettu clear. So let's
try to improve documentation about these value. Especially when <lra> is
updated and how to used it.
In sc_need_room(), we compute the maximum room that can be requested to
restarted reading to be sure to be able to unblock the SC. At worst when the
buffer is emptied. Here, the buffer reserve is considered but it is an issue.
Counting the reserve can lead to a wicked bug with the H1 multiplexer, when
small amount of data are found at the end of the HTX buffer. In this case,
to not wrap, the H1 mux requests more room. It is an optim to be able to
resync the buffer with the consumer side and to be able to perform zero-copy
transfers. However, if this amount of data is smaller than the reserve and
if the consumer is congested, we fall in a loop because the wrong value is
used to request more room. The H1 mux continues to pretend there is not
enough space in the buffer, while the effective requested value is lower
than the free space in the buffer. While the consumer is congested and does
not consume these data, the is no way to stop the loop.
We can fix the function by removing the buffer reserve from the
computation. But it remains a dangerous decision to apply a max value on
room_needed. It is safer to require the caller must set a correct value. For
now, it is true. But at the end, it is totally unexepected to wait for more
room than an empty buffer can contain.
This patch must be backported to 2.8.
When receive or send expiration date of a stream-connector is retrieved, we
now automatically check if it may expire. If not, TICK_ETERNITY is returned.
The expiration dates of the frontend and backend stream-connectors are used
to compute the stream expiration date. This operation is performed at 2
places: at the end of process_stream() and in sc_notify() if the stream is
not woken up.
With this patch, there is no special changes for process_stream() because it
was already handled. It make thing a little simpler. However, it fixes
sc_notify() by avoiding to erroneously compute an expiration date in
past. This highly reduce the stream wakeups when there is contention on the
consumer side.
The bug was introduced with the commit 8073094bf ("NUG/MEDIUM: stconn:
Always update stream's expiration date after I/O"). It was an error to
unconditionnaly set the stream expiration data, without testing blocking
conditions on both SC.
This patch must be backported to 2.8.
When data are directly forwarded from a mux to the opposite one, we must not
forget to report send activity when data are successfully sent or report a
blocked send with data are blocked. It is important because otherwise, if
the transfer is quite long, longer than the client or server timeout, an
error may be triggered because the write timeout is reached.
H1, H2 and PT muxes are concerned. To fix the issue, The done_fastword()
callback now returns the amount of data consummed. This way it is possible
to update/reset the FSB data accordingly.
No backport needed.
This commit introduces a generic server-side parsing of type-value pair
arguments and allocation of a TLV list via a new keyword called
set-proxy-v2-tlv-fmt.
This allows to 1) forward any TLV type with the help of fc_pp_tlv,
2) generally, send out any TLV type and value via a log format expression.
To have this fully working the connection will need to be updated in
a follow-up commit to actually respect the new server TLV list.
default-server support has also been implemented.
In this patch, we add the possibility to declare on a table definition
("table" in peer section, or "stick-table" in proxy section) that we
want the remote/peer updates on that table to be pushed on a local
haproxy table in addition to the source table.
Consider this example:
|peers mypeers
| peer local 127.0.0.1:3334
| peer clust 127.0.0.1:3333
| table t1.local type string size 10m store server_id,server_key expire 30s
| table t1.clust type string size 10m store server_id,server_key write-to mypeers/t1.local expire 30s
With this setup, we consider haproxy uses t1.local as cache/local table
for read and write operations, and that t1.clust is a remote table
containing datas processed from t1.local and similar tables from other
haproxy peers in a cluster setup. The t1.clust table will be used to
refresh the local/cache one via the "write-to" statement.
What will happen, is that every time haproxy will see entry updates for
the t1.clust table: it will overwrite t1.local table with fresh data and
will update the entry expiration timer. If t1.local entry doesn't exist
yet (key doesn't exist), it will automatically create it. Note that only
types that cannot be used for arithmetic ops will be handled, and this
to prevent processed values from the remote table from interfering with
computations based on values from the local table. (ie: prevent
cumulative counters from growing indefinitely).
"write-to" will only push supported types if they both exist in the source
and the target table. Be careful with server_id and server_key storage
because they are often declared implicitly when referencing a table in
sticking rules but it is required to declare them explicitly for them to
be pushed between a remote and a local table through "write-to" option.
Also note that the "write-to" target table should have the same type as
the source one, and that the key length should be strictly equal,
otherwise haproxy will raise an error due to the tables being
incompatibles. A table that is already being written to cannot be used
as a source table for a "write-to" target.
Thanks to this patch, it will now be possible to use sticking rules in
peer cluster context by using a local table as a local cache which
will be automatically refreshed by one or multiple remote table(s).
This commit depends on:
- "MINOR: stktable: stktable_init() sets err_msg on error"
- "MINOR: stktable: check if a type should be used as-is"
stick table types now have an extra bit named 'as_is' that allows us to
check if such type should be used as-is or if it may be involved in
arithmetic operations such as counters. This can be useful since those
types are not common and may require specific handling.
e.g.: stktable_data_types[data_type].as_is will be set to 1 if the type
cannot be used in arithmetic operations.
Simplify stick and store sticktable proxy rules postparsing by adding
a sticking rule entry resolve (postparsing) function.
This will ease code maintenance.
This new fetcher can be used to extract the list of cookie names from
Cookie request header or from Set-Cookie response header depending on
the stream direction. There is an optional argument that can be used
as the delimiter (which is assumed to be the first character of the
argument) between cookie names. The default delimiter is comma (,).
Note that we will treat the Cookie request header as a semi-colon
separated list of cookies and each Set-Cookie response header as
a single cookie and extract the cookie names accordingly.
Similar to the previous commit which check for maxconn before allocating
a QUIC connection, this patch checks for maxsslconn at the same step.
This is necessary as a QUIC connection cannot run without a SSL context.
This should be backported up to 2.6. It relies on the following patch :
"BUG/MINOR: ssl: use a thread-safe sslconns increment"
Increment actconn and check maxconn limit when a quic_conn is
instantiated. This is necessary because prior to this patch, quic_conn
instances where not counted. Global actconn was only incremented after
the handshake has been completed and the connection structure is
allocated.
The increment is done using increment_actconn() on INITIAL packet
parsing if a new connection is about to be created. If the limit is
reached, the allocation is cancelled and the INITIAL packet is dropped.
The decrement is done under quic_conn_release(). This means that
quic_cc_conn instances are not taken into account. This seems safe
enough because quic_cc_conn are only used for minimal usage.
The counterpart of this change is that maxconn must not be checked a
second time when listener_accept() is done over a QUIC connection. For
this, a new bind_conf flag BC_O_XPRT_MAXCONN is set for listeners when
maxconn is already counted by the lower layer. For the moment, it is
positionned only for QUIC listeners.
Without this patch, haproxy process could suffer from heavy memory/CPU
load if the number of concurrent handshake is high.
This patch is not considered a bug fix per-se. However, it has a major
benefit to protect against too many QUIC handshakes. As such, it should
be backported up to 2.6. For this, it relies on the following patch :
"MINOR: frontend: implement a dedicated actconn increment function"
Each time a new SSL context is allocated, global.sslconns is
incremented. If global.maxsslconn is reached, the allocation is
cancelled.
This procedure was not entirely thread-safe due to the check and
increment operations conducted at different stage. This could lead to
global.maxsslconn slightly exceeded when several threads allocate SSL
context while sslconns is near the limit.
To fix this, use a CAS operation in a do/while loop. This code is
similar to the actconn/maxconn increment for connection.
A new function increment_sslconn() is defined for this operation. For
the moment, only SSL code is using it. However, it is expected that QUIC
will also use it to count QUIC connections as SSL ones.
This should be backported to all stable releases. Note that prior to the
2.6, sslconns was outside of global struct, so this commit should be
slightly adjusted.
When a new frontend connection is instantiated, actconn global counter
is incremented. If global maxconn value is reached, the connection is
cancelled. This ensures that system limit are under control.
Prior to this patch, the atomic check/increment operations were done
directly into listener_accept(). Move them in a dedicated function
increment_actconn() in frontend module. This will be useful when QUIC
connections will be counted in actconn counter.
Now when calling ha_panic() with a thread still under malloc_trim(),
we'll set a new tainted flag to easily report it, and the output
trace will report that this condition happened and will suggest to
use no-memory-trimming to avoid it in the future.
William suggested that since we can detect the presence of Lua in the
stack, let's combine it with stuck detection to set a new pair of flags
indicating a stuck Lua context and a stuck Lua shared context.
Now, executing an infinite loop in a Lua sample fetch function with
yield disabled crashes with tainted=0xe40 if loaded from a lua-load
statement, or tainted=0x640 from a lua-load-per-thread statement.
In addition, at the end of the panic dump, we can check if Lua was
seen stuck and emit recommendations about lua-load-per-thread and
the choice of dependencies depending on the presence of threads
and/or shared context.
This will make it easier to know that the panic function was called,
for the occasional case where the dump crashes and/or the stack is
corrupted and not much exploitable. Now at least it will be sufficient
to check the tainted value to know that someone called ha_panic(), and
it will also be usable to condition extra analysis.
This function allows to safely map proxy mode to corresponding proto_mode
This will allow for easier code maintenance and prevent mixups between
proxy mode and proto mode.
In 9a74a6c ("MAJOR: log: introduce log backends"), a mistake was made:
it was assumed that the proxy mode was already known during server
keyword parsing in parse_server() function, but this is wrong.
Indeed, "mode log" can be declared late in the proxy section. Due to this,
a simple config like this will cause the process to crash:
|backend test
|
| server name 127.0.0.1:8080
| mode log
In order to fix this, we relax some checks in _srv_parse_init() and store
the address protocol from str2sa_range() in server struct, then we set-up
a postparsing function that is to be called after config parsing to
finish the server checks/initialization that depend on the proxy mode
to be known. We achieve this by checking the PR_CAP_LB capability from
the parent proxy to know if we're in such case where the effective proxy
mode is not yet known (it is assumed that other proxies which are implicit
ones don't provide this possibility and thus don't suffer from this
constraint).
Only then, if the capability is not found, we immediately perform the
server checks that depend on the proxy mode, else the check is postponed
and it will automatically be performed during postparsing thanks to the
REGISTER_POST_SERVER_CHECK() hook.
Note that we remove the SRV_PARSE_IN_LOG_BE flag because it was introduced
in the above commit and it is no longer relevant.
No backport needed unless 9a74a6c gets backported.
The two recent commits below each added one flag to h2c but omitted to
update the __APPEND_FLAG macro used by dev/flags so they are not
properly decoded:
3dd963b35 ("BUG/MINOR: mux-h2: fix http-request and http-keep-alive timeouts again")
68d02e5fa ("BUG/MINOR: mux-h2: make up other blocked streams upon removal from list")
This can be backported along with these commits.
Define a new function srv_add_to_avail_list(). This function is used to
centralize connection insertion in available tree. It reuses a BUG_ON()
statement to ensure the connection is not present in the idle list.
Reverse HTTP bind is very specific in that in rely on a server to
initiate connection. All connection settings are defined on the server
line and ignored from the bind line.
Before this patch, most of keywords were silently ignored. This could
result in a configuration from doing unexpected things from the user
point of view. To improve this situation, add a new 'rhttp_ok' field in
bind_kw structure. If not set, the keyword is forbidden on a reverse
bind line and will cause a fatal config error.
For the moment, only the following keywords are usable with reverse bind
'id', 'name' and 'nbconn'.
This change is safe as it's already forbidden to mix reverse and
standard addresses on the same bind line.
Previously, maxconn keyword was reused for a specific usage on reverse
HTTP binds to specify the number of active connect to proceed. To avoid
confusion, introduce a new dedicated keyword 'nbconn' which is specific
to reverse HTTP bind.
This new keyword is forbidden for non-reverse listener. A fatal error is
emitted during config parsing if this rule is not respected. It's safe
because it's also forbidden to mix standard and reverse addresses on the
same bind line.
Internally, nbconn value will be reassigned to 'maxconn' member of
bind_conf structure. This ensures that listener layer will automatically
reenable the preconnect task each time a connection is closed.
Reverse HTTP listeners are very specific and share only a very limited
subset of keywords with other listeners. As such, it is probable
meaningless to mix standard and reverse addresses on the same bind line.
This patch emits a fatal error during configuration parsing if this is
the case.
The number of updates sent at once was limited to not loop too long to emit
updates when the buffer size is huge or when the number of sync tables is
huge. The limit can be configured and is set to 200 by default. However,
this fix introduced a bug. It is impossible to syncrhonize two peers if the
number of tables is higher than this limit. Thus by default, it is not
possible to sync two peers if there are more than 200 tables to sync.
Technically speacking, a teaching process is finished if we loop on all tables
with no new update messages sent. Because we are limited at each call, the loop
is splitted on several calls. However the restart point for the next loop is
always the last table for which we emitted an update message. Thus with more
tables than the limit, the loop never reachs the end point.
Worse, in conjunction with the bug fixed by "BUG/MEDIUM: peers: Be sure to
always refresh recconnect timer in sync task", it is possible to trigger the
watchdog because the applets may be woken up in loop and leave requesting
more room while its buffer is empty.
To fix the issue, restart conditions for a teaching loop were changed. If
the teach process is interrupted, we now save the restart point, called
stop_local_table. It is the last evaluated table on the previous loop. This
restart point is reset when the teach process is finished.
In additionn, the updates_sent variable in peer_send_msgs() was renamed to
updates to avoid ambiguities. Indeed, the variable is incremented, whether
messages were sent or not.
This patch must be backported as far as 2.6.
Stefan Behte reported that since commit f279a2f14 ("BUG/MINOR: mux-h2:
refresh the idle_timer when the mux is empty"), the http-request and
http-keep-alive timeouts don't work anymore on H2. Before this patch,
and since 3e448b9b64 ("BUG/MEDIUM: mux-h2: make sure control frames do
not refresh the idle timeout"), they would only be refreshed after stream
frames were sent (HEADERS or DATA) but the patch above that adds more
refresh points broke these so they don't expire anymore as long as
there's some activity.
We cannot just revert the fix since it also addressed an isse by which
sometimes the timeout would trigger too early and provoque truncated
responses. The right approach here is in fact to only use refresh the
idle timer when the mux buffer was flushed from any such stream frames.
In order to achieve this, we're now setting a flag on the connection
whenever we write a stream frame, and we consider that flag when deciding
to refresh the buffer after it's emptied. This way we'll only clear that
flag once the buffer is empty and there were stream data in it, not if
there were no such stream data. In theory it remains possible to leave
the flag on if some control data is appended after the buffer and it's
never cleared, but in practice it's not a problem as a buffer will always
get sent in large blocks when the window opens. Even a large buffer should
be emptied once in a while as control frames will not fill it as much as
data frames could.
Given the patch above was backported as far as 2.6, this patch should
also be backported as far as 2.6.
tune.rcvbuf.client and tune.rcvbuf.server are not suitable for shared
dgram sockets because they're per connection so their units are not the
same. However, QUIC's listener and log servers are not connected and
take per-thread or per-process traffic where a socket log buffer might
be too small, causing undesirable packet losses and retransmits in the
case of QUIC. This essentially manifests in listener mode with new
connections taking a lot of time to set up under heavy traffic due to
the small queues causing delays. Let's add a few new settings allowing
to set these shared socket sizes on the frontend and backend side (which
reminds that these are per-front/back and not per client/server hence
not per connection).
Instead of speaking of an initialisation stage for each data
fast-forwarding, we now use the negociate term. Thus init_ff/init_fastfwd
functions were renamed nego_ff/nego_fastfwd.
The zero-copy forwarding or the mux-to-mux forwarding is a way to
fast-forward data without using the channels buffers. Data are transferred
from a mux to the other one. The kernel splicing is an optimization of the
zero-copy forwarding. But it can also use normal buffers (but not channels
ones). This way, it could be possible to fast-forward data with muxes not
supporting the kernel splicing (H2 and H3 muxes) but also with applets.
However, this mode can introduce regressions or bugs in future (just like
the kernel splicing). Thus, It could be usefull to disable this optim. To do
so, in configuration, the global tune settting
'tune.disable-zero-copy-forwarding' may be set in a global section or the
'-dZ' command line parameter may be used to start HAProxy. Of course, this
also disables the kernel splicing.
Because channel_is_empty() function does now only check the channel's
buffer, we can remove it and rely on co_data() instead. Of course, all tests
must be inverted.
channel_is_empty() is thus removed.
It is important to split channels and I/O buffers. When data are pushed in
an I/O buffer, we consider them as forwarded. The channel never sees
them. Fast-forwarded data are now handled in the SE only.
When data were sent using the kernel splicing, we tried to send all data
with no restriction. Most of time it is valid. However, because the payload
representation may differ between the producer and the consumer, it is
important to be able to specify how must data to send via the splicing.
Of course, for performance reason, it is important to maximize amount of
data send via splicing at each call. However, on edge-cases, this now can be
limited.
The kernel splicing support was totally remove waiting for the mux-to-mux
fast-forward implementation. So corresponding mux callbacks can be removed
now.
mux-to-mux fast-forwarding will be added. To avoid mix with the splicing and
simplify the commits, the kernel splicing support is removed from the
stconn. CF_KERN_SPLICING flag is removed and the support is no longer tested
in process_stream().
In the stconn part, rcv_pipe() callback function is no longer called.
Reg-tests scripts testing the kernel splicing are temporarly marked as
broken.
To perform the mux-to-mux data fast-forwarding, 4 new callbacks were added
into the mux_ops structure. 2 callbacks will be used from the stconn for
fast-forward data. The 2 other callbacks will be used by the endpoint to
request an iobuf to the opposite endpoint.
* fastfwd() callback function is used by a producer to forward data
* resume_fastfwd() callback function is used by a consumer if some data are
blocked in the iobuf, to resume the data forwarding.
* init_fastfwd() must be used by an endpoint (the producer one), inside the
fastfwd() callback to request an iobuf to the opposite side (the consumer
one).
* done_fastfwd() must be used by an endpoint (the producer one) at the end
of fastfwd() to notify the opposite endpoint (the consumer one) if data
were forwarded or not.
This API is still under development, so it may evolved. Especially when the
fast-forward will be extended to applets.
2 helper functions were also added into the SE api to wrap init_fastfwd()
and done_fastfwd() callback function of the underlying endpoint.
For now, this API is unsed and not implemented at all in muxes.
It is unused for now, but the iobuf structure now owns a pointer to a
buffer. This buffer will be used to perform mux-to-mux fast-forwarding when
splicing is not supported or unusable. This pointer should be filled by an
endpoint to let the opposite one forward data.
Extra fields, in addition to the buffer, are mandatory because the buffer
may already contains some data. the ".offset" field may be used may be used
as the position to start to copy data. Finally, the amount of data copied in
this buffer must be saved in ".data" field.
Some flags are also added to prepare next changes. And helper stconn
fnuctions are updated to also count data in the buffer. For a first
implementation, it is not planned to handle data in the buffer and in the
pipe in same time. But it will be possible to do so.
Instead of talking about kernel splicing at stconn/sedesc level, we now try
to talk about mux-to-mux fast-forwarding. To do so, 2 functions were added
to know if there are fast-forwarded data and to retrieve this amount of
data. Of course, for now, there is only data in a pipe.
In addition, some flags were renamed to reflect this notion. Note the
channel's documentation was not updated yet.
The pipes used to put data when the kernel splicing is in used are moved in
the SE descriptors. For now, it is just a simple remplacement but there is a
major difference with the pipes in the channel. The data are pushed in the
consumer's pipe while it was pushed in the producer's pipe. So it means the
request data are now pushed in the pipe of the backend SE descriptor and
response data are pushed in the pipe of the frontend SE descriptor.
The idea is to hide the pipe from the channel/SC side and to be able to
handle fast-forwading in pipe but also in buffer. To do so, the pipe is
inside a new entity, called iobuf. This entity will be extended.
An interesting issue was met when testing the mux-to-mux forwarding code.
In order to preserve fairness, in h2_snd_buf() if other streams are waiting
in send_list or fctl_list, the stream that is attempting to send also goes
to its list, and will be woken up by h2_process_mux() or h2_send() when
some space is released. But on rare occasions, there are only a few (or
even a single) streams waiting in this list, and these streams are just
quickly removed because of a timeout or a quick h2_detach() that calls
h2s_destroy(). In this case there's no even to wake up the other waiting
stream in its list, and this will possibly resume processing after some
client WINDOW_UPDATE frames or even new streams, so usually it doesn't
last too long and it not much noticeable, reason why it was left that
long. In addition, measures have shown that in heavy network-bound
benchmark, this exact situation happens on less than 1% of the streams
(reached 4% with mux-mux).
The fix here consists in replacing these LIST_DEL_INIT() calls on
h2s->list with a function call that checks if other streams were queued
to the send_list recently, and if so, which also tries to resume them
by calling h2_resume_each_sending_h2s(). The detection of late additions
is made via a new flag on the connection, H2_CF_WAIT_INLIST, which is set
when a stream is queued due to other streams being present, and which is
cleared when this is function is called.
It is particularly difficult to reproduce this case which is particularly
timing-dependent, but in a constrained environment, a test involving 32
conns of 20 streams each, all downloading a 10 MB object previously
showed a limitation of 17 Gbps with lots of idle CPU time, and now
filled the cable at 25 Gbps.
This should be backported to all versions where it applies.
"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
hash lb algorithm can be configured with the "log-balance hash <cnv_list>"
directive. With this algorithm, the user specifies a converter list with
<cnv_list>.
The produced log message will be passed as-is to the provided converter
list, and the resulting hash will be used to select the log server that
will receive the log message.
split sample_process() in 2 parts in order to be able to only process
the converter part of a sample expression from an existing input sample
struct passed as parameter.
Allow the use of the "none" hash-type function so that the key resulting
from the sample expression is directly used as the hash.
This can be useful to do the hashing manually using available hashing
converters, or even custom ones, and then inform haproxy that it can
directly rely on the sample expression result which is explictly handled
as an integer in this case.
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)
log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:
| log backend@mybackend local0
A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
This helper function can be used to create a new sink from an existing
server struct (and thus existing proxy as well), in order to spare some
resources when possible.
implicit rings were automatically forced to the parent logger format, but
this was done upon ring creation.
This is quite restrictive because we might want to choose the desired
format right before generating the log header (ie: when producing the
log message), depending on the logger (log directive) that is
responsible for the log message, and with current logic this is not
possible. (To this day, we still have dedicated implicit ring per log
directive, but this might change)
In ring_write(), we check if the sink->fmt is specified:
- defined: we use it since it is the most precise format
(ie: for named rings)
- undefined: then we fallback to the format from the logger
With this change, implicit rings' format is now set to UNSPEC upon
creation. This is safe because the log header building function
automatically enforces the "raw" format when UNSPEC is set. And since
logger->format also defaults to "raw", no change of default behavior
should be expected.
Introduce log_header struct to easily pass log header data between
functions and use that to simplify the logic around log header
handling.
While at it, some outdated comments were updated as well.
No change in behavior should be expected.
log targets were immediately embedded in logger struct (previously
named logsrv) and could not be used outside of this context.
In this patch, we're introducing log_target type with the associated
helper functions so that it becomes possible to declare and use log
targets outside of loggers scope.
When 'log' directive was implemented, the internal representation was
named 'struct logsrv', because the 'log' directive would directly point
to the log target, which used to be a (UDP) log server exclusively at
that time, hence the name.
But things have become more complex, since today 'log' directive can point
to ring targets (implicit, or named) for example.
Indeed, a 'log' directive does no longer reference the "final" server to
which the log will be sent, but instead it describes which log API and
parameters to use for transporting the log messages to the proper log
destination.
So now the term 'logsrv' is rather confusing and prevents us from
introducing a new level of abstraction because they would be mixed
with logsrv.
So in order to better designate this 'log' directive, and make it more
generic, we chose the word 'logger' which now replaces logsrv everywhere
it was used in the code (including related comments).
This is internal rewording, so no functional change should be expected
on user-side.
CIDs tree is now allocated dynamically since the following commit :
276697438d
MINOR: quic: Use a pool for the connection ID tree.
This can caused a crash if qc_new_conn() is interrupted due to an
intermediary failed allocation. When freeing all connection members,
free_quic_conn_cids() is used. However, this function does not support a
NULL cids.
To fix this, simply check that cids is NULL during free_quic_conn_cids()
prologue.
This bug was reproduced using -dMfail.
No need to backport.
Since commit 5afcb686b ("MAJOR: connection: purge idle conn by last usage")
in 2.9-dev4, the test on conn->toremove_list added to conn_get_idle_flag()
in 2.8 by commit 3a7b539b1 ("BUG/MEDIUM: connection: Preserve flags when a
conn is removed from an idle list") becomes misleading. Indeed, now both
toremove_list and idle_list are shared by a union since the presence in
these lists is mutually exclusive. However, in conn_get_idle_flag() we
check for the presence in the toremove_list to decide whether or not to
delete the connection from the tree. This test now fails because instead
it sees the presence in the idle or safe list via the union, and concludes
the element must not be removed. Thus the element remains in the tree and
can be found later after the connection is released, causing crashes that
Tristan reported in issue #2292.
The following config is sufficient to reproduce it with 2 threads:
defaults
mode http
timeout client 5s
timeout server 5s
timeout connect 1s
listen front
bind :8001
server next 127.0.0.1:8002
frontend next
bind :8002
timeout http-keep-alive 1
http-request redirect location /
Sending traffic with a few concurrent connections and some short timeouts
suffices to instantly crash it after ~10k reqs:
$ h2load -t 4 -c 16 -n 10000 -m 1 -w 1 http://0:8001/
With Amaury we analyzed the conditions in which the function is called
in order to figure a better condition for the test and concluded that
->toremove_list is never filled there so we can safely remove that part
from the test and just move the flag retrieval back to what it was prior
to the 2.8 patch above. Note that the patch is not reverted though, as
the parts that would drop the unexpected flags removal are unchanged.
This patch must NOT be backported. The code in 2.8 works correctly, it's
only the change in 2.9 that makes it misbehave.
Move all QUIC trace definitions from quic_conn.h to quic_trace-t.h. Also
remove multiple definition trace_quic macro definition into
quic_trace.h. This forces all QUIC source files who relies on trace to
include it while reducing the size of quic_conn.h.
This bug was detected when compiling haproxy against aws-lc TLS stack
during QUIC interop runner tests. Some algorithms could be negotiated by haproxy
through the TLS stack but not fully supported by haproxy QUIC implentation.
This leaded tls_aead() to return NULL (same thing for tls_md(), tls_hp()).
As these functions returned values were never checked, they could triggered
segfaults.
To fix this, one closes the connection as soon as possible with a
handshake_failure(40) TLS alert. Note that as the TLS stack successfully
negotiates an algorithm, it provides haproxy with CRYPTO data before entering
->set_encryption_secrets() callback. This is why this callback
(ha_set_encryption_secrets() on haproxy side) is modified to release all
the CRYPTO frames before triggering a CONNECTION_CLOSE with a TLS alert. This is
done calling qc_release_pktns_frms() for all the packet number spaces.
Modify some quic_tls_keys_hexdump to avoid crashes when the ->aead or ->hp EVP_CIPHER
are NULL.
Modify qc_release_pktns_frms() to do nothing if the packet number space passed
as parameter is not intialized.
This bug does not impact the QUIC TLS compatibily mode (USE_QUIC_OPENSSL_COMPAT).
Thank you to @ilia-shipitsin for having reported this issue in GH #2309.
Must be backported as far as 2.6.
The openssl-compat.h file has some function which were implemented in
order to provide compatibility with openssl < 1.0.0. Most of them where
to support the 0.9.8 version, but we don't support this version anymore.
This patch removes the deprecated code from openssl-compat.h
Many actions take arguments after a parenthesis. When this happens, they
have to be tagged in the parser with KWF_MATCH_PREFIX so that a sub-word
is sufficient (since by default the whole block including the parenthesis
is taken).
The problem with this is that the parser stops on the first match. This
was OK years ago when there were very few actions, but over time new ones
were added and many actions are the prefix of another one (e.g. "set-var"
is the prefix of "set-var-fmt"). And what happens in this case is that the
first word is picked. Most often that doesn't cause trouble because such
similar-looking actions involve the same custom parser so actually the
wrong selection of the first entry results in the correct parser to be
used anyway and the error to be silently hidden.
But it's getting worse when accidentally declaring prefixes in multiple
files, because in this case it will solely depend on the object file link
order: if the longest name appears first, it will be properly detected,
but if it appears last, its other prefix will be detected and might very
well not be related at all and use a distinct parser. And this is random
enough to make some actions succeed or fail depending on the build options
that affect the linkage order. Worse: what if a keyword is the prefix of
another one, with a different parser but a compatible syntax ? It could
seem to work by accident but not do the expected operations.
The correct solution is to always look for the longest matching name.
This way the correct keyword will always be matched and used and there
will be no risk to randomly pick the wrong anymore.
This fix must be backported to the relevant stable releases.
sc_need_room() function may be called with a negative value. In this case,
the intent is to be notified if any space was made in the channel buffer. In
the function, we get the min between the requested room and the maximum
possible room in the buffer, considering it may be an HTX buffer.
However this max value is unsigned and leads to an unsigned comparison,
casting the negative value to an unsigned value. Of course, in this case,
this always leads to the wrong result. This bug seems to have no effect but
it is hard to be sure.
To fix the issue, we take care to respect the requested room sign by casting
the max value to a signed integer.
This patch must be backported to 2.8.
now forward_px only serves as a hint to know if a proxy was created
specifically for the sink, in which case the sink is responsible for it.
Everywhere forward_px was used in appctx context: get the parent proxy from
the sft->srv instead.
This permits to finally get rid of the double link dependency between sink
and proxy.
The regtests are using the "feature()" predicate but this one can only
rely on build-time options. It would be nice if some runtime-specific
options could be detected at boot time so that regtests could more
flexibly adapt to what is supported (capabilities, splicing, etc).
Similarly, certain features that are currently enabled with USE_XXX
could also be automatically detected at build time using ifdefs and
would simplify the configuration, but then we'd lose the feature
report in the feature list which is convenient for regtests.
This patch makes sure that haproxy -vv shows the variable's contents
and not the macro's contents, and adds a new hap_register_feature()
to allow the code to register a new keyword.
This patch add a hash of the Origin header to the cache's secondary key.
This enables to manage store responses that have a "Vary: Origin" header
in the cache when vary is enabled.
This cannot be considered as a means to manage CORS requests though, it
only processes the Origin header and hashes the presented value without
any form of URI normalization.
This need was expressed by Philipp Hossner in GitHub issue #251.
Co-Authored-by: Philipp Hossner <philipp.hossner@posteo.de>
This patch fixes the build with AWSLC and USE_QUIC=1, this is only meant
to be able to build for now and it's not feature complete.
The set_encryption_secrets callback has been split in set_read_secret
and set_write_secret.
Missing features:
- 0RTT was disabled.
- TLS1_3_CK_CHACHA20_POLY1305_SHA256, TLS1_3_CK_AES_128_CCM_SHA256 were disabled
- clienthello callback is missing, certificate selection could be
limited (RSA + ECDSA at the same time)
If a "Content-length" or "Transfer-Encoding; chunked" headers is found or
inserted in an outgoing message, a specific flag is now set on the H1
stream. H1S_F_HAVE_CLEN is set for "Content-length" header and
H1S_F_HAVE_CHNK for "Transfer-Encoding: chunked".
This will be useful to properly format outgoing messages, even if one of
these headers was removed by hand (with no update of the message meta-data).
Refactor alloc_bind_address() function which is used to allocate a
sockaddr if a connection to a target server relies on a specific source
address setting.
The main objective of this change is to be able to use this function
outside of backend module, namely for preconnections using a reverse
server. As such, this function is now exported globally.
For reverse connect, there is no stream instance. As such, the function
parts which relied on it were reduced to the minimal. Now, stream is
only used if a non-static address is configured which is useful for
usesrc client|clientip|hdr_ip. These options have no sense for reverse
connect so it should be safe to use the same function.
Improve EACCES permission errors encounterd when using QUIC connection
socket at runtime :
* First occurence of the error on the process will generate a log
warning. This should prevent users from using a privileged port
without mandatory access rights.
* Socket mode will automatically fallback to listener socket for the
receiver instance. This requires to duplicate the settings from the
bind_conf to the receiver instance to support configurations with
multiple addresses on the same bind line.
Define a new bind option quic-socket :
quic-socket [ connection | listener ]
This new setting works in conjunction with the existing configuration
global tune.quic.socket-owner and reuse the same semantics.
The purpose of this setting is to allow to disable connection socket
usage on listener instances individually. This will notably be useful
when needing to deactivating it when encountered a fatal permission
error on bind() at runtime.
When EBO was brought to pl_take_w() by plock commit 60d750d ("plock: use
EBO when waiting for readers to leave in take_w() and stow()"), a mistake
was made: the mask against which the current value of the lock is tested
excludes the first reader like in stow(), but it must not because it was
just obtained via an ldadd() which means that it doesn't count itself.
The problem this causes is that if there is exactly one reader when a
writer grabs the lock, the writer will not wait for it to leave before
starting its operations.
The solution consists in checking for any reader in the IF. However the
mask passed to pl_wait_unlock_*() must still exclude the lowest bit as
it's verified after a subsequent load.
Kudos to Remi Tricot-Le Breton for reporting and bisecting this issue
with a reproducer.
No backport is needed since this was brought in 2.9-dev3 with commit
8178a5211 ("MAJOR: threads/plock: update the embedded library again").
The code is now on par with plock commit ada70fe.
Add a new MUX flag MX_FL_REVERSABLE. This value is used to indicate that
MUX instance supports connection reversal. For the moment, only HTTP/2
multiplexer is flagged with it.
This allows to dynamically check if reversal can be completed during MUX
installation. This will allow to relax requirement on config writing for
'tcp-request session attach-srv' which currently cannot be used mixed
with non-http/2 listener instances, even if used conditionnally with an
ACL.
Define a new error code for connection CO_ER_REVERSE. This will be used
to report an issue which happens on a connection targetted for reversal
before reverse process is completed.
This reverts commit 072e774939.
Doing h2load with h3 tests we notice this behavior:
Client ---- INIT no token SCID = a , DCID = A ---> Server (1)
Client <--- RETRY+TOKEN DCID = a, SCID = B ---- Server (2)
Client ---- INIT+TOKEN SCID = a , DCID = B ---> Server (3)
Client <--- INIT DCID = a, SCID = C ---- Server (4)
Client ---- INIT+TOKEN SCID = a, DCID = C ---> Server (5)
With (5) dropped by haproxy due to token validation.
Indeed the previous patch adds SCID of retry packet sent to the aad
of the token ciphering aad. It was useful to validate the next INIT
packets including the token are sent by the client using the new
provided SCID for DCID as mantionned into the RFC 9000.
But this stateless information is lost on received INIT packets
following the first outgoing INIT packet from the server because
the client is also supposed to re-use a second time the lastest
received SCID for its new DCID. This will break the token validation
on those last packets and they will be dropped by haproxy.
It was discussed there:
https://mailarchive.ietf.org/arch/msg/quic/7kXVvzhNCpgPk6FwtyPuIC6tRk0/
To resume: this is not the role of the server to verify the re-use of
retry's SCID for DCID in further client's INIT packets.
The previous patch must be reverted in all versions where it was
backported (supposed until 2.6)
When a stream is caught looping, we produce some output to help figure
its internal state explaining why it's looping. The problem is that this
debug output is quite old and the info it provides are quite insufficient
to debug a modern process, and since such bugs happen only once or twice
a year the situation doesn't improve.
On the other hand the output of "show sess all" is extremely detailed
and kept up to date with code evolutions since it's a heavily used
debugging tool.
This commit replaces the call to the totally outdated stream_dump() with
a call to strm_dump_to_buffer(), and removes the filters dump since they
are already emitted there, and it now produces much more exploitable
output:
[ALERT] (5936) : A bogus STREAM [0x7fa8dc02f660] is spinning at 5653514 calls per second and refuses to die, aborting now! Please report this error to developers:
0x7fa8dc02f660: [28/Sep/2023:09:53:08.811818] id=2 proto=tcpv4 source=127.0.0.1:58306
flags=0xc4a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=0x133f220, pend_pos=(nil) waiting=0 epoch=0x1
frontend=public (id=2 mode=http), listener=? (id=1) addr=127.0.0.1:4080
backend=public (id=2 mode=http) addr=127.0.0.1:61932
server=s1 (id=1) addr=127.0.0.1:7443
task=0x7fa8dc02fa40 (state=0x01 nice=0 calls=5749559 rate=5653514 exp=3s tid=1(1/1) age=1s)
txn=0x7fa8dc02fbf0 flags=0x3000 meth=1 status=-1 req.st=MSG_DONE rsp.st=MSG_RPBEFORE req.f=0x4c rsp.f=0x00
scf=0x7fa8dc02f5f0 flags=0x00000482 state=EST endp=CONN,0x7fa8dc02b4b0,0x05004001 sub=1 rex=58s wex=<NEVER>
h1s=0x7fa8dc02b4b0 h1s.flg=0x100010 .sd.flg=0x5004001 .req.state=MSG_DONE .res.state=MSG_RPBEFORE
.meth=GET status=0 .sd.flg=0x05004001 .sc.flg=0x00000482 .sc.app=0x7fa8dc02f660
.subs=0x7fa8dc02f608(ev=1 tl=0x7fa8dc02fae0 tl.calls=0 tl.ctx=0x7fa8dc02f5f0 tl.fct=sc_conn_io_cb)
h1c=0x7fa8dc0272d0 h1c.flg=0x0 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fa8dc0273f0 .exp=<NEVER>
co0=0x7fa8dc027040 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=LISTENER:0x12840c0
flags=0x00000300 fd=32 fd.state=20 updt=0 fd.tmask=0x2
scb=0x7fa8dc02fb30 flags=0x00001411 state=EST endp=CONN,0x7fa8dc0300c0,0x05000001 sub=1 rex=58s wex=<NEVER>
h1s=0x7fa8dc0300c0 h1s.flg=0x4010 .sd.flg=0x5000001 .req.state=MSG_DONE .res.state=MSG_RPBEFORE
.meth=GET status=0 .sd.flg=0x05000001 .sc.flg=0x00001411 .sc.app=0x7fa8dc02f660
.subs=0x7fa8dc02fb48(ev=1 tl=0x7fa8dc02feb0 tl.calls=2 tl.ctx=0x7fa8dc02fb30 tl.fct=sc_conn_io_cb)
h1c=0x7fa8dc02ff00 h1c.flg=0x80000000 .sub=1 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fa8dc030020 .exp=<NEVER>
co1=0x7fa8dc02fcd0 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x133f220
flags=0x10000300 fd=33 fd.state=10421 updt=0 fd.tmask=0x2
req=0x7fa8dc02f680 (f=0x1840000 an=0x8000 pipe=0 tofwd=0 total=79)
an_exp=<NEVER> buf=0x7fa8dc02f688 data=(nil) o=0 p=0 i=0 size=0
htx=0xc18f60 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
res=0x7fa8dc02f6d0 (f=0x80000000 an=0x1400000 pipe=0 tofwd=0 total=0)
an_exp=<NEVER> buf=0x7fa8dc02f6d8 data=(nil) o=0 p=0 i=0 size=0
htx=0xc18f60 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
call trace(10):
| 0x59f2b7 [0f 0b 0f 1f 80 00 00 00]: stream_dump_and_crash+0x1f7/0x2bf
| 0x5a0d71 [e9 af e6 ff ff ba 40 00]: process_stream+0x19f1/0x3a56
| 0x68d7bb [49 89 c7 4d 85 ff 74 77]: run_tasks_from_lists+0x3ab/0x924
| 0x68e0b4 [29 44 24 14 8b 4c 24 14]: process_runnable_tasks+0x374/0x6d6
| 0x656f67 [83 3d f2 75 84 00 01 0f]: run_poll_loop+0x127/0x5a8
| 0x6575d7 [48 8b 1d 42 50 5c 00 48]: main+0x1b22f7
| 0x7fa8e0f35e45 [64 48 89 04 25 30 06 00]: libpthread:+0x7e45
| 0x7fa8e0e5a4af [48 89 c7 b8 3c 00 00 00]: libc:clone+0x3f/0x5a
Note that the output is subject to the global anon key so that IPs and
object names can be anonymized if required. It could make sense to
backport this and the few related previous patches next time such an
issue is reported.
There used to be two working modes for this function, a single-line one
and a multi-line one, the difference being made on the "eol" argument
which could contain either a space or an LF (and with the prefix being
adjusted accordingly). Let's get rid of the single-line mode as it's
what limits the output contents because it's difficult to produce
exploitable structured data this way. It was only used in the rare case
of spinning streams and applets and these are the ones lacking info. Now
a spinning stream produces:
[ALERT] (3511) : A bogus STREAM [0x227e7b0] is spinning at 5581202 calls per second and refuses to die, aborting now! Please report this error to developers:
strm=0x227e7b0,c4a src=127.0.0.1 fe=public be=public dst=s1
txn=0x2041650,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_RPBEFORE,0
rqf=1840000 rqa=8000 rpf=80000000 rpa=1400000
scf=0x24af280,EST,482 scb=0x24af430,EST,1411
af=(nil),0 sab=(nil),0
cof=0x7fdb28026630,300:H1(0x24a6f60)/RAW((nil))/tcpv4(33)
cob=0x23199f0,10000300:H1(0x24af630)/RAW((nil))/tcpv4(32)
filters={}
call trace(11):
(...)
Since 2.4-dev18 with commit b4476c6a8 ("CLEANUP: freq_ctr: make
arguments of freq_ctr_total() const"), most of the freq_ctr readers
should be fine with a const, except that they were not updated to
reflect this and they continue to force variable on some functions
that call them. Let's update this. This could even be backported if
needed.
Added set-timeout for frontend side of session, so it can be used to set
custom per-client timeouts if needed. Added cur_client_timeout to fetch
client timeout samples.
Add reporting using send_log() for preconnect operation. This is minimal
to ensure we understand the current status of listener in active reverse
connect.
To limit logging quantity, only important transition are considered.
This requires to implement a minimal state machine as a new field in
receiver structure.
Here are the logs produced :
* Initiating : first time preconnect is enabled on a listener
* Error : last preconnect attempt interrupted on a connection error
* Reaching maxconn : all necessary connections were reversed and are
operational on a listener
When a connection is freed during preconnect before reversal, the error
must be notified to the listener to remove any connection reference and
rearm a new preconnect attempt. Currently, this can occur through 2 code
paths :
* conn_free() called directly by H2 mux
* error during conn_create_mux(). For this case, connection is flagged
with CO_FL_ERROR and reverse_connect task is woken up. The process
task handler is then responsible to call conn_free() for such
connection.
Duplicated steps where done both in conn_free() and process task
handler. These are now removed. To facilitate code maintenance,
dedicated operation have been centralized in a new function
rev_notify_preconn_err() which is called by conn_free().
This patch adds the ability to externalize and customize the code
of the computation of extra CIDs after the first one was derived from
the ODCID.
This is to prepare interoperability with extra components such as
different QUIC proxies or routers for instance.
To process the patch defines two function callbacks:
- the first one to compute a hash 64bits from the first generated CID
(itself continues to be derived from ODCID). Resulting hash is stored
into the 'quic_conn' and 64bits is chosen large enought to be able to
store an entire haproxy's CID.
- the second callback re-uses the previoulsy computed hash to derive
an extra CID using the custom algorithm. If not set haproxy will
continue to choose a randomized CID value.
Those two functions have also the 'cluster_secret' passed as an argument:
this way, it is usable for obfuscation or ciphering.
Function comments were outdated, probably because they have not been
updated during the previous refactors.
Fixing comments to better reflect the current behavior.
This may be backported up to 2.2, or even 2.0 by slightly adapting the
patch (in 2.0, such functions are documented in proto/pattern.h)
The ring lock was initially mostly used for the logs and used to inherit
its name in lock stats. Now that it's exclusively used by rings, let's
rename it accordingly.
The log server lock is pretty visible in perf top when using log samples
because it's taken for each server in turn while trying to validate and
update the log server's index. Let's change this for a CAS, since we have
the index and the range at hand now. This allow us to remove the logsrv
lock.
The test on 4 servers now shows a 3.7 times improvement thanks to much
lower contention. Without log sampling a test producing 4.4M logs/s
delivers 4.4M logs/s at 21 CPUs used, everything spent in the kernel.
After enabling 4 samples (1:4, 2:4, 3:4 and 4:4), the throughput would
previously drop to 1.13M log/s with 37 CPUs used and 75% spent in
process_send_log(). Now with this change, 4.25M logs/s are emitted,
using 26 CPUs and 22% in process_send_log(). That's a 3.7x throughput
improvement for a 30% global CPU usage reduction, but in practice it
mostly shows that the performance drop caused by having samples is much
less noticeable (each of the 4 servers has its index updated for each
log).
Note that in order to even avoid incrementing an index for each log srv
that is consulted, it would be more convenient to have a single index
per frontend and apply the modulus on each log server in turn to see if
the range has to be updated. It would then only perform one write per
range switch. However the place where this is done doesn't have access
to a frontend, so some changes would need to be performed for this, and
it would require to update the current range independently in each
logsrv, which is not necessarily easier since we don't know yet if we
can commit it.
By using a single long long to store both the current range and the
next index, we'll make it possible to perform atomic operations instead
of locking. Let's only regroup them for now under a new "curr_rg_idx".
The upper word is the range, the lower is the index.
This index is useless because it only serves to know when the global
index reached the end, while the global one already knows it. Let's
just drop it and perform the test on the global range.
It was verified with the following config that the first server continues
to take 1/10 of the traffic, the 2nd one 2/10, the 3rd one 3/10 and the
4th one 4/10:
log 127.0.0.1:10001 sample 1:10 local0
log 127.0.0.1:10002 sample 2,5:10 local0
log 127.0.0.1:10003 sample 3,7,9:10 local0
log 127.0.0.1:10004 sample 4,6,8,10:10 local0
The test of the log range is not very clear, in part due to the
reuse of the "curr_idx" name that happens at two levels. The call
to in_smp_log_range() applies to the smp_info's index to which 1 is
added: it verifies that the next index is still within the current
range.
Let's just have a local variable "next_index" in process_send_log()
that gets assigned the next index (current+1) and compare it to the
current range's boundaries. This makes the test much clearer. We can
then simply remove in_smp_log_range() that's no longer needed.
This reverts commit c618ed5ff4.
The list iterator is broken. As found by Fred, running QUIC single-
threaded shows that only the first connection is accepted because the
accepter relies on the element being initialized once detached (which
is expected and matches what MT_LIST_DELETE_SAFE() used to do before).
However while doing this in the quic_sock code seems to work, doing it
inside the macro show total breakage and the unit test doesn't work
anymore (random crashes). Thus it looks like the fix is not trivial,
let's roll this back for the time it will take to fix the loop.
In 1.9 with commit 627505d36 ("MINOR: freq_ctr: add swrate_add_scaled()
to work with large samples") we got the ability to indicate when adding
some values that they represent a number of samples. However there is an
issue in the calculation which is that the number of samples that is
added to the sum before the division in order to avoid fading away too
fast, is multiplied by the scale. The problem it causes is that this is
done in the negative part of the expression, and that as soon if the sum
of old_sum and v*s is too small (e.g. zero), we end up with a negative
value of -s.
This is visible in "show pools" which occasionally report a very large
value on "needed_avg" since 2.9, though the bug has been there for longer.
Indeed in 2.9 since they're hashed in buckets, it suffices that any
thread got one such error once for the sum to be wrong. One possible
impact is memory usage not shrinking after a short burst due to pools
refraining from releasing objects, believing they don't have enough.
This must be backported to all versions. Note that the opportunistic
version can be dropped before 2.8.
The new mt_list code supports exponential back-off on conflict, which
is important for use cases where there is contention on a large number
of threads. The API evolved a little bit and required some updates:
- mt_list_for_each_entry_safe() is now in upper case to explicitly
show that it is a macro, and only uses the back element, doesn't
require a secondary pointer for deletes anymore.
- MT_LIST_DELETE_SAFE() doesn't exist anymore, instead one just has
to set the list iterator to NULL so that it is not re-inserted
into the list and the list is spliced there. One must be careful
because it was usually performed before freeing the element. Now
instead the element must be nulled before the continue/break.
- MT_LIST_LOCK_ELT() and MT_LIST_UNLOCK_ELT() have always been
unclear. They were replaced by mt_list_cut_around() and
mt_list_connect_elem() which more explicitly detach the element
and reconnect it into the list.
- MT_LIST_APPEND_LOCKED() was only in haproxy so it was left as-is
in list.h. It may however possibly benefit from being upstreamed.
This required tiny adaptations to event_hdl.c and quic_sock.c. The
test case was updated and the API doc added. Note that in order to
keep include files small, the struct mt_list definition remains in
list-t.h (par of the internal API) and was ifdef'd out in mt_list.h.
A test on QUIC with both quictls 1.1.1 and wolfssl 5.6.3 on ARM64 with
80 threads shows a drastic reduction of CPU usage thanks to this and
the refined memory barriers. Please note that the CPU usage on OpenSSL
3.0.9 is significantly higher due to the excessive use of atomic ops
by openssl, but 3.1 is only slightly above 1.1.1 though:
- before: 35 Gbps, 3.5 Mpps, 7800% CPU
- after: 41 Gbps, 4.2 Mpps, 2900% CPU
This bug arrived with this commit in 2.9-dev3:
MEDIUM: quic: Allow the quic_conn memory to be asap released.
When sending packets from quic_cc_conn_io_cb(), e.g. when the quic_conn
object has been released and replaced by a lighter one (quic_cc_conn),
some counters may have to be incremented. But they were not reachable
because not shared between quic_conn and quic_cc_conn struct.
To fix this, one has only to move the ->cntrs counters from quic_conn
to QUIC_CONN_COMMON struct which is shared between both quic_cc_conn
Thank you to Tristan for having reported this in GH #2247.
No need to backport.
It's a bit frustrating sometimes to see pool checks catch a bug but not
provide exploitable information without a core.
Here we're adding a function "pool_inspect_item()" which is called just
before aborting in pool_check_pattern() and POOL_DEBUG_CHECK_MARK() and
which will display the error type, the pool's pointer and name, and will
try to check if the item's tag matches the pool, and if not, will iterate
over all pools to see if one would be a better candidate, then will try
to figure the last known caller and possibly other likely candidates if
the pool's tag is not sufficiently trusted. This typically helps better
diagnose corruption in use-after-free scenarios, or freeing to a pool
that differs from the one the object was allocated from, and will also
indicate calling points that may help figure where an object was last
released or allocated. The info is printed on stderr just before the
backtrace.
For example, the recent off-by-one test in the PPv2 changes would have
produced the following output in vtest logs:
*** h1 debug|FATAL: pool inconsistency detected in thread 1: tag mismatch on free().
*** h1 debug| caller: 0x62bb87 (conn_free+0x147/0x3c5)
*** h1 debug| pool: 0x2211ec0 ('pp_tlv_256', size 304, real 320, users 1)
*** h1 debug|Tag does not match. Possible origin pool(s):
*** h1 debug| tag: @0x2565530 = 0x2216740 (pp_tlv_128, size 176, real 192, users 1)
*** h1 debug|Recorded caller if pool 'pp_tlv_128':
*** h1 debug| @0x2565538 (+0184) = 0x62c76d (conn_recv_proxy+0x4cd/0xa24)
A mismatch in the allocated/released pool is already visible, and the
callers confirm it once resolved, where the allocator indeed allocates
from pp_tlv_128 and conn_free() releases to pp_tlv_256:
$ addr2line -spafe ./haproxy <<< $'0x62bb87\n0x62c76d'
0x000000000062bb87: conn_free at connection.c:568
0x000000000062c76d: conn_recv_proxy at connection.c:1177
In preparation for more detailed pool error reports, let's pass the
caller pointers to the check functions. This will be useful to produce
messages indicating where the issue happened.
When recording the caller of a pool_alloc(), we currently store it only
when the object comes from the cache and never when it comes from the
heap. There's no valid reason for this except that the caller's pointer
was not passed to pool_alloc_nocache(), so it used to set NULL there.
Let's just pass it down the chain.
These ones were still in cfgparse.c but they're not specific to the
config at all and may actually be used even when parsing cpu list
entries in /sys. Better move them where they can be reused.
cpu_map is 8.2kB/entry and there's one such entry per group, that's
~520kB total. In addition, the init code is still in haproxy.c enclosed
in ifdefs. Let's make this a dynamically allocated array in the cpuset
code and remove that init code.
Later we may even consider reallocating it once the number of threads
and groups is known, in order to shrink it a little bit, as the typical
setup with a single group will only need 8.2kB, thus saving half a MB
of RAM. This would require that the upper bound is placed in a variable
though.
This function takes on input a printf format for the file name, making
it particularly suitable for /proc or /sys entries which take a lot of
numbers. It also automatically trims the trailing CR and/or LF chars.
The function generate_random_cluster_secret() which initializes the cluster secret
when not supplied by configuration is buggy. There 1/256 that the cluster secret
string is empty.
To fix this, one stores the cluster as a reduced size first 128 bits of its own
SHA1 (160 bits) digest, if defined by configuration. If this is not the case, it
is initialized with a 128 bits random value. Furthermore, thus the cluster secret
is always initialized.
As the cluster secret is always initialized, there are several tests which
are for now on useless. This patch removes such tests (if(global.cluster_secret))
in the QUIC code part and at parsing time: no need to check that a cluster
secret was initialized with "quic-force-retry" option.
Must be backported as far as 2.6.
This patch implements the 'curves' keyword on server lines as well as
the 'ssl-default-server-curves' keyword in the global section.
It also add the keyword on the server line in the ssl_curves reg-test.
These keywords allow the configuration of the curves list for a server.
We currently know the number of tasks in the run queue that are niced,
and we don't expose it. It's too bad because it can give a hint about
what share of the load is relevant. For example if one runs a Lua
script that was purposely reniced, or if a stats page or the CLI is
hammered with slow operations, seeing them appear there can help
identify what part of the load is not caused by the traffic, and
improve monitoring systems or autoscalers.
When building the secondary signature for cache entries when vary is
enabled, the referer part of the signature was a simple crc32 of the
first referer header.
This patch changes it to a 64bits hash based of xxhash algorithm with a
random seed built during init. This will prevent "malicious" hash
collisions between entries of the cache.
We previously had postparsing logic but only for logsrv sinks, but now we
need to make this operation on logsrv directly instead of sinks to prepare
for additional postparsing logic that is not sink-specific.
To do this, we migrated post_sink_resolve() and sink_postresolve_logsrvs()
to their postresolve_logsrvs() and postresolve_logsrv_list() equivalents.
Then, we split postresolve_logsrv_list() so that the sink-only logic stays
in sink.c (sink_resolve_logsrv_buffer() function), and the "generic"
target part stays in log.c as resolve_logsrv().
Error messages formatting was preserved as far as possible but some slight
variations are to be expected.
As for the functional aspect, no change should be expected.
All applets only check the -1 error value (need room) for applet_put*
functions while the underlying functions may also return -2 if the input is
closed or -3 if the data length is invalid. It means applets already handle
other cases by their own.
The API should be fixed but for now, to ease backports, we only fix
applet_put* functions to always return -1 on error. This way, at least for
the applets point of view, the API is consistent.
This patch should be backported to 2.8. Probably not further. Except if we
suspect it could fix a bug.
Due to the fact that several variable values (rtt_var, srtt) were stored as multiple
of their real values, some calculations were less accurate as expected.
Stop storing 4*rtt_var values, and 8*srtt values.
Adjust all the impacted statements.
Must be backported as far as 2.6.
This detects when there are more threads bound via cpu-map than CPUs
enabled in cpu-map, or when there are more total threads than the total
number of CPUs available at boot (for unbound threads) and configured
for bound threads. In this case, a warning is emitted to explain the
problems it will cause, and explaining how to address the situation.
Note that some configurations will not be detected as faulty because
the algorithmic complexity to resolve all arrangements grows in O(N!).
This means that having 3 threads on 2 CPUs and one thread on 2 CPUs
will not be detected as it's 4 threads for 4 CPUs. But at least configs
such as T0:(1,4) T1:(1,4) T2:(2,4) T3:(3,4) will not trigger a warning
since they're valid.
It's very easy to mess up with some cpu-map directives and to leave
some thread unbound. Let's add a test that checks that either all
threads are bound or none are bound, but that we do not face the
intermediary situation where some are pinned and others are left
wandering around, possibly on the same CPUs as bound ones.
Note that this should not be backported, or maybe turned into a
notice only, as it appears that it will easily catch invalid
configs and that may break updates for some users.
Till now the CPUs that were bound were only retrieved in
thread_cpus_enabled() in order to count the number of CPUs allowed,
and it relied on arch-specific code.
Let's slightly arrange this into ha_cpuset_detect_bound() that
reuses the ha_cpuset struct and the accompanying code. This makes
the code much clearer without having to carry along some arch-specific
stuff out of this area.
Note that the macos-specific code used in thread.c to only count
online CPUs but not retrieve a mask, so for now we can't infer
anything from it and can't implement it.
In addition and more importantly, this function is reliable in that
it will only return a value when the detection is accurate, and will
not return incomplete sets on operating systems where we don't have
an exact list, such as online CPUs.
When building without threads, the recently introduced BUG_ON(tid != 0)
turns to a constant expression that evaluates to 0 and that is not used,
resulting in this warning:
src/connection.c: In function 'conn_free':
src/connection.c:584:3: warning: statement with no effect [-Wunused-value]
This is because the whole thing is declared as an expression for clarity.
Make it return void to avoid this. No backport is needed.
This adds a new option for the Makefile USE_OPENSSL_AWSLC, and
update the documentation with instructions to use HAProxy with
AWS-LC.
Update the type of the OCSP callback retrieved with
SSL_CTX_get_tlsext_status_cb with the actual type for
libcrypto versions greater than 1.0.2. This doesn't affect
OpenSSL which casts the callback to void* in SSL_CTX_ctrl.
For the same reason than the previous patch, we must not block the sends
when there is a pending shutdown. In other words, we must consider the sends
are allowed when there is a pending shutdown.
This patch must slowly be backported as far as 2.2. It should partially fix
issue #2249.
The progressive adoption of OpenSSL 3 and its abysmal handshake
performance has started to reveal situations where it simply isn't
possible anymore to succesfully run health checks on many servers,
because between the moment all the checks are started and the moment
the handshake finally completes, the timeout has expired!
This also has consequences on production traffic which gets
significantly delayed as well, all that for lots of checks. While it's
possible to increase the check delays, it doesn't solve everything as
checks still take a huge amount of time to converge in such conditions.
Here we take a different approach by permitting to enforce the maximum
concurrent checks per thread limitation and implementing an ordered
queue. Thanks to this, if a thread about to start a check has reached
its limit, it will add the check at the end of a queue and it will be
processed once another check is finished. This proves to be extremely
efficient, with all checks completing in a reasonable amount of time
and not being disturbed by the rest of the traffic from other checks.
They're just cycling slower, but at the speed the machine can handle.
One must understand however that if some complex checks perform multiple
exchanges, they will take a check slot for all the required duration.
This is why the limit is not enforced by default.
Tests on SSL show that a limit of 5-50 checks per thread on local
servers gives excellent results already, so that could be a good starting
point.