This function is particularly useful to dump unknown areas watching
for opportunistic symbols, so let's move it to tools.c so that we can
reuse it a little bit more.
When a corruption was detected in an object, it's often said that the
tag doesn't match the pool, but it should also check if it matches the
location of an earlier pool_free() call, which happens when -dMcaller
is used. That's what we're doing now.
In 2.9 with commit 7968fe3889 ("MEDIUM: stick-table: change the ref_cnt
atomically") we significantly relaxed the stick-tables locking when
dealing with peers by adjusting the ref_cnt atomically and moving it
out of the lock.
However it opened a tiny window that became problematic in 3.0-dev7
when the table's contention was lowered by commit 1a088da7c2 ("MAJOR:
stktable: split the keys across multiple shards to reduce contention").
What happens is that some peers may access the entry for reading at
the moment it's about to expire, and while the read accesses to push
the data remain unnoticed (possibly that from time to time we push
crap), but the releasing of the refcount causes a new write that may
damage anything else. The scenario is the following:
process_table_expire() peer_send_teachmsgs()
RDLOCK(&updt_lock);
tick_is_expired() != 0
ebmb_delete(ts->key);
if (ts->upd.node.leaf_p) {
HA_ATOMIC_INC(&ts->ref_cnt);
RDUNLOCK(&updt_lock);
WRLOCK(&updt_lock);
eb32_delete(&ts->upd);
}
__stksess_free(t, ts);
peer_send_updatemsg(ts);
RDLOCK(&updt_lock);
HA_ATOMIC_DEC(&ts->ref_cnt);
Here it's clear that the bottom part of peer_send_teachmsgs() believes
to be protected but may act on freed data.
This is more visible when enabling -dMtag,no-merge,integrity because
the ATOMIC_DEC(&ref_cnt) decrements one byte in the area, that makes
the eviction check fail while the tag has the address of the left
__stksess_free(), proving a completed pool_free() before the decrement,
and the anomaly there is pretty visible in the crash dump. Changing
INC()/DEC() with ADD(2)/DEC(2) shows that the byte is now off by two,
confirming that the operation happened there.
The solution is not very hard, it consists in checking for the ref_cnt
on the left after grabbing the lock, and doing both before deleting the
element, so that we have the guarantee that either the peer will not
take it or that it has already started taking it.
This was proven to be sufficient, as instead of crashing after 3s of
injection with 4 peers, 16 threads and 130k RPS, it survived for 15mn.
In order to stress the setup, a config involving 4+ peers, tracking
HTTP request with randoms and applying a bwlim-out filter with a
random key, with a client made of 160 h2 conns downloading 10 streams
of 4MB objects in parallel managed to trigger it within a few seconds:
frontend ft
http-request track-sc0 rand(100000) table tbl
filter bwlim-out lim-out limit 2047m key rand(100000000),ipmask(32) min-size 1 table tbl
http-request set-bandwidth-limit lim-out
use_backend bk
backend bk
server s1 198.18.0.30:8000
server s2 198.18.0.34:8000
backend tbl
stick-table type ip size 1000k expire 1s store http_req_cnt,bytes_in_rate(1s),bytes_out_rate(1s) peers peers
This seems to be very dependent on the timing and setup though.
This will need to be backported to 2.9. This part of the code was
reindented with shards but the block should remain mostly unchanged.
The logic to apply is the same.
Sending "trace peers event" on the CLI crashes because the event list
in the peers is not finished. This was introduced in 2.4 by commit
d865935f32 ("MINOR: peers: Add traces to peer_treat_updatemsg().")
so this must be backported to 2.4.
When adding the shards support to tables with commit 1a088da7c ("MAJOR:
stktable: split the keys across multiple shards to reduce contention"),
the condition to stop eliminating entries based on the batch size being
reached is based on a pre-decrement of the max_search counter, but now
it goes back into the outer loop which doesn't check it, so next time
it does it when entering the next shard, it will become even more
negative and will properly stop, but at first glance it looks like an
int overflow (which it is not). Let's make sure the outer loop stops
on this condition so that we don't continue searching when the limit
is reached.
While changing the stick-table indexing that led to commit 1a088da7c
("MAJOR: stktable: split the keys across multiple shards to reduce
contention"), I met a problem with the task's expiration date being
incorrectly updated, I fixed it and apparently I committed the wrong
version :-/
The effect is that the task's date is only correctly reset if the
table is empty, otherwise the task wakes up again and is queued at
the previous date, eating 100% CPU. The tick_isfirst() must not be
used when storing the last result.
No backport is needed as this was only merged in 3.0-dev7.
crt-list will be enhanced with ckch_conf keywords, however these keywords
does not fill the 'ssl_conf' structure. So we don't need to allocate the
ssl_conf for every options between [ ] but only when we found a relevant
one.
'crt-store' is a new section useful to define the struct ckch_store.
The "load" keyword in the "crt-store" section allows to define which
files you want to load for a specific certificate definition.
Ex:
crt-store
load crt "site1.crt" key "site1.key"
load crt "site2.crt" key "site2.key"
frontend in
bind *:443 ssl crt "site1.crt" crt "site2.crt"
This is part of the certificate loading which was discussed in #785.
When cache and stats applets were changed to use their own buffers, a change
was also performed to no longer access the stream from the I/O
handller. Among other things, the HTTP start-line of the request is now
retrieved to get the method. But, when these changes were brought, the inbuf
buffer allocation failures were not handled.
It is of course not so common. But if this happens, a crash may be
experienced. To fix the issue, we now check for inbuf allocation failures
before accessing it.
No backported needed.
We observed that a dynamic server which health check is down for longer
than slowstart delay at startup doesn't trigger the warmup phase, it
receives full traffic immediately. This has been confirmed by checking
haproxy UI, weight is immediately the full one (e.g. 75/75), without any
throttle applied. Further tests showed that it was similar if it was in
maintenance, and even when entering a down or maintenance state after
being up.
Another issue is that if the server is down for less time than
slowstart, when it comes back up, it briefly has a much higher weight
than expected for a slowstart.
An easy way to reproduce is to do the following:
- Add a server with e.g. a 20s slowstart and a weight of 10 in config
file
- Put it in maintenance using CLI (set server be1/srv1 state maint)
- Wait more than 20s, enable it again (set server be1/srv1 state ready)
- Observe UI, weight will show 10/10 immediately.
If server was down for less than 20s, you'd briefly see a weight and
throttle value that is inconsistent, e.g. 50% throttle value and a
weight of 5 if server comes back up after 10s before going back to
6% after a second or two.
Code analysis shows that the logic in server_recalc_eweight stops the
warmup task by setting server's next state to SRV_ST_RUNNING if it
didn't change state for longer than the slowstart duration, regardless
of its current state. As a consequence, a server being down or disabled
for longer than the slowstart duration will never enter the warmup phase
when it will be up again.
Regarding the weight when server comes back up, issue is that even if
the server is down, we still compute its next weight as if it was up,
hence when it comes back up, it can briefly have a much higher weight
than expected during slowstart, until the warmup task is called again
after last_change is updated.
This patch aims to fix both issues.
The CPU variable, when used, is almost always exclusively used with
"generic" to disable any CPU-specific optimizations, or "native" to
enable "-march=native". Other options are not used and are just making
CPU_CFLAGS more confusing.
This commit just drops all pre-configured variants and replaces them
with documentation about examples of supported options. CPU_CFLAGS is
preserved as it appears that it's mostly used as a proxy to inject the
distro's CFLAGS, and it's just empty by default.
The CPU variable is checked, and if set to anything but "generic", it
emits a warning about its deprecation and invites the user to read
INSTALL.
Users who would just set CPU_CFLAGS will be able to continue to do so,
those who were using CPU=native will have to pass CPU_CFLAGS=-march=native
and those who were passing one of the other options will find it in the
doc as well.
Note that this also removes the "CPU=" line from haproxy -vv, that most
users got used to seeing set to "generic" or occasionally "native"
anyway, thus that didn't provide any useful information.
This option has been set by default for a very long time and also
complicates the manipulation of the DEBUG variable. Let's make it
the official default and permit to unset it by setting it to zero.
The other pool-related DEBUG options were adjusted to also explicitly
check for the zero value for consistency.
Some compilers report this on the cache:
src/cache.c:235: warning: 'release_entry_locked' declared inline after being called
src/cache.c:235: warning: previous declaration of 'release_entry_locked' was here
And indeed, the function is first declared non-inline and later inline.
Let's just set the inline status from the beginning. It's not really
needed to backport this.
Using an invalid GUID for guid_insert() causes a crash. This is easily
reproducible using for example an invalid character with "guid" keyword.
Here is the provided backtrace :
Thread 1 "haproxy" received signal SIGSEGV, Segmentation fault.
0x00005555561fda95 in guid_insert (objt=0x520000002080, uid=0x519000002dac "@foo2", errmsg=0x7ffff4c0a7a0)
at src/guid.c:83
83 ha_free(&guid->node.key);
This error is present in guid_insert() cleanup parts. GUID node is not
allocated in case of an early error so it's impossible to dereference it
to free guid.node.key. Fix this simply by using an intermediary pointer
key.
This does not need to be backported.
In HTTP keep-alive, if we face a connection error to the server while
sending the request, the error should not be reported, and the client-side
connection should simply be closed, so that client knows it can retry.
If the error happens during the connection stage, there is two cases. We
have a connection timeout or an allocation error. In this case, the 503
response must be skipped if it is not the first request on the client-side
connection. Or we have a connection error. In this case, the 503 response
must be skipped if it is a reused server connection. Otherwise, during the
connection stage, the 503-Service-unavailable response is delivered to the
client. The part works properly.
If the error happens after this stage, the 502-Bad-gateway response
delivering should only be based on the server-side connection status. For a
reused server connection, the client-side connection must be closed with no
reponses. However, for a fresh server-side connection, a 502-Bad-gateway
response must be delivered to the client. Unfortunately, This part is
buggy. Only the client-side connection state is considered and the response
is skipped if it is not the first request for the same client connection.
The bug is not so visbile in HTTP/1.1 but in H2 and H3 it is pretty annoying
because for a connection, requests are multiplexed, in parallels. It means
there is no first request. So, because of this bug, for H2 and H3,
502-Bad-gateway responses because of a connection error before receiveing
the response are always skipped.
To fix the issue, in http_wait_for_response() analyser, we must only rely on
SF_SRV_REUSED stream flag to skip the 502 response or not. This flag is set
if the server connection was reused.
The bug is their since a while. SF_SRV_REUSED flag was added in the version
1.5 especially to fix this kind of bug. But only the 503 case was fixed.
This patch should fix the issue #2285. It must be backported to every stable
versions.
qc_send() is implemented as a loop to repeatedly invoke
qc_prep_pkts()/qc_send_ppkts(). This ensures that all data are emitted
even if bigger that a single Tx buffer instance. This is useful if
congestion window is empty but big enough for application data.
Looping is interrupted if qc_prep_pkts() returns a negative error
code, for example due to no space left in congestion window. It can also
returns 0 if no input data to sent, which also interrupt the loop.
To limit this last case, removed quic_enc_level from send_list each time
everything already send via qc_prep_pkts(). Loop can then be interrupted
as soon as send_list is empty, avoiding an extra superfluous call to
qc_prep_pkts().
qc_send() was systematically called by quic_conn IO handlers with all
instantiated quic_enc_level. Change this to only register quic_enc_level
for send if needed. Do not call at all qc_send() if no qel registered.
A new function qel_need_sending() is defined to detect if sending is
required. First, it checks if quic_enc_level has prepared frames or
probing is set. It can also returns true if ACK required either on
quic_enc_level itself or because of quic_conn ack timer fired. Finally,
a CONNECTION_CLOSE emission for quic_conn is also a valid case.
This should reduce the number of invocations of qc_send(). This could
improve slightly performance, as well as simplify traces debugging.
A series of previous patches have clean up sending function for
handshake case. Their new exposed API is now flexible enough to convert
app case to use the same functions.
As such, qc_send_hdshk_pkts() is renamed qc_send() and become the single
entry point for QUIC emission. It is used during application packets
emission in quic_conn_app_io_cb(), qc_send_mux(). Also the internal
function qc_prep_hpkts() is renamed qc_prep_pkts().
Remove the new unneeded qc_send_app_pkts() and qc_prep_app_pkts().
Also removed qc_send_app_probing(). It was a simple wrapper over other
application send functions. Now, default qc_send() can be reuse for such
cases with <old_data> argument set to true.
An adjustment was needed when converting qc_send_hdshk_pkts() to the
general qc_send() version. Previously, only a single packets
encoding/emission cycle was performed. This was enough as handshake
packets are always smaller than Tx buffer. However, it may be possible
to emit more application data. As such, a loop is necessary to perform
multiple encoding/emission cycles, as this was already the case in
qc_send_app_pkts().
No functional difference should happen with this commit. However, as
these are critcal functions with a lot of changes, this patch is
labelled as medium.
quic_conn_io_cb() manually implements emission by using lower level
functions qc_prep_pkts() and qc_send_ppkts(). Replace this by using the
higher level function qc_send_hdshk_pkts() which notably handle buffer
allocation and purging.
This allows to clean up send API by flagging qc_prep_pkts() and
qc_send_ppkts() as static. They are now used in a single location inside
qc_send_hdshk_pkts().
qc_send_hdshk_pkts() is a wrapper for qc_prep_hpkts() used on
retransmission. It was restricted to use two quic_enc_level pointers as
distinct arguments. Adapt it to directly use the same list of
quic_enc_level which is passed then to qc_prep_hpkts().
Now for retransmission quic_enc_level send list is built directly into
qc_dgrams_retransmit() which calls qc_send_hdshk_pkts().
Along this change, a new utility function qel_register_send() is
defined. It is an helper to build the quic_enc_level send list. It
enfores that each quic_enc_level instance is only registered in a single
list to prevent memory issues. It is both used in qc_dgrams_retransmit()
and quic_conn_io_cb().
Emission of packets during handshakes was implemented via an API which
uses two alternative ways to specify the list of frames.
The first one uses a NULL list of quic_enc_level as argument for
qc_prep_hpkts(). This was an implicit method to iterate on all qels
stored in quic_conn instance, with frames already inserted in their
corresponding quic_pktns.
The second method was used for retransmission. It uses a custom local
quic_enc_level list specified by the caller as input to qc_prep_hpkts().
Frames were accessible through <retransmit> list pointers of each
quic_enc_level used in an implicit mechanism.
This commit clarifies the API by using a single common method. Now
quic_enc_level list must always be specified by the caller. As for
frames list, each qels must set its new field <send_frms> pointer to the
list of frames to send. Callers of qc_prep_hpkts() are responsible to
always clear qels send list. This prevent a single instance of
quic_enc_level to be inserted while being attached to another list.
This allows notably to clean up some unnecessary code. First,
<retransmit> list of quic_enc_level is removed as it is replaced by new
<send_frms>. Also, it's now possible to use proper list_for_each_entry()
inside qc_prep_hpkts() to loop over each qels. Internal functions for
quic_enc_level selection is now removed.
Clean up trailer of qc_send_hdshk_pkts() by removing label "leave". Only
"out" label is now used. This operation is safe as LIST_DEL_INIT() is
idempotent. Caller of qc_send_hdshk_pkts() also ensures input frame
lists are freed, so it's better to always reset quic_enc_level
<retrans_frms> member.
Also take the opportunity to reset QUIC_FL_CONN_RETRANS_OLD_DATA only if
already set. This is considered more robust and will also remove
unneeded trace occurences.
No functional change. The main objective of this commit is to clean up
code in preparation of a refactoring on send functions.
In c83684519 ("MEDIUM: log: add the ability to include samples in logs")
we checked the return value of lf_text_len() as an integer instead of
comparing the pointer with NULL explicitly. Since this may be confusing,
let's test the return value against NULL.
[ada: for backports, the patch needs to be applied manually because of
c6a713842 ("MINOR: log: simplify last_isspace in sess_build_logline()")]
According to snprintf() man page:
The functions snprintf() and vsnprintf() do not write more than
size bytes (including the terminating null byte ('\0')). If the
output was truncated due to this limit, then the return value is
the number of characters (excluding the terminating null byte)
which would have been written to the final string if enough space
had been available. Thus, a return value of size or more means
that the output was truncated.
However, in sess_build_logline(), each time we need to check the return
value of snprintf(), here is how we proceed:
iret = snprintf(tmplog, max, ...);
if (iret < 0 || iret > max)
// error
// success
tmplog += iret;
Here is the issue: if snprintf() lacks 1 byte space to write the
terminating NULL byte, it will return max. Which means in this case
that we fail to know that snprintf() truncated the output in reality,
and we still add iret to tmplog pointer. Considering sess_build_logline()
should NOT write more than <maxsize> bytes (including the terminating NULL
byte) as per the function description, in this case the function would
write <maxsize>+1 byte (to write the terminating NULL byte upon return),
which may lead to invalid write if <dst> was meant to hold <maxsize> bytes
at maximum.
Hopefully, this bug wasn't triggered so far because sess_build_logline()
is called with logline as <dst> argument and <global.max_syslog_len> as
<maxsize> argument, logline being initialized with 1 extra byte upon
startup.
But we better fix this to comply with the function description and prevent
any side-effect since some sess_build_logline() helpers may assume that
'tmplog-dst < maxsize' is always true. Also sess_build_logline() users
probably don't expect NULL-byte to be accounted for in the produced
logline length.
This should be backported to all stable versions.
[ada: for backports, the patch needs to be applied manually because of
c6a713842 ("MINOR: log: simplify last_isspace in sess_build_logline()")]
encode_{chunk,string}() is often found to be used this way:
ret = encode_{chunk,string}(start, stop...)
if (ret == NULL || *ret != '\0') {
//error
}
//success
Indeed, encode_{chunk,string} will always try to add terminating NULL byte
to the output string, unless no space is available for even 1 byte.
However, it means that for the caller to be able to spot an error, then it
must provide a buffer (here: start) which is already initialized.
But this is wrong: not only this is very tricky to use, but since those
functions don't return NULL on failure, then if the output buffer was not
properly initialized prior to calling the function, the caller will
perform invalid reads when checking for failure this way. Moreover, even
if the buffer is initialized, we cannot reliably tell if the function
actually failed this way because if the buffer was previously initialized
with NULL byte, then the caller might think that the call actually
succeeded (since the function didn't return NULL and didn't update the
buffer).
Also, sess_build_logline() relies lf_encode_{chunk,string}() functions
which are in fact wrappers for encode_{chunk,string}() functions and thus
exhibit the same error handling mechanism. It turns out that
sess_build_logline() makes unsafe use of those functions because it uses
the error-checking logic mentionned above while buffer (tmplog) is not
guaranteed to be initialized when entering the function. This may
ultimately cause malfunctions or invalid reads if the output buffer is
lacking space.
To fix the issue once and for all and prevent similar bugs from being
introduced, we make it so encode_{string, chunk} and escape_string()
(based on encode_string()) now explicitly return NULL on failure
(when the function failed to write at least the ending NULL byte)
lf_encode_{string,chunk}() helpers had to be patched as well due to code
duplication.
This should be backported to all stable versions.
[ada: for 2.4 and 2.6 the patch won't apply as-is, it might be helpful to
backport ae1e14d65 ("CLEANUP: tools: removing escape_chunk() function")
first, considering it's not very relevant to maintain a dead function]
In c5bff8e550 ("BUG/MINOR: log: improper behavior when escaping log data")
we fixed lf_text_len() behavior with +E (escape) option.
However we introduced an inconsistency if output buffer is too small to
hold the whole output and truncation occurs: indeed without +E option up
to <size> bytes (including NULL byte) will be used whereas with +E option
only <size-1> bytes will be used. Fixing the function and related comment
so that the function behaves the same in regards to truncation whether +E
option is used or not.
This should be backported to all stable versions.
When sharded listeners were introdcued in 2.5 with commit 6dfbef4145
("MEDIUM: listener: add the "shards" bind keyword"), a point was
overlooked regarding how IDs are assigned to listeners: they are just
duplicated! This means that if a "option socket-stats" is set and a
shard is configured, or multiple thread groups are enabled, then a stats
dump will produce several lines with exactly the same socket name and ID.
This patch tries to address this by trying to assign consecutive numbers
to these sockets. The usual algo is maintained, but with a preference for
the next number in a shard. This will help users reserve ranges for each
socket, for example by using multiples of 100 or 1000 on each bind line,
leaving enough room for all shards to be assigned.
The mechanism however is quite tricky, because the configured listener
currently ends up being the last one of the shard. This helps insert them
before the current position without having to revisit them. But here it
causes a difficulty which is that we'd like to restart from the current
ID and assign new ones on top of it. What is done is that the number is
passed between shards and the current one is cleared (and removed from
the tree) so that we instead insert the new one. It's tricky because of
the situation which depends whether it's the listener that was already
assigned on the bind line or not. But overall, always removing the entry,
always adding the new one when the ID is not zero, and passing them from
the reference to the next one does the trick.
This may be backported to all versions till 2.6.
When a command is too big to fit in a buffer, a error is returned before
closing. However, the error is also returned if the command is small enough
but incomplete. It happens on abort. In this case, the error must not be
reported. The regression was introduced when a dedicated sn_buf callbac
function was added.
To fix the issue, both cases are now handled separately.
No backport needed.
There are several places where the QUIC low-level code performs unaligned
accesses by casting unaligned char* pointers to uint32_t, but this is
totally forbidden as it only works on machines that support unaligned
accesses, and either crashes on other ones (SPARC, MIPS), can result in
reading garbage (ARMv5) or be very slow due to the access being emulated
(RISC-V). We do have functions for this, such as read_u32() and write_u32()
that rely on the compiler's knowledge of the machine's capabilities to
either perform an unaligned access or do it one byte at a time.
This must be backported at least as far as 2.6. Some of the code moved a
few times since, so in order to figure the points that need to be fixed,
one may look for a forced pointer cast without having verified that either
the machine is compatible or that the pointer is aligned using this:
$ git grep 'uint[36][24]_t \*)'
Or build and run the code on a MIPS or SPARC and perform requests using
curl to see if they work or crash with a bus error. All the places fixed
in this commit were found thanks to an immediate crash on the first
request.
This was tagged medium because the affected archs are not the most common
ones where QUIC will be found these days.
Since the Linux capabilities support add-on (see the commit bd84387beb
("MEDIUM: capabilities: enable support for Linux capabilities")), we can also
check haproxy process effective and permitted capabilities sets, when it
starts and runs as non-root.
Like this, if needed network capabilities are presented only in the process
permitted set, we can get this information with capget and put them in the
process effective set via capset. To do this properly, let's introduce
prepare_caps_from_permitted_set().
First, it checks if binary effective set has CAP_NET_ADMIN or CAP_NET_RAW. If
there is a match, LSTCHK_NETADM is removed from global.last_checks list to
avoid warning, because in the initialization sequence some last configuration
checks are based on LSTCHK_NETADM flag and haproxy process euid may stay
unpriviledged.
If there are no CAP_NET_ADMIN and CAP_NET_RAW in the effective set, permitted
set will be checked and only capabilities given in 'setcap' keyword will be
promoted in the process effective set. LSTCHK_NETADM will be also removed in
this case by the same reason. In order to be transparent, we promote from
permitted set only capabilities given by user in 'setcap' keyword. So, if
caplist doesn't include CAP_NET_ADMIN or CAP_NET_RAW, LSTCHK_NETADM would not
be unset and warning about missing priviledges will be emitted at
initialization.
Need to call it before protocol_bind_all() to allow binding to priviledged
ports under non-root and 'setcap cap_net_bind_service' must be set in the
global section in this case.
Linux capabilities support and ability to preserve it for running process
after switching to a global.uid was added recently by the commit bd84387beb
("MEDIUM: capabilities: enable support for Linux capabilities")).
This new feature hasn't yet been taken into account by last config checks,
which are performed at initialization stage.
So, to update it, let's perform it after set_identity() call. Like this,
current EUID is already changed to a global.uid and prepare_caps_for_setuid()
would unset LSTCHK_NETADM flag, only if capabilities given in the 'setcap'
keyword in the configuration file were preserved.
Otherwise, if system doesn't support Linux capabilities or they were not set
via 'setcap', we keep the previous strict behaviour: process will terminate
with an alert, in order to insist that user: either needs to change
run UID (worst case: start and run as root), or he needs to set/recheck
capabilities listed as 'setcap' arguments.
In the case, when haproxy will start and run under a non-root user this patch
doesn't change the previous behaviour: we'll still let him try the
configuration, but we inform via warning that unexpected things may occur.
Need to be backported until v2.9, including v2.9.
This commit is similar with the two previous ones. Its purpose is to add
GUID support on listeners. Due to bind_conf and listeners configuration,
some specifities were required.
Its possible to define several listeners on a single bind line, for
example by specifying multiple addresses. As such, it's impossible to
support a "guid" keyword on a bind line. The problem is exacerbated by
the cloning of listeners when sharding is used.
To resolve this, a new keyword "guid-prefix" is defined for bind lines.
It allows to specify a string which will be used as a prefix for
automatically generated GUID for each listeners attached to a bind_conf.
Automatic GUID listeners generation is implemented via a new function
bind_generate_guid(). It is called on post-parsing, after
bind_complete_thread_setup(). For each listeners on a bind_conf, a new
GUID is generated with bind_conf prefix and the index of the listener
relative to other listeners in the bind_conf. This last value is stored
in a new bind_conf field named <guid_idx>. If a GUID cannot be inserted,
for example due to a non-unique value, an error is returned, startup is
interrupted with configuration rejected.
This commit is similar to previous one, except that it implements GUID
support for server instances. A guid_node field is inserted into server
structure. A new "guid" server keyword is defined.
Implement proxy identiciation through GUID. As such, a guid_node member
is inserted into proxy structure. A proxy keyword "guid" is defined to
allow user to fix its value.
GUID format is unspecified to allow users to choose the naming scheme.
Some restrictions however are added by this patch, mainly to ensure
coherence and memory usage.
The first restriction is on the length of GUID. No more than 127
characters can be used to prevent memory over consumption.
The second restriction is on the character set allowed in GUID. Utility
function invalid_char() is used for this : it allows alphanumeric
values and '-', '_', '.' and ':'.
Define a new module guid. Its purpose is to be able to attach a global
identifier for various objects such as proxies, servers and listeners.
A new type guid_node is defined. It will be stored in the objects which
can be referenced by such GUID. Several functions are implemented to
properly initialized, insert, remove and lookup GUID in a global tree.
Modification operations should only be conducted under thread isolation.
Currently, the way proxy-oriented logformat directives are handled is way
too complicated. Indeed, "log-format", "log-format-error", "log-format-sd"
and "unique-id-format" all rely on preparsing hints stored inside
proxy->conf member struct. Those preparsing hints include the original
string that should be compiled once the proxy parameters are known plus
the config file and line number where the string was found to generate
precise error messages in case of failure during the compiling process
that happens within check_config_validity().
Now that lf_expr API permits to compile a lf_expr struct that was
previously prepared (with original string and config hints), let's
leverage lf_expr_compile() from check_config_validity() and instead
of relying on individual proxy->conf hints for each logformat expression,
store string and config hints in the lf_expr struct directly and use
lf_expr helpers funcs to handle them when relevant (ie: original
logformat string freeing is now done at a central place inside
lf_expr_deinit(), which allows for some simplifications)
Doing so allows us to greatly simplify the preparsing logic for those 4
proxy directives, and to finally save some space in the proxy struct.
Also, since httpclient proxy has its "logformat" automatically compiled
in check_config_validity(), we now use the file hint from the logformat
expression struct to set an explicit name that will be reported in case
of error ("parsing [httpclient:0] : ...") and remove the extraneous check
in httpclient_precheck() (logformat was parsed twice previously..)
split parse_logformat_string() into two functions:
parse_logformat_string() sticks to the same behavior, but now becomes an
helper for lf_expr_compile() which uses explicit arguments so that it
becomes possible to use lf_expr_compile() without a proxy, but also
compile an expression which was previously prepared for compiling (set
string and config hints within the logformat expression to avoid manually
storing string and config context if the compiling step happens later).
lf_expr_dup() may be used to duplicate an expression before it is
compiled, lf_expr_xfer() now makes sure that the input logformat is
already compiled.
This is some prerequisite works for log-profiles implementation, no
functional change should be expected.
This patch tries to address a design flaw with how logformat expressions
are parsed from config. Indeed, some parse_logformat_string() calls are
performed during config parsing when the proxy mode is not yet known.
Here's a config example that illustrates the issue:
defaults
mode tcp
listen test
bind :8888
http-response set-header custom-hdr "%trl" # needs http
mode http
The above config should work, because the effective proxy mode is http,
yet haproxy fails with this error:
[ALERT] (99051) : config : parsing [repro.conf:6] : error detected in proxy 'test' while parsing 'http-response set-header' rule : format tag 'trl' is reserved for HTTP mode.
To fix the issue once and for all, let's implement smart postparsing for
logformat expressions encountered during config parsing:
- split parse_logformat_string() (and subfonctions) in order to create a
new lf_expr_postcheck() function that must be called to finish
preparing and checking the logformat expression once the proxy type is
known.
- save some config hints info during parse_logformat_string() to
generate more precise error messages during lf_expr_postcheck(), if
needed, we rely on curpx->conf.args.{file,line} hints for that because
parse_logformat_string() doesn't know about current file and line
number.
- lf_expr_postcheck() uses PR_FL_CHECKED proxy flag to know if the
function may try to make the proxy compatible with the expression, or
if it should simply fail as soon as an incompatibility is detected.
- if parse_logformat_string() is called from an unchecked proxy, then
schedule the expression for postparsing, else (ie: during runtime),
run the postcheck right away.
This change will also allow for some logformat expression error handling
simplifications in the future.
PR_FL_CHECKED is set on proxy once the proxy configuration was fully
checked (including postparsing checks).
This information may be useful to functions that need to know if some
config-related proxy properties are likely to change or not due to parsing
or postparsing/check logics. Also, during runtime, except for some rare cases
config-related proxy properties are not supposed to be changed.
log format expressions are broadly used within the code: once they are
parsed from input string, they are converted to a linked list of
logformat nodes.
We're starting to face some limitations because we're simply storing the
converted expression as a generic logformat_node list.
The first issue we're facing is that storing logformat expressions that
way doesn't allow us to add metadata alongside the list, which is part
of the prerequites for implementing log-profiles.
Another issue with storing logformat expressions as generic lists of
logformat_node elements is that it's starting to become really hard to
tell when we rely on logformat expressions or not in the code given that
there isn't always a comment near the list declaration or manipulation
to indicate that it's relying on logformat expressions under the hood,
so this adds some complexity for code maintenance.
This patch looks quite impressive due to changes in a lot of header and
source files (since logformat expressions are broadly used), but it does
a simple thing: it defines the lf_expr structure which itself holds a
generic list of logformat nodes, and then declares some helpers to
manipulate lf_expr elements and fixes the code so that we now exclusively
manipulate logformat_node lists as lf_expr elements outside of log.c.
For now, lf_expr struct only contains the list of logformat nodes (no
additional metadata), but now that we have dedicated type and helpers,
doing so in the future won't be problematic at all and won't require
extensive code changes.
This is a pretty simple patch despite requiring to make some visible
changes in the code:
When parsing a logformat string, log tags (ie: '%tag', AKA log tags) are
turned into logformat nodes with their type set to the type of the
corresponding logformat_tag element which was matched by name. Thus, when
"compiling" a logformat tag, we only keep a reference to the tag type
from the original logformat_tag.
For example, for "%B" log tag, we have the following logformat_tag
element:
{
.name = "B",
.type = LOG_FMT_BYTES,
.mode = PR_MODE_TCP,
.lw = LW_BYTES,
.config_callback = NULL
}
When parsing "%B" string, we search for a matching logformat tag
inside logformat_tags[] array using the provided name, once we find a
matching element, we craft a logformat node whose type will be
LOG_FMT_BYTES, but from the node itself, we no longer have access to
other informations that are set in the logformat_tag struct element.
Thus from a logformat_node resulting from a log tag, with current
implementation, we cannot easily get back to matching logformat_tag
struct element as it would require us to scan the whole logformat_tags
array at runtime using node->type to find the matching element.
Let's take a simpler path and consider all tag-specific LOG_FMT_*
subtypes as being part of the same logformat node type: LOG_FMT_TAG.
Thanks to that, we're now able to distinguish logformat nodes made
from logformat tag from other logformat nodes, and link them to
their corresponding logformat_tag element from logformat_tags[] array. All
it costs is a simple indirection and an extra pointer in logformat_node
struct.
While at it, all LOG_FMT_* types related to logformat tags were moved
inside log.c as they have no use outside of it since they are simply
lookup indexes for sess_build_logline() and could even be replaced by
function pointers some day...
rename logformat_type internal struct to logformat_tag to to make it less
confusing, then expose logformat_tag struct through header file so that it
can be referenced in other structs.
also rename logformat_keywords[] to logformat_tags[] for better
consistency.
What we use to call logformat variable in the code is referred as
log-format tag in the documentation. Having both 'var' and 'tag' labels
referring to the same thing is really confusing. Let's make the code
comply with the documentation by replacing all logformat var/variable/VAR
occurences with either tag or TAG.
No functional change should be expected, the only visible side-effect from
user point of view is that "variable" was replaced by "tag" in some error
messages.
When support for dynamic names was added for use_backend rules in
702d44f2f ("MEDIUM: proxy: support use_backend with dynamic names"), the
sample expression resulting from parse_logformat_string() was only freed
for non dynamic rules (when the expression resolved to a simple string
node). But for complex expressions (ie: multiple nodes), rule->dynamic
was set but the expression was never released, resulting in a small
memory leak when freeing the parent proxy.
To fix the issue, in free_proxy(), we free the switching rule expression
if the switching rule is dynamic.
This should be backported to every stable versions.
[ada: prior to 2.9, free_logformat_list() helper did not exist: we may
use the same manual sample expr freeing logic as in server_rules pruning
right above it]
As per the `sd_notify` manual:
> A field carrying the monotonic timestamp (as per CLOCK_MONOTONIC) formatted
> in decimal in μs, when the notification message was generated by the client.
> This is typically used in combination with "RELOADING=1", to allow the
> service manager to properly synchronize reload cycles. See systemd.service(5)
> for details, specifically "Type=notify-reload".
Thus this change allows users with a recent systemd to switch to
`Type=notify-reload`, should they desire to do so. Correct behavior was
verified with a Fedora 39 VM.
see systemd/systemd#25916
[wla: the service file should be updated this way:]
diff --git a/admin/systemd/haproxy.service.in b/admin/systemd/haproxy.service.in
index 22a53d8aab..8c6dadb5e5 100644
--- a/admin/systemd/haproxy.service.in
+++ b/admin/systemd/haproxy.service.in
@@ -8,12 +8,11 @@ EnvironmentFile=-/etc/default/haproxy
EnvironmentFile=-/etc/sysconfig/haproxy
Environment="CONFIG=/etc/haproxy/haproxy.cfg" "PIDFILE=/run/haproxy.pid" "EXTRAOPTS=-S /run/haproxy-master.sock"
ExecStart=@SBINDIR@/haproxy -Ws -f $CONFIG -p $PIDFILE $EXTRAOPTS
-ExecReload=@SBINDIR@/haproxy -Ws -f $CONFIG -c $EXTRAOPTS
-ExecReload=/bin/kill -USR2 $MAINPID
KillMode=mixed
Restart=always
SuccessExitStatus=143
-Type=notify
+Type=notify-reload
+ReloadSignal=SIGUSR2
# The following lines leverage SystemD's sandboxing options to provide
# defense in depth protection at the expense of restricting some flexibility
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
This bug arrived with this commit:
MAJOR: stktable: split the keys across multiple shards to reduce contention
At this time, there are no callers which call stktable_get_entry() without checking
the nullity of <key> passed as parameter. But the documentation of this function
says it supports this case where the <key> passed as parameter could be null.
Move the nullity test on <key> at first statement of this function.
Thanks to @chipitsine for having reported this issue in GH #2518.
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.
The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):
function dump_stk()
local dmp = core.proxies['tbl'].stktable:dump({});
local count = 0
for _, __ in pairs(dmp) do
count = count + 1
end
core.Info('Total entries: ' .. count)
end
core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);
##
global
tune.lua.log.stderr on
lua-load-per-thread lua-cnttbl.lua
listen front
bind :8001
http-request lua.dump_stk if { path_beg /stk }
http-request track-sc1 rand(),upper,hex table tbl
http-request redirect location /
backend tbl
stick-table size 100k type string len 12 store http_req_cnt
##
$ h2load -c 16 -n 10000 0:8001/
$ curl 0:8001/stk
## A count close to 100k appears on haproxy's stderr
## On the CLI, "show table tbl" | wc will show the same.
Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.
The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.
The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().
The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.
At this point, this allowed to completely unlock the performance on a
80-thread setup:
before: 5.4 Gbps, 150k RPS, 80 cores
52.71% haproxy [.] stktable_lookup_key
36.90% haproxy [.] stktable_get_entry.part.0
0.86% haproxy [.] ebmb_lookup
0.18% haproxy [.] process_stream
0.12% haproxy [.] process_table_expire
0.11% haproxy [.] fwrr_get_next_server
0.10% haproxy [.] eb32_insert
0.10% haproxy [.] run_tasks_from_lists
after: 36 Gbps, 980k RPS, 80 cores
44.92% haproxy [.] stktable_get_entry
5.47% haproxy [.] ebmb_lookup
2.50% haproxy [.] fwrr_get_next_server
0.97% haproxy [.] eb32_insert
0.92% haproxy [.] process_stream
0.52% haproxy [.] run_tasks_from_lists
0.45% haproxy [.] conn_backend_get
0.44% haproxy [.] __pool_alloc
0.35% haproxy [.] process_table_expire
0.35% haproxy [.] connect_server
0.35% haproxy [.] h1_headers_to_hdr_list
0.34% haproxy [.] eb_delete
0.31% haproxy [.] srv_add_to_idle_list
0.30% haproxy [.] h1_snd_buf
WIP: uint64_t -> long
WIP: ulong -> uint
code is much smaller
Thanks to the previous commit, we can now simply perform an atomic read
on stksess->seen and take the write lock to recreate the entry only if
at least one peer has seen it, otherwise leave it untouched. On a test
on 40 cores, the performance used to drop from 2.10 to 1.14M RPS when
one peer was connected, now it drops to 2.05, thus there's basically
no impact of connecting a peer vs ~45% previously, all spent in the
read lock. This can be particularly important when often updating the
same entries (user-agent, source address during an attack etc).
Right now we're taking the stick-tables update lock for reads just for
the sake of checking if the update index is past it or not. That's
costly because even taking the read lock is sufficient to provoke a
cache line write, while when under load or attack it's frequent that
the update has not yet been propagated and wouldn't require anything.
This commit brings a new field to the stksess, "seen", which is zeroed
when the entry is updated, and set to one as soon as at least one peer
starts to consult it. This way it will reflect that the entry must be
updated again so that this peer can see it. Otherwise no update will
be necessary. For now the flag is only set/reset but not exploited.
A great care is taken to avoid writes whenever possible.
Some bwlim error messages at parsing time were missing the trailing '\n'
in commit 2b6777021d ("MEDIUM: bwlim: Add support of bandwith limitation
at the stream level"). This commit can be backported wherever the commit
above is (likely as far as 2.7).
Thanks to previous commit, we can now build with USE_SYSTEMD=1 on any
system without requiring any parts from systemd. It just turns our that
there was one remaining include in haproxy.c that needed to be replaced
with haproxy/systemd.h to build correctly. That's what this commit does.
Given the xz drama which allowed liblzma to be linked to openssh, lets remove
libsystemd to get rid of useless dependencies.
The sd_notify API seems to be stable and is now documented. This patch replaces
the sd_notify() and sd_notifyf() function by a reimplementation inspired by the
systemd documentation.
This should not change anything functionnally. The function will be built when
haproxy is built using USE_SYSTEMD=1.
References:
https://github.com/systemd/systemd/issues/32028https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes
Before:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffcfaf65000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x000074637fef4000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000074637fe4f000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000074637f400000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x000074637fe0d000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x000074637f92a000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x000074637f365000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074637f000000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074637f27a000)
libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x000074637fdff000)
libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x000074637eeb8000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000074637fdcd000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x000074637ee01000)
liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x000074637fda8000)
/lib64/ld-linux-x86-64.so.2 (0x000074637ff5d000)
libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x000074637f904000)
After:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffd51901000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f758d6c0000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f758d61b000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f758ca00000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x00007f758d5d9000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f758d365000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f758d5ba000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f758c600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f758c915000)
/lib64/ld-linux-x86-64.so.2 (0x00007f758d729000)
A backport to all stable versions could be considered at some point.
Since faa8c3e ("MEDIUM: lb-chash: Deterministic node hashes based on
server address") the following configuration will cause haproxy to crash:
backend test1
mode http
balance hash int(1)
server s1 haproxy.org:80
This is because lbprm.update_server_eweight() method is now systematically
called in _srv_set_inetaddr_port() upon srv addr/port change (and with the
above config it happens during startup after initial dns resolution).
However, depending on the chosen lbprm algo, update_server_eweight function
may not be set (it is not a mandatory method, some lb implementations don't
define it).
Thus, using 'balance hash' with map-based hashing or 'balance sticky' will
cause a crash due to a NULL de-reference in _srv_set_inetaddr_port(). To
fix the issue, we first check that the update_server_eweight() method is
set before using it.
No backport needed unless faa8c3e ("MEDIUM: lb-chash: Deterministic node
hashes based on server address") gets backported.
This issue arrived with this commit:
"MINOR: quic: HyStart++ implementation (RFC 9406)"
Thanks to @chipitsine for having reported this issue in GH #2513.
Should be backported where the previous commit will be backported.
In peer_send_msg(), we take a lock before calling
peer_send_teach_process_msgs because of the check on the flags and update
indexes, and the function then drops it then takes it again just to resume
in the same situation, so that on return we can drop it again! Not only
this is absurd because it doubles the costs of taking the lock, it's also
totally inefficient because it takes a write lock while the only usage
that is done with it is to read the indexes! Let's drop the lock from
peer_send_teach_process_msgs() and move it explicitly in its only caller
around the condition, and turn it into a read lock only.
The MAX() macro was used to limit the count of bytes to be transferred
in appctx_raw_rcv_buf() by commit ee53d8421f ("MEDIUM: applet: Simplify
a bit API to exchange data with applets") instead of MIN(). It didn't
seem to have any consequences until commit f37ddbeb4b ("MAJOR: cli:
Update the CLI applet to handle its own buffers") that triggers a BUG_ON()
in __b_putblk() when the other side is slow to read, because we're trying
to append a full buffer on top of a non-empty one. A way to reproduce it
is to dump a heavy stick table on the CLI with a screen scrolling.
No backport is needed since this was introduced in 3.0-dev3 and revealed
after dev5 only.
In 2.9, the stick-tables' locking was split between the lock used to
manipulate the contents (->lock) and the lock used to manipulate the
list of updates and the update indexes (->updt_lock). This was done
with commit 87e072eea5 ("MEDIUM: stick-table: use a distinct lock for
the updates tree"). However a part was overlooked in the peers code,
the parts that consult (and update) the indexes use the table's lock
instead of the update lock. It's surprising that it hasn't caused more
trouble. It's likely due to the fact that the tree nodes are not often
immediately freed and that their memory area remains connected to valid
nodes in the tree during peer_stksess_lookup(), while other parts only
check or update indexes, thus are not that critical.
This needs to be backported wherever the commit above is, thus logically
2.9.
It is only an issue when the kernel splicing is used. The zero-copy
forwarding via the buffers is not affected. When a shutdown is received on
the producer side and some data are blocked in the pipe for a while, the
shutdown may be forwarded to the other side. Usually, in this case, the
shutdown must be scheduled, waiting all output data (from the channel and
the consumer's iobuf) are sent. But only the channel was considered.
The bug was introduced by commit 20c463955d ("MEDIUM: channel: don't look at
iobuf to report an empty channel"). To fix the issue, we must also check
data blocked in the consummer iobuf.
This patch should solve the issue #2505. It must be backported to 2.9.
This is a simple algorithm to replace the classic slow start phase of the
congestion control algorithms. It should reduce the high packet loss during
this step.
Implemented only for Cubic.
According to the documentation, "option redispatch 0" is expected to
disable redispatch just like "no option redispatch", but due to the
fact that it keeps PR_O_REDISP set, it doesn't actually work. Let's
make sure value 0 is properly handled and drops PR_O_REDISP. This can
be backported to all versions since it seems it has been broken since
its introduction in 1.6 with commit 726ab7145c ("MEDIUM: backend: Allow
redispatch on retry intervals").
As a workaround, "no option redispatch" does work though.
In 2.7 we addressed a race condition in the stick tables expiration task
with commit fbb934d ("BUG/MEDIUM: stick-table: fix a race condition when
updating the expiration task"). The issue was that the task could be
running on another thread which would destroy its expiration timer
while one had just recalculated it and prepares to queue it, causing
a bug due to the attempt to queue an expired task. The fix consisted in
enclosing the change into the stick-table's lock, which had a very low
cost since it's done only after having checked that the date changed,
i.e. no more than once every millisecond.
But as reported by Ricardo and Felipe from Taghos in github issue #2508,
a tiny race remained after the fix: the unlock() was done before the call
to task_queue(), leaving a tiny window for another thread to run between
unlock() and task_queue() and erase the timer. As confirmed, it's
sufficient to also protect the task_queue() call.
But overall this raises a point regarding the task_queue() API on tasks
that may run anywhere. A while ago an attempt was made at removing the
timer for woken up tasks, but something like this would be deserved
with more atomicity on the timer manipulation (e.g. atomically use
task_schedule() instead maybe). This should be backported to all
stable branches.
Motivation: When services are discovered through DNS resolution, the order in
which DNS records get resolved and assigned to servers is arbitrary. Therefore,
even though two HAProxy instances using chash balancing might agree that a
particular request should go to server3, it is likely the case that they have
assigned different IPs and ports to the server in that slot.
This patch adds a server option, "hash-key <key>" which can be set to "id" (the
existing behaviour, default), "addr", or "addr-port". By deriving the keys for
the chash tree nodes from a server's address and port we ensure that independent
HAProxy instances will agree on routing decisions. If an address is not known
then the key is derived from the server's puid as it was previously.
When adjusting a server's weight, we now check whether the server's hash has
changed. If it has, we have to remove all its nodes first, since the node keys
will also have to change.
A compilation error occurs when using DEBUG_MEM_STATS due to a variable
now being unused in debug_iohandler_memstats() :
src/debug.c: In function ‘debug_iohandler_memstats’:
src/debug.c:1862:24: error: unused variable ‘sc’ [-Werror=unused-variable]
1862 | struct stconn *sc = appctx_sc(appctx);
| ^~
This is caused since the following commit :
94b8ed446f
MEDIUM: cli/applet: Stop to test opposite SC in I/O handler of CLI commands
This must not be backported.
log load-balancing implementation was not seamlessly integrated within
lbprm API. The consequence is that it could become harder to maintain
over time since it added some specific cases just for the log backend.
Moreover, it resulted in some code duplication since balance algorithms
that are common to logs and regular (tcp, http) backends were specifically
rewritten for log backends.
Thanks to the previous commit, we now have all the prerequisites to make
log load-balancing fully leverage lbprm logic. Thus in this patch we make
__do_send_log_backend() use existing lbprm algorithms, and we no longer
require log-specific lbprm initialization in cfgparse.c and in
postcheck_log_backend().
As a bonus, for log backends this allows weighed algorithms to properly
support weights (ie: roundrobin, random and log-hash) since we now
leverage the same lb algorithms that we use for tcp/http backends
(doc was updated).
As previously mentioned in cd352c0db ("MINOR: log/balance: rename
"log-sticky" to "sticky""), let's define a sticky algorithm that may be
used from any protocol. Sticky algorithm sticks on the same server as
long as it remains available.
The documentation was updated accordingly.
b61147fd ("MEDIUM: log/balance: merge tcp/http algo with log ones")
introduced some ambiguities, because while it shares some algos with the
ones from mode {tcp,http}, we forgot report an error when the user tries
to use an algorithm that is not available in this mode (as per the doc).
Because of that, haproxy would silently drop log messages during runtime.
To fix that, we ensure that algo is one of the supported ones during log
backend postparsing. If the algo is not supported, we raise an error.
This should be backported in 2.9 with b61147fd
The CLI applet is now using its own snd_buf callback function. Instead of
copying as most output data as possible, only one command is copied at a
time.
To do so, a new state CLI_ST_PARSEREQ is added for the CLI applet. In this
state, the CLI I/O handle knows a full command was copied into its input
buffer and it must parse this command to evaluate it.
This flag can be use by endpoints to know the data to send, via .snd_buf
callback function are the last ones. It is useful to know a shutdown is
pending but it cannot be delivered while sedning data are not consumed.
It is now the responsbility of applets .snd_buf callback function to notify
the input buffer is full. This will allow the applets to not consume all
data waiting for more data. Of course, it is only useful for applets using a
custom .snd_buf callback function.
It is the third applet to be refactored to use its own buffers. In addition to
the CLI applet, some I/O handlers of CLI commands were also updated, especially
the stats ones.
Some command I/O handlers were updated to use applet's buffers instead of
channels ones.
It is an harmless bug for now because only stats and cache applets are using
their own buffers and it is not possible to trigger this bug with these
applets. However, it remains important to try a receive if EOI, EOS or ERROR
is reached by the applet while no data was produced. Otherwise, it is not
possible to ack these events at the SE level.
No backport needed.
The main CLI I/O handle is responsible to interrupt the processing on
shutdown/abort. It is not the responsibility of the I/O handler of CLI
commands to take care of it.
Instead of using connection versions, we now use generic versions. It means
we will also perfom sync receives and sync sends on applets too, but only
for applets using their own buffers. Old applets are not concerned.
sc_sync_recv() and sc_sync_send() were added to use connection or applet
versions, depending on the endpoint type. For now these functions are not
used. But this will be used by process_stream() to replace the connection
version.
Too big command, larger than a buffer, was silently rejected by the CLI
applet. It was handled as an error and the connection was closed, but no
error message was reported to user to notify him. Now an error is reported
before closing. It is only displayed if the chunk buffer used by the CLI
applet is full and no delimiter (\n or ;) is found to mark the end of the
command. It works for a simple command but also for a command with a huge
payload.
This patch could be backported to all stable versions.
This commit allows "cookie" keyword for dynamic servers. After code
review, nothing was found which could prevent a dynamic server to use
it. An extra warning is added under cli_parse_add_server() if cookie
value is ignored due to a non HTTP backend.
This patch is not considered a bugfix. However, it may backported if
needed as its impact seems minimal.
When adding a server dynamically, we observe that when a backend has a
dynamic persistence cookie, the new server has no cookie as we receive
the following HTTP header:
set-cookie: test-cookie=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/
Whereas we were expecting to receive something like the following, which
is what we receive for a server added in the config file:
set-cookie: test-cookie=abcdef1234567890; path=/
After investigating code path, srv_set_dyncookie() is never called when
adding a server through CLI, it is only called when parsing config file
or using "set server bkd1/srv1 addr".
To fix this, call srv_set_dyncookie() inside cli_parse_add_server().
This patch must be backported up to 2.4.
Since their first implementation, dynamic servers are created into
maintenance state. This has been done purposely to avoid immediate
activation of a newly inserted server.
However, this principle is incompatible if "enabled" keyword is used on
"add server". The newly created instance will be unreacheable as proxy
load-balancing algorithm is not informed of its presence via
srv_lb_propagate(). The new server could be unblocked by toggling its
state with "disable server" / "enable server" commands, which will
trigger srv_lb_propagate() invocation.
To avoid this unexpected state, simply forbid "enabled" keyword for
dynamic servers. In the long-term, it could be possible to re authorize
it but at least this requires to call srv_lb_propagate() on dynamic
server creation.
This should fix github issue #2497.
This patch should not be backported as-is, to avoid breaking dynamic
servers API on stable versions. "enabled" should instead be ignored for
them. This will be implemented in a dedicated patch on top of 2.9.
This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).
The inconsistencies in 'ocsp-update' parameter were only checked when
parsing a crt-list line so if a certificate was used on a bind line
after being used in a crt-list with 'ocsp-update' set to 'on', then no
error would be raised. This patch helps detect such inconsistencies.
This patch can be backported up to branch 2.8.
In a crt-list such as the following:
foo.pem [ocsp-update off] foo.com
foo.pem bar.com
we would get a wrong "Incompatibilities found in OCSP update mode ..."
error message during init when the two lines are actually saying the
same thing since the default for 'ocsp-update' option is 'off'.
This patch can be backported up to branch 2.8.
A recent issue was uncovered by the CI which started to randomly report
segfaults on a few tests, and more systematically on FreeBSD. It turn
out that it was introduced by recent commit 03816ccfa9 ("MAJOR: ring:
insert an intermediary ring_storage level"), which overlooked the munmap()
path of the sink and startup logs: once the ring and its storage were
split, it was no longer correct to munmap() the ring, only its storage
area needs to be unmapped, and the ring must always be freed separately.
Thanks to Christopher and William for their help at trying to reproduce
it and figure the circumstances that triggers it.
No backport is needed.
Sebastien Gross reported that 'interface' keyword ('source' subargument)
is silently ignored when used from 'default-server' directive despite the
documentation implicitly stating that the keyword should be supported
there.
When support for 'source' keyword was added to 'default-server' directive
in dba97077 ("MINOR: server: Make 'default-server' support 'source'
keyword."), we properly duplicated the conn iface_name from the default-
server but we forgot to copy the conn iface_len which must be set as well
since it is used as setsockopt()'s 'optlen' argument in
tcp_connect_server().
It should be backported to all stable versions.
OpenSSL 3.2 triggers the code part added by commit 25da217 ("MINOR: ssl:
Update ssl_fc_curve/ssl_bc_curve to use SSL_get0_group_name") which
contains a variable declaration in the for() statement and breaks on
older compilers, as reported in GH issues #2501.
Let's just declare it normally to fix the problem. This must be
backported wherever the commit above is (at least 2.9).
We don't care in what order the threads are released, so we can write
their sent value using relaxed atomic stores. This brings a 3-5% perf
boost on ARM with 80 cores, reaching 7.25M/s, and doesn't change
anything on x86 since it keeps using strict ordering.
It has been found that performing a first pass consisting in copying
all messages, and a second one to notify about releases is more efficient
on AMD than updating all of them on the fly using a CAS, despite making
writers wait longer to be released.
Maybe it's related to the ability for the CPU to prefetch the contents
during a simple load while it wouldn't do it for an XCHG, it's unsure
at this point. This will also mater permit to use relaxed stores to
release threads.
On ARM the performance increased to 7.0M/s. If this patch is applied
before the dropping of the intermediary step, instead it drops to
3.9M/s. This shows the dependency between such changes that strive to
limit the number of writes on the fast path.
On x86_64, the EPYC at 3C6T saw a small drop from 4.57M to 4.45M, but
the 24C48T setup saw a nice 33% boost from 3.33M to 4.44M, i.e. we
get stable perf at 3 and 24 cores, despite having 8 CCX involved and
fighting with each other.
Other possibilities are:
- use of HA_ATOMIC_XCHG() instead of FETCH_OR()
=> slightly faster (4.62/7.37 vs 4.58/7.34). Pb: requires to
modify the readers to wait much longer since the tail value
won't be valid in this case during updates, and it will have
to wait by looping over it.
- use other conditions to release a cell
=> to be tested
Archs relying on CAS benefit from a read prior to FETCH_OR, so it's
not just x86 that benefits from this. Let's just change the condition
to only exclude __ARM_FEATURE_ATOMICS which is the only one faster
without.
The loop was cleaned up a little bit so that the inner loops are more
readable and that the ifdef'd parts are whole blocks and not just an
"if" condition. A few conditions were adjusted to benefit from "break"
and "continue".
This is mostly a cleanup in that it turns the two-level loop into a
single one, but it also simplifies the code a little bit and brings
some performance savings again, which are mostly noticeable on ARM,
but don't change anything for x86.
x86_64 doesn't have a native atomic FETCH_OR(), it's implemented using
a CAS, which will always cause a write cycle. Here we know we can just
wait as long as the lock bit is held so better loop on a load, and only
attempt the CAS on success. This requires a tiny ifdef and brings nice
benefits. This brings the performance back from 3.33M to 3.75M at 24C48T
while doing no change at 3C6T.
By doing that and placing the cpu_relax at the right places, the ARM
reaches 6.0M/s on 80 threads. On x86_64, at 3C6T the EPYC sees a small
increase from 4.45M to 4.57M but at 24C48T it sees a drop from 3.82M
to 3.33M due to the write contention hidden behind the CAS that
implements the FETCH_OR(), that we'll address next.
The queue-based approach consists in forcing threads to wait away from
the work area so as not to disturb the current writer, and to prepare
the work by grouping them in a queue. The last arrived takes the head
of the queue by placing its preinitialized ring cell there, becomes the
queue's leader, informs itself about the amount of previously accumulated
bytes so that when its turn comes, it immediately knows how much room is
needed to be released.
It can then take the whole queue with it, leaving an empty one for new
threads to come while it's releasing the room needed to copy everything.
By doing so we're cascading contention areas so that multiple parts can
work in parallel.
Note that we must never leave a write counter set to 0xFF at tail, and
this happens when a message cannot fit and we give up, because in this
case we're writing back tail_ofs, and only later we restore the counter.
The solution here is to make a special case when we're going to drop
the messages, and to write the readers count before restoring tail.
This already shows a tremendous performance gain on ARM (385k -> 4.8M),
thanks to the fact that now all waiting threads wait on the queue's
head instead of polluting the tail lock. On x86_64, the EPYC sees a big
boost at 24C48T (1.88M -> 3.82M) and a slowdown at 3C6T (6.0->4.45)
though this one is much less of a concern as so few threads need less
bandwidth than bigger counts.
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.
Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.
This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.
It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
Code disassembly shows that ring->storage->tail and ring->queue are
accessed a lot and reloaded a lot due to aliasing. Let's just have
variables for them in the local stack. It makes the code smaller and
slightly faster.
It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.
There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.
If there's nothing to read, it's pointless for a reader to try to update
the offset pointer, that's two atomic ops to replace a value by itself
twice. Let's just stop this.
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.
The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
The writer is using tags 0xFF instead of readers count at the front of
messages that are undergoing an update, while the tail has already been
updated. The reader needs to take care of this because it can face these
messages and mistakenly parse data that's still being written, leading
to corruption (especially if this happens while the size is changing).
Let's just stop reading when facing reserved codes, since they indicate
that the end of usable messages was reached.
Since we're going to remove the lock, there's no more way to prevent the
ring from being fed while we're attaching a client to it. We need to
freeze the buffer while looking at its head so that we can attach there
and have a trustable one. We could do it by setting the lock bit on the
tail offset but quite frankly we don't need to bother with that, attaching
a client is rare enough to permit a thread_isolate().
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
There's no point looking for freshly attached readers if there are none,
taking this lock requires an atomic write to a shared area, something we
clearly want to avoid.
A general test with 213-byte messages on different thread counts shows
how the performance degrades across CCX and how this patch improves the
situation:
Before After
3C6T/1CCX: 6.39 Mmsg/s 6.35 Mmsg/s
6C12T/2CCX: 2.90 Mmsg/s 3.16 Mmsg/s
12C24T/4CCX: 2.14 Mmsg/s 2.33 Mmsg/s
24C48T/8CCX: 1.75 Mmsg/s 1.92 Mmsg/s
This tends to confirm that the queues will really be needed and that
they'll have to be per-ccx hence per thread-group. They will amortize
the number of updates on head & tail (one per multiple messages).
We know we can continue to protect the message area so we can unlock the
tail as soon as we know its new value. Now we're seeing ~6.4M msg/s vs
5.4M previously on 3C6T of a 3rd gen EPYC, and 1.88M vs 1.54M for 24C48T
threads, which is a significant gain!
This requires to carefully write the new head counter before releasing
the writers, and to change the calculation of the work area from
tail..head to tail...new_tail while writing the message.
Now the lock is only taken around the readers list. With careful
ordering of writes to head/tail, the ring remains protected.
The perf is a bit better, though (1.54M msg/s vs 1.4M at 48T on
a 3rd gen EPYC, and 5.4M vs 5.3M for a 3C6T setup).
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.
For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.
Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
The reader now needs to protect the positions it's reading. This is
already done via the readers counter at the beginning of messages,
but as long as the lock is present, this counter is decremented
before starting to parse messages, and incremented at the end.
We must now do that in reverse, first protect the end of the messages,
and only then remove ourselves from the already processed messages, so
that at no point could a writer pass over and possibly overwrite data
we're currently watching.
The goal here is to start to protect the writing area inside the area
itself so that we'll later be able to release the ring's lock. We're not
there yet, but at least the tail is marked as protected for as long as the
message is not fully written.
We'll want to reserve some special values for the readers count to
temporary lock the following message, but for this it will be mandatory
that readers check for them before incrementing/decrementing the counter.
Let'sdo that using a CAS. The readers performance is not as critical as
the writer's anyway so the slight overhead is not a problem.
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.
The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.
The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
This is the start of the replacement of the buffer API calls. Only the
ring_write() function was touched. Instead of manipulating a buffer all
along, we now extract the ring buffer's head and tail upon entry, store
them locally and use them using the vec<->ring API until the last moment
where we can update the buffer with the new values. One subtle point is
that we must never fill the buffer past the last byte otherwise the
vec-to-ring conversion gets lost and there's no more possibility to know
where's the beginning nor the end (just like when dealing with head+tail
in fact), because it then becomes impossible to distinguish between an
empty and a full buffer.
In ring_resize() we used to check if the new ring was at least as large
as the previous one before resizing it, but what counts is that it's as
large as the previous one's contents. Initially it was thought this
would not really matter, but given that rings are initially created as
BUFSIZE, it's currently not possible to shrink them for debugging
purposes. Now with this change it is.
The ring resizing was already quite tricky, but when facing atomic
writes it will no longer be possible and we definitely do not want to
have to deal with a lock there. Since it's only done at boot time, and
possibly later from the CLI, let's just do it under thread isolation.
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.
This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.
It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.
Some open-coded constructs were updated to make use of the ring accessors
instead. This allows to remove some direct dependencies on the buffers
API a bit more.
In startup_logs_dup() we currently need to reference the ring's buffer,
better not do this as it will complicate operations when switching to
other types.
The ring_write() function uses confusing variable names: totlen is in
fact the length of the message, not the total length that is going to
be written. Let's rename it msglen and have a real "needed" that
corresponds to the total size we're going to write. We also add a
BUG_ON_HOT() to catch mistakes causing discrepancies.
In order to support concurrent writers we'll need to lock areas in the
buffer. For this we'll use one special value of the single-byte readers
count. Let's reserve it now and use the macro instead of the hardcoded
255.
The goal is to remove references to the buffer's head and tail in the
fast path so that we can release the lock during some reads. This means
no more comparisons with b_data() nor operations relative to b_head()
will be possible anymore. As a first step we need to have an absolute
offset in the buffer, and to use b_getblk_ofs() in the applet callbacks
to retrieve the data based on this.
This code becomes even simpler and almost does not need any knowledge
of the structure of the ring anymore. It even highlighted that an old
race had not been fixed due to code duplication, but that's now done.
This new function is made around the loop that scans a ring for new
messages and dispatches them to a message handler. It also takes
ring flags (WAIT, NEW, etc) and offset pointers that the caller will
use to initialize/reuse/update the current processing offset. The
caller is still responsible for presetting it to ~0 before the
first call if it wants the function to automatically adjust it (or set
it to the correct value). The function may also return the last_ofs
that was known before releasing the lock so that the caller knows
what to compare against and if it needs to restart processing or not.
The context remains a void* so that should not necessarily depend on
an appctx.
The current "show ring" code was ported to this and it continues to
work as expected.
A ring is used for the DNS code but slightly differently from the generic
one, which prevents some important changes from being made to the generic
code without breaking DNS. As the use cases differ, it's better to just
split them apart for now and have the DNS code use its own ring that we
rename dns_ring and let the generic code continue to live on its own.
The unused parts such as CLI registration were dropped, resizing and
allocation from a mapped area were dropped. dns_ring_detach_appctx() was
kept despite not being used, so as to stay consistent with the comments
that say it must be called, despite the DNS code explicitly mentioning
that it skips it for now (i.e. this may change in the future).
Hopefully after the generic rings are converted the DNS code can migrate
back to them, though this is really not necessary.
The rink reader code was duplicated as-is in 2.2 for the ring forwarding
code in commits 494c505703 ("MEDIUM: ring: add server statement to forward
messages from a ring") and 975564784f ("MEDIUM: ring: add new srv statement
to support octet counting forward") (which only differs by using a prefix
instead of a suffix to delimit messages).
Unfortunately, that makes it almost impossible to rework the core ring
code because all these parts rely on it. This first commit aims at
restoring a common structure for the core loop by just calling a distinct
function based on the use case. The functions are either
applet_append_line() when a whole line is to be emitted followed by an LF
character, or syslog_applet_appent_event() when trying to send a TCP
syslog line prepended with its size in decimal.
There is no functional change beyond this.
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. Contrary to its sibling applet_append_line(), instead of just
appending an LF at the end of the line, it prepends the message size
in decimal and a space before the message, as expected by syslog TCP
implementaions. This will be used to simplify the ring reader code.
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. This will be used to simplify the ring reader code.
This new command, enabled only with "DEBUG_DEV", sends 2 or 20 traces
per task wakeup (depending on the verbosity level), and stops after 1M
wakeups per thread in order not to have to stop/start the process each
time it's fired.
We have two small messages and 18 larger ones from 20 to 270 bytes
each, so that the average size is approx 213 bytes counting headers
(the header adds approx 82 bytes), which matches what's generally
observed on average when traces are enabled in all muxes.
Typical figures show varations between 5.7M and 6.2M msg/s on an EPYC
in a 3C6T setup (single CCX), and 2.12M - 2.22M in a 24C48T setup
(across 8 CCX, with 8 thread groups).
In http_7239_extract_{ipv4,ipv6}, we declare a local buffer in order to
use inet_pton() since it requires a valid destination argument (cannot be
NULL). Then, if the caller provided <ip> argument, we copy inet_pton()
result (from local buffer to <ip>).
In fact when the caller provides <ip>, we may directly use <ip> as
inet_pton() dst argument to avoid an useless copy. Thus the local buffer
is only relevant when the user doesn't provide <ip>.
While at it, let's add a missing testcase for the rfc7239_n2nn converter
(to check that http_7239_extract_ipv4() with <ip> provided works properly)
This could be backported in 2.8 with b2bb925 ("MINOR: proxy/http_ext:
introduce proxy forwarded option")
Willy reported that since 3ac79b504 ("MEDIUM: server:
make server_set_inetaddr() updater serializable"), haproxy fails to
compile on some older compilers such as gcc-4.4 with this kind of error:
src/server.c: In function 'snr_resolution_cb':
src/server.c:4471: error: unknown field 'dns_resolver' specified in initializer
compilation terminated due to -Wfatal-errors.
make: *** [Makefile:1006: src/server.o] Error 1
This is due to referencing a member inside anonymous union from a compound
literal assignment. Apparently such use of anonymous union wasn't properly
supported back then on older compilers. To fix the issue, we give "u" name
to the parent union use this name to explicitly refer to the union where
relevant in the code (only a few changes fortunately).
The fix itself was verified to restore build compatibility with gcc 4.4
(and even 4.2).
As 3ac79b504 is used as a prerequisite for 64c9c8ef3 ("BUG/MINOR:
server/dns: use server_set_inetaddr() to unset srv addr from DNS"), please
consider backporting this patch too if 64c9c8ef3 happens to be backported
in 2.9.
Trailers are skipped by the FCGI multiplexer. However empty chunked messages
are not properly handled. It may be a chunked H1 request with no payload or
a H2/H3 POST request with no payload. In that caes, the EOT HTX block is
just ignored. The issue is that the EOM flag is thus ignored too. It means
no empty STDIN record is sent to mark the end of the request to the server.
To fix the issue, when a EOT htx block is found and it is the last HTX block
of the message (and it should be), the EOM flag is tested. If it is found,
an empty STDIN record is emitted.
This patch should fix the issue #2499. It must be backported as far as 2.4.
QUIC MUX is freed via qcc_release(). This in turn liberate all the
remaining QCS instances. For each one of them, their corresponding
stream-desc is released via qc_stream_desc_release().
This last function may itself notifies QUIC MUX when new buffers are
available. This is useful when QCS are closed individually without the
whole connection. However, when the connection is closed through
qcc_release(), this may cause issue as some elements of QUIC MUX are
already freed.
In 2.9.6, a bug was detected directly linked to this. Indeed, QCC
instance may be woken up on stream-desc release. If called through
qcc_release(), this is an issue because QCC tasklet is freed before QCS
instances. However, this bug is not systematic and relies on prior
conditions : in particular, QUIC MUX must be under Tx buffers exhaustion
prior to the qcc_release() invocation.
The current dev tree is not impacted by this bug, thanks to QUIC MUX
refactoring. Indeed, notifying accross layers have changed and now
stream-desc release notifies individual QCS instances instead of the QCC
element, which is a safer mechanism. However, to simplify backport
process, bugfix is introduced in the current dev tree as it does not
have any impact.
Note that a proper fix would be to set quic-conn MUX state to
QC_MUX_RELEASED. However, it is not possible to call quic_close()
without having releasing all stream-desc elements first. The simpler
solution was chosen to prevent other breaking issues during backports.
This should fix github issue #2494.
It should be backported up to 2.6. Note that prior to 2.7 qcc_release()
was named qc_release().
This commit similar to the following one :
65ae241dcfe710e1cdd3ec4e7a9bde38d2e4c116
MEDIUM: server: close idle conn before server deletion
This patch implements a similar logic, this time to close private idle
connections stored in sessions. The principle is identical to the above
commit : conn_release() is used on idle connections after a takeover to
ensure thread safety.
An extra change was required to be able to execute takeover on such
connections. Their original thread ID was unknown, contrary to non
private connections which are stored in sharded lists. As such, a new
tid member has been added under sess_priv_conns chaining element.
When a backend connection is marked as idle, a special flag TASK_F_USR1
is set on MUX tasklet. When MUX tasklet is reactivated, extra checks are
executed under this flag to ensure no takeover occurred in the meantime.
Previously, only non private connections could be targetted by a
takeover. However, this will change when implementing private idle
connections closure on "delete server" CLI handler. As such, TASK_F_USR1
is now also set for private connections in MUX detach callbacks.
To be able to delete a server, a number of preconditions must be
validated to ensure it is not in used anymore. Previously, if idle
connections were stored in the server, the deletion was cancelled. No
action was implemented to force idle connection closure, the only
solution was to wait for the periodic purging to be achieved.
This is an extra burden to be able to delete a server. Indeed, idle
connections are by definition inactive and can be closed prior to delete
a server. This is the exact purpose of this patch.
Idle connections removal is implemented inside "delete server" handler,
once it has been determined that the server can be freely removed. A
simple loop is run to call conn_release() over each idle connections.
Takeover is also executed before conn_release() to ensure tasks/tasklets
or any other sensible elements are not deleted from a foreign thread.
This patch should reduce the occurence of rejected "delete server"
execution, especially when connection reuse is high.
Extend takeover API both for MUX and XPRT with a new boolean argument
<release>. Its purpose is to signal if the connection will be freed
immediately after the takeover, rendering new resources allocation
unnecessary.
For the moment, release argument is always false. However, it will be
set to true on delete server CLI handler to proactively close server
idle connections.
Several places reuse the same code to ensure a connection is properly
freed, either via its MUX or by calling the proper set of functions.
Factorize all of this in a new function conn_release().
This new function is now called via session_free() and
session_accept_fd(). It will also be reused on delete server to
proactively close idle connections.
Those fetchess were undocumented and were just here so that the
ocsp-update log could be made through a regular log format. But since
the logging is now "handmade" (since BUG/MEDIUM: ssl: Fix crash in
ocsp-update log function), we don't need those anymore.
Since commit "BUG/MEDIUM: ssl: Fix crash in ocsp-update log function",
some information from the log line are "faked" because they can be
actually retrieved anymore (or never could). We should then remove them
from the logline all along instead of providing some useless fields.
We then only keep pure OCSP-update information in the log line:
"<certname> <status> <status str> <fail count> <success count>"
The ocsp-update logging mechanism was built around the 'sess_log'
function which required to keep a pointer to the said session until the
logging function could be called. This was made by keeping a pointer to
the appctx returned by the 'httpclient_start' function. But this appctx
lives its life on its own and might be destroyed before
'ssl_ocsp_send_log' is called, which could result in a crash (UAF).
Fixing this crash requires to stop using the 'sess_log' function to emit
the ocsp-update logs. The log line will then need to be built by hand
out of the information actually available when 'ssl_ocsp_send_log' is
called. Since we don't use the "regular" logging functions anymore, we
don't need to use the error_logformat anymore. In order to keep a
consistent behavior than before, we will keep the same format for the
logs but replace the fields that required a 'sess' pointer by fake
values (the %ci:%cp for instance, which was never filled anyway).
This crash was raised in GitHub issue #2442.
It should be backported up to branch 2.8.
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.
These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.
Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.
An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.
This patch can be backported up to 2.8. Wait a little bit before
backporting.
With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each SSL_CTX that references it increments its
refcount. The reference to the certificate_ocsp is kept in the SSL_CTX
linked to each ckch_inst, in an ex_data entry that gets freed when the
context is freed.
One of the downsides of this implementation is that if every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), and if all
the corresponding SSL_CTXs were destroyed (no ongoing connection using
them), the OCSP response would be destroyed even if the certificate
remains in the system (as an unused certificate).
In such a case, we would want the OCSP response not to be "usable",
since it is not used by any ckch_inst, but still remain in the OCSP
response tree so that if the certificate gets reused (via an "add ssl
crt-list" command for instance), its OCSP response is still known as
well.
But we would also like such an entry not to be updated automatically
anymore once no instance uses it. An easy way to do it could have been
to keep a reference to the certificate_ocsp structure in the ckch_store
as well, on top of all the ones in the ckch_instances, and to remove the
ocsp response from the update tree once the refcount falls to 1, but it
would not work because of the way the ocsp response tree keys are
calculated. They are decorrelated from the ckch_store and are the actual
OCSP_CERTIDs, which is a combination of the issuer's name hash and key
hash, and the certificate's serial number. So two copies of the same
certificate but with different names would still point to the same ocsp
response tree entry.
The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
actual reference counter corresponding to the number of "live" pointers
on the certificate_ocsp structure, incremented for every SSL_CTX using
it, and one for the ckch stores.
If the ckch_store reference counter falls to 0, the corresponding
certificate must have been removed via CLI calls ('set ssl cert' for
instance).
If the actual refcount falls to 0, then no live SSL_CTX uses the
response anymore. It could happen if all the corresponding crt-list
lines were removed and there are no live SSL sessions using the
certificate anymore.
If any of the two refcounts becomes 0, we will always remove the
response from the auto update tree, because there's no point in spending
time updating an OCSP response that no new SSL connection will be able
to use. But the certificate_ocsp object won't be removed from the tree
unless both refcounts are 0.
Must be backported up to 2.8. Wait a little bit before backporting.
By default, backend connections are accounted by the server. This allows
to determine the number of idle connections to keep. A backend
connection can also be marked as private to prevent its reuse. It is
thus removed from server lists into the session list. As such, a private
connection is not accounted into server : conn_set_private() uses
srv_release_conn() to ensure this.
When using HTTP/2 on backend side with default http-reuse safe, the
above principle are mixed. Indeed, when a connection is first used, or
switches from idle to used, it is moved into the session list but it is
not flagged as private. This is done to prevent its sharing by different
clients to prevent head-of-line blocking issue. When all streams are
closed, the connection becomes idle again and is reinserted in the
server list. This has been introduced by the following patch :
0d21deaded
MEDIUM: backend: add reused conn to sess if mux marked as HOL blocking
When freeing a backend connection, special care is taken to ensure
server used counter is decremented. This is implemented into
conn_backend_deinit(). However, this function does this only if the
connection is not present in a session list. This is valid for private
connections. However, if a connection is non-private and present only
temporarily into a session list, the decrement operation won't be
executed despite the connection being accounted by the server.
This bug has several impacts. The server used counter won't be able to
reach its initial null value, even when all its connections are closed.
This can result in a wrong estimation of necessary idle connections,
which may cause unnecessary new connection usage. Also, this will
prevent definitely the server from being removed via "delete server" CLI
command.
This should be backported up to 2.4. Note that conn_backend_deinit() was
introduced in 2.9. For lesser versions, the change should be done
directly into conn_free().
Backend connections can be marked as private to prevent their sharing by
multiple clients. Now, this has become an exception as only two reasons
for data traffic can trigger this (checks are ignored here) :
* http-reuse never
* HTTP response with NTLM header
The first case is easy to manage as the connection is flagged as private
since its inception. However, the second case is dynamic as the
connection can be flagged anytime during its lifetime. When using a
backend protocol such as HTTP/2 with reuse mode aggressive or always, we
face a design issue as the connection would be marked as private,
despite potentially being shared by several clients at the same time.
This is conceptually invalid, but worst it can trigger crashes on MUX
stream detach callback depending on the order of release of the streams,
by calling session_check_idle_conn() with a NULL session. It could also
be possible to have several NTLM responses on a single connection for
different sessions. In this case, connection owner is still being
updated without attaching the connection to its correct session, which
ultimately would cause a crash on session_check_idle_conn with an
invalid session.
Here are two backtrace examples from GDB for such cases :
Thread 1 (Thread 0x7ff73e9fc700 (LWP 648859)):
#0 session_check_idle_conn (conn=0x7ff72f597800, sess=0x0) at include/haproxy/session.h:209
#1 h2_detach (sd=<optimized out>) at src/mux_h2.c:4520
#2 0x000056151742be24 in sc_detach_endp (scp=scp@entry=0x7ff73e9f0f18) at src/stconn.c:376
#3 0x000056151742c208 in sc_destroy (sc=<optimized out>) at src/stconn.c:444
#4 0x0000561517370871 in stream_free (s=s@entry=0x7ff72a2dbd80) at src/stream.c:728
#5 0x000056151737541f in process_stream (t=t@entry=0x7ff72d5e2620, context=0x7ff72a2dbd80, state=<optimized out>) at src/stream.c:2645
#6 0x0000561517456cbb in run_tasks_from_lists (budgets=budgets@entry=0x7ff73e9f10d0) at src/task.c:632
#7 0x00005615174576b9 in process_runnable_tasks () at src/task.c:876
#8 0x000056151742275a in run_poll_loop () at src/haproxy.c:2996
#9 0x0000561517422db1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3195
#10 0x00007ff789e081ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff789a39e73 in clone () from /lib64/libc.so.6
(gdb)
Thread 1 (Thread 0x7ff52e7fc700 (LWP 681458)):
#0 0x0000556ebd6e7e69 in session_check_idle_conn (conn=0x7ff5787ff100, sess=0x7ff51d2539a0) at include/haproxy/session.h:209
#1 h2_detach (sd=<optimized out>) at src/mux_h2.c:4520
#2 0x0000556ebd7f3e24 in sc_detach_endp (scp=scp@entry=0x7ff52e7f0f18) at src/stconn.c:376
#3 0x0000556ebd7f4208 in sc_destroy (sc=<optimized out>) at src/stconn.c:444
#4 0x0000556ebd738871 in stream_free (s=s@entry=0x7ff520e28200) at src/stream.c:728
#5 0x0000556ebd73d41f in process_stream (t=t@entry=0x7ff565783700, context=0x7ff520e28200, state=<optimized out>) at src/stream.c:2645
#6 0x0000556ebd81ecbb in run_tasks_from_lists (budgets=budgets@entry=0x7ff52e7f10d0) at src/task.c:632
#7 0x0000556ebd81f6b9 in process_runnable_tasks () at src/task.c:876
#8 0x0000556ebd7ea75a in run_poll_loop () at src/haproxy.c:2996
#9 0x0000556ebd7eadb1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3195
#10 0x00007ff5752081ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff574e39e73 in clone () from /lib64/libc.so.6
(gdb)
To solve this issue, simply ignore NTLM responses when using a
multiplexer with streams support and the connection is not already
attached to the session. The connection is not marked as private and
will continue to be shared freely accross clients. This is considered
conceptually valid as NTLM usage (rfc 4559) with HTTP is broken and was
designed only with HTTP/1.1 in mind. A side-effect of the change is that
SESS_FL_PREFER_LAST is also not set anymore on NTLM detection, which
allows following requests to be load-balanced accross several server
instances.
The original behavior is kept for HTTP/1 or if the connection is already
attached to the session. This last case happens when using HTTP/2 with
default http-reuse safe mode since the following patch :
0d21deaded
MEDIUM: backend: add reused conn to sess if mux marked as HOL blocking
This should be backported up to all stable releases. Up until 2.4, it
can be taken as-is. For lesser versions, above patch is not present. In
this case the condition should be restricted only to HTTP/1 usage :
if (srv_conn && strcmp(srv_conn->mux->name, "H1") == 0) {
Frames with a too small size must be detected on receive and an error must
be triggered. It is especially important for frames of size 0. Otherwise,
because the frame length is used as return value, the frame is ignored (0 is
the return value to state the frame must be ignored). It is an issue because
in this case, outgoing data, the 4 bytes representing the frame size, are
never consumed. If the agent also closes the connection, this leads to a
wakeup loop because outgoing data are stuck and a shutdown is pending.
In addition, all pending outgoing data are systematcially skipped when the
applet is in SPOE_APPCTX_ST_END state.
The patch should fix the issue #2490. It must be backported to all stable
versions.
It is the first deprecated directive exposed via the
'expose-deprecated-directives' global option. This way, it is possible to
silent the warning about the SPOE uses.
Similarly to "expose-exprimental-directives" option, there is no a global
option to expose some deprecated directives. Idea is to have a way to silent
warnings about deprecated directives when there is no alternative solution.
Of course, deprecated directives covered by this option are not listed and
may change. It is only a best effort to let users upgrade smoothly.
As announced on the ML few weeks (months ?) ago and on several GH issues,
the SPOE is now deprecated. Sadly, this filter should be refactored to work
properly. It was implemented as a functionnal PoC for the 1.7 and since
then, no time was invest to improve it and make it truly maintainable in
time. Worst, other parts of HAProxy evolve, especially applets part, making
maintenance ever more expensive.
Instead of keeping the SPOE filter in a this state and always reply to users
encountering issues or limitations that it is far from perfect but we cannot
work on it for now, we decided to deprecate it.
We can still change our mind before the 3.0.0 release if the situation
evolves. Otherwise the filter will be removed or marked as unmaintained for
the 3.1. If the situation does not change, it means the 3.0 will be the last
version with a true SPOE support.
On soft-stop, we try, as far as possible, to process all pending messages
before closing SPOE applets. However, in sync mode, when an applets waiting
for a response receives the ACK frame, it is switched to IDLE state without
checking if it may be closed. In this case, we will wait the idle timeout
before closing de applet, delaying the soft-stop.
To reduce this delay, on soft-stop, IDLE applets are woken up. On the next
wakeup, the applet will try to process pending messages or will be
closed.
This patch should be backported to all stable versions.
On stream side, the SPOE filter relied on the stream's expiration date to be
woken up and be able to detect processing timeout. However, the stream
expiration date must not be updated this way. Mainly because it may be
overwritten at the end of process_stream(). In the worst case, it is set to
TICK_ETERNITY for any reason. In this case, it is impossible to detect the
SPOE filter must time out and abort the processing.
The right way to do is to set an analysis expiration date on the
corresponding channel, depending on the direction. This expiration date will
be used to compute the stream's expiration date at the end of
process_stream().
This patch may be related to issue #2478. It must be backported to all
stable versions.
A server can only be deleted if there is no elements which reference it.
This is taken care via srv_check_for_deletion(), most notably for active
and idle connections.
A special case occurs for connections directly managed by a session.
This is for so-called private connections, when using http-reuse never
or H2 + http-reuse safe for example. In this case. server does not
account these connections into its idle lists. This caused a bug as the
server is deleted despite the session still being able to access it.
To properly fix this, add a new referencing element into the server for
these session connections. A mt_list has been chosen for this. On
default http-reuse, private connections are typically not used so it
won't make any difference. If using H2 servers, or more generally when
dealing with private connections, insert/delete should typically occur
only once per session lifetime so impact on performance should be
minimal.
This should be backported up to 2.4. Note that srv_check_for_deletion()
was introduced in 3.0 dev tree. On backport, the extra condition in it
should be placed in cli_parse_delete_server() instead.
By default, backend connections are attached to a server instance. This
allows to implement connection reuse. However, in some particular cases,
connection cannot be shared accross several clients. These connections
are considered and private and are attached to the session instance
instead.
These private connections are also indexed by the target server to not
mix them. All of this is implemented via a dedicated structure
previously named struct sess_srv_list.
Rename it to better reflect its usage to struct sess_priv_conns. Also
rename its internal members and all of the associated functions.
This commit is only a renaming, thus no functional impact is expected.
null pointer dereference was reported by Coverity in listener_release()
function. Indeed, we must not try to schedule frontend without task when a
limit is still blocking the frontend. This issue was introduced by commit
65ae1347c7 ("BUG/MINOR: listener: Wake proxy's mngmt task up if necessary on
session release")
This patch should fix issue #2488. It must be backported to all stable
version with the commit above.
When a session is released, listener_release() function is called to notify
the listener. It is an opportunity to resume limited/full listeners. We
first try to resume the listener owning the released session, then all
limited listeners in the global queue and finally all limited listeners in
the frontend's waiting queue. This last step is only performed if there is
no limit applied on the frontend. Nothing is performed if the session rate is
still limited. And it is an issue because if this happens for the last
listener's session, there is no other event to wake the frontend's managment
task up and the listener remains in the limited state.
To fix the issue, when a limit is still applied on the frontent, we must
compute the new wake up date from the sessions rate and schedule the
frontend's managment task.
It is easy to reproduce the issue in SSL by setting a maxconn and a rate
limit on sessions.
This patch should fix the issue #2476. It must be backported to all stable
versions.
-dI allow to enable "insure-fork-wanted" directly from the command line,
which is useful when you want to run ASAN with addr2line with a lot of
configuration files without editing them.
While trying to reproduce another crash case involving lua filters
reported by @bgrooot on GH #2467, we found out that mixing filters loaded
from different contexts ('lua-load' vs 'lua-load-per-thread') for the same
stream isn't supported and may even cause the process to crash.
Historically, mixing lua-load and lua-load-per-threads for a stream wasn't
supported, but this changed thanks to 0913386 ("BUG/MEDIUM: hlua: streams
don't support mixing lua-load with lua-load-per-thread").
However, the above fix didn't consider lua filters's use-case properly:
unlike lua fetches, actions or even services, lua filters don't simply
use the stream hlua context as a "temporary" hlua running context to
process some hlua code. For fetches, actions.. hlua executions are
processed sequentially, so we simply reuse the hlua context from the
previous action/fetch to run the next one (this allows to bypass memory
allocations and initialization, thus it increases performance), unless
we need to run on a different hlua state-id, in which case we perform a
reset of the hlua context.
But this cannot work with filters: indeed, once registered, a filter will
last for the whole stream duration. It means that the filter will rely
on the stream hlua context from ->attach() to ->detach(). And here is the
catch, if for the same stream we register 2 lua filters from different
contexts ('lua-load' + 'lua-load-per-thread'), then we have an issue,
because the hlua stream will be re-created each time we switch between
runtime contexts, which means each time we switch between the filters (may
happen for each stream processing step), and since lua filters rely on the
stream hlua to carry context between filtering steps, this context will be
lost upon a switch. Given that lua filters code was not designed with that
in mind, it would confuse the code and cause unexpected behaviors ranging
from lua errors to crashing process.
So here we take another approach: instead of re-creating the stream hlua
context each time we switch between "global" and "per-thread" runtime
context, let's have both of them inside the stream directly as initially
suggested by Christopher back then when talked about the original issue.
For this we leverage hlua_stream_ctx_prepare() and hlua_stream_ctx_get()
helper functions which return the proper hlua context for a given stream
and state_id combination.
As for debugging infos reported after ha_panic(), we check for both hlua
runtime contexts to check if one of them was active when the panic occured
(only 1 runtime ctx per stream may be active at a given time).
This should be backported to all stable versions with 0913386
("BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread")
This commit depends on:
- "DEBUG: lua: precisely identify if stream is stuck inside lua or not"
[for versions < 2.9 the ha_thread_dump_one() part should be skipped]
- "MINOR: hlua: use accessors for stream hlua ctx"
For 2.4, the filters API didn't exist. However it may be a good idea to
backport it anyway because ->set_priv()/->get_priv() from tcp/http lua
applets may also be affected by this bug, plus it will ease code
maintenance. Of course, filters-related parts should be skipped in this
case.
Change hlua_stream_ctx_prepare() prototype so that it now returns the
proper hlua ctx on success instead of returning a boolean.
Add hlua_stream_ctx_get() to retrieve hlua ctx out of a given stream.
This way we may easily change the storage mechanism for hlua stream in
the future without extensive code changes.
No backport needed unless a commit depends on it.
When ha_panic() is called by the watchdog, we try to guess from
ha_task_dump() and ha_thread_dump_one() if the thread was stuck while
executing lua from the stream context. However we consider this is the
case by simply checking if the stream hlua context was set, but this is
not very precise because if the hlua context is set, then it simply means
that at least one lua instruction was executed at the stream level, not
that the stuck was currently executing lua when the panic occured.
This is especially true with filters, one could simply register a lua
filter that does nothing but this will still end up initializing the
stream hlua context for each stream. If the thread end up being stuck
during the stream handling, then debug dumping functions will report
that the stream was stuck while handling lua, which is not necessarilly
true, and could in fact confuse us even more.
So here we take another approach, we add the BUSY flag to hlua context:
this flag is set by hlua_ctx_resume() around lua_resume() call, this way
we can precisely tell if the thread was handling lua when it was
interrupted, and we rely on this flag in debug functions to check if the
thread was effectively stuck inside lua or not while processing the stream
No backport needed unless a commit depends on it.
hlua_filter_delete() calls hlua_unref() on the stream hlua stack, but
we should own the lock prior to manipulating the stack.
This should be backported up to 2.6.
This is a complementary patch to 8670db7 ("BUG/MAJOR: hlua: improper lock
usage with hlua_ctx_resume()") for hlua_filter_new().
Indeed, the HLUA_E_ERRMSG case still relies on the lua stack but didn't
take the lock to do so.
This should be backported up to 2.6.
Trying to register the same lua filter from global and per-thread context
(using 'lua-load' + 'lua-load-per-thread') causes a segmentation fault in
hlua_post_init().
This is due to a simple copy paste error as we try to print the function
name in the error message (like we do when loading the same lua function
from different contexts) instead of the filter name.
This should be backported up to 2.6.
The new "ssl-security-level" option allows one to change the OpenSSL
security level without having to change the openssl.cnf global file of
your distribution. This directives applies on every SSL_CTX context.
People sometimes change their security level directly in the ciphers
directive, however there are some cases when the security level change
is not applied in the right order (for example when applying a DH
param).
Before this patch, it was to possible to trick by using a specific
openssl.cnf file and start haproxy this way:
OPENSSL_CONF=./openssl.cnf ./haproxy -f bug-2468.cfg
Values for the security level can be found there:
https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_security_level.html
This was discussed in github issue #2468.
In issue #2448, users are complaining that FIPS is not working correctly
since the removal of SSL_library_init().
This was removed because SSL_library_init() is deprecated with OpenSSL
3.x and emits a warning. But the initialization was not needed anymore
because it is done at the first openssl API call.
However it some cases it is needed. SSL_library_init() is now a define
to OPENSSL_init_ssl(0, NULL). This patch adds OPENSSL_init_ssl(0, NULL)
to the init.
This could be backported in every stable branches, however let's wait
before backporting it.
3.0-dev1 introduced a small regression with commit b4db3be86e ("BUG/MINOR:
server: fix server_find_by_name() usage during parsing"). By changing the
way servers are indexed and moving it into the server template loop, the
first one is no longer indexed because the loop starts at low+1 since it
focuses on duplication. Let's index the first one explicitly now.
This should not be backported, unless the commit above is backported.
This was not useful and was using uninitialized value. Introduced with
the commit 08ac28237 ("MINOR: Add aes_gcm_enc converter").
Must be backported wherever the commit 08ac28237 was backported.
The issue was introduced with the commit c31499d74 ("MINOR: ssl: Add
aes_gcm_dec converter").
This must be backported to all stable branches where the above converter
is present, but it may need to be adjusted for older branches because of
code refactoring.
Where possible (FreeBSD 13+), use the public, documented interface to
the ELF auxiliary argument vector: elf_aux_info().
__elf_aux_vector is a private interface exported so that the runtime
linker can set its value during process startup and not intended for
public consumption. In FreeBSD 15 it has been removed from libc and
moved to libsys.
This commit removes qc_treat_rx_crypto_frms(). This function was used in
a single place inside qc_ssl_provide_all_quic_data(). Besides, its
naming was confusing as conceptually it is directly linked to quic_ssl
module instead of quic_rx.
Thus, body of qc_treat_rx_crypto_frms() is inlined directly inside
qc_ssl_provide_all_quic_data(). Also, qc_ssl_provide_quic_data() is now
only used inside quic_ssl to its scope is set to static. Overall, API
for CRYPTO frame handling is now cleaner.
On CRYPTO frames reception, tasklet is rescheduled with TASK_HEAVY to
limit CPU consumption. This commit slighly simplifies this by regrouping
TASK_HEAVY setting and tasklet_wakeup() instructions in a single
location in qc_handle_crypto_frm(). All other unnecessary
tasklet_wakeup() are removed.
Till now it was still needed to write rules to eliminate bad behaving
H2 clients, while most of the time it would be desirable to just be able
to set a threshold on the level of anomalies on a connection.
This is what this patch does. By setting a glitches threshold for frontend
and backend, it allows to automatically turn a connection to the error
state when the threshold is reached so that the connection dies by itself
without having to write possibly complex rules.
One subtlety is that we still have the error state being exclusive to the
parser's state so this requires the h2c_report_glitches() function to return
a status indicating if the threshold was reached or not so that processing
can instantly stop and bypass the state update, otherwise the state could
be turned back to a valid one (e.g. after parsing CONTINUATION); we should
really contemplate the possibility to use H2_CF_ERROR for this. Fortunately
there were very few places where a glitch was reported outside of an error
path so the changes are quite minor.
Now by setting the front value to 1000, a client flooding with short
CONTINUATION frames is instantly stopped.
The function aims at centralizing counter measures but due to the fact
that it only increments the counter by one unit, sometimes it was not
used and the value was calculated directly. Let's pass the increment in
argument so that it can be used everywhere.
Compilation on solaris fails because of usage of names reserved on that
platform, i.e. 'queue' and 's_addr'.
This patch redefines 'queue' as '_queue' and renames 's_addr' to
'srv_addr' which fixes compilation for now.
Future plan: rename 'queue' in code base so define can be removed again.
Backporting: 2.9, 2.8
The sink lock was made to prevent event producers from passing while
there were other threads trying to print a "dropped" message, in order
to guarantee the absence of reordering. It has a serious impact however,
which is that all threads need to take the read lock when producing a
regular trace even when there's no reader.
This patch takes a different approach. The drop counter is shifted left
by one so that the lowest bit is used to indicate that one thread is
already taking care of trying to dump the counter. Threads only read
this value normally, and will only try to change it if it's non-null,
in which case they'll first check if they are the first ones trying to
dump it, otherwise will simply count another drop and leave. This has
a large benefit. First, it will avoid the locking that causes stalls
as soon as a slow reader is present. Second, it avoids any write on the
fast path as long as there's no drop. And it remains very lightweight
since we just need to add +2 or subtract 2*dropped in operations, while
offering the guarantee that the sink_write() has succeeded before
unlocking the counter.
While a reader was previously limiting the traffic to 11k RPS under
4C/8T, now we reach 36k RPS vs 14k with no reader, so readers will no
longer slow the traffic down and will instead even speed it up due to
avoiding the contention down the chain in the ring. The locking cost
dropped from ~75% to ~60% now (it's in ring_write now).
When a reader doesn't read fast enough and causes drops, subsequent
threads try to produce a "dropped" message. But it takes time to
produce and emit this message, in part due to the use of chunk_printf()
that relies on vfprintf() which has to parse the printf format, and
during this time other threads may continue to increment the counter.
This is the reason why this is currently performed in a loop. When
reading what is received, it's common to see a large count followed
by one or two single-digit counts, indicating that we could possibly
have improved that by writing faster.
Let's improve the situation a little bit. First we're now using a
static message prefixed with enough space to write the digits, and a
call to ultoa_r() fills these digits from right to left so that we
don't have to process a format string nor perform a copy of the message.
Second, we now re-check the counter immediately after having prepared
the message so that we still get an opportunity for updating it. In
order to avoid too long loops, this is limited to 10 iterations.
Tests show that the number of single-digit "dropped" counters on output
now dropped roughly by 15-30%. Also, it was observed that with 8 threads,
there's almost never more than one retry.
The previous patch fix the handling of in-order CRYPTO frames which
requires the usage of a new buffer for these data as their handling is
delayed to run under TASK_HEAVY.
In fact, as now all CRYPTO frames handling must be delayed, their
handling can be unify. This is the purpose of this commit, which removes
the just introduced new buffer. Now, all CRYPTO frames are buffered
inside the ncbuf. Unused elements such as crypto_frms member for
encryption level are also removed.
This commit is not a bugcfix but is a direct follow-up to the last one.
As such, it can probably be backported with it to 2.9 to reduce code
differences between these versions.
QUIC relies on SSL_do_hanshake() to be able to validate handshake. As
this function is computation heavy, it is since 2.9 called only under
TASK_HEAVY. This has been implemented by the following patch :
94d20be138
MEDIUM: quic: Heavy task mode during handshake
Instead of handling CRYPTO frames immediately during reception, this
patch delays the process to run under TASK_HEAVY tasklet. A frame copy
is stored in qel.rx.crypto_frms list. However, this frame still
reference the receive buffer. If the receive buffer is cleared before
the tasklet is rescheduled, it will point to garbage data, resulting in
haproxy decryption error. This happens if a fair amount of data is
received constantly to preempt the quic_conn tasklet execution.
This bug can be reproduced with a fair amount of clients. It is
exhibited by 'show quic full' which can report connections blocked on
handshake. Using the following commands result in h2load non able to
complete the last connections.
$ h2load --alpn-list h3 -t 8 -c 800 -m 10 -w 10 -n 8000 "https://127.0.0.1:20443/?s=10k"
Also, haproxy QUIC listener socket mode was active to trigger the issue.
This forces several connections to share the same reception buffer,
rendering the bug even more plausible to occur. It should be possible to
reproduce it with connection socket if increasing the clients amount.
To fix this bug, define a new buffer under quic_cstream. It is used
exclusively to copy CRYPTO data for in-order frame if ncbuf is empty.
This ensures data remains accessible even if receive buffer is cleared.
Note that this fix is only a temporary step. Indeed, a ncbuf is also
already used for out-of-order data. It should be possible to unify its
usage for both in and out-of-order data, rendering this new buffer
instance unnecessary. In this case, several unneeded elements will
become obsolete such as qel.rx.crypto_frms list. This will be done in a
future refactoring patch.
This must be backported up to 2.9.
The converter can be used to encrypt the raw byte input using the
AES-GCM algorithm, using provided nonce and key.
Co-authored-by: Dragan Dosen (ddosen@haproxy.com)
When a parsing error occurs inside a log-format-sd expression, we report
the location of the log-format directive (which may not be set) instead
of reporting the proper log-format-sd directive location where the parsing
error occured.
1|listen test
2| log-format "%B" # no error
3| log-format-sd "%bad" # error
| [ALERT] (322261) : config : Parsing [empty.conf:2]: failed to parse log-format-sd : no such format variable 'bad'. If you wanted to emit the '%' character verbatim, you need to use '%%'.
The fix consists in using the config hints dedicated to log-format-sd
directive instead of the log-format one.
The bug was introduced in 8a4e4420 ("MEDIUM: log-format: Use standard
HAProxy log system to report errors").
This should be backported to every stable versions.
httpclient_precheck(), ssl_ocsp_update_precheck(), and
resolvers_create_default() functions are registered through
REGISTER_PRE_CHECK() macro to be called by haproxy during init from the
pre_check_list list. When calling functions registered in pre_check_list,
haproxy expects ERR_* return values. However those 3 functions currently
use raw return values, so we better use explicit ERR_* macros to prevent
breakage in the future if ERR_* values mapping were to change.
Since 833cc794 ("MEDIUM: sample: handle comma-delimited converter list")
logformat expressions now support having a comma-delimited converter list
right after the fetch. Let's remove a leftover comment from the initial
implementation that says otherwise.
A remote unidirectional stream can be aborted prematurely if application
layers cannot identify its type. In this case, a STOP_SENDING frame is
emitted.
Since QUIC MUX refactoring, a crash would occur in this scenario due to
2 specific characteristics of remote uni streams :
* qcs.tx.fctl was not initialized completely. This cause a crash due to
BUG_ON() statement inside qcs_destroy().
* qcs.stream is never allocated. This caused qcs_prep_bytes() to crash
inside qcc_io_send().
This bug is considered minor as it happens only on very specific QUIC
clients. It was detected when using s2n-quic over interop.
This does not need to be backported.
After handshake completion, QUIC server is responsible to emit
HANDSHAKE_DONE frame. Some clients wait for it to begin STREAM
transfers.
Previously, there was no explicit tasklet_wakeup() after handshake
completion, which is necessary to emit post-handshake frames. In most
cases, this was undetected as most client continue emission which will
reschedule the tasklet. However, as there is no tasklet_wakeup(), this
is not a consistent behavior. If this bug occurs, it causes a connection
freeze, preventing the client to emit any request. The connection is
finally closed on idle timeout.
To fix this, add an explicit tasklet_wakeup() after handshake
completion. It sounds simple enough but in fact it's difficult to find
the correct location efor tasklet_wakeup() invocation, as post-handshake
is directly linked to connection accept, with different orderings.
Notably, if 0-RTT is used, connection can be accepted prior handshake
completion. Another major point is that along HANDSHAKE_DONE frame, a
series of NEW_CONNECTION_ID frames are emitted. However, these new CIDs
allocation must occur after connection is migrated to its new thread as
these CIDs are tied to it. A BUG_ON() is present to check this in
qc_set_tid_affinity().
With all this in mind, 2 locations were selected for the necessary
tasklet_wakeup() :
* on qc_xprt_start() : this is useful for standard case without 0-RTT.
This ensures that this is done only after connection thread migration.
* on qc_ssl_provide_all_quic_data() : this is done on handshake
completion with 0-RTT used. In this case only, connection is already
accepted and migrated, so tasklet_wakeup() is safe.
Note that as a side-change, quic_accept_push_qc() API has evolved to
better reflect differences between standard and 0-RTT usages. It is now
forbidden to call it multiple times on a single quic_conn instance. A
BUG_ON() has been added.
This issue is labelled as medium even though it seems pretty rare. It
was only reproducible using QUIC interop runner, with haproxy compiled
with LibreSSL with quic-go as client. However, affected code parts are
pretty sensible, which justify the chosen severity.
This should fix github issue #2418.
It should be backported up to 2.6, after a brief period of observation.
Note that the extra comment added in qc_set_tid_affinity() can be
removed in 2.6 as thread migration is not implemented for this version.
Other parts should apply without conflict.
In resolvers.c:rslv_promex_next_ts() and in
stick-tables.c:stk_promex_next_ts(), an unused argument was mistakenly
called "unsued" instead of "unused". Let's fix this in a separate patch
so that it can be omitted from backports if this causes build problems.
This is 39th iteration of typo fixes
The naming issue on the argument called "unsued" instead of "unused"
in two functions from resolvers and stick-tables was put into a second
patch so that it can be omitted if it were to cause backport issues.
That's exactly the same as commit 53bfab080c ("BUG/MINOR: sink: fix a race
condition between the writer and the reader") that went into 2.7 and was
backported as far as 2.4, except that since the code was duplicated, the
second instance was not noticed, leaving the race present. The race has
a limited impact, if a forwarder reaches the end of the logs and a new
message arrives before it leaves, the forwarder will only wake up after
yet another new message will be sent. In practice it remains unnoticeable
because for the race to trigger, one needs to have a steady flow of logs,
which means the wakeup will happen anyway.
This should be backported, but no need to insist on it if it resists.
Instead of reporting lua errors using ha_alert(), let's use SEND_ERR()
helper which will also try to generate a log message according to lua
log settings.
hlua_event_subscribe() is meant to be called from a protected lua env
during init and/or runtime. As such, only hlua_event_sub() makes uses
of it: when an error happens hlua_event_sub() will already raise a Lua
exception. Thus it's not relevant to use ha_alert() there as it could
generate log pollution (error is relevant from Lua script point of view,
not from haproxy one).
This could be backported in 2.8.
hlua_ctx_resume() itself can safely be used as-is in a multithreading
context because it takes care of taking the lua lock.
However, when hlua_ctx_resume() returns, the lock is released and it is
thus the caller's responsibility to ensure it owns the lock prior to
performing additional manipulations on the Lua stack. Unfortunately, since
early haproxy lua implementation, we used to do it wrong:
The most common hlua_ctx_resume() pattern we can find in the code (because
it was duplicated over and over over time) is the following:
|ret = hlua_ctx_resume()
|switch (ret) {
| case HLUA_E_OK:
| break;
| case HLUA_E_ERRMSG:
| break;
| [...]
|}
Problem is: for some of the switch cases, we still perform lua stack
manipulations. This is the case for the HLUA_E_ERRMSG for instance where
we often use lua_tostring() to retrieve last lua error message on the top
of the stack, or sometimes for the HLUA_E_OK case, when we need to perform
some lua cleanup logic once the resume ended. But all of this is done
WITHOUT the lua lock, so this means that the main lua stack could be
accessed simultaneously by concurrent threads when a script was loaded
using 'lua-load'.
While it is not critical for switch-cases dedicated to error handling,
(those are not supposed to happen very often), it can be very problematic
for stack manipulations occuring in the HLUA_E_OK case under heavy load
for instance. In this case, main lua stack corruptions will eventually
happen. This is especially true inside hlua_filter_new(), where this bug
was known to cause lua stack corruptions under load, leading to lua errors
and even crashing the process as reported by @bgrooot in GH #2467.
The fix is relatively simple, once hlua_ctx_resume() returns: we should
consider that ANY lua stack access should be lua-lock protected. If the
related lua calls may raise lua errors, then (RE)SET_SAFE_LJMP
combination should be used as usual (it allows to lock the lua stack and
catch lua exceptions at the same time), else hlua_{lock,unlock} may be
used if no exceptions are expected.
This patch should fix GH #2467.
It should be backported to all stable versions.
[ada: some ctx adj will be required for older versions as event_hdl
doesn't exist prior to 2.8 and filters were implemented in 2.5, thus
some chunks won't apply]
When we want to perform some unsafe lua stack manipulations from an
unprotected lua environment, we use SET_SAFE_LJMP() RESET_SAFE_LJMP()
combination to lock lua stack and catch potential lua exceptions that
may occur between the two.
Hence, we regularly find this pattern (duplicated over and over):
|if (!SET_SAFE_LJMP(hlua)) {
| const char *error;
|
| if (lua_type(hlua->T, -1) == LUA_TSTRING)
| error = hlua_tostring_safe(hlua->T, -1);
| else
| error = "critical error";
| SEND_ERR(NULL, "*: %s.\n", error);
|}
This is wrong because when SET_SAFE_LJMP() returns false (meaning that an
exception was caught), then the lua lock was released already, thus the
caller is not expected to perform lua stack manipulations (because the
main lua stack may be shared between multiple threads). In the pattern
above we only want to retrieve the lua exception message which may be
found at the top of the stack, to do so we now explicitly take the lua
lock before accessing the lua stack. Note that hlua_lock() doesn't catch
lua exceptions so only safe lua functions are expected to be used there
(lua functions that may NOT raise exceptions).
It should be backported to every stable versions.
[ada: some ctx adj will be required for older versions as event_hdl
doesn't exist prior to 2.8 and filters were implemented in 2.5, thus
some chunks won't apply, but other fixes should stay relevant]
In hlua_filter_new(), after each hlua resume, we systematically try to
empty the stack by calling lua_settop(). However we're doing this without
locking the lua context, so it is unsafe in multithreading context if the
script is loaded using 'lua-load'. To fix the issue, we protect the call
with hlua_{lock,unlock}() helpers.
This should be backported up to 2.6.
In hlua_filter_callback(), some lua stack work is performed under
SET_SAFE_LJMP() guard which also takes care of locking the hlua context
when needed. However, a lua_gettop() call is performed out of the guard,
thus it is unsafe in multithreading context if the script is loaded using
'lua-load' because in this case the main lua stack is shared between
threads and each access to a lua stack must be performed under the lock,
thus we move lua_gettop() call under the lock.
It should be backported up to 2.6.
hlua_filter_new() handles memory allocation errors by jumping to the
"end:" cleanup label in case of errors. Such errors may happen when the
system is heavily loaded for instance.
In hlua_filter_new(), we try to allocate two hlua contexts in a row before
checking if one of them failed (in which case we jump to the cleanup part
of the function), and only then we initialize them both.
If a memory allocation failure happens for only one out of the two
flt_ctx->hlua[] contexts pair, we still jump to the cleanup part.
It means that the hlua context that was successfully allocated and wasn't
initialized yet will be passed to hlua_ctx_destroy(), resulting in invalid
reads in the cleanup function, which may ultimately cause the process to
crash.
To fix the issue: we make sure flt_ctx hlua contexts are initialized right
after they are allocated, that is before any error handling condition that
may force the cleanup.
This bug was discovered when trying to reproduce GH #2467 with haproxy
started with "-dMfail" argument.
It should be backported up to 2.6.
As per lua documentation, lua_tostring() may raise a memory error.
However, we're often using it to fetch the error message at the top of
the stack (ie: after a failing lua call) from unprotected environments.
In practise, lua_tostring() has rare chances of failing, but still, if
it happens to be the case, it could crash the process and we better not
risk it.
So here, we add hlua_tostring_safe() function, which works exactly as
lua_tostring(), but the function cannot LJMP as it will catch
lua_tostring() exceptions to return NULL instead.
Everywhere lua_tostring() was used to retrieve error string from such
unprotected contexts, we now rely on hlua_tostring_safe().
This should be backported to all stable versions.
[ada: ctx adj will be required, for versions prior to 2.8 event_hdl
API didn't exist so some chunks won't apply, and prior to 2.5 filters
API didn't exist either, so again, some chunks should be ignored]
Lua documentation says that lua_tostring() returns a pointer that remains
valid as long as the object is not removed from the stack.
However there are some places were we use the returned string AFTER the
corresponding object is removed from the stack. In practise this doesn't
seem to cause visible bugs (probably because the pointer remains valid
waiting for a GC cycle), but let's fix that to comply with the
documentation and avoid undefined behavior.
It should be backported in all stable versions.
Thomas Baroux reported a very interesting issue. "balance random" would
systematically assign the same server first upon restart. That comes from
its use of statistical_prng() which is only seeded with the thread number,
and since at low loads threads are assigned to incoming connections in
round robin order, practically speaking, the same thread always gets the
same request and will produce the same random number.
We already have a much better RNG that's also way more expensive, but we
can use it at boot time to seed the PRNG instead of using the thread ID
only.
This needs to be backported to 2.4.
Add core.silent (-1) value to be able to disable logging via
TXN:set_loglevel() call. Otherwise, there is no way to do so and it may be
handy. This special value cannot be used with TXN:log() function.
This patch may be backported if necessary.
When the log level is changed in lua, by calling TXN:set_loglevel function,
it must be incremented by one because it is decremented in strm_log()
function.
This patch must be backport to all stable versions.
PROXY procotol is not supported on QUIC for now. Thus return an error during
configuration parsing if 'accept-proxy' option is used for a QUIC listener.
This patch should fix the issue #2186. It should be backport as far as 2.6.
2 return values are specified in the h2s_make_data() function comment. Both
are more or less equivalent but the later is probably more accurate. So,
keep the right one and remove the other one.
This patch should fix the issue #2175.
Extend "show quic" to be able to dump MUX related information. This is
done via the new function qcc_show_quic(). This replaces the old streams
dumping list which was incomplete.
These info are displayed on full output or by specifying "mux" field.
Add the possibility to customize show quic full output with only a
specific set of printed fields. This is specified as a comma-separated
list. Here are the currently supported values :
* tp: transport parameters
* sock: connection addresses and socket FD
* pktns: packet number space with ack ranges and in flight bytes
* cc: congestion controler and loss information
Note that streams output is not filtered by this mechanism. It's because
it will be replaced soon by an output generated from the MUX which will
use its owned field name.
Add the possibilty to restrict show quic output to only a single
connection. This is done by specifying a quic_conn address pointer.
Default format selection has evolved with it. Indeed, it seems more
fitting to use full format by default when filtering on a connection.
However, it's still possible to revert to the original oneline format
with it by specifying it explicitely.
When a response was returned by HAProxy, a dedicated HTX flag was
set. Thanks to this flag, it was possible to add a "connection: close"
header to the response if the request was not fully received and to close
the connection. In the same way, when a redirect rule was applied,
keep-alive was forcefully disabled for unfinished requests.
All these mechanisms are now useless because the H1 mux is able to drain the
response. So HTX_FL_PROXY_RESP flag is removed and no special processing is
performed on HAProxy response when the request is unfinished.
unlike for H2 and H3, there is no mechanism in H1 to notify the client it
must stop to upload data when a response is replied before the end of the
request without closing the connection. There is no RST_STREAM frame
equivalent.
Thus, there is only two ways to deal with this situation: closing the
connection or draining the request. Until now, HAProxy didn't support
draining H1 messages. Closing the connection in this case has however a
major drawback. It leads to send a TCP reset, dropping this way all in-fly
data. There is no warranty the client has fully received the response.
Draining H1 messages was never implemented because in old versions it was a
bit tricky to implement. However, it is now far simplier to support this
feature because it is possible to have a H1 stream without any applicative
stream. It is the purpose of this patch. Now, when a shutdown is requested
and the stream is detached from the connection, if the request is unfinished
while the response was fully sent, the request in drained.
To do so, in this case the shutdown and the detach are delayed. From the
upper layer point of view, there is no changes. The endpoint is shut down
and detached as usual. But on H1 mux point of view, the H1 stream is still
alive and is being able to drain data. However the stream-endpoint
descriptor is orphan. Once the request is fully received (and drained), the
connection is shut down if it cannot be reused for a new transaction and the
H1 stream is destroyed.
All code from h1_detach() function was moved in a internal function,
h1s_finish_detach(). It will be used to defer the detach and be able to
drain the requests payload.
Checks performed in h1_shutw() to determine if the connection must be
shutdown now or not was move in a dedicated function. This will be used to
be able to drain the requests payload.
During a zero-copy forwarding negociation, if the H1 mux is blocked for any
reason, the IOBUF_FL_FF_BLOCKED flag must be set on its iobuf to notfiy the
producer it must wait. However, there were two places where it was not
performed: when the output buffer allocation failed and when the chunk
formatting failed.
This patch fixes the issue. It must be backported to 2.9.
There is still an issue with zero-copy forwarding of chunks with an unknown
size. It is possible for a producer to fill the sapce reserved for the CRLF
at the end of the chunk. The root cause is that this space is not accounted
in the iobuf offset. So, from the producer point of view, the space may be
used. We can also argue the current design for iobuf is not well suited for
this case. Instead of using a pointer on the consumer's buffer, it could be
easier to use a custom buffer built on top of the consumer one, via a call
to b_make(), with the size, head and data field reflecting the avaialble
space the producer can use.
By the way, because of this bug, it is possible to trigger a BUG_ON() when
we try to write the CRLF at the end of the chunk because the buffer is
full. It is unexpected. Only the stats applet may hit this bug.
To fix the issue, instead of writting this CRLF when the current chunk is
consumed, it is written before consuming the next one. This way, all space
reserved to create the chunk formatting is always placed before forwarding
data.
No backport needed.
This is a followup of the previous commit: GH user @songliumeng initially
reported an issue with the GPL license version for event_hdl source file
which was fixed by the previous commit. It turns out the same mistake was
made in http_ext source file: due to a mixup between LGPL and GPL, GPL
version '2.1' was referenced instead of '2'.
Again, clarify that this is indeed GPL by making use of the banner
provided in doc/gpl.txt
This should be backported in 2.8 with b2bb925 ("MINOR: proxy/http_ext:
introduce proxy forwarded option")
As spotted by user @songliumeng in GH #2463, there was a mixup between
LGPL and GPL in event_hdl source file: GPL version '2.1' was referenced
instead of '2'. Clarify that this is indeed GPL by making use of the
banner provided in doc/gpl.txt.
This should be backported in 2.8 with 68e692d ("MINOR: event_hdl: add
event handler base api")
Since 23cab33 ("BUG/MINOR: ssl: Clear the ckch instance when deleting a
crt-list line"), LIST_DELETE is done twice, one time in
cli_parse_del_crtlist() and another time in ckch_inst_free().
It could trigger a crash with -DDEBUG_LIST.
This isn't a major problem since the ptr is not freed in the meantime so
it will only trigger with the debug.
This patch removes the LIST_DELETE as well as the loop done on link_ref
which is also don in ckch_inst_free()
Could be backported as far as 2.4. 2.4 version does not have a link_ref
loop.
Contrary to static servers, dynamic servers does not initialize their
settings from a default server instance. As such, _srv_parse_init() was
responsible to set a set of minimal values to have a correct behavior.
However, some settings were not properly initialized. This caused
dynamic servers to not behave as static ones without explicit
parameters.
Currently, the main issue detected is connection reuse which was
completely impossible. This is due to incorrect pool_purge_delay and
max_reuse settings incompatible with srv_add_to_idle_list().
To fix the connection reuse, but also more generally to ensure dynamic
servers are aligned with other server instances, define a new function
srv_settings_init(). This is used to set initial values for both default
servers and dynamic servers. For static servers, srv_settings_cpy() is
kept instead, using their default server as reference.
This patch could have unexpected effects on dynamic servers behavior as
it restored proper initial settings. Previously, they were set to 0 via
calloc() invocation from new_server().
This should be backported up to 2.6, after a brief period of
observation.
This patch reverts 2 fixes that were made in an attempt to fix the
ocsp-update feature used with the 'commit ssl cert' command.
The patches crash the worker when doing a soft-stop when the 'set ssl
ocsp-response' command was used, or during runtime if the ocsp-update
was used.
This was reported in issue #2462 and #2442.
The last patch reverted is the associated reg-test.
Revert "BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing"
This reverts commit 5e66bf26ec.
Revert "BUG/MEDIUM: ocsp: Separate refcount per instance and per store"
This reverts commit 04b77f84d1b52185fc64735d7d81137479d68b00.
Revert "REGTESTS: ssl: Add OCSP related tests"
This reverts commit acd1b85d3442fc58164bd0fb96e72f3d4b501d15.
In appctx_htx_rcv_buf(), HTX blocks found in the appctx output buffer are
copied into the channel buffer. At the end, the state of the underlying
buffer must be updated. If everything was copied, the buffer is reset. This
way, it will be released later, at the end of the applet process function.
However, here there was a typo. We do it on the input buffer instead of the
output buffer. As side effect, an empty HTX message remained stuck in the
appctx outbut buffer, blocking the applet and leading to blocked session
with no expiration date.
No backport needed.
Due to the possibility of calling a control process after adding CRLs, the
ssl_commit_crlfile_cb variable was added. It is actually a pointer to the
callback function, which is called if defined after initial loading of CRL
data from disk and after committing CRL data via CLI command
'commit ssl crl-file ..'.
If the callback function returns an error, then the CLI commit operation
is terminated.
Also, one case was added to the CLI context used by "commit cafile" and
"commit crlfile": CACRL_ST_CRLCB in which the callback function is called.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
Output of 'show quic' is messed up since the introduction of reordered
packets counter in the following commit. The new counter is mixed up
with the first stream line. This is due to the wrong placement of the
newline delimiter.
167e38e0e0
MINOR: quic: Add a counter for reordered packets
This should be backported up to 2.6.
The issue was decribed in commit "BUG/MEDIUM: cli: Warn if pipelined commands
are delimited by a \n". In non-interactive mode, it was possible to use a
newline character as delimiter for pipelined commands. As a consequence, it was
possible to stop commands processing on the middle.
With the above commit, a warning is emitted to notify users. With this one,
we restore the expected behavior, as documented in the management guide.
Only the first line of commands is parsed. This commit will not be
backported to avoid breaking changes on stable versions.
This commit has of course some visible effects. All script using a newline
character as delimiter to pipeline commands in non-interactive mode will
stop working. Only the first command will be evaluated, all others will be
ignored. Pipelined commands MUST now be separated by a semi-colon.
But there is a more subtle and probably more annoying change. It is no
longer possible to pipeline commands with a payload ! A command with a
payload will always be the last one evaluated because it must be finished by
a newline (eventually preceeded by a custom pattern).
It is really annoying to introduce such breaking change. But, on the long
term, it is mandatory. The 2.8 will be the last LST version supporting the
old behavior (with some warning however). This will let 4 years to users to
adapt their scripts.
No backport needed.
This was broken since commit 0011c25144 ("BUG/MINOR: cli: avoid O(bufsize)
parsing cost on pipelined commands"). It is not really a bug fix but it is
labelled as is to make it more visible.
Before, a full line was first retrieved from the request buffer before
extracting the first command to eval it. Now, only one command is retrieved.
But we rely on the request buffer state to interrupt processing in
non-interactive mode. After a command processing, if output of the request
buffer is empty, we leave. Before the above commit, this was not a problem.
But since then, it is obviously a bad statement. First because some input
data may still be there. It is not true today, but it might change. Then,
there is no warranty to receive all commands in same time. For small list of
commands, it will be most of time the case, but it is a dangerous
assumption. For long list of commands, it is almost always false.
To be an issue, commands must be chunked exactly between two commands. But
in this case, remaining commands are skipped. A good way to reproduce the
issue is to wait a bit between two commands, for instance:
(printf "show info;"; sleep 2; printf "show stat\n") | socat ...
In fact, to properly fix the issue, we should exit on the first command
finished by a newline. Indeed, as stated in the documentation, in
non-interactive mode, a single line is processed. To pipeline commands,
commands must be separated by a semi-colon. Unfortunately, the above commit
introduced another change. It is possible to pipeline commands delimited by
a newline. It was pushed 2 years ago and backported to all stable versions.
Several scripts may rely on this behavior.
So, on stable version, the bug will not be fixed. However a warning will be
emitted to notify users their scripts don't respect the documentation and
they must adapt it. Mainly because the cli behavior on this point will be
changed in 3.0 to stick to the doc. This warning will only be emitted once
over the whole worker process life. Idea is to not flood the logs with the
same warning for every offending commands.
This commit should probably be backported to all stable versions. But with
some cautions because the CLI was often modified.
This loop was added to detect pipelined commands when only co_getline() was
used to get commands. Now, co_getdelim() is used and the semi-colon is also
considered as a command delimiter.
As side effet, the last semi-colon, if any, is no longer replaced by a
newline. Thus, we must take care to adapt the test to detect partial
commands.
On qcs_destroy(), a BUG_ON() statement check that QCS does not have
anymore prepared data. This is to ensure connection flow control is
always coherent and prevent transfer freeze.
However, this BUG_ON() may cause a spurrious crash in case QCC is
considered on error. Indeed, in this case, all transfers are interrupted
and qmux_strm_detach() will proceed to immediate QCS free before
releasing the connection. In this situation, connection flow control is
irrelevant so the BUG_ON() should be ignored.
This crash occurs since the MUX refactoring via the following patch.
Previously, a similar BUG_ON() was used but it was incorrectly
implemented rendering it immune even to targetted cause.
3fe3251593
MEDIUM: mux-quic: simplify sending API
This should fix github issue #2456.
This does not need to be backported.
Before a dynamic server can be deleted, a set of preconditions must be
validated to ensure it is not referenced naymore by a stream or a
connection. This is implemented in srv_check_for_deletion().
The various criteria specified were incomplete. This allows a server
instance to be deleted while still be referenced by a stream and a
connection.
This bug was reproduced by using ASAN compilation. A script was used to
add and delete a server every second, while using h2load to generate
traffic with download of 1k objects. Here is the ASAN error.
==140916==ERROR: AddressSanitizer: heap-use-after-free on address 0x520000020080 at pc 0x63cb25679537 bp 0x701529ff5070 sp 0x701529ff5060
READ of size 1 at 0x520000020080 thread T7
#0 0x63cb25679536 in objt_server include/haproxy/obj_type.h:99
#1 0x63cb2568f465 in process_stream src/stream.c:1823
#2 0x63cb25a4a4a2 in run_tasks_from_lists src/task.c:632
#3 0x63cb25a4bf62 in process_runnable_tasks src/task.c:876
#4 0x63cb2596a220 in run_poll_loop src/haproxy.c:3050
#5 0x63cb2596b192 in run_thread_poll_loop src/haproxy.c:3252
#6 0x701539aa9559 (/usr/lib/libc.so.6+0x8b559) (BuildId: c0caa0b7709d3369ee575fcd7d7d0b0fc48733af)
#7 0x701539b26a3b (/usr/lib/libc.so.6+0x108a3b) (BuildId: c0caa0b7709d3369ee575fcd7d7d0b0fc48733af)
To fix this, add <curr_used_conns> to the counters checked in
srv_check_for_deletion().
Outside of this bug, one case which remains sensible is for SF_DIRECT
streams which referenced a server instance early in process_stream()
before connect_server(). This occurs with use-server directive,
force-persist rule or cookie persistence. However, after code
reexamination, the code is considered reliable as process_stream() is
not rescheduled before connect_server() invocation. These observations
have been saved in sess_change_server() documentation to ensure it
remains valid in the future.
This must be backported up to 2.6.
Server refcount is used to protect from server deletion while dumping a
server instance, for stats dump on both CLI and HTTP applet. However,
dump can be aborted prematurely before reaching the end. In this case,
server refcount is never decremented.
This bug can cause an inconsistency on servers refcount, preventing them
to be deleted even after "del server" success.
To fix this, implement release handler for both stats CLI and HTTP
applet. Drop server reference if dump was interrupted during servers
loop.
This should be backported up to 2.6.
Recent commit 2ed6068 ("MINOR: log: custom name for logformat node")
introduced a potential memory leak because when custom name is provided,
lf->name value is allocated using strdup(), thus is expected to be freed
alongside the node when the node is released.
However lf->name was only freed in some common places within log.c
cleanups and helpers func, but in reality there are still cases where
lf nodes are manually freed without making use of freeing helpers.
So this is what this patch does, it makes sure all lf freeing places now
leverage the free_logformat_node() helper function that takes care of
freeing all known allocated elements within the node, including custom
name.
This commit depends on:
- "MINOR: log: add free_logformat_node() helper function"
No backport needed unless 2ed6068 gets backported.
This is a follow up for 24a5e42db6 ("CLEANUP: log: deinitialization of
the log buffer in one function") as there was another opportunity to
make use of the new cleanup function.
A chunk message transferred via zero-copy forwarding in H1 may be
corrupted. This only happens when the chunk size is not known during the
nego stage and when there is nothing to forward when h1_donn_ff() is
called. In this case, we always emit a chunk. Because there is nothing to
forward, a 0-CRLF is emitted in the middle of the message.
The issue occurred with the HTTP stats applet only.
A simple fix is to check the size of data in the iobuf before emitting a new
chunk in h1_done_ff(). However, we still try to send outgoing data because
when this happens, it is most of time because the H1 output buffer is almost
full.
This patch should fix the issue #2453. No backport needed.
Previously, msghdr struct used for sendmsg was memset to 0. This was
updated for performance reason with each members individually defined.
This is done by the following commit :
commit 107d6d7546
OPTIM: quic: improve slightly qc_snd_buf() internal
msg_flags is the only member unset, as sendmsg manual page reports that
it is unused. However, this caused a coverity report. In the end, it is
better to explicitely set it to 0 to avoid any future interrogations,
compiler warning or even portability issues.
This should fix coverity report from github issue #2455.
No need to backport unless above patch is.
The to_forward field was added to debugging output of applets with commit
62a81cb6a ("MINOR: applet: Add callback function to deal with zero-copy
forwarding"), though it's a size_t printed as %lu, which causes complaints
on 32-bit archs. Let's just cast as %lu.
No backport is needed.
This patch is the direct followup of the previous one :
MINOR: quic: remove sendto() usage variant
This finalizes qc_snd_buf() simplification by removing send() syscall
usage for quic-conn owned socket. Syscall invocation is merged in a
single code location to the sendmsg() variant.
The only difference for owned socket is that destination address for
sendmsg() is set to NULL. This usage is documented in man 2 sendmsg as
valid for connected sockets. This allows maximum performance by avoiding
unnecessary lookups on kernel socket address tables.
As the previous patch, no functional change should happen here. However,
it will be simpler to extend qc_snd_buf() for GSO usage.
qc_snd_buf() is a wrapper around emission syscalls. Given QUIC
configuration, a different variant is used. When using connection
socket, send() is the only used. For listener sockets, sendmsg() and
sendto() are possible. The first one is used only if local address has
been retrieved prior. This allows to fix it on sending to guarantee the
source address selection. Finally, sendto() is used for systems which do
not support local address retrieval.
All of these variants render the code too complex. As such, this patch
simplifies this by removing sendto() alternative. Now, sendmsg() is
always used for listener sockets. Source address is then specified only
if supported by the system.
This patch should not exhibit functional behavior changes. It will be
useful when implementing GSO as the code is now simpler.
When using listener socket, source address for emission is explicitely
set using ancillary data for sendmsg(). This is useful to guarantee the
correct address is used when binding on a non-explicit address.
This code was implemented directly under qc_snd_buf(). However, it is
quite complex due to portability issue. For IPv4, two parallel
implementations coexist, defined under IP_PKTINFO or IP_RECVDSTADDR. For
IPv6, another option is defined under IPV6_RECVPKTINFO. Each variant
uses its distinct name which increase the code complexity.
Extract ancillary data filling in a dedicated function named
cmsg_set_saddr(). This reduces greatly the body of qc_snd_buf(). Such
functions can be replicated when other ancillary data type will be
implemented. This will notably be useful for GSO implementation.
qc_snd_buf() is a wrapper for sendmsg() syscall (or its derivatives)
used for all QUIC emissions. This patch aims at removing several
non-optimal code sections :
* fd_send_ready() for connected sockets is only checked on the function
preambule instead of inside the emission loop
* zero-ing msghdr structure for unconnected sockets is removed. This is
unnecessary as all fields are properly initialized then.
* extra memcpy/memset invocations when using IP_PKTINFO/IPV6_RECVPKTINFO
are removed by setting directly the address value into cmsg buffer
Binding on multiple addresses for QUIC is safe only if IP_PKTINFO or
equivalent is available. Else, the behavior may be undefined as the
system is responsible to choose the network interface and source address
on response.
This commit adds a warning on boot if no or partial support for
IP_PKTINFO or equivalent is detected and configuration contains UDP
binding on multiple addresses.
This should be backported up to 2.6. Special backport recommdations :
* change ha_warning() to ha_diag_warning() to ensure no spurrious
warnings will be triggered on stable releases
* IP_PKTINFO usage was introduced on 2.7. For 2.6, multiple addresses
QUIC binding is always unreliable. As such, preprocessor condition
must simply be removed so that the warning is always active regarding
of the system. Warning message should also be truncated to suppress
IP_PKTINFO reference.
make it so string array construction is performed by dedicated macro
helpers instead of manual char insertion between string members.
The goal is to easily be able to support multiple forms of array
construction depending on the data encoding format (raw, json..).
Only %hrl and %hsl logformats are concerned.
Some log variables may be prefixed with specific chars that represent
extra informations that are relevant with it but are are not directly
part of the "raw" value.
ie: '+' char is prepended before some values when "option logasap" is
used to indicate that the value has not yet reached its final value.
However, as those "metadata" are printed using the general purpose
LOGCHAR() printing helper, it's not easy to tell if they are part of the
base value or not.
In this patch we add the LOGMETACHAR() helper that is a wrapper for
LOGCHAR(). The goal is to prepare for adding some logic to prevent such
additional infos from being generated when not relevant or needed.
quotes building for some log formats is directly performed under each
switch case statement so it would become painful to add other conditions
to prevent the quotes from being generated when it's not supported by the
the data encoding format for instance (ie: JSON).
Let's centralize and simplify quotes handling by adding LOGQUOTE_START()
and LOGQUOTE_END() helper macros. If a quotation is started and not
explicitly ended, it will be automatically ended at the end of the current
logformat node:
LOGQUOTE_START() sets 'quote' variable to 1, this way LOGQUOTE_END() only
prints the ending quote when needed. LOGQUOTE_END() is systematically
called after each node switch-case (after each value). LOGQUOTE_START()
does nothing if LOG_OPT_QUOTE isn't set, so does LOGQUOTE_END().
Some rare cases such as %hsl (list of captured headers) required special
handling: in this case multiple quoted texts are generated for the same
field value so explicit LOGQUOTE_START() + LOGQUOTE_END() combination was
needed.
last_isspace variable is explicitly set to 0 in all cases except
LOG_FMT_SEPARATOR case. So we can actually simplify the code by setting
last_isspace to 0 by default and skipping the assignment for the
LOG_FMT_SEPARATOR case.
Add the ability to manually specify desired output type after a custom
field name for logformat nodes. Forcing the type can be useful to ensure
value is stored with the proper type representation. (i.e.: forcing
numerical to string to work around the limited resolution of JS number
types)
By default, type is set to SMP_T_SAME, which means the original type will
be preserved.
Currently supported types are: bool, str, sint
type_to_smp(type) does the reverse operation of smp_to_type[smp]: it takes
a type name as input string and tries to return the corresponding SMP_T_*
smp type or SMP_TYPES if not found.
Add the ability to specify custom name (will be used for representation
in verbose output types such as json) to logformat nodes.
For now, a custom name should be composed by characters [a-zA-Z0-9-_]*
Transient send errors is handled differentely if using connection or
listener socket for QUIC transfers. In the first case, proper poller
subscription is used via fd_cant_send()/fd_want_send(). For the listener
socket case, error is ignored by qc_snd_buf() caller and retransmission
mechanism will allow to reemit the data.
For listener socket, transient error code handling is buggy. It blindly
uses fd_cand_send() with <qc.fd> member which is set to -1 for listener
socket usage. This results in an invalid fdtab access, with a possible
crash or a modification of a totally unrelated FD.
This bug is simply fixed by using qc_test_fd() before using
fd_cant_send()/fd_want_send(). This ensures <qc.fd> is used only if
initialized which is only the case when using connection socket.
No crash was reported yet for this bug. However, it is reproducible by
using ASAN compilation and the following strace sendmsg() errno command
injection :
# strace -qq -yy -p $(pgrep haproxy) -f -e trace=%network \
-e inject=sendto,sendmsg:error=EAGAIN:when=20+20
This must be backported up to 2.7.
If some data are received for a lua socket while the lua script responsible
to consume these data is not ready to do so, for instance because it is
sleeping, the applet is woken up in loop because it never states it will not
consume these data yet.
To fix the issue, in the applet I/O handle, when there are outgoing data, we
always pretend the applet will not consume it. It is the responsibility to
the lua script to reactivate receives by calling Socket.receive() function.
This patch must be backported to every stable version. For 2.4 and older,
si_want_get()/si_cant_get() must be used instead of
applet_will_consume()/applet_wont_consume().
It is poosible to create a lua socket without performing any connect. In
this case, the lua socket is released because of the garbage collector.
However, the garbarge collector does not release the applet, it wakes it
up. Since commit 751b59c40b ("BUG/MEDIUM: hlua: Initialize appctx used by a
lua socket on connect only"), the applet initialization is performed on
connect. So, here, it is possible to wake an uninitialized applet. It is an
unexpected case for the applet's I/O handler, leading to a segfault because
some resources are not initialized (the stream's target in this case).
So, now, in the lua socket GC function, we take care to immediately release
uninitialized applets. At worst, the release itself is delayed. But it is
safe because we are sure the applet's I/O handler will never be executed.
In addition, we take case to increment the GC counter when the lua socket is
created. The way, uninitialized lua socket are released more quickly.
This patch should fix the issue #2451. It must be backported as far as 2.6.
When an error is triggered during the applet initialization, a dedicated
function is called to release it. Indeed, in this case, because the applet
was not initialized, the ->release callback must not be called. However,
because the init stage may be delayed to be performed during the first
applet wakeup, we must also take care to not rely on the default
appctx_free() function, to immediately release the applet. Otherwise, if the
error happens in a delayed init stage, the applet is never released.
This patch partially fix the issue #2451. It must be backported as far as
2.6.
Currently haproxy does not implement dynamic table support for QPACK. As
such, dynamic table capacity advertized via H3 SETTINGS is 0. When
receiving a non-null Set Dynamic Table Capacity instruction, close
immediately the connection using QPACK_ENCODER_STREAM_ERROR.
Prior to this patch, such instructions were simply ignored. This is non
conform to QUIC specification.
This should be backported up to 2.6. Note that on 2.6 qcc_set_error()
must be replaced by function qcc_emit_cc_app().
Close the connection using QPACK_DECODER_STREAM_ERROR when receiving an
invalid insert count increment. As haproxy does not use dynamic table,
this instruction must never be emitted by the peer.
Prior to this patch, haproxy silently ignored such instruction which is
not conform to the QUIC specification.
This should be backported up to 2.6. Note that on 2.6 qcc_set_error()
must be replaced by function qcc_emit_cc_app().
As specified in RFC 9000, a client must never emit a HANDSHAKE_DONE
frame. If this happens, the server must close the connection with error
PROTOCOL VIOLATION.
Previously, such a frame was silently discarded on server side. The
connection remained opened which is not conformant to the specification.
This should be backported up to 2.6.
Ensure every frame types are handled in qc_parse_pkt_frms. Add an
ABORT_NOW on the default case. This is safe as an unknown frame must be
rejected prior via qc_parse_frm().
As specified by RFC 9000, connection is closed on error if an unknown
QUIC frame type is received.
Previously, a frame with unknown type was silently discarded. The
connection remained opened which is not conformant to the specification.
This should be backported up to 2.6.
Global options to disable for zero-copy forwarding are now tested outside
callbacks responsible to perform the forwarding itself. It is cleaner this
way because we don't try at all zero-copy forwarding if at least one side
does not support it. It is equivalent to what was performed before, but it
is simplier this way.
There is a nego stage when a producer is ready to forward data to the other
side. At this stage, the zero-copy forwarding may be disabled if the
consumer does not support it. However, there is a flaw with this way to
proceed. If the channel buffer is not empty, we delay the zero-copy
forwarding to flush all data from the channel first. During this delay,
receives on the endpoint (at connection level for muxes), are blocked to be
sure to have the opportunity to switch on zero-copy forwarding. It is a
problem if the consumer cannot flush data from the channel's buffer, waiting
for more data for instance.
It is especially annoying with the CLI applet, because this scenario can
happen if a command is partially received. For instance without the LF at
the end. In this case, the CLI applet is blocked because it waits more
data. The frontend connexion is also blocked because channel's data must be
flushed before trying to receive more data. Worst, this happen at where no
timeout is armed. Thus the session is stuck infinitly, client aborts cannot
be detected because receives are blocked, and the applet cannot abort on its
side because there are pending outgoing data. It is clearly a situation
where it is easy to consume all CLI slots.
To fix the issue, thanks to previous commits, we now check zero-copy
forwarding support on both sides before proceeding.
This patch relies on the following commits:
* MINOR: muxes: Announce support for zero-copy forwarding on consumer side
* MINOR: stconn: Add SE flag to announce zero-copy forwarding on consumer side
* MINOR: stconn: Rename SE_FL_MAY_FASTFWD and reorder bitfield
* CLEANUP: stconn: Move SE flags set by app layer at the end of the bitfield
All the series must be backported to 2.9.
It is unused for now, but the muxes announce their support of the zero-copy
forwarding on consumer side. All muxes, except the fgci one, are supported
it.
To fix a bug, a flag to announce the capabitlity to support the zero-copy
forwarding on the consumer side will be added on the SE descriptor. So the
old flag SE_FL_MAY_FASTFWD is renamed to indicate it concerns the producer
side. It is now SE_FL_MAY_FASTFWD_PROD. And to prepare addition of the new
flag, the bitfield is a bit reordered.
This revert of commit 0b93ff8c87 ("BUG/MEDIUM: stconn: Wake applets on
sending path if there is a pending shutdown") and 9e394d34e0 ("BUG/MINOR:
stconn: Don't report blocked sends during connection establishment") because
it was not the right fixes.
We must not wake an applet up when a shutdown is pending because it means
output some data are still blocked in the channel buffer. The applet does
not necessarily consume these data. In this case, the applet may be woken up
infinitly, except if it explicitly reports it wont consume datay yet.
This patch must be backported as far as 2.8. For older versions, as far as
2.2, it may be backported. If so, a previous fix must be pushed to prevent
an HTTP applet to be stuck. In http_ana.c, in http_end_request() and
http_end_reponse(), the call to channel_htx_truncate() on the request
channel in case of MSG_ERROR must be replace by a call to
channel_htx_erase().
Thanks to the previous patch ("MEDIUM: applet: Add notion of shutdown for
write for applets"), it is no longer necessary to check SC flags to detect
shutdowns to interrupt the wait command. It is possible to remove this ugly
workaround. In addition, we only test the SE for shutdown because end of
stream and error are already checked by the CLI I/O handler. And it is no
longer necessary to remove output data from the channel's buffer because
shutdown are not reported if there are remaining outgoing data.
Of course, if the "wait" command is backported, the commit above and this
one must be backported too.
In fact there is already flags on the SE to state a shutdown for reads or
writes was performed. But for applets, this notion does not exist. Both
flags are set in same time when the applet is released. But at the SC level,
there are functions to perform a shutdown (formely the shutw) and an abort
(formely the shutr). For applets, when a shutdown is performed on the SC, if
the applet is not immediately released, nothing is acknowledge at the SE
level.
With old way to implement applets, this was not an real issue until recently
because applets accessed to the channel/SC flags. It was thus possible to
catch the shutdowns. But the "wait" command on the CLI reveals the
flaw. Indeed, when this command is executed, nothing is read or sent. So, it
is not possible to detect the shutdowns. As a workaround, a dedicated test
on the SC flags was added at the end of the wait command I/O handler. But it
is pretty ugly.
With new way to implement applets, there is no longer access to the channel
or SC. So we must add a way to acknowledge shutdown into the SE.
This patch solves the both sides of the issue. The shutw notion is added for
applets. Its only purpose is to set SE_FL_SHWN flags. This flag is tested by
all applets, so, it solves the issue quite simply.
Note that it is described as a bug fix but there is no real issue, just a
design flaw. However, if the "wait" command is backported, this patch must
be backported too. Unfortinately it will require an adaptation because there
is no appctx flags on older versions.
These both flags are set after releasing the applet, in
appctx_shut(). Concretly, it means the applet is shutdown for reads and
writes. Once set, the applet's I/O handler was no longer called. Tests on
these flags are useless. There is no chance to match them.
This case does not exist yet with the H1 multiplexer, but applets may decide to
not produce data if there is not enough room in the destination buffer (the
applet's outbuf or the opposite SE buffer). It is true for the stats applets for
instance. However this case is not properly handled when the zero-copy
forwarding is in-use.
To fix the issue, the se_done_ff() function was modified to return the number of
bytes really forwarded and to subs for sends if nothing was forwarded while the
zero-copy forwarding was blocked by the producer. On the applet side, we take
care to block the zero-copy forwarding if the applet requests more room. At the
end, zero-copy forwarding is unblocked if something was forwarded.
This way, it is now possible for the stats applet to report a full buffer and
block the zero-copy forwarding, even if the buffer is not really full, by
requesting more room.
No backport needed.
An issue was introduced when zero-copy forwarding was added to the stats and
cache applets. There is no test to be sure the upper layer is ready to use
the zero-copy forwarding. So these applets refuse to deliver the response
into the applet's output buffer if the zero-copy forwarding is supported by
the opposite endpoint. It is especially an issue when a filter, like the
compression, is in-use on the response channel.
Because of this bug, the response is not delivered and the applet is woken
up in loop to produce data.
To fix the issue, an appctx flag was added, APPCTX_FL_FASTFWD, to know when
the zero-copy forwarding is in-use. We rely on this flag to not fill the
outbuf in the applet's I/O handler.
No backport needed.
This simplifies a bit the stats applet. Because the CLI part was not
refactored yet to use the applet's buffers, there are 3 ways to produce
data:
* the HTX message for the HTTP stats when zero-copy forwarding is not
used
* raw data in the opposite endpoint buffer for the HTTP stats when
zero-copy forwarding is used
* the channel buffer when the CLI "show stat" command is evaluated
There is already a dedicated function to take care to copy data at the right
place. There is now also a dedicated function to check us the output buffer
is almost full.
Commit 91b77c1632 ("MEDIUM: mux-h1: Support zero-copy forwarding for chunks with
an unknown size") was recently pushed but it contains 3 bugs. The first one is
during the nego. The extra size reserved for the CRLF at the end of the chunk
must not be added to the offset value. Indeed, the CRLF will be appended after
the data and not prepended to them.
The second one, still during the nego, is an integer overflow when the available
room in the output buffer is computed.
Finally, the last one is when the chunk itself is formatted. This part was
totally buggy if the output buffer was not empty at the beginning.
No backport needed.
A packet is considered as reordered when it is detected as lost because its packet
number is above the largest acknowledeged packet number by at least the
packet reordering threshold value.
Add ->nb_reordered_pkt new quic_loss struct member at the same location that
the number of lost packets to count such packets.
Should be backported to 2.6.
Let's say that the largest packet number acknowledged by the peer is #10, when inspecting
the non already acknowledged packets to detect if they are lost or not, this is the
case a least if the difference between this largest packet number and and their
packet numbers are bigger or equal to the packet reordering threshold as defined
by the RFC 9002. This latter must not be less than QUIC_LOSS_PACKET_THRESHOLD(3).
Which such a value, packets #7 and oldest are detected as lost if non acknowledged,
contrary to packet number #8 or #9.
So, the packet loss detection is very sensitive to such a network characteristic
where non acknowledged packets are distant from each others by their packet number
differences.
Do not use this static value anymore for the packet reordering threshold which is used
as a criteria to detect packet loss. In place, make it depend on the difference
between the number of the last transmitted packet and the number of the oldest
one among the packet which are still in flight before being inspected to be
deemed as lost.
Add new tune.quic.reorder-ratio setting to apply a ratio in percent to this
dynamic packet reorder threshold.
Should be backported to 2.6.
The new formula for K CUBIC which arrives with RFC 9438 is as follows:
K = cubic_root((W_max - cwnd_epoch) / C)
Note that W_max is c->last_w_max, and cwnd_epoch is c->cwnd when entering
quic_cubic_update() just after a congestion event.
Must be backported as far as 2.6.
The formula for K CUBIC calculation is as follows:
K = cubic_root(W_max * (1 - beta_quic) / C).
Note that this does not match the comment. But the aim of this patch is to not
hide a bug inside another patch to update this K CUBIC calculation.
The unit of C is bytes/s^3 (or segments/s^3). And we want to store K as
milliseconds. So, the conversion inside the cubic_root() to convert seconds in
milliseconds is wrong. The unit used here is bytes/(ms/1000)^3 or
bytes*1000^3/ms^3. That said, it is preferable to compute K as seconds, then
convert to milliseconds as done by this patch.
Must be backported as far as 2.6.
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.
These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.
Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.
An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.
This patch can be backported up to 2.8.
As reported by github user @JB0925 in issue #2427, there is a possible
crash in pool_flush(). The problem is that if the free_list is not empty
in the first test, and is empty at the moment the xchg() is performed,
for example because another thread called it in parallel, we place a
POOL_BUSY there that is never removed later, causing the next thread to
wait forever.
This was introduced in 2.5 with commit 2a4523f6f ("BUG/MAJOR: pools: fix
possible race with free() in the lockless variant"). It has probably
very rarely been detected, because:
- pool_flush() is only called when stopping is set
- the function does nothing if global pools are disabled, which is
the case on most modern systems with a fast memory allocator.
It's possible to reproduce it by modifying __task_free() to call
pool_flush() on 1% of the calls instead of only when stopping.
The fix is quite simple, it consists in moving the zeroing of the
entry in the break path after verifying that the entry was not already
busy.
This must be backported wherever commit 2a4523f6f is.
In issue #2427 Ilya reports that gcc-14 rightfully complains about
sizeof() being placed in the left term of calloc(). There's no impact
but it's a bad pattern that gets copy-pasted over time. Let's fix the
few remaining occurrences (debug.c, halog, udp-perturb).
This can be backported to all branches, and the irrelevant parts dropped.
The "wait" command now supports a condition, "srv-unused", which waits
for the designated server to become totally unused, indicating that it
is removable. Upon each wakeup it calls srv_check_for_deletion() to
verify if conditions are met, if not if it's recoverable, or if it's
not recoverable, and proceeds according to this, never waiting for a
final decision longer than the configured delay.
The purpose is to make it possible to remove servers from the CLI after
waiting for their sessions to be terminated:
$ socat -t5 /path/to/socket - <<< "
disable server px/srv1
shutdown sessions server px/srv1
wait 2s srv-unused px/srv1
del server px/srv1"
Or even wait for connections to terminate themselves:
$ socat -t70 /path/to/socket - <<< "
disable server px/srv1
wait 1m srv-unused px/srv1
del server px/srv1"
Conditions will need to have context, arguments etc from the command line.
Since these will vary with time (otherwise we wouldn't wait), let's just
pass them as text (possibly pre-processed). We're starting with 4 strings
that are expected to be allocated by strdup() and are always sent to free()
upon release.
Since we'll support waiting for an action to succeed or permanently
fail, we need the ability to return an unrecoverable failure. Let's
add CLI_WAIT_ERR_FAIL for this. A static error message may be placed
into ctx->msg to report to the user why the failure is unrecoverable.
We'll need to be able to verify whether or not a server may be deleted.
For now, both the verification and the action are performed in the same
function, at once under thread isolation. The goal here is to extract
the verification code into a new function that will perform these checks,
return a status between success/recoverable/non-recoverable failure, and
will also return a message for the caller.
When an applet is using its own buffers, it is important to release them, if
empty, after processing to recycle unsued buffers. It is not a leak because
these buffers are necessarily released when the applet is released. But this
leads to an excess of buffer allocations.
No need to backport.
This allows to insert delays between commands, i.e. to collect a same
set of metrics at a fixed interval. E.g:
$ socat -t20 /path/to/socket <<< "show activity; wait 10s; show activity"
The goal will be to extend the feature to optionally support waiting on
certain conditions. For this reason the struct definitions and enums were
placed into cli-t.h.
The CLI applet doesn't make use of its timeout at all, only the stream
does. That's a wonder because it allows any command's I/O handler to
trivially set a wakeup timer by simply touching the task's ->expire
field, and the I/O handler will automatically be woken up again. The
only condition for this is that we properly take care of clearing that
timeout whenever we finish processing a command and switch back to the
PROMPT state. That's what this patch does.
If a release handler produces a final message, it's currently left
pending in the CLI context and needs another I/O event to be dumped
because immediately after calling ->release, we check for states
OUTPUT and above and we wait until more data arrives.
This patch adds continue statement to go back to the loop immediately
after leaving the release handler in order to attempt to emit the
output message.
At this point it's not sure whether any release handlers are producing
messages, so it's probably not needed to backport this.
Some commands are still missing their trailing LF, and very few were even
already spotted in the past emitting more than one. The risk of missing
this LF is particularly high, especially when tests are run in non-
interactive mode where the output looks good at first glance. The problem
is that once run in interactive mode, the missing empty line makes the
command not being complete, and scripts can wait forever.
Let's tackle the problem at its root: messages emitted at the end must
always end with an LF and we know some miss it. Thus, in cli_output_msg()
we now start by removing the trailing LFs from the string, and we always
add exactly one. This way the trailing LF added by correct functions are
silently ignored and all functions are now correct.
This would need to be progressively backported to all supported versions
in order to address them all at once, though the risk of breaking a legacy
script relying on the wrong output is never zero. At first it should at
least go as far as the lastest LTS (2.8), and maybe another one depending
on user demands. Note that it also requires previous patch ("BUG/MINOR:
vars/cli: fix missing LF after "get var" output") because it fixes a test
for a bogus output for "get var" in a VTC.
"get var" on the CLI was also missing an LF, and the vtest as well, so
that fixing only the code breaks the vtest. This must be backported to
2.4 as the issue was brought with commit c35eb38f1d ("MINOR: vars/cli:
add a "get var" CLI command to retrieve global variables").
Some cli_err(), cli_msg() or even ha_error() etc are missing the trailing
LF, which breaks the continuity of the CLI parsing: the extra LF that serves
to mark the end of the command is in fact taken as the missing LF and no
extra one is added.
This patch adds the missing LF on identified messages. It might be worth
trying to proceed in a more generic way with this, given the amount of
code that is possibly at risk.
We now update the session's tracked counters with the observed glitches.
In order to avoid incurring a high cost, e.g. if many small frames contain
issues, we batch the updates around h2_process_demux() by directly passing
the difference. Indeed, for now all functions that increment glitches are
called from h2_process_demux(). If that were to change, we'd just need to
keep the value of the last synced counter in the h2c struct instead of the
stack.
The regtest was updated to verify that the 3rd client that does not cause
issue still sees the counter resulting from client 2's mistakes. The rate
is also verified, considering it shouldn't fail since the period is very
long (1m).
This adds a new pair of stored types in the stick-tables:
- glitch_cnt
- glitch_rate
These keep count of the number of glitches reported on a front connection,
in order to decide how to act with a badly defective client or a potential
attacker. For now nothing updates these counters, but all the infrastructure
needed to configure, update and retrieve them was added, including the doc.
No regtest was added yet since they're not filled yet.
It's quite uncommon for a client to decide to change the connection's
initial window size after the settings exchange phase, unless it tries
to increase it. One of the impacts depending is that it updates all
streams, so it can be expensive, depending on the stacks, and may even
be used to construct an attack. For this reason, we now count a glitch
when this happens.
A test with h2spec shows that it triggers 9 across a full test.
Here we consider that if a HEADERS frame is made of more than 4 fragments
whose average size is lower than 1kB, that's very likely an abuse so we
count a glitch per 16 fragments, which means 1 glitch per 1kB frame in a
16kB buffer. This means that an abuser sending 1600 1-byte frames would
increase the counter by 100, and that sending 100 headers per request in
individual frames each results in a count of ~7 to be added per request.
A test consisting in sending 100M requests made of 101 frames each over
a connection resulted in ~695M glitches to be counted for this connection.
Note that no special care is taken to avoid wrapping since it already takes
a very long time to reach 100M and there's no particular impact of wrapping
here (roughly 1M/s).